Introduction to Hashing

 

Objectives of this lecture

q       Introduce the Hashing technique

q       Learn some methods of choosing Hashing functions

q       Learn some collision resolution methods

 

What is Hashing?

q       We recall from previous lecture that index table is an auxiliary array that is used to find data stored in another array.

q       One limitation of index table it can only store array indices. – it can only be used to access tables whose key is the array index.

q       Hashing is an extension of index table to cover general situations where the key is no longer an array index.-- The resulting table is called Hash table

q       It is used extensively in many applications and is considered one of the cleverest inventions of computer science.

q       The idea involves setting up one-to-one correspondence between the keys by which we wish to retrieve information and indices that can be used to access an array.

q       This is done by developing suitable index functions (or more appropriately called hash functions).

q       Ideally a hash function should be one-to-one. i.e. distinct keys should be mapped to distinct indices.

q       Unfortunately, in most cases, hash functions results in collision—two or more keys mapped to the same index.

q       Thus hashing involves two main tasks:

Ø      Finding a good hash function

Ø      Determining how to resolve collision.

 

q       Regardless of the hashing function chosen and the collision resolution method adopted, the following are the main task involves in hashing:

Ø      Declare an array to hold the hash table including a field for the key

Ø      Initialize the table to empty – depends on the application

Ø      To insert a record, the hash function for the key is first calculated. If the corresponding location is empty, the record is inserted, else if the keys are equal, insertion is not allowed, else collision resolution is necessary.

 

Ø      To retrieve the record with a given field, first evaluate the hash function.  If the record is found, fine; else while the location is not empty and the record is not found, follow the same steps used for collision resolution.  If an empty position is found or all locations have been considered, the record does not exists.

 

Choice of TABLESIZE

q       The choice of the size of a hash table, TABLESIZE, can affect the collision rate of a hash function as shown by the following examples:

 

Example 1a: To load/insert the numbers 10, 20, 30, 40, 50, 60 and 70 into a hash table. with TABLESIZE: 13  Assuming the Hash function:

f(key) = key mod  TABLESIZE.

 

-1

40

80

-1

30

70

-1

20

60

-1

10

50

-1

 

Example 1b: Same as in part (a) but with TABLESIZE as 15.

 

30

-1

-1

-1

-1

20

-1

-1

-1

-1

10

-1

-1

-1

-1

60

-1

-1

-1

-1

50

-1

-1

-1

-1

40

-1

-1

-1

-1

-1

-1

-1

-1

-1

80

-1

-1

-1

-1

70

-1

-1

-1

-1

 

q       It is obvious from the above examples that the way we select our TABLESIZE can affect the hashing. This choice depends on data sets but using prime numbers as TABLESIZE usually lead to less collisions

 

Choosing Hash functions:

q       The principal criteria in selecting hash functions are that:

Ø      It should be easy to compute the index of the key

Ø      It should distribute keys evenly

Ø      It should minimize collision

 

Some of the common methods of achieving these are :

Truncating

q       This involves ignoring part of a key and use the remaining part as the index

e.g.  972136 can be mapped to 236

q       Truncation is very fast, but it often fails to distribute keys evenly

 

Folding

q       This involves partitioning the key into several parts and combining the parts to obtain the index.

e.g.  972136 is mapped to 9 + 72+ 136 = 217

q       Since all information in the key can affect the value of a hash function, folding often achieves a better spread than truncation.

Modular Arithmetic:

q       This involves converting the key to an integer and taking the remainder on dividing by the TABALESIZE. 

q       The spread achieves by this method as we saw in the above examples depends on the TABLESIZE  If the size is a power of some integer like 2 or 10, then chances of collision are higher.

 

Collision Resolution

There are many methods of collision resolution.  Some of these are:

Linear Probing

q       This involves starting with the hash-address and making a sequential search backward or forward for the desired key or an empty location.

q       The major drawback of linear probing is clustering; making the sequential search longer and longer as insertion is made.

 

Example 2:

Consider the Hash function:

      f(Ln) = n mod TABLESIZE

 

where L is a letter of the English alphabets and n is the letter's number in the alphabet Insert the following into a table of size 7 using decremental liner probing:

S, B, J, N, X, W

 

Solution:

 

0

N14

1

X24

2

B2

3

J10

4

 

5

S19

6

W23

 

 

Quadratic Probing

q       If there is a collision at address h, this involves probing location (h+i2) % TABLESIZE.

q       This substantially reduces clustering especially for prime table size.

Rehashing

q       This involves using a second hash function to obtain the second position.

Random Probing

q       This involves using a pseudorandom number generator to obtain the increment.  Note that the generator must be such that it always generates the same sequence, given the same seed.