Introduction to Hashing
Objectives of this lecture
q Introduce the Hashing technique
q Learn some methods of
choosing Hashing functions
q Learn some collision resolution methods
What is Hashing?
q We recall from previous
lecture that index table is an auxiliary array that is used to find data stored
in another array.
q One limitation of index
table it can only store array indices. – it can only be used to access tables whose key
is the array index.
q Hashing is an extension of
index table to cover general situations where the key is no longer an array
index.-- The resulting table is called Hash table
q It is used extensively in
many applications and is considered one of the cleverest inventions of computer
science.
q The idea involves setting up
one-to-one correspondence between the keys by which we wish to retrieve
information and indices that can be used to access an array.
q This is done by developing
suitable index functions (or more appropriately called hash functions).
q Ideally a hash function
should be one-to-one. i.e. distinct keys should be mapped to distinct indices.
q Unfortunately, in most
cases, hash functions results in collision—two or more keys mapped to
the same index.
q Thus hashing involves two
main tasks:
Ø Finding a good hash function
Ø Determining how to resolve
collision.
q Regardless of the hashing
function chosen and the collision resolution method adopted, the following are
the main task involves in hashing:
Ø Declare an array to hold the
hash table including a field for the key
Ø Initialize the table to
empty – depends on the application
Ø To insert a record, the hash
function for the key is first calculated. If the corresponding location is
empty, the record is inserted, else if the keys are equal, insertion is not
allowed, else collision resolution is necessary.
Ø To retrieve the record with
a given field, first evaluate the hash function. If the record is found, fine; else while the location is not
empty and the record is not found, follow the same steps used for collision resolution. If an empty position is found or all
locations have been considered, the record does not exists.
Choice of TABLESIZE
q The choice of the size of a
hash table, TABLESIZE, can affect the collision rate of a hash function as
shown by the following examples:
Example 1a: To load/insert the numbers 10, 20,
30, 40, 50, 60
and 70
into a hash table. with TABLESIZE: 13
Assuming the Hash function:
f(key) = key mod
TABLESIZE.
-1
|
40
|
80
|
-1
|
30
|
70
|
-1
|
20
|
60
|
-1
|
10
|
50
|
-1
|
Example 1b: Same as in part (a) but with TABLESIZE as 15.
30
|
-1
|
-1
|
-1
|
-1
|
20
|
-1
|
-1
|
-1
|
-1
|
10
|
-1
|
-1
|
-1
|
-1
|
60
|
-1
|
-1
|
-1
|
-1
|
50
|
-1
|
-1
|
-1
|
-1
|
40
|
-1
|
-1
|
-1
|
-1
|
-1
|
-1
|
-1
|
-1
|
-1
|
80
|
-1
|
-1
|
-1
|
-1
|
70
|
-1
|
-1
|
-1
|
-1
|
q It is obvious from the above
examples that the way we select our TABLESIZE can affect the hashing. This
choice depends on data sets but using prime numbers as TABLESIZE usually lead
to less collisions
Choosing Hash functions:
q The principal criteria in
selecting hash functions are that:
Ø It should be easy to compute
the index of the key
Ø It should distribute keys
evenly
Ø It should minimize collision
Some of the common methods of achieving these are :
Truncating
q This involves ignoring part
of a key and use the remaining part as the index
e.g. 972136 can be mapped to 236
q Truncation is very fast, but
it often fails to distribute keys evenly
Folding
q This involves partitioning
the key into several parts and combining the parts to obtain the index.
e.g. 972136 is mapped to 9 + 72+
136 = 217
q Since all information in the
key can affect the value of a hash function, folding often achieves a better
spread than truncation.
Modular Arithmetic:
q This involves converting the
key to an integer and taking the remainder on dividing by the TABALESIZE.
q The spread achieves by this
method as we saw in the above examples depends on the TABLESIZE If the size is a power of some integer like 2 or 10, then chances of collision
are higher.
Collision Resolution
There are many methods of collision resolution. Some of these are:
Linear Probing
q This involves starting with
the hash-address and making a sequential search backward or forward for the
desired key or an empty location.
q The major drawback of linear
probing is clustering; making the sequential search longer and longer as insertion is made.
Example 2:
Consider the Hash
function:
f(Ln) = n mod TABLESIZE
where L is a letter
of the English alphabets and n is the letter's number in the alphabet Insert
the following into a table of size 7 using decremental liner probing:
S, B, J, N, X, W
Solution:
0
|
N14
|
1
|
X24
|
2
|
B2
|
3
|
J10
|
4
|
|
5
|
S19
|
6
|
W23
|
Quadratic Probing
q If there is a collision at
address h, this involves probing location (h+i2) % TABLESIZE.
q This substantially reduces
clustering especially for prime table size.
Rehashing
q This involves using a second
hash function to obtain the second position.
Random Probing
q This involves using a
pseudorandom number generator to obtain the increment. Note that the generator must be such that it
always generates the same sequence, given the same seed.