Hashing Analysis

Closed Hashing Analysis

1. Compute the average number of collisions encountered when inserting a key into a hash table, H, of size N containing m keys.
We will analyze the case where we have an unlimited number of very good hash functions, h1, h2, ...
We first use h1 to attempt to insert k into H.
If there is a collision then we use h2 to attempt to insert k into H, ...
Conceivably, there could be an indeterminate number of collisions using this method. A common method that is used to avoid this problem is to use h2 for double hashing to create a probe sequence that will specify offsets from the slot that caused the original collision so that every location in the array is probed without repetition. The performance of double hashing is generally not quite as good (although more practical) than the multiple hash function technique we are analyzing here and is more difficult to analyze. So, for clarity of presentation, we will analyze the multiple hash function approach.

Let d = m/N represent the load factor (density) of H.
The probability of a key, k, being inserted with no collisions is 1-d which is the percentage of opens slots in the table.
The probability of a key, k, being inserted with exactly one collision is (1-d)*d since d is the probability of a collision on the first probe and 1-d is the probability of no collision on the second probe.
...
The probability of a key, k, being inserted with exactly r collisions is (1-d)*dr since the probability of r consecutive collisions is dr and 1-d is the probability of suceeding on the r+1st probe.

If EC = number of expected collisions when inserting key k then
EC = 1*(1-d)*d + 2*(1-d)*d2 + ... = (1-d)*∑(i*di) (1 <= i <= infinity)

A closed form expression for the sum may be obtained using the following technique.

S = 1*d + 2*d2 + ... + i*di + ...
d*S = 1*d2 + 2*d3 + ... + + (i-1)*di + i*di+1 + ...
Matching terms in the S series with terms in the d*S series that have the same power of d and subtracting we obtain the following.

S - d*S = d + d2 + d3 + ... di + ...
The sum on the right is an infinite geometric series equal to d/(1-d)
Factoring we obtain S*(1-d) = d/(1-d) and
S = d/(1-d)2 and so
EC = (1-d)*d/(1-d)2 = d/(1-d)

2. Compute the expected number of probes requred to insert a key into a hash table, H, of size N containing m keys.

After a key insertion has experienced c collisions a final, collision-free probe is required to place the key. Therefore the expected number of probes, EP, equals the expected number of collisions plus one.
I.e., EP = EC + 1 = 1 + d/(1-d) = 1/(1-d).

3. Compute the average number of lookups required to find a key in hash table, H, of size N when H contains m keys.
For the purpose of this analysis we will assume that all keys were inserted before any lookups were requested and no keys are inserted after lookups have begun.
If key k were inserted on the first probe then it is located where hash function h1 placed it and the number of lookups required to find k is 1.
If key k were inserted after i-1 collisions then it is located where hash function hi placed it and the number of lookups required to find k is i.
Therefore the expected number of lookups, EL, required to find a key, k, is equal to the number of probes required to insert the key so the results above solve this problem.
I.e., EL = 1/(1-d)

4. If we have only a small number of hash functions, say j, available we may wish to compute the expected number of probes required when the number of collisions is j or fewer.
If EP(j) represents this quantity then
EP(j) = 1 + 1*(1-d)*d + 2*(1-d)*d2 + ... + j*(1-d)*dj = 1 + (1-d)*∑(i*di) (1 <= i <= j)

The same method employed above will solve this summation.
S = 1*d + 2*d2 + ... + j*dj
d*S = 1*d2 + 2*d3 + ... + (j-1)*dj + j*dj+1
Again, matching terms in the S series with terms in the d*S series that have the same power of d and subtracting we obtain the following.

S - d*S = d + d2 + d3 + ... + dj - j*dj+1
The sum on the right without the last term is a finite geometric series equal to (d-dj+1)/(1-d)
Factoring we obtain S*(1-d) = (d-dj+1)/(1-d) - j*dj+1 and
S = (d-dj+1)/(1-d)2 - j*dj+1/(1-d) and so
EP(j) = 1 + (1-d)*S = 1 + (d-dj+1)/(1-d) - j*dj+1 = (1 - dj+1)/(1-d) - j*dj+1

Obviously, if we are limited to j collisions then some keys may not find an open slot in the hash table. We may then use another form of closed hashing collision resolution such as linear probing or we may store these keys outside of the hash table. Storing keys outside the hash table is called open hashing and is addressed in the next section.


Open Hashing Analysis

Compute the average number of keys in a hash table, H, after we attempt to insert m keys into the hash table. Recall that in open hashing keys involved in attempted insertions that result in collisions are stored external to the hash table.

We shall assume that h is an excellent hashing function. I.e., h distributes the keys uniformly (as a random number generator would do).

Let HT be a hash table of size N

Let Ki be the expected number of items in HT after i attempted insertions.

The expected number of elements in the hash table after m-1 attempted insertions using hash function h is Km-1. Therefore the expected number of open slots at this time is N-Km-1.
The probability of inserting the mth key (i.e. no collision on the insertion of the mth key) is, then, (N-Km-1)/N.
Therefore Km = Km-1 + (N-Km-1)/N

Algebraic manipulation gives Km = 1 + ((N-1)/N)*Km-1

Hence, Km = ∑((N-1)/N)i where i ranges from 0 to m-1.

Applying the formula for the sum of a finite geometric series we have
Km = N*(1 - ((N-1)/N)m) = N - N*((N-1)/N)m

Let Cm equal the expected number of collisions after m attempted insertions into HT using h.
Since Km + Cm = m we have
Cm = m - (N - ( N*((N-1)/N)m) = m - N + N*((N-1)/N)m
Cm = m - N - N×(N-1
-----
N
)m = m - N + N*(N-1)/Nm


An example.
Consider the file ShortList.txt which is a text file containing English language words. ShortList.txt contains 36721 unique words.

If we hash these words into a hash table containing 73342 slots then N = 73342 and m=36721. Note that N = 2*m.
After attempting to insert all of the words into the table we should expect the table to contain approximately
N*(1 - ((N-1)/N)m-1) = 28888 words. This implies that the remaining 7833 words are expected to be involved in collisions and stored outside of the hash table.



Top of this page   Top of page      Home page   Home page