Thiago P.

Poly Hashing Strings

Posted in Computer Science, Programming, Snips

← back to the blog


Premise

After learning a bit on hashing I decided to program an experiment to see how table size, input size, and constant value, would affect the collisions generated by polynomial hashing of a string.

 

I obtained a list of 274918 common English words and put each one through a polynomial string hashing function to obtain an int hash for that word, and then inserted each one in an array (hash table) of fixed size, using modulo to resolve hash values greater than the array bounds.

I repeated this step for 25 different constant values to be used in the polynomial hashing function and recorded the number of collisions that occurred. Note that each constant value is a prime number.

NOTE:
The hash table does not resize.
The hash table uses separate chaining (but not with linked lists) to resolve collisions.

Observations

Overall, everything went as expected. If the size of the hash table (array) was less than or equal to the amount of words (input), the number of collisions would be large, around 45% of the amount of words.


 If the size of the table was larger than the word count, then the amount of collisions would obviously go down. This effect would continue, all the way up to table size values of 500000, 800000, and even 5000000. This is expected as polynomials are capable of generating large values and a larger table size would result in less use of the modulo function to resolve collisions.

At a table size of 5000000, the collision percentage could get as low as 3.6 percent.


One thing I noticed across all table sizes, is that the amount of collisions does not linearly correspond to the constant number used. In other words, the collisions did not increase as the constant increased and vice versa.

Constant prime numbers 11 through 31 the best results. This may be because the table sizes I chose were clean even numbers instead of odd prime numbers as is recommended for table sizes to be.

Code

Here is a link to the git repo Polyhash Experiments

(Is it wrong to waste git repo names on experiment projects like this? Probably not)