Thiago P.

Hashing

Posted in Computer Science

← back to the blog


Basics

Definitions

 

Hash Function Composition

 

Some Hashing Methods

Mem Addr

Return the objects address in memory and use that as the hash value. Good for complex objects with many properties. Bad for primitive objects like strings because 2 strings that have the same value will have different hashes.

Cast to int

Be carful, some values cast to an int may be a number greater than the default int size (32 bits).

Sum components

Split the value into groups of bits, let’s say 16 bits, and sum those groups together. Good for keys/input values that are numeric. You can ignore the overflow of the addition if the result is greater than the default int size (32 bits).

EXAMPLE: A number of type Long is represented in 8 bytes and we want to create a hash value for that long that is a 4 byte long int. Split the long down the middle into 2 4 byte values and add the two 4 byte halves together.

Sum char codes

Take the ASCII char codes of each value in a string and add them together. This is bad because different strings that have the same size and same letters will have the same sum. (this method does not take the position of the char into account)

Polynomial

Create a polynomial from the char codes of the chars in the strings.

EXAMPLE BELOW

The value a should be some prime number.

An algorithm called Horner’s Method can be used to more efficiently calculate the polynomial.

Don't forget to use a compression function to convert the output result into a value that fits inside the hash table/array.

Resolving Collisions

Open Addressing

If a collision occurs, find the next available, unused, spot in the table, for example if A[2] is occupied, try A[3], and so on. Using open addressing reduces the memory usage but there arises complications when performing operations on the hash table such as remove.

The process of finding an available spot in the table is called probing, and using the next available index in the table (i+1) is called linear probing.

One can also use quadratic probing, or use another hash function to probe with.

Uses less memory

Seperate Chaining

Where each index in the table is actually a bucket/array. If collisions are stored in those buckets/arrays. Ideally, a good hash function will minimize collisions so that each bucket stores 1 or zero elements.

The load factor of such a hash table is how full the table can get before it must be resized, usually doubled. For example, a table holding 16 buckets with a load factor of 0.75, will be resized when the table reaches 12 entries.

In other words, a load factor of 0.75 means a table will resize when it is 75% full. This resizing prevents any one bucket from getting to large and thus increasing the time to search and delete elements and to insert elements (if you are implementing the buckets as linked lists rather than arrays).

Higher load factors decrease space cost but increase time cost, and vice versa. 0.75 seems to be an agreed upon industry standard