String hashing


Reading time: 20 minutes | Coding time: 5 minutes

Hashing is an important technique which converts any object into an integer of a given range. Hashing is the key idea behind Hash Maps which provides searching in any dataset in O(1) time complexity. Hashing is widely used in a variety of problems as we can map any data to integer upon which we can do arithmetic operations or use it as an index for data structures. We will take a look at the techniques to hash a string that is to convert a string to an integer.

Ideally, an hashing technique has the following properties:

  • If S is the object and H is the hash function, then hash of S is denoted by H(S).
  • If there are two distinct objects S1 and S2, ideally, H(S1) should not be equal to H(S2).
  • In some cases, H(S1) can be equal to H(S2) which we call collision and can be minimized and taken care of as well
  • If there is a range of hash function H as 0 to M, then H(S) = H(S) mod M.

You can consider S as a string for our case.

One of the most common applications of hashing strings is to compare it. Comparing strings of length N takes O(N) time complexity but comparing integers take O(1) time complexity. Hence, comparing hash of strings take O(1) time complexity.

Hash of a string

For a string of length N, a strong hash function is defined as:

$$ H(S) = ( S[0] + S[1] * P + S[2] * P^{2} + ... + S[N-1] * P^{N-1} ) mod M $$

$$ H(S) = ( \sum_{i=0}^{N-1} S[i] * P^i ) mod M $$

where:

  • P is a prime number (say 3)
  • S[i] is the ascii value of the character at index i of S (String)
  • M is the upper limit/ range of the hash function

Why this hash function minimizes collision?

On careful inspection, you will notice that we are minimizes the number of factors involves in the equation and hence, trying to map each string to a unique prime number each time. As there are factors present in the ascii value of characters and some arise due to addition, we utilize more integers and minimize the collision greatly.

This hash function is commonly referred to as polynomial rolling hash function.

The probability that there will be a collision is equal to 1/M. Hence, M should be a large number.

Reduce collision further

To reduce collision further, we can compute two hash values using two different values of P (like 3 and 5) and use both for comparision.

Pseudocode

The pseudocode is as follows:

long hash(string s) {
    const int p = 3;
    const int m = 1000000009;
    long hash = 0;
    long pow = 1;
    for (for every character c in s) {
        hash = (hash + (c - 'a' + 1) * pow) % M;
        pow = (pow * p) % M;
    }
    return hash;
}

Complexity

The complexity of the above algorithm is:

Time complexity: O(N) where N is the length of the string

Space complexity: O(1)

The power of the prime factor can be precomputed to give a boost.

Implementations

#include<iostream> 
#include<string>
using namespace std; 

long long compute_hash(string s, int p) {
    const int m = 1e9 + 9;
    long long hash = 0;
    long long pow = 1;
    for (char c : s) {
        hash = (hash + (c - 'a' + 1) * pow) % m;
        pow = (pow * p) % m;
    }
    return hash;
}

int main() {
	cout << "Hash of opengenus is: " << compute_hash("opengenus", 3);
	return 0;
}

Output:

Hash of opengenus is: 183060

Applications

There are several benefits of being able to compute hash of strings. Some of them are:

  • Comparing two strings in O(1) time complexity
  • Rabin-Karp algorithm for pattern matching in a string can be done in O(N)
  • Calculating the number of different substrings of a string in O(N^2 * log N)
  • Calculating the number of palindromic substrings in a string