Rolling Hash


Reading time: 25 minutes | Coding time: 5 minutes

Hashing is used for efficient comparison of strings by converting them into integers and then comparing those strings on the basis of their integer values. Rolling hash is used to prevent rehashing the whole string while calculating hash values of the substrings of a given string. In rolling hash,the new hash value is rapidly calculated given only the old hash value.Using it, two strings can be compared in constant time.

Example

Consider the string abcd and we have to find the hash values of substrings of this string having length 3 ,i.e.,abc and bcd.
For simplicity let us take 5 as the base but in actual scenarios we should mod it with a large prime number to avoid overflow.The highest power of base is calculated as (len-1) where len is length of substring.

So,the hash value of first substring,H(abc) :-

H(abc) => a*(5^2) + b*(5^1) + c*(5^0) 
= 97*25 + 98*5 + 99*1 = 3014

And the hash value of the second substring,H(bcd) :-

H(bcd) => b*(5^2) + c*(5^1) + d*(5^0) 
= 98*25 + 99*5 + 100*1 = 3045

Rolling hash works on the priciple that while finding the hash value of the next substring, the window is moved forward only by dropping one character and adding another.In this case character 'a' is removed from substring abc and character 'd' is added next to it to get the substring bcd.

So, we do not need to rehash the string again.Instead, we can substract the hash code corresponding to the first character from the first hash value,multiply the result by the considered prime number and add the hash code corresponding to the next character to it.

For instance in this case,

H(bcd)=(H(abc)-a*(5^2))*5 + d*(5^0)=(3014-97*25)*5 + 100*1 = 3045

Thus we can find the hash value of next substring in O(1).

Formula

In general,the hash H can be defined as:-

H=( c1ak-1 + c2ak-2 + c3ak-3. . . . + cka0 ) % m
where a is a constant, c1,c2, ... ck are the input characters and m is a large prime number, since the probability of two random strings colliding is about ≈ 1/m.

Then, the hash value of next substring,Hnxt using rolling hash can be defined as:-

Hnxt=( ( H - c1ak-1 ) * a + ck+1a0 ) % m

Pseudo Code

To find the hash of a substring,we find the hash of previous substring and calculate the hash of next substring using the formula:

Hnxt=( ( H - c1ak-1 ) * a + ck+1a0 ) % m defined above.

// computes the hash value of the input string s
long long compute_hash(string s) {
    const int p = 31;   // base 
    const int m = 1e9 + 9; // large prime number
    long long hash_value = 0;
    long long p_pow = 1;
    for (char c : s) {
        hash_value = (hash_value + (c - 'a' + 1) * p_pow) % m;
        p_pow = (p_pow * p) % m;  
    }
    return hash_value;
}
// finds the hash value of next substring given nxt as the ending character 
// and the previous substring prev 
long long rolling_hash(string prev,char nxt)
{
   const int p = 31;
   const int m = 1e9 + 9;
   long long H=compute_hash(prev);
   long long Hnxt=( ( H - pow(prev[0],prev.length()-1) ) * p + (int)nxt ) % m;
   return Hnxt;
}

Complexity

The rolling hash takes O(n) time compexity to find hash values of all substrings of a length k of a string of length n.Computing hash value for the first substring will take O(k) as we have to visit every character of the substring and then for each of the n-k characters we can find the hash in O(1) so the total time complexity would be O(k+n-k) i.e. O(n).Once all the hash values are found,substrings can be compared in O(1).

Applications

The various applications of Rolling Hash algorithm are:

  • Rabin-Karp algorithm for pattern matching in a string in O(n) time
  • Calculating the number of different substrings of a string in O(n2logn)
  • Calculating the number of palindromic substrings in a string.

With this, you have the complete knowledge of Rolling Hash technique. Enjoy and ace competitive programming.