Z Algorithm function

The Z-function for a string S of length N is an array of length N where the i th element is equal to the greatest number of characters starting from the position i that coincide with the first characters of S.

In other words, z[i] is the length of the longest common prefix between S and the suffix of S starting at i. We assume 0-based indexes; that is, the first character of S has index 0 and the last one has index N-1.

The first element of Z-functions, z[0], is generally not well-defined. In this article we will assume it is zero.

z [ 0 ] = 0

This article presents an algorithm for calculating the Z-function in O(N) time, as well as various of its applications.


Examples


For example, here are the values of the Z-function computed for different strings:

s = 'aaaaa'

z[0] z[1] z[2] z[3] z[4]
0 4 3 2 1

s = 'aaabaab'

z[0] z[1] z[2] z[3] z[4] z[5] z[6]
0 2 1 0 2 1 0

s = 'abacaba'

z[0] z[1] z[2] z[3] z[4] z[5] z[6]
0 0 1 0 3 0 1

Trivial algorithm


The formal definition can be represented in the following elementary implementation.


vector<int> z_function_trivial(string s) 
{
    int n = (int) s.length();
    vector<int> z(n);
    for (int i = 1; i < n; ++i)
        while (i + z[i] < n && s[z[i]] == s[i + z[i]])
            ++z[i];
    return z;
}

We just iterate through every position and update for each one of them, starting from and incrementing it as long as we do not find a mismatch (and as long as we do not reach the end of the line).

Of course, this is not an efficient implementation. We will now show the construction of an efficient implementation.


Efficient algorithm


To obtain an efficient algorithm, we will compute the values of z[i] in turn from i = 1 to N-1 but at the same time, when computing a new value, we will try to make the best use possible of the previously computed values.

Let us call segment matches those substrings that coincide with a prefix of S. For example, the value of the desired Z-function z[i] is the length of the segment match starting at position i (and that ends at position i + z[i] - 1).

To do this, we will keep the L, R indices of the rightmost segment match. Among all detected segments, we will keep the one that ends rightmost. In a way, the index can be seen as the "boundary" to which our string S has been scanned by the algorithm; everything beyond that point is not yet known.

Then, if the current index (for which we have to compute the next value of the Z-function) is i, we have one of two options:

  • L > R : the current position is outside of what we have already processed.

We will then compute z[i] with the trivial algorithm (that is, just comparing values one by one). Note that in the end, if z[i] > 0, we will have to update the indices of the rightmost segment, because it is guaranteed that the new r = i + z[i] - 1 is better than the previous r.

  • L <= R :The current position is inside the current segment match.

Then we can use the already calculated Z-values to "initialize" the value of z[i] to something (it sure is better than "starting from zero"), maybe even some big number.

For this, we observe that the substrings S[L...R] and S[0...R-L] match. This means that as an initial approximation for z[i] we can take the value already computed for the corresponding segment s[0...R-L], and that is z[i-L].

However, the value z[i-L] could be too large: when applied to position i it could exceed the index r. This is not allowed because we know nothing about the characters to the right of r.

Here is an example of a similar scenario:

s = 'aaaabaa'

When we get to the last position (i = 6), the current match segment will be [5:6]. Position 6 will then match position 6-5=1, for which the value of the Z-function is z[1]=3. Obviously, we cannot initialize z[6] to 3, it would be completely incorrect. The maximum value we could initialize it to is 1 because it is the largest value that does not bring us beyond the index r of the match segment [L:R].

Thus, as an initial approximation for z[i] we can take:

$$ Z_O[i] = minimum (R-i+1, z[i-L]) $$

After having $z[i]$ initialized to $z_o[i]$, we try to increment $z[i]$ by running the trivial algorithm because in general, after the border r, we cannot know if the segment will continue to match or not.

Thus, the whole algorithm is split in two cases, which differ only in the initial value of $z[i]$: in the first case it is assumed to be zero, in the second case it is determined by the previously computed values (using the above formula). After that, both branches of this algorithm can be reduced to the implementation of the trivial algorithm, which starts immediately after we specify the initial value.

The algorithm turns out to be very simple. Despite the fact that on each iteration the trivial algorithm is run, we have made significant progress, having an algorithm that runs in linear time. Later on we will prove that the running time is linear.


Implementations


Implementations of Z function in Java and C++:

  • Java
  • C++

Java


   // returns array z[] where z[i] is z-function of s[i]
int[] zFucntion(String s) {
    int n = s.length();
    int z[] = new int[n];
    int R = 0;
    int L = 0;
    for(int i = 1; i < n; i++) {
        z[i] = 0;
        if (R > i) {
            z[i] = Math.min(R - i, z[i - L]);
        }
        while (i + z[i] < n && s.charAt(i+z[i]) == s.charAt(z[i])) {
            z[i]++;
        }
        if (i + z[i] > R) {
            L = i;
            R = i + z[i];
        }
    }
    z[0] = n;
    return z;
}
   

C++


   vector<int> z_function(string s) {
    int n = (int) s.length();
    vector<int> z(n);
    for (int i = 1, l = 0, r = 0; i < n; ++i) {
        if (i <= r)
            z[i] = min (r - i + 1, z[i - l]);
        while (i + z[i] < n && s[z[i]] == s[i + z[i]])
            ++z[i];
        if (i + z[i] - 1 > r)
            l = i, r = i + z[i] - 1;
    }
    return z;
}

Complexity


  • Worst case time complexity: Θ(N)
  • Average case time complexity: Θ(N)
  • Best case time complexity: Θ(N)
  • Space complexity: Θ(log N)

Applications


Applications of Z algorithms are as follows:

  • Finding all occurrences of the pattern P inside the text T in O(length(T) + length(P))

  • Counting the number of distinct substrings of a string S in O(1)

  • Finding a string T of shortest length such that S can be represented as a concatenation of one or more copies of T