Suffix Array

Reading time: 25 minutes | Coding time: 10 minutes

A substring of a String S is defined as another String S1 that occurs "in" S. For example, "World" is a substring of the string "OpenGenus is the World".

A suffix is a special case of a substring. A suffix of String S is a substring that occurs in the end of S.

Formally, consider a string S of length N. The ith suffix of is defined as the substring:

$$ S\ [\ i\ ...\ n-1\ ]\ ,\ where\ i\ =\ 0\ to\ n-1\ $$

A Suffix Array contains integers that represent the starting indices of all the suffixes of a given string, after the aforementioned suffixes have been sorted.


Example


Let S = abaab. All suffixes are as follows:

Index Suffix
0 abaab
1 baab
2 aab
3 ab
4 b

After sorting these suffixes alphabetically, we get:

Index Suffix
2 aab
3 ab
0 abaab
4 b
1 baab

Suffix Array for S will be (2, 3, 0, 4, 1).


Algorithm


O(N^2 * log N) approach


The most common way to construct a Suffix Array is to get all the suffix strings and sort them using a sorting algorithm such as Quicksort or Mergesort, while simultaneously retaining their original indices.

Sorting typically merge sort has a complexity of O(N log N). This approach assumes that the elements being sorted can be compared in O(1) or constant time. While this is true for integers, in the case of strings, the comparison operation involved in sorting can end up taking O(N) in the worst case. Due to this, the worst case complexity of sorting suffix strings becomes O(N^2 log N).


O(N * log N * log N) approach

We can reduce this complexity by reducing the time taken by the comparison operation from O(N) to O(1) using the fact that the given suffix strings are not random strings; they are part of single string. Each suffix string has something in common with the other.

Step 1 - Sort the suffixes on the basis of their first character and assign them rank. If the first character of two suffixes are the same, they have the same rank.

Index Suffix
0 a|baab
0 a|ab
0 a|b
1 b|aab
1 b|

Step 2

Consider two characters from each string for sorting.

Now double the characters to take from each for sorting, i.e. 2.
When we take a string of two chars we can have two parts; first containing 1 char, other containing 1.

Let's compare abaab with baab, based on the first part, the first character; we can say that abaab will always be ranked above baab, so skip further comparison.

Now compare abaab with aab based on their first part; both have same rank. Now we will compare their second part. The second part of abaab is only b, and for aab be is a for these we already know their ranks; for b (i.e. whole baab) is 1 and a (i.e. whole ab) is 0.

Hence abaab will be ranked above baab.

For the strings not having a second part we will rank their second part the highest, i.e. -1. For example, b is not having 2nd character so its rank tuple will be (1, -1).

  1. aa | b
  2. ab | aab
  3. ab |
  4. b |
  5. ba | ab

In the next iteration, we sort 4-characters strings. This involves a lot of comparisons between different 4-characters strings.

How do we compare two 4-characters strings? Well, we could compare them character by character. That would be up to 4 operations per comparison.

Instead, we compare them by looking up the ranks of the two characters contained in them, using the rank table generated in the previous steps.

That rank represents the lexicographic rank from the previous 2-charater sort, so if any given 4-character string has a higher rank than another 4-character string, then it must be lexicographically greater somewhere in the first two characters.

Hence, if two 4-characters string's rank is identical, they must be identical in the first two characters.

In other words, two look-ups in the rank table are sufficient to compare all 4 characters of the two 4-characters strings.

Similarly, we can compare 8, 16, 32 strings in at most two integer comparisons that is in O(1) time complexity.

Hence, using this approach the comparison of two strings take place in O(log N) time complexity.

Therefore, using this approach, the entire time complexity is O(N * log N * log N).


Implementation


The implementation of Suffix Array is as follows:

  • C++

C++


#include bits/stdc++.h  
using namespace std;
// suffixRank is table hold the rank of each string on each iteration  
// suffixRank[i][j] denotes rank of jth suffix at ith iteration  
int suffixRank[20][int(1E6)];
// Example "abaab"  
// Suffix Array for this (2, 3, 0, 4, 1)  
// Create a tuple to store rank for each suffix  
struct myTuple {  
    int originalIndex;   // stores original index of suffix  
    int firstHalf;       // store rank for first half of suffix  
    int secondHalf;      // store rank for second half of suffix  
};
// function to compare two suffix in O(1)  
// first it checks whether first half chars of 'a' are equal to first half chars of b  
// if they compare second half  
// else compare decide on rank of first half  
int cmp(myTuple a, myTuple b) {  
    if(a.firstHalf == b.firstHalf) return a.secondHalf < b.secondHalf;  
    else return a.firstHalf < b.firstHalf;  
}
int main() {
    // Take input string
    // initialize size of string as N
    string s; cin >> s;
    int N = s.size();
    // Initialize suffix ranking on the basis of only single character
    // for single character ranks will be 'a' = 0, 'b' = 1, 'c' = 2 ... 'z' = 25
    for(int i = 0; i < N; ++i)
        suffixRank[0][i] = s[i] - 'a';
    // Create a tuple array for each suffix
    myTuple L[N];
    // Iterate log(n) times i.e. till when all the suffixes are sorted
    // 'stp' keeps the track of number of iteration
    // 'cnt' store length of suffix which is going to be compared
    // On each iteration we initialize tuple for each suffix array
    // with values computed from previous iteration
    for(int cnt = 1, stp = 1; cnt < N; cnt *= 2, ++stp) {
        for(int i = 0; i < N; ++i) {
            L[i].firstHalf = suffixRank[stp - 1][i];
            L[i].secondHalf = i + cnt < N ? suffixRank[stp - 1][i + cnt] : -1;
            L[i].originalIndex = i;
        }
        // On the basis of tuples obtained sort the tuple array
        sort(L, L + N, cmp);
        // Initialize rank for rank 0 suffix after sorting to its original index
        // in suffixRank array
        suffixRank[stp][L[0].originalIndex] = 0;
        for(int i = 1, currRank = 0; i < N; ++i) {
            // compare ith ranked suffix ( after sorting ) to (i - 1)th ranked suffix
            // if they are equal till now assign same rank to ith as that of (i - 1)th
            // else rank for ith will be currRank ( i.e. rank of (i - 1)th ) plus 1, i.e ( currRank + 1 )
            if(L[i - 1].firstHalf != L[i].firstHalf || L[i - 1].secondHalf != L[i].secondHalf)
                ++currRank;
            suffixRank[stp][L[i].originalIndex] = currRank;
        }
    }
    // Print suffix array
    for(int i = 0; i < N; ++i) cout << L[i].originalIndex << endl;
    return 0;
} 
   

Applications


The applications of Suffix Array are as follows:

  • The suffix array of a string can be used as an index to quickly locate every occurrence of a substring pattern P within the string S

  • There are several other applications of Suffix Array in conjunction with Suffix Tree and LCP array.


Questions


What is the Suffix Array of the String "ABCAB"?

(3, 0, 4, 1, 2)
(0, 1, 2, 3, 4)
(4, 3, 2, 1, 0)
(2, 4, 1, 0, 3)