Shortest Superstring problem

Free Linux Book

Get FREE domain for 1st year and build your brand new site

Hey Guys 🎉, in this article, we have explained three approaches to solve the Shortest Superstring problem. This involve concepts like DFS and Dynamic Programming.

Table of contents:

  1. Problem Statement
  2. Approach 1: Brute-Force (Back-Tracking)
  3. Approach 2: Search + Pre-Compiling
  4. Approach 3: Dynamic Programming

This is similar to Leetcode problem 943. Find the Shortest Superstring.

Problem Statement

Let's discuss a hard yet very intresting problem named "Shortest SuperString Problem".

The problem statement 🤔 is as follows:-

"Given an array of strings words, return the smallest string that contains each string in words as a substring. If there are multiple valid strings of the smallest length, return any of them.

You may assume that no string in words is a substring of another string in words."

The shortest superstring problem takes as input, several strings of different lengths and finds the shortest string that contains all the input strings as substrings.

There are many ways to solve this problem. We can use backtracking, dynamic programming with bitmasks, and even graphs.
Let's discuss each approach.

Example 1:

Input: words = ["alex","loves","leetcode"]
Output: "alexlovesleetcode"
Explanation: All permutations of "alex","loves","leetcode" would also be accepted.

Example 2:

Input: words = ["catg","ctaagt","gcta","ttca","atgcatc"]
Output: "gctaagttcatgcatc"


Let's think of an easy Brute-Force solution which involves Recursion 😈 + BackTracking 😈 + Greedy Thinking 🤑.

Approach 1: Brute-Force (Back-Tracking)


Algorithm:-

  1. Our First Approach is Backtracking.
  2. Here since we want to find the shortest superstring everytime we make sample vector of strings and compress it.
  3. Since we are solving this question recursively hence we make a function void and want to cover all the strings hence in each traversal we push the string inside the sample string, mark that index visited and call the recursive function again with index + 1.
  4. After returning from that recursion function, we just backtrack and pop a string from the vector of strings and also mark unvisited.
  5. At the moment when the size of sample vector of string reaches the given vector of string, we try to compress the strings.
  6. For compressing we travesre through all the strings in the vector of strings and find if the postfix of the current string is common with the prefix of the next string.
  7. For each iteration in this traversal we test for overlap for all the lengths.
  8. Greedily we find the maximum overlap and in the end return the final answer string.

Code:-

   // M: Number of strigs
   // L :  combined length of all strings
   string strCompress(vector<string>& svec) {
       string current = svec[0];
       // Find if the postfix of current string is common with prefix of next string
       for (int i = 0; i < svec.size() - 1; i++) {
           int clen = current.length();
           string next = svec[i + 1];            
           int j = 1;
           int max_overlap = 0;         
           // Test for an overlap for all lengths starting 1
           while (j <= min(current.length(), next.length())) {
               string postfix = next.substr(0, j); 
               string prefix = current.substr(clen - j, j);          
               if (postfix.compare(prefix) != 0) {
                   j++;
                   continue;
               }               
               max_overlap = max(max_overlap, j);
               j++;
           }           
           if (max_overlap == 0) {
               current += next;
               continue;
           }        
           current += next.substr(max_overlap);
       }       
       return current;
   }    

   
   void shortestSuperstringRec(vector<string>& A, string& result, vector<string>& svec, int idx, vector<bool>& visited) {
       if (svec.size() == A.size()) {
           string s = strCompress(svec);
           if (result.length() == 0 || s.length() < result.length())
               result = s;           
           return;
       }
       for (int i = 0; i < A.size(); i++) {        
           if (visited[i] == true)
               continue;          
           svec.push_back(A[i]);
           visited[i] = true;
           shortestSuperstringRec(A, result, svec, idx + 1, visited);
           // Backtrack
           svec.pop_back();
           visited[i] = false;
       }
   }  
   string shortestSuperstring(vector<string>& A) {
       string result;
       int len = 0;
       vector<string> svec;
       vector<bool> visited(A.size(), false);       
       if (A.size() == 1)
           result = A[0];      
       if (A.size() < 2)
           return result;        
       for (int i = 0; i < A.size(); i++)
           len += A[i].length();       
       shortestSuperstringRec(A, result, svec, 0, visited);
       return result;
   }

Time Complexity:-

  • The time complexity of this code would be O(M! x ML) 😨, where, M is the number of given strings ans L is combined length of all the strings.
  • Since this a brute force approach, we would get Time Limit Exceeded for some of the test cases.

To avoid TLE, let's discuss another approach 😉.


Approach 2: Search + Pre-Compiling


Algorithm:-

  1. Try all permutations.
  2. Pre-process the cost from word[i] to word[j] and store it in g[i][j].
  3. In simple words with the help of recursion(dfs) and pre-compilation, we are just trying to reducing the time complexity of the code.
  4. In the code, with the help of dfs function, we are able to try all the permutations recursively which means we dont have to make and check each and every permutation on our own but the compiler will do that for us.
  5. Since we are pre-compiling (storing the recursive results before hand rather than doing the same recursion again and again), our space complexity will be increased a little bit but time complexity will decrease to a great extent.
  6. We also used some bit manupulation functions like (used & (1<<i)) tells the ith bit of used is set or not. Hence a little bit masking is also used to solve this problem.

Code:-

  vector<vector<int>> g_;
  vector<int> best_path_;
  int best_len_;
  void dfs(const vector<string>& A, int d, int used, int cur_len, vector<int>& path) {
    if (cur_len >= best_len_) return;
    if (d == A.size()) {
      best_len_ = cur_len;
      best_path_ = path;
      return;
    }
    
    for (int i = 0; i < A.size(); ++i) {
      if (used & (1 << i)) continue;      
      path[d] = i;
      dfs(A,d + 1,used | (1 << i), d == 0 ? A[i].length(): cur_len + g_[path[d - 1]][i],path);
    }
  }
  string shortestSuperstring(vector<string>& A) {    
    const int n = A.size();
    g_ = vector<vector<int>>(n, vector<int>(n));
    for (int i = 0; i < n; ++i)
      for (int j = 0; j < n; ++j) {
        g_[i][j] = A[j].length();
        for (int k = 1; k <= min(A[i].length(), A[j].length()); ++k)
          if (A[i].substr(A[i].size() - k) == A[j].substr(0, k))            
            g_[i][j] = A[j].length() - k;
      }
    vector<int> path(n);
    best_len_ = INT_MAX;
    dfs(A, 0, 0, 0, path);    
    string ans = A[best_path_[0]];
    for (int k = 1; k < best_path_.size(); ++k) {
      int i = best_path_[k - 1];
      int j = best_path_[k];
      ans += A[j].substr(A[j].length() - g_[i][j]);
    }
    return ans;
  }

Time Complexity:-

  • The time complexity of this code would be Time complexity: O(n!), where n is the number of strings. 🎉
  • Better than previous one !!

But wait a minute 🤔.
Is there a more better or the best solution for this problem?
Let's analyse that too.

Approach 3: Dynamic Programming


  1. Let's now discuss the best approach one can think towards this problem.

  2. Say we have put some words down in our row, ending with word A[i]. Now say we put down word A[j] as the next word, where word j hasn't been put down yet. The overlap increases by overlap(A[i], A[j]).

  3. We can use dynamic programming to leverage this recursion.

  4. Let dp(mask, i) be the total overlap after putting some words down (represented by a bitmask mask), for which A[i] was the last word put down. Then, the key recursion is dp(mask ^ (1<<j), j) = max(overlap(A[i], A[j]) + dp(mask, i)), where the jth bit is not set in mask, and i ranges over all bits set in mask.

  5. Of course, this only tells us what the maximum overlap is for each set of words.

  6. We also need to remember each choice along the way (ie. the specific i that made dp(mask ^ (1<<j), j) achieve a minimum) so that we can reconstruct the answer.

  7. Lets see the implementation of this code which makes the logic more clear.

  8. Also, Don't forget to observe the comments inside the code, they make things more clear. 😊.

string shortestSuperstring(vector<string>& words) {
        int n=words.size();
        int overlap[n][n];
        memset(overlap,0,sizeof(overlap));
        for(int i=0;i<n;i++){
            for(int j=0;j<n;j++){
                if(i!=j){
                    int m=min(words[i].length(),words[j].length());
                    for(int k=m;k>=0;--k){
                        int sz=words[i].size();
                        if(words[i].substr(sz-k,k)==words[j].substr(0,k)){
                            overlap[i][j]=k;
                            break;
                        }
                    }
                }
            }
        }
        int dp[1<<n][n];
        int parent[1<<n][n];
        memset(dp,0,sizeof(dp));
        memset(parent,0,sizeof(parent));
        for(int mask=0;mask<(1<<n);mask++){
            for(int i=0;i<n;i++){
                parent[mask][i]=-1;
            }
            for(int bit=0;bit<n;bit++){
                if(((mask>>bit)&1)>0){
                    int pmask=mask^(1<<bit);
                    if(pmask==0)continue;
                    for(int i=0;i<n;i++){
                        if(((pmask>>i)&1)>0){
                            int val=dp[pmask][i]+overlap[i][bit];
                            if(val>dp[mask][bit]){
                                dp[mask][bit]=val;
                                parent[mask][bit]=i;
                            }
                        }
                    }
                }
            }
        }
        // # Answer will have length sum(len(A[i]) for i) - max(dp[-1])
        // Reconstruct answer, first as a sequence 'perm' representing
        // the indices of each word from left to right.
        int perm[n];
        bool seen[n];
        memset(perm,0,sizeof(perm));
        memset(seen,false,sizeof(seen));
        int t=0;
        int mask=(1<<n)-1;
        // p : last element of perm (last word written left to right)
        int p=0;
        for(int j=0;j<n;j++){
            if(dp[(1<<n)-1][j]>dp[(1<<n)-1][p]){
                p=j;
            }
        }
        //follow parents down backwards path that retains maximum overlap 
        while(p!=-1){
            perm[t++]=p;
            seen[p]=true;
            int p2=parent[mask][p];
            mask^=1<<p;
            p=p2;
        }
        //reverse perm
        for(int i=0;i<t/2;i++){
            swap(perm[i],perm[t-1-i]);
        }
        //fill remaining words not yet added 
        for(int i=0;i<n;i++){
            if(!seen[i]){
                perm[t++]=i;
            }
        }
        string ans=words[perm[0]];
        for(int i=1;i<n;i++){
            int ov=overlap[perm[i-1]][perm[i]];
            ans+=(words[perm[i]].substr(ov));
        }
        return ans;
    }
};

Time Complexity:-

  • The time complexity of this code would be Time complexity: O(n^2 * 2^n) 🥳🙌, where, n is the number of strings.
  • Hurray!! We have constructed the best algorithm for this problem.

Hope you liked this Article at OpenGenus 😊😉.
Thanks for Reading!!