Longest Common Substring in two strings


Reading time: 30 minutes | Coding time: 10 minutes

A substring is a contiguous sequence of characters within a string.For example,open is a substring of opengenus. Here,we have presented a dynamic programming approach to find the longest common substring in two strings in an efficient way.

Examples:

str1 = opengenus
str2 = genius
Output = gen
The longest common substring of str1(opengenus) and str2(genius) is "gen" of length 3.

str1 = carpenter
str2 = sharpener
Output = arpen
The longest common substring of str1(carpenter) and str2(sharpener) is "arpen" of length 5.

Approach

We have provided two approaches of solving the problem:-

  • Brute Force Approach O(N2 * M) time + O(1) space
  • Dynamic Programming Approach O(N * M) time + O(N + M) space

Where N and M are length of the two strings.

Brute Force Approach

The problem can be easily solved by finding all the substrings of one of the strings and then check whether the substring is present in the other string and returning the same with maximum length.

Substrings of the first string can be found using two loops in O(n2) where n is the length of first string and we can check whether it is present as a substring in second string in O(m) where m is the length of the second string.So the total complexity of the brute force approach would be O(n2 * m).

Following is the cpp implementation of the same:-

#include<bits/stdc++.h>
using namespace std;
void lcsubstr(string a,string b)
{
    string ans;
    int flg=0;
    for(int i=0;i<a.length();i++)
    {
        for(int j=i;j<a.length();j++)
        {
		  // substring of a of length (j-i+1)
            string x=a.substr(i,j-i+1);
            int t=0;
            for(int k=0;k<b.length();k++)
            {
				// if character of substring matches
				// that of other string
                if(b[k]==x[t])
                {
                    t++;
                }
				// if substring found
                else if(t==x.length())
                {
                    break;
                }
                else
                {
                    t=0;
                }
            }
            if(t==x.length())
            {
                flg=1;
				// if the length of found substring 
				// is greater than that of result 
				// update result
                if(ans.length()<x.length())
                    ans=x;
            }
        }
    }
    if(flg)
        cout << "Output="<<ans << '\n';
    else
        cout <<"No Common Substrings.\n";
  return;
}
int main()
{
	string str1,str2;
    cout << "str1=";
    cin  >> str1;
    cout << "str2=";
    cin >> str2;
    lcsubstr(str1,str2);
}
/*
Sample Input:- 
		str1=college
		str2=collage
Sample Output:-
		Output=coll
*/

Dynamic Programming Approach

Let the first string be s1 and second be s2.Suppose we are at DP state when the length of s1 is i and length of s2 is j, the result of which is stored in dp[i][j]. So, it means:

dp[i][j] = length of the longest substring in first i characters of S1 and first j characters of S2

Now if s1[i-1] == s2[j-1], then dp[i][j] = 1 + dp[i-1][j-1], that is result of current row in matrix dp[][] depends on values from previous row. Hence the required length of longest common substring can be obtained by maintaining values of two consecutive rows only, thereby reducing space requirements to O(2 * n).

dp[i][j] = 1 + dp[i-1][j-1]   if s1[i-1] == s2[j-1]

To print the longest common substring, we use variable end. When dp[i][j] is calculated, it is compared with res where res is the maximum length of the common substring.

If res is less than dp[i][j], then end is updated to i-1 to show that longest common substring ends at index i-1 in s1 and res is updated to dp[i][j]. The longest common substring then is from index end – res + 1 to index end in s1.

For instance,if we consider string doll as str1 and another string dog as str2 and loop through all characters of str1,position denoted by i, and str2,position denoted by j, to build the table dp[][].We consider the variable curr to state the current row which toggles its value to 0 and 1 at the end of every outer iteration.When i=0 we all the values of dp[][] corresponding to the first row is set 0.When i=1 and str1[i-1]!=str2[j-1], dp[curr][j] = 0.When str1[i-1] = str2[j-1] = 'd' (at j=1) , dp[cur][j] = dp[1][1] = dp[1-curr][j-1] + 1 = 0+1 =1. We store the maximum length of longest common substring (maximum value of dp[cur][j]) as res and the ending index of the substring as end. Here res=1 and end=0.

When i=2 and j=2, curr=0, str1[i-1] = 'o' and str2[j-1] = 'o'.Since str1[i-1] = str2[j-1] therefore dp[curr][j] = dp[1-curr][j-1] + 1 = dp[1][1] + 1 = 1 + 1 = 2.
Since res = 1 which is less than dp[curr][j], res = dp[curr][j] = 2 and end = i-1 = 1.

All other positions of dp is less than res.So finally the length of the largest common substring between doll and dog is 2.

To retrieve the substring iterate from position end-res+1 i.e. 1-2+1 = 0 till end i.e. 1 in str1 doll to get do as the largest common substring between dog and doll.

Pseudocode of Dynamic Programming approach:

Following is the pseudo code for dynamic programming implementation:-

longest_substring(X, Y, m, n)
  
   dp = [[0 for k in range(n+1)] for l in range(m+1)] 
      
    # To store the length of longest common substring 
    result = 0 
  
    # Following steps to build 
    # dp[m+1][n+1] in bottom up fashion 
    for i in range(m + 1): 
        for j in range(n + 1): 
            if (i == 0 or j == 0): 
                dp[i][j] = 0
            elif (X[i-1] == Y[j-1]): 
                dp[i][j] = dp[i-1][j-1] + 1
                if (result < dp[i][j]) 
                { 
                    result = dp[i][j]; 
                    r = i; 
                    c = j; 
                } 
            else: 
                dp[i][j] = 0
    while (dp[r][c] != 0)
    { 
        ans[--result] = X[r - 1]; 
        r--; 
        c--; 
    } 
    return ans 

Following is the complete cpp implementation:-

#include<bits/stdc++.h>
using namespace std;
#define f(i,n) for(int i=0;i<n;i++) 
#define fp(i,a,b) for(int i=a;i<=b;i++)
#define fr(i,a,b) for(int i=a;i>=b;i--)
string lcsubstr(string s1,string s2)
{
    int len1=s1.length(),len2=s2.length();
    int dp[2][len2+1];
    int curr=0,res=0,end=0;
    fp(i,0,len1)
    {
        fp(j,0,len2)
        {
            if(i==0 || j==0)
             dp[curr][j]=0;
            else if(s1[i-1]==s2[j-1])
            {
                dp[curr][j]=dp[1-curr][j-1] + 1;
                if(res<dp[curr][j])
                {
                    res=dp[curr][j];
                    end=i-1;
                }
            }
            else
            {
                dp[curr][j]=0;
            }
        }
        curr=1-curr;
    }
    if(res==0)
        return "0";
    string ans;
    ans=s1.substr(end-res+1,res);
    return ans;
}
int main()
{
    string s1,s2;
    cin >> s1 >> s2;
    string ans=lcsubstr(s1,s2);
   if(ans!="0")
        cout << ans << '\n';
    else
        cout << "No Common Substring\n";
    return 0;
}
/*
Sample Input:- 
		str1=college
		str2=collage
Sample Output:-
		Output=coll
*/

Time Complexity

  • Worst case time complexity: O(m*n)
  • Average case time complexity: O(m*n)
  • Best case time complexity: O(m*n)
  • Space complexity: O(2*n)

Since we are using two for loops for both the strings ,therefore the time complexity of finding the longest common substring using dynamic programming approach is O(n * m) where n and m are the lengths of the strings.Since this implemetation involves only two rows and n columns for building dp[][],therefore, the space complexity would be O(2 * n).

With this, you have the complete knowledge of Longest Common Substring in two strings using Dynamic Programming.