Longest Common Subsequence


Reading time: 30 minutes | Coding time: 10 minutes

A subsequence is any string formed by any collection of characters of the string based on their indices, like ogs is a subsequence of the string opengenus.We have presented an efficient way to find the longest common subsequence of two strings using dynamic programming.

Examples

str1 = ashutosh
str2 = amitesh
The largest common subsequence is "atsh" as it is present in both the strings.

str1 = opengenus
str2 = engineers
The largest common subsequence can either be "engns" or "enges" as the length of both the subsequences are equal i.e. 5.

Approach

We have presented two approaches to find the longest common subsequences:-

  • Naive Recursive approach in O(2n) time-complexity.
  • Dynamic Programming approach in O(n * m) time-complexity.

Recursive Approach

The naive solution for this problem is to generate all subsequences of both given sequences and find the longest matching subsequence. This solution is exponential in term of time complexity.Time complexity of the above naive recursive approach is O(2^n) in worst case and worst case happens when all characters of both the strings mismatch i.e., length of LCS is 0.

The pseudo-code for the recursive implementation is :-

function LCS(S, T)
   if S is empty or T is empty then
      return empty string

   if first char of S == first char of T then
       return (first char of S) + LCS(S - first char, T - first char)

   otherwise // first chars are different
       return longer of LCS(S - first char, T) and LCS(S, T - first char)

Dynamic Programming Approach

In dynamic programming approach we store the values of longest common subsequence in a two dimentional array which reduces the time complexity to O(n * m) where n and m are the lengths of the strings.

Let the input sequences be X and Y of lengths m and n respectively. And let dp[n][m] be the length of LCS of the two sequences X and Y.

We iterate through a two dimentional loops of lengths n and m and
use the following algorithm to update the table dp[][]:-

  • If any of the loop variable i or j is 0 , then dp[i][j] = 0.
  • if X[i-1] = Y[j-1] ,i.e., when the characters at ith and jth index matches, dp[i][j] = 1 + dp[i-1][j-1].
  • Otherwise, store the maximum value we get after considering either the charater X[i] or the character Y[j],i.e.,dp[i][j] = max(dp[i][j-1],dp[i-1][j]).

To retrieve the subsequence, follow a bottom-up approach.When the length of subsequence considering the Y[j] is greater than that considering X[i],decrement j, increment i when the length of subsequence considering the Y[j] is less than that considering X[i].Whenever the the length of subsequence considering the Y[j] is equal to that considering X[i],include the character in the answer.

For example:-

Consider the strings aabbc and abacc.Here length of both strings is 5.When we iterate through two nested loops where 0<=i<=5 and 0<=j<=5,When i=0 or j=0 ,dp[0][0..5] = 0 and d[0..5][0] = 0.When i=1 and j=1 ,since X[i-1] = Y[j-1] = 'a',therefore dp[1][1] = 1 + dp[0][0] = 1.

Since X[i-1] != X[j-1] for i=1 and j=2, therefore dp[1][2] = max(dp[0][2],dp[1][1]) = 1.

Similarly filling the table, the final table dp[][] would look like:-

a b a c c
0 0 0 0 0 0
a 0 1 1 1 1 1
a 0 1 1 2 2 2
b 0 1 2 2 2 2
b 0 1 2 2 2 2
c 0 1 2 2 3 3

To retrieve the longest common substring, let us create a character array ans[] of length ind equal to the length of longest common subseuence i.e. dp[5][5].So in our example ind=3.Let us initialize two variables i=5 and j=5(since length of both strings is 5).Let the string aabbc be a[] and abacc be b[].
Since a[i-1]=b[j-1],i.e, a[4]=b[4]='c', ans[ind-1]=a[i-1].So ans[2]=c and i,j and ind all are decremented.Now i=4,j=4,ind=2.

Now dp[i-1][j] = dp[3][4] = 2 and dp[i][j-1]=dp[4][3] = 2.Since dp[3][4] = dp[4][3],therefore, only j is decremented.Now i = 4,j = 3,ind=2.

Now dp[i-1][j] = dp[3][3] = 2 and dp[i][j-1] = dp[4][2] = 2.Since dp[3][3] = dp[4][2],therefore, only j is decremented.Now i = 4,j=2,ind=2.

Since a[i-1] = b[j-1],i.e, a[3] = b[1] = 'b', ans[ind-1] = a[i-1].So ans[1]=b and i,j and ind all are decremented.Now i=3,j=1,ind=1.

Now dp[i-1][j] = dp[2][1] = 1 and dp[i][j-1] = dp[3][0] = 0.Since dp[2][1]>dp[3][0],therefore, only i is decremented.Now i=2,j=1,ind=1.

Since a[i-1] = b[j-1],i.e, a[1] = b[0] = 'a', ans[ind-1]=a[i-1].So ans[0] = a and i,j and ind all are decremented.Now i = 2,j = 0,ind = 0.Since j=0,therefore the algorithm will stop.So, the longest common subsequence would be ans[]=abc.

Pseudo code of the above implementation:-

bottom-up-lcs() {
lcs = new array [0..m, 0..n]
for (i = 0 to m) lcs[i,0] = 0 // basis cases
for (j = 0 to n) lcs[0,j] = 0
for (i = 1 to m) { // fill rest of table
for (j = 1 to n) {
if (x[i] == y[j]) // take x[i] (= y[j]) for LCS
lcs[i,j] = lcs[i-1, j-1] + 1
else
lcs[i,j] = max(lcs[i-1, j], lcs[i, j-1])
}
}
return lcs[m, n] // final lcs length
}

Following is the cpp implementation of the above approach :-

#include<bits/stdc++.h>
using namespace std;
void lcs(string a,string b,int n,int m)
{
	int dp[n+1][m+1];
    // Building the table in bottom-up manner
	for(i=0;i<n+1;i++)
	{
		for(int j=0j<m+1;j++)
		{
			if(i==0 || j==0)
				dp[i][j]=0;
			else if(a[i-1]==b[j-1])
				dp[i][j]=1+dp[i-1][j-1];
			else
			{
				dp[i][j]=max(dp[i][j-1],dp[i-1][j]);
			}
		}
	}

	int ind=dp[n][m]; // length of longest common subsequence
	char ans[ind+1]; // longest common subsequence
	int i=n,j=m;
	ans[ind]='\0';
	while(i>0 && j>0)
	{
		if(a[i-1]==b[j-1])
			{
				ans[ind-1]=a[i-1];
				i--;j--;ind--;
			}
		else if(dp[i-1][j]>dp[i][j-1])
			i--;
		else
		{
		 j--;
		}
	}
	cout << ans;
}
int main()
{
	string a,b;
	cin >> a >> b;
	cout << "The largest common subsequence is " ;
	lcs(a,b,a.length(),b.length());
}
/*
Sample Input:-
    aabbc
    abacc
Sample Output:-
    The largest common subsequence is abc
*/

Time Complexity

  • Worst case time complexity: O(n*m)
  • Average case time complexity: O(n*m)
  • Best case time complexity: O(n*m)
  • Space complexity: O(n*m)

Since we are using two for loops for both the strings ,therefore the time complexity of finding the longest common subsequence using dynamic programming approach is O(n * m) where n and m are the lengths of the strings.Since this implemetation involves only n rows and m columns for building dp[][],therefore, the space complexity would be O(n * m).

With this, you have complete idea of effectively solving the Longest Common Subsequence problem using Dynamic Programming.