Find if a string is a sub-string of another string


Reading time: 35 minutes | Coding time: 15 minutes

A substring is a contiguous sequence of characters within a string. For instance, open is a substring of opengenus. We have presented two approaches to check whether a string is a substring of another string namely naive method which takes O(N^2) time and efficient method using the concept of Rolling Hash which takes linear time O(N).

Example

Input:-

str1 = opengenus
str2 = open

Output:-

Yes.
"open" exists as a substring in opengenus starting from index 0 and ending at index 3.

Input:-

str1 = opengenus
str2 = opera

Output:-

No.
No continuous sequence of characters exists in "opengenus" which is equal to "opera".

Approaches

We have provided two approaches:-

  • Naive Method O(N^2) time and O(1) space
  • Efficient method using rolling hash O(N) time and O(N) space

Naive Method

We can simply compare both strings character by character and stop when we find the entire second string in the first or when we reach the end of the first string. This would in worst case require to traverse the entire first string giving the time complexity of O(length(str1)) or O(n) for ease and comparing all substrings results in O(N^2) time complexity.

For example,suppose we want to find whether string iit is present as a substring in the string iiitian.We will start taking substring of length equal to 3,i.e., length of the string iit, from the string iiitian and check whether it is equal to the required substring iit or not.If they are equal,return true else check next substring.If none of them matches,return false.Here firstly, iii will be checked.Since iii is not equal to iit , therefore next 3-letters substring starting from index 1(0 based),i.e, iit will be considered.Since it is equal to the required substring therefore true is returned.

Pseudocode:

is_substring(string str1, string str2)
{
    for(int i=0; i<str1.length()-str2.length()+1; i++)
    {
        // substring of str1 of length of str2
        string x=str1.substr(i,str2.length()); 
        // takes O(N) time
        if(x == str2)
            return true;
    }
    // if str2 isn't a substring of str1
    return false;
}

String comparison take linear time O(N) as each character needs to be compared one by one. To learn more about this idea, go through this article: String Hashing by OpenGenus.

Following is the C++ implementation of the naive approach:-

#include<bits/stdc++.h>
using namespace std;
bool isSubstring(string str1,string str2)
{
    for(int i=0;i<str1.length()-str2.length()+1;i++)
    {
        // substring of str1 of length of str2
        string x=str1.substr(i,str2.length()); 
        if(x==str2)
            return true;
    }
    // if str2 isn't a substring of str1
    return false;
}
int main()
{
    string str1,str2;
    cout << "str1=";
    cin >> str1;
    cout << "str2=";
    cin >> str2;
    if(isSubstring(str1,str2))
        cout << "Yes\n";
    else
        cout << "No\n";
    return 0; 
}
/*
Sample Input 1:-
            str1= iiitian
            str2= iit
Sample Output 1:-
            Yes
Sample Input 2:-
            str1= opengenus
            str2= engine
Sample Output 2:-
            No
 */

Efficient Method (using Rolling Hash)

The idea is to compute the hash value of a string str1 which needs to be searched. Let this be H1. Then, we will generate hash value of subsequent sub-strings in str2 with the same length as str1 using the rolling hash method and compare it with H1.

We can efficiently find the hash values of all substrings of the first string str1 and store it in the array. Similarly find the hash value of the second string str2 and compare the hash values of the substrings of str1 with the hash value of the str2 using mapping to find whether the given string str2 exists as a substring in str1. The map m2 is used to mark the integer 1 at the hash value corresponding to str2. The map m1 maps the hash values corresponding to all the substrings of str1 to 1.So if the value of m1 at the hash value corresponding to str2 is 1 that means a substring is present in str1 corresponding to the same hash value,i.e.,str2 is present as substring in str1.

We can also use binary-search instead of maps by storing the hash values corresponding to all the substrings of str1 in a sorted form in an array and then search for the hash value corresponding to str2 in the array using binary search in O(log(n)) time complexity where n is the size of the array.You can read more about binary search here.

In general,the hash H can be defined as:-
H=( c1ak-1 + c2ak-2 + c3ak-3. . . . + cka0 ) % m
where a is a constant, c1,c2, ... ck are the input characters and m is a large prime number, since the probability of two random strings colliding is about ≈ 1/m.
Then, the hash value of next substring,Hnxt using rolling hash can be defined as:-
Hnxt=( ( H - c1ak-1 ) * a + ck+1a0 ) % m

Let's consider the same example, we want to find whether string iit is present as a substring in the string iiitian.Applying the above formula,the hash value of string iit would be 7955.So m2[7955]=1.

When we find the hash values of all 3-letters substrings of iiitian, the hash value of the first substring iii would be 7944 and that of the next substring iit would be 7955.Therefore m1[7944]=1 and m1[7955]=1.We can map these hash values and when the hash value of the substring matches with the string,return true.In this case since m2[7955]=m1[7955]=1,i.e., the hash value of the substring iit matches , therefore true will be returned.

Note that you can implement this without the use of maps.

Pseudocode:

bool is_substring(string str1, string str2)
{
    int j=0;
    string prev = str1.substr(j,str2.length());
    j++;
    map<long long,int> m1, m2;
    long long h2=compute_hash(str2);
    // map 1 to the hash-value corresponding to str2
    m2[h2]=1;
    long long h1=compute_hash(prev);
    // map 1 to the hash-values corresponding to all 
    // substrings of str1
    m1[h1]=1;
    for(int i=str2.length();i<str1.length();i++)
    {
        h1=rolling_hash(h1,prev,str1[i]);
        m1[h1]=1;
        // finds previously considered substring
        prev=str1.substr(j,str2.length());
        j++;
    }
    // if str2 isn't a substring of str1 returns 0 else 1
    return m1[h2] == m2[h2];
}

Following is the complete C++ implementation:-

#include<bits/stdc++.h>
using namespace std;
// computes the hash value of the input string s
long long compute_hash(string s) 
{
    const int p = 31;   // base 
    const int m = 1e9 + 9; // large prime number
    long long hash_value = 0;
    long long p_pow = (long long)pow(p,s.length()-1);
    for (char c : s) 
    {
        hash_value = (hash_value + (c - 'a') * p_pow) % m;
        p_pow = (p_pow / p);  
    }
    return hash_value;
}
// finds the hash value of next substring given nxt as the ending character 
// and the previous substring prev 
long long rolling_hash(long long H,string prev,char nxt)
{
   const int p = 31;
   const int m = 1e9 + 9;
   long long Hnxt=( ( H - (prev[0]-'a')*(long long)pow(p,prev.length()-1) ) * p + (nxt-'a') ) % m;
   return Hnxt;
}

bool isSubstring(string str1,string str2)
{
    int j=0;
    string prev=str1.substr(j,str2.length());
    j++;
    map<long long,int> m1,m2;
    long long h2=compute_hash(str2);
    // map 1 to the hash-value corresponding to str2
    m2[h2]=1;
    long long h1=compute_hash(prev);
    // map 1 to the hash-values corresponding to all 
    // substrings of str1
    m1[h1]=1;
    for(int i=str2.length();i<str1.length();i++)
    {
        h1=rolling_hash(h1,prev,str1[i]);
        m1[h1]=1;
        // finds previously considered substring
        prev=str1.substr(j,str2.length());
        j++;
    }
    // if str2 isn't a substring of str1 returns 0 else 1
    return m1[h2]==m2[h2];
}
int main()
{
    string str1,str2;
    cout << "str1=";
    cin >> str1;
    cout << "str2=";
    cin >> str2;
    if(isSubstring(str1,str2))
        cout << "Yes\n";
    else
        cout << "No\n";
    return 0; 
}
/*
Sample Input 1:-
            str1= iiitian
            str2= iit
Sample Output 1:-
            Yes
Sample Input 2:-
            str1= opengenus
            str2= engine
Sample Output 2:-
            No
 */

Complexity

  • Worst case time complexity: O(n)
  • Average case time complexity: O(log(n))
  • Best case time complexity: O(1)
  • Space complexity: O(n)

We have still require to get all characters of string. So the worst case time complexity will remain O(n) but with the use of rolling hash,two strings can be compared in O(1) time complexity and all substrings can be compared in O(log(n)) using maps or binary search. With the increase of the number of entries, the map's space will increase linearly. So space complexity is O(n).

References