Substring with Concatenation of All Words

Free Linux Book

Get FREE domain for 1st year and build your brand new site

In this article, we have explained two approaches to solve the problem Substring with Concatenation of All Words. This involves the idea of Hash Map and Two Pointer.

Table of Contents

  • Problem Statement
  • Approach 1: Hash Map
  • Approach 2: Two pointer
  • Comparison of methods

This is similar to Leetcode Problem 30. Substring with Concatenation of All Words.

Problem Statement

Let's say that we are given a list of strings L of N strings, and a string S. We need to find the substrings in S which are a concactenation of the strings in L, we then need to return the starting index of said strings, in any order.

For example:

L: ["jazz","quiz","open"]
S: "jazzquizopenquizjazzopenquiz"

Output:
(0, 8, 12, 16)

Reasoning:
At index 0: jazzquizopen
At index 8: openquizjazz
At index 12: quizjazzopen
At index 16: jazzopenquiz

Explanation:

Firstly, we need to calculate the possible permutations of the strings in L. Since there are 3 strings, that means 6 possiblilities:

  • jazzquizopen
  • jazzopenquiz
  • quizjazzopen
  • quizopenjazz
  • openjazzquiz
  • openquizjazz

Then we shall search for these permutations in our list S, and then return the starting indices.

Approach 1: Hash Map

An effective technique in solving this problem is through the use of a hash map. It will allow us to solve the problem, whilst maintaining a relatively efficient time complexity. It goes as follows:

  • Define hash map with all words inside L according to their location inside the list
  • Iterate through all possible substrings of S which are equal to the size of strings in L concactenated
  • Define a temporary hash map containing the original for all possible substrings
  • Find words in substring, if it is present in the temporary hash map, decrease the concerned count if not, break
  • Iterate through the temporary hash map, looking for keys with a count > 0, if there are no counts > 0, all the words in L have been found in S -> store starting index, if there are counts > 0, substring has not been fully traveresed.

Python Implementation

def substring_concact(L, S):

    word_size = len(L[0])
    total_words = len(L)
    chr_total = word_size*total_words

    result = []

    if chr_total > len(S):
        return result

    hash_map = {}

    for i in range(total_words):
        if L[i] in hash_map:
            hash_map[L[i]] += 1
        else:
            hash_map[L[i]] = 1
        
    for i in range(0, len(S) - chr_total + 1, 1):
        tmp_hash_map = hash_map.copy()
        j = i
        count = total_words

        while j < i + chr_total:
            word = S[j:j + word_size]

            if (word not in hash_map or tmp_hash_map[word] == 0):
                break
            else:
                tmp_hash_map[word] -= 1
                count -= 1
            j += word_size

            if count == 0:
                result.append(i)

    return result

L = ["jazz","quiz","open"]
S = "jazzquizopenquizjazzopenquiz"

print(substring_concact(L,S))

Explanation

L: ["jazz","quiz","open"]
S: "jazzquizopenquizjazzopenquiz"

Step 1: Define variables and results list:
- word_size (number of characters in a given string within L)
- total_words (total strings within L)
- chr_total (total characters in list L)
- results (list to store final indices)
Step 2: Create hash map which stores strings in list L against its occurances in the list.
Step 3: Create temporary hash map
Step 3: Iterate throguh the string using condition controlled loop:
- Extract the word
- If word is not in the hash map or its more than needed, break the loop
- Else, increment the current words count in hash map
Step 4: Append the starting index of given substring to result when all words in L are in S
Step 5: Return result

Output: [0, 8, 12, 16]

This approach has a time complexity of O(N-M)*M

Where N: Length of S
Where M: Length of chr_total (word_size * word_length)

Approach 2: Two pointer sliding window

To further optimise our original approach we can add two pointers in a sliding window fashion so that we can reduce the algorithm to constant time.

  • Define two dictionaries, one to store all words (dict) and one to store words of current substring (cur)
  • Define a running count to count words in current substring
  • Define two pointers, left and right:
    • right pointer tracks the word_size
    • substring = [right:right+word_size]
    • if the substring is in dict, add to cur and increment total
    • if substring is larger than expected, slide left pointer until it equals the expected result, increment total
    • we need to find all words in cur that appear less or equal to the expected amount of times, when the total = word_size, we can find the position and add it to the result, left pointer slides forward 1 place
    • if the substring is not in dict, reset cur, move left pointer to the position of right

Python Implementation

from collections import defaultdict

def find_substring(s, l):

    if not l or not s:
        return []
    
    dict = defaultdict(int)

    for i in l:
        dict[i] += 1
    
    result = []

    word_size = len(l[0])
    for i in range(word_size): 
        left = i
        right = left
        total = 0
        cur = defaultdict(int)

        while left <= len(s) - len(l) * word_size:
            if s[right:right + word_size] in dict and dict[s[right:right + word_size]] > cur[s[right:right + word_size]]:
                total += 1
                cur[s[right:right + word_size]] += 1
                right += word_size

            elif s[right:right + word_size] not in dict:
                left = right + word_size
                right = left
                cur = defaultdict(int)
                total = 0
            
            elif s[right:right + word_size] in dict and dict[s[right:right + word_size]] == cur[s[right:right + word_size]]:
                while dict[s[right:right + word_size]] == cur[s[right:right + word_size]]:
                    cur[s[left:left + word_size]] -= 1
                    total -= 1
                    left += word_size
                cur[s[right:right + word_size]] += 1
                right += word_size
                total += 1

            if total == len(l):
                result.append(left)
                cur[s[left:left + word_size]] -= 1
                total -= 1
                left = left + word_size

    return result

s = "jazzquizopenquizjazzopenquiz"
l = ["jazz","quiz","open"]

print(find_substring(s, l))

Explanation

s = "jazzquizopenquizjazzopenquiz"
l = ["jazz","quiz","open"]

Step 1: Define initial dictionary: dict containing all words in l according to their index
Step 2: Define results list
Step 3: Define left and right pointers along with current dictionary and total counter
Step 4: Create condition controlled loop, iterate through our 3 rules:

  • if substring is larger than the expected amount of appearances -> move left pointer until it reaches the expected times, increment our total by 1
  • if the substring isn't in our dictionary -> move left pointer to the position of right, reset current dictionary and set total to 0
  • if substring is smaller or equal to the expected amount of appearances -> move left pointer forward one, add position of substring to results
    Step 5: Return result

Output: [0, 8, 12, 16]

Using this two pointer approach, we can make the time complexity O(N) since we are only checking through elements once.

Comparison of methods

In this article at OpenGenus, we have gone over a few methods to find concactenation of strings as substrings both via hash map and by improving on said method with a two pointer technique. Our first method is reasonably effective, however we are making a string comparison where we have to try every possible starting point thus, the algorithm loses efficiency with larger data. Using two pointer method with a sliding window, we adjust where we are checking from depending on what has come before it, meaning we only iterate through each element one time, therefore approach 2 is the most efficient and will be our preferred approach for tackling this problem.

Thank you for reading!