Number of Substrings with distinct characters

Internship at OpenGenus

Get FREE domain for 1st year and build your brand new site

In this article, we have explored various algorithms that will help us in determining the number of substrings that a string can have with distinct characters.

We will be exploring the following algorithms:

  • Naive algorithm, with a time complexity of O(n³)
  • Optimized naive algorithm, with a time complexity of O(n²)
  • Optimized algorithm with hash mapping, with a time complexity of O(n)

Table of content:

  1. Understanding the problem
  2. (1) Naive algorithm
  3. (2) Optimized naive algorithm
  4. (3) Optimized algorithm with hash mapping

Understanding the problem

What exactly do you infer when the problem statement asks us to find out the number of substrings with distinct characters?

Are the substrings "abcd" and "cabd" different, since they have different ordering of the same set of characters? Or are they the same, since both the substrings consist of the same set of characters, its just that their ordering is different. More times than not, this should be mentioned in the problem statement.

In this article, we will be covering both of these problems.

  • Our first algorithm, i.e., the naive algorithm, will find out the total number of substrings with distinct characters and it will count "abcd" and "cabd" as two different substrings.
  • After a little optimization, our second algorithm, i.e., the optimized naive algorithm, will find out the total number of substrings with distinct characters, but in this algorithm "abcd" and "cabd" will be counted as the same substring.
  • Finally, we will construct a new optimized algorithm with hash mapping that will also find out the total number of substrings with distinct characters and the substrings "abcd" and "cabd" shall be counted as the same substring in this algorithm as well.

(1) Naive algorithm

This is a very simple and straight-forward solution to this problem that doesn't really take the efficiency, complexity and constraints of the problem into account. It is a brute force algorithm that will make use of nested loops to produce the desired output.
The steps of the algorithm have been illustrated below:

  1. Generate all substrings of the input word, starting from 0 to n
  2. Check if the characters in the input word are repeated in the generated substrings
    2.1 If the there is no repetition, add that character to substrings
    2.2 Keep count of the valid substrings
  3. Repeat step 2 as many times as required
  4. Finally return the length of substrings as the final answer

(a) Working/Explanation

  • In this algorithm, we will go through three loops of length n (n = length of input string).
  • Firstly, we initialize substrings as an empty list/array and n as the length of the input string, i.e., word.
  • {First loop} For each i from 0 to n, we check if the array/list substring contains the character at index i of word or not. If it does not contain it, then that character is appended to substrings.
  • {Second loop, nested inside first loop} Here, another nested loop is executed which will run as k from 0 to n.
    • If the character at index i of word is not equal to the character at index k of word and substrings does not contain the word that is a concatenation of character at index i and character at index k of word, then append the concatenation of character at index i and character at index k to substrings.
      For example, if word[i] = "s" and word[k] = "k", then their concatenation is "sk".
      That is what we are checking for in this step.
    • If temp_word does not contain the character at index i of word then append that character to temp_word.
    • If temp_word does not contain the character at index k of word then append that character to temp_word.
  • {Third loop, nested inside second loop} Now we are going to run another nested loop here, which is going to be inside the body of our previous loop and will run as j from 0 to n as well.
    • If temp_word contains the character at index j of word then go back to the last loop (j).
    • If temp_word does not contain the character at index j of word and substrings does not contain the word that is a concatenation of temp_word and the character at index j, then append the character at index j to temp_word and then append temp_word to substrings.
  • After the control comes out of the third loop (with j as its base variable), replace the value of temp_word with an empty string and send the control back to the second loop (k). This step ensures that the next loop will build a different substring with unique characters.
  • After the control comes out of the second loop (with k as its base variable), send the control back to the first loop (i).
  • Once the control comes out of the first loop, our algorithm would have been executed completely and the length of the array/list substrings is returned as our final answer.

(b) Implementation

Following is the implementation of our naive approach in Python:

def uniqueSubstringCount(word):
    substrings = []
    for i in range(len(word)):
        temp_word = ""
        if word[i] not in substrings:
            substrings.append(word[i])
        for k in range(len(word)):
            if (word[i] != word[k] and word[i] + word[k] not in substrings):
                substrings.append(word[i] + word[k])
            if (word[i] not in temp_word):
                temp_word += word[i]
            if (word[k] not in temp_word):
                temp_word += word[k]
            for j in range(len(word)):
                if word[j] in temp_word:
                    continue
                elif(word[j] not in temp_word and str(temp_word + word[j]) not in substrings):
                    temp_word += word[j]
                    substrings.append(temp_word)
            temp_word = ""
    return len(substrings)


word = "sky"
print("Word:", word)
print("Total substrings with unique characters:", uniqueSubstringCount(word.lower()))

(c) Output

Word: sky
Total substrings with unique characters: 15

(d) Complexity

  • Time complexity: In our algorithm, we have a total of three nested loops which run over n times with different base variables, which makes the complexity n³. Let's assume that a constant amount of work k is being done each time inside each loop (different operations such as comparison, addition, etc.), which makes our total complexity (n³ + k).
    We can ignore the constant work k.
    Hence, our overall complexity becomes O(n³).
  • Auxiliary space: In our implementation, since the size of the extra list that we have used to store the substrings is independent of the size of our input n, the space complexity of our algorithm will be O(1).

(2) Optimized naive algorithm

This algorithm is an optimized version of our previous naive algorithm. This is also a simple and straight-forward solution which is better than our previous algorithm in terms of efficiency and complexity, but it is still not the most optimal solution to this problem. This is also a brute-force algorithm that will make use of nested loops to produce the desired output.

The steps of the algorithm have been illustrated below:

  1. Create an empty list/array to store the generated substrings
  2. As long as no character is repeated in a particular substring, add it to a temporary variable
  3. For another character in the input word, if it is not repeated in the temporary variable, add it to the list/array
  4. Finally return the length of the list/array as the final answer

(a) Working/Explanation

  • In this algorithm, we will go through two loops of length n (n = length of input string).
  • Firstly, we initialize l as an empty list/array and n as the length of the input string, i.e., word
  • Initialize temp as an empty string
  • {First loop} For each i from 0 to n, we check if the array/list l does not contain the character at index i of word. If it does not contain that character, then that character is appended to l.
    After this, the value of temp is replaced with the character at index i of word.
  • {Second loop, nested inside first loop} Here, another nested loop is executed which will run as j from i + 1 to n.
    • If temp does not contain the character at index j in word, then append that character to temp.
    • If l does not contain temp, then append temp to l.
  • Once the control comes out of the second loop (with j as its base variable), send the control back to the first loop (i).
  • After the control comes out of the first loop, we can now return the length of the list l as our algorithm would have been executed completely.

(b) Implementation

Following is the implementation of our optimized naive approach in Python:

def uniqueSubstringCount(word):
    l = []
    n = len(word)
    for i in range(n):
        if(word[i] not in l):
            l.append(word[i])
        temp = word[i]
        for j in range(i + 1, n):
            if(word[j] not in temp):
                temp += word[j]
            if temp not in l:
                l.append(temp)
    return len(l)


word = "skkfab"
print("Word:", word)
print("Total substrings with unique characters:", uniqueSubstringCount(word))

(c) Output

Word: skkfab
Total substrings with unique characters: 15

(d) Complexity

  • Time complexity: In our algorithm, we have a total of two nested loops which run over n times with different base variables, which makes the complexity n². Let's assume that a constant amount of work k is being done each time inside each loop (different operations such as comparison, addition, etc.), which makes our total complexity (n² + k).
    We can ignore the constant work k.
    Hence, our overall complexity becomes O(n²).
  • Auxiliary space: In our implementation, since the size of the extra list that we have used to store the substrings is independent of the size of our input n, the space complexity of our algorithm will be O(1).

(3) Optimized algorithm with hash mapping

This algorithm is more efficient than both of our previous algorithms and will run in linear time. There won't be any nested loops here.
The steps of the algorithm have been illustrated below:

  1. Map the frequency of each character present in the input word onto a hash map/dictionary
  2. Iterate over the dictionary and keep a track of the occurrence of each character in the input word
  3. If a character is repeated (has a frequency of more than 1), then decrement its value by 1 in the hash map/dictionary
  4. Increment the value of counter appropriately
  5. Finally return the value of counter as the final answer

(a) Working/Explanation

  • Firstly, n is initialized with its value as the length of the input string, i.e., word.
  • Then d is initialized as a dictionary where its keys will range from 0 to 25 and the value mapped to each key will be 0 initially. The reason as to why we have 25 keys in this dictionary is because there are a total of 26 characters in English, and we will be using this dictionary to store the frequency of each character of the input word and it will also be used to calculate the number of substrings that can be constructed with distinct characters.
  • Three more variables ctr, i and j are declared with each of their values as 0. The variables i and j are going to be the two pointers that we are going to be using for looping and navigation purposes in the input string word. The variable ctr is a counter that is going to be used to store the actual number of substrings with distinct characters.
  • {While loop begins} We are going to run a while loop here and it will run until the value of i is smaller than the value of n, i.e., i < n.
    • Check if the value of j is less than the value of n and also if the value mapped to the key (ord(word[j] - ord("a"))) of d is 0. What does (ord(word[j] - ord("a"))) represent here? The function ord() returns the unicode value of the provided argument. Well, we initialized our dictionary with its keys ranging from 0 to 25 and their values as 0. However, the ASCII values of alphabets from 'a' to 'z' range from 97 to 122. If we used these values as our keys in the dictionary, the keys wouldn't be in sync with our 2 pointers (i and j) anymore and we don't want that because we have to perform comparisons in each iteration of the loop. Moreover, we don't know as to what string the user is going to send to our function as an argument, so we cannot assume anything. We can use 97-122 as our keys but that would make our code a bit longer unnecessarily. If this condition is true, then perform the following steps:
      (1) Increment the value that is mapped to the key (ord(word[j] - ord("a"))) of d by 1
      (2) Replace the value of ctr by (ctr + j - i + 1)
      (3) Increment the value of j by 1
    • Else perform the following steps:
      (1) Decrement the value that is mapped to the key (ord(word[i] - ord("a"))) by 1
      (2) Increment the value of i by 1
  • By the time we reach this step, we should have our final result stored in the variable ctr, so we can simply return ctr.

(b) Implementation

Following is the implementation of our optimized hash mapping approach in Python:

def uniqueSubstringCount(word):
    n, d = len(word), {i: 0 for i in range(26)}
    ctr, i, j = 0, 0, 0
    while(i < n):
        if(j < n and d[ord(word[j]) - ord("a")] == 0):
            d[ord(word[j]) - ord("a")] += 1
            ctr += j - i + 1
            j += 1
        else:
            d[ord(word[i]) - ord("a")] -= 1
            i += 1
    return ctr


word = "unique"
print("Word:", word)
print("Total substrings with unique characters:", uniqueSubstringCount(word))

(c) Output

Word: unique
Total substrings with unique characters: 19

(d) Complexity

  • Time complexity: In our algorithm, we run a single loop which runs over n times and a constant amount of work k is being through each iteration (various operations like comparisons, addition, etc.) of the loop which gives us a total complexity of (n + k).
    Ignoring the constant k, we get an overall complexity of O(n).
  • Auxiliary space: In our implementation, since the size of the extra dictionary that we have used to map the frequency of each alphabet is independent of the size of our input n, i.e., no matter what n is, the size of our dictionary will always be 26 and it won't be affected by our input in any way, the space complexity of our algorithm will be O(1)

THANK YOU!