Approaches for implementing Autocomplete Feature


Reading time: 40 minutes

With the dawn of modern browsers and search engines, Autocomplete has been an interesting UX Feature that derives heavily from the existing concepts of the Data Structure and Algorithm to implement a simple thing: Suggesting the user a list of search phrases as he starts typing in the query box. With the huge volumes of data being processed around the world in real-time, Search Queries were required to be accurate and fast which lead to the use of more efficient algorithms that can be used to retrieve and rank search phrases depending on the given input. The Autocomplete Feature are also expected to possess some Typo-Tolerance as it can implement a spell check in case the given query has been mispelled

In this article, we will discuss the various approaches to implement an Autocomplete Feature using various algorithms, from the most naive ones to the others which are complex yet intriguing to implement at the same time. These approaches will help the reader to better understand the working of the Autocomplete Feature under the covers as we gently unveil the complexities of these algorithms within the layers.

The various approaches are:

  • Linear Search (Brute Force)
  • Binary Search Approach
  • TRIE Data Structure
  • Ternary Search Tree Approach
  • Fuzzy Search
  • Burkhard Keller Tree Approach
  • Machine Learning Approach

We will now explore each technique in depth.

1. Linear Search (Brute Force)

Probably the most naive method to implement our Autocomplete Feature is the Linear Force where we can Brute Force through the entire list of search phrases looking for the best fit which can be returned back to the user as a Search Query is passed to it. In Linear Search Algorithm, the entire list of search phrases is searched for to check if any search phrase associated with the given query is present or not.

Time Complexity for this algorithm is O(N) where N is the number of search phrases in the corpus while Space Complexity is taken to be O(1). While this algorithm is good for providing search suggestions for a small number of elements, it is inefficient to implement where we have a large corpus of search phrases.

To improve the Time Complexity we can implement the concept of Levenshtein distance which can be understood as the value that is returned after two strings are passed and the minimum number of insertion, deletion and replacements are returned which are required to translate one string into the other. Levenshtein distance is also used in the Burkhard-Keller Trees for returning near-matches to a String query where the Time Complexity associated with the operation is O(NP²) where P is the length of the search query.

The associated Code in Python is as follows:

class LinearSearch(object):
    def __init__(self):
        self.data = []
        
    def index(self, suggestions: Sequence[str]) -> None:
        for i in suggestions:
            self.data.append(i)
            
    def search(self, pre: str) -> Generator[str, None, None]:
        for i in self.data:
            if i.startswith(pre):
                yield i

2. Binary Search Approach

Binary Search is yet another efficient approach to implement the Autocomplete Feature using the "Divide and Conquer paradigm. Binary Search algorithm is regarded as one of the most popular algorithm in which the list, wherein the element is to be searched, is to be divided into two and the query element is compared to the middle element. To ensure that the element is efficiently searched, the list of search queries is sorted before it is processed. A Binary Search Algorithm can be implemented either using Iteration or using Recurssion with a time complexity of O(log N) where N is the number of search phrases.

Binary Search Approach is suitable for moderate number of Search Phrases but as the complexity of the Search Phrases increase, it gets inefficient with lack of Typo-Tolerance. In next section we will discuss the TRIE Data Structure which is much more efficient for implementing the Autocomplete Feature in a more efficient and lesser complex way.

The associated Code in Python is as follows:

from bisect import bisect_left

class BinarySearch(object):
    def __init__(self):
        self.array = []
 
    def index(self, suggestions: Sequence[str]) -> None:
        for i in suggestions:
            self.array.append(i)

        self.array.sort()
       
    def search(self, pre: str) -> Generator[str, None, None]:
        start_position = bisect_left(self.array, pre)
      
        for i in self.array[start_position:]:
            if i.startswith(pre):
                yield i
            else:
                break

3. TRIE Data Structure

TRIE Data Structure also known as the Prefix Tree or Radix Tree is remarked as an excellent Data Structure for processing and storing the Strings and is primarily used for word retrieval. The nodes in the TRIE Data Structure are used for storing the associate keys of strings and values that are being stored in the TRIE.

What makes a TRIE Data Structure so optimized for performing such actions? Unlike the popular Binary Trees, each node in TRIE can contain a maximum of 26 nodes which are pointers to the 26 different alphabets of English Language. The position of each node in a TRIE defines the key with which the node is associated. This makes the TRIE indexing suitable for maintaining the variable sized key values. The actual key values is never stored but the key values are implied through links. A TRIE of Order 26 can maintain a whole English dictionary.

As we insert keys in the TRIE Data Structure, each input key is inserted as an individual node and perform the role of an index to further child nodes. If our input is a prefix of an existing string present in the TRIE, then we can mark the new node as a leaf node to the pre-defined key. Otherwise, the new keys are constructed and the last node is marked as the Leaf Node. Search Operation is relatively easy in TRIE as a Search Operation starts from the Root Node and moves ahead by comparing characters. If any character does not exist, then no such string is found in the TRIE Data Structure.

In a TRIE Data Structure, the best Case Time Complexity has been described as O(1) while the Average and Worst Case is described as O(Length-Of-Key). The TRIE Data Structure can be optimized to support larger collection of words using Compact TRIE Data Structure Implementations.

The associated Code in Python is as follows:

class trieNode:
    def __init__(self):
        self.next = {}
        self.leaf = False
    def add_item(self, item):
        i = 0
        while i < len(item):
            k = item[i]
            if not k in self.next:
                node = trieNode()
                self.next[k] = node
            self = self.next[k]
            if i == len(item) - 1: 
                self.leaf = True
            else:
                self.leaf = False
            i += 1
    def search(self, item):
        if self.leaf and len(item) == 0:
            return True
        first = item[:1]  
        str = item[1:]  
        if first in self.next:
            return self.next[first].search(str)
        else:
            return False
    def traversal(self, item):
        if self.leaf:
            print (item)
        for i in self.next:
            s = item + i
            self.next[i].traversal(s)
    def autocomplete(self, item):
        i = 0
        s = ''
        while i < len(item):
            k = item[i]
            s += k
            if k in self.next:
                self = self.next[k]
            else:
                return 'NOT FOUND'
            i += 1
        self.traversal(s)
        return 'END'

4. Ternary Search Tree Approach

Ternary Search Tree is an optimized TRIE Data Structure where the nodes can have a maximum 3 Child Nodes, unlike the TRIE where 26 nodes are optimally associated with a node. Each node in the Ternary Search Tree has three pointers: Left, Middle and Right along with an EndOfWord Bit and a Data Field where the character associated with the string is stored. The Left Pointer is used to point to the node whose value is comparatively lesser than the value in the current node while the Right Pointer is used to point to the point to the node whose value is comparatively greater than the value in the current node. The equal pointers points to the node whose value is equal to the current node value.

Ternary Search Tree is regarded as quite efficient compared to the earlier TRIE Data Structure and can perform insertion, deletion and search operation with the same efficiency as a Binary Search Tree. The Space Complexity associated with a Ternary Search Tree is equal to the length of the string that has been stored into it.

Insertion Operation is relatively easy in Ternary Search Tree and is similar to the one in Binary Search Tree. A character comparison is performed at each stage and the node character is compared to the character to be inserted which gives us three cases:

  • If the Character to be inserted is smaller than node value, then move left.
  • If the Character to be inserted is larger than node value, then move right.
  • If the Character to be inserted is equal to the node value, then move to middle position.

The Search Operation in the Ternary Search Tree is also performed similar to the operation in Binary Search Tree. The next character in the query string is matched to the character present in the current node. The center child is looked upon recursively as the first letter from each string is removed.

The Worst Case Complexity for performing an Autocomplete Operation with Ternary Search Tree is O(N) while the average Time complexity is O(log N).

The associated Code in Python is as follows:

class Node():
    def __init__(self, value=None, endOfWord=False):
        self.value = value
        self.left = None
        self.right = None
        self.equal = None
        self.endOfWord = endOfWord

class Autocomplete():

    def __init__(self, word_list):
        self.n = Node()
        for word in word_list:
            self.insert(word, self.n)

    def insert(self, word, node):
        char = word[0]
        if not node.value:
            node.value = char

        if char < node.value:
            if not node.left:
                node.left = Node()
            self.insert(word, node.left)
        elif char > node.value:
            if not node.right:
                node.right = Node()
            self.insert(word, node.right)
        else:
            if len(word) == 1:
                node.endOfWord = True
                return

            if not node.equal:
                node.equal = Node()
            self.insert(word[1:], node.equal)

    def all_suffixes(self, pattern, node):
        if node.endOfWord:
            yield "{0}{1}".format(pattern, node.value)

        if node.left:
            for word in self.all_suffixes(pattern, node.left):
                yield word
        if node.right:
            for word in self.all_suffixes(pattern, node.right):
                yield word
        if node.equal:
            for word in self.all_suffixes(pattern + node.value, node.equal):
                yield word

    def find(self, pattern):
        final_pattern = {pat: set([]) for pat in pattern}
        for pat in final_pattern.keys():
            word = self.find_(pat)
            if word == None:
                return None
            else:
                completions = {x for x in word}
            return list(completions)

    def find_(self, pattern):
        node = self.n
        for char in pattern:
            while True:
                if char > node.value:
                    node = node.right
                elif char < node.value:
                    node = node.left
                else:
                    node = node.equal
                    break
                if not node:
                    return None
        return self.all_suffixes(pattern, node)

5. Fuzzy Search

Autocomplete Feature can be implement using Fuzzy Search which is used to find the approximate matches for the search query and is used as a spell checker and in autocomplete operations by working on algorithms like Levenshtein distance, Damerau-Levenshtein distance, Bitap algorithm, Smith-Waterman algorithm to name a few. Fuzzy Search is currently implemented on various popular Search Engines like Google, Bing, Yahoo and Duck Duck Go.

Fuzzy Search is implemented using N-Gram Sequencing Technique where n-characters are sequenced in a larger string. Let us take a string "mobile" for which the 2-grams will be: ["mo","ob","bi","il","le"]. These n-grams will be indexed and is used for autocompletion and suggestion as these n-grams are compared with the prefixes.

N-grams index is used to identify the list of suitable search phrases and finally return the most appropriate ones that are found by performing filtering on the search phrases. N-grams technique is also used in N-gram tokenizer used in Elastic Search, wherein it breaks every search phrase into numerous N-grams tokens known as "shingles" and allows the autocomplete operation to suggest new phrases.

The associated Code in Python is as follows:


class NGramSuggester(object):
    def __init__(self, ngram_size=2):
        self.data = {}
        self.ngram_size = ngram_size
        
    def index(self, suggestions: Sequence[str]) -> None:
        for s in suggestions:
            grams = ngrams(s, self.ngram_size)
            
            for ngram in grams:
              
                if ngram not in self.data:
                    self.data[ngram] = [s]
                else:
                    self.data[ngram].append(s)
                    
    def search(self, prefix: str, match_percentage: float=0.9):
        suggestions = {}
        grams = list(ngrams(prefix, self.ngram_size))
        total_grams = len(grams)
        
        for ngram in grams:
            for s in self.data.get(ngram, []):
                if s in suggestions:
                    suggestions[s] += 1
                else:
                    suggestions[s] = 1
                    
        for s, gram_count in suggestions.items():
            percentage = gram_count * 1.0 / total_grams
            if percentage > match_percentage:
                yield s

6. Burkhard Keller Tree Approach

Burkhard Keller Tree or otherwise known as the BK-Tree is a Tree-Based Data Structure that is used to find the near-matches to a String Query. The BK-Trees were first proposed in a paper "Some approaches to best match file searching" by Burkhard and Keller and has been since used as an algorithm for performing Spell Check.

Unlike other algorithms, BK-Tree rests its concepts on the Levenshtein distance which can be understood as the value that is returned after two strings are passed and the minimum number of insertion, deletion and replacements are returned which are required to translate one string into the other. Burkhard Keller Tree are also utilized in implementing the Auto-Complete Feature in many Software and Web Applications and is regarded as more efficient compared to the TRIE and Ternary Search Tree that we have discussed earlier.

BK-Tree is a Tree-Based Data Structure and like any other Tree Data Structure it consists of nodes and edges that connect them. In a BK-Tree, each node consist of the individual words in the dictionary and the number of nodes in a BK-Tree is equal to the number of the words that are being inserted into the dictionary. Each edge that connects the nodes possess an integer value that is corresponsing to the Levenshtein distance.

To construct a BK-Tree we can take any node as the root node and then we have to add some element to the root node, by calculating their Levenshtein Distance and adding them as a left node or the right node, in the same manner as will add nodes in any other Binary Tree.

In the Search Operation, our primary aim is to find all the words that are closest given to a query word. This means that we will take an integer 'N' which we will call as the radius. To find any node in the BK-Tree we will just compute the Levenshtein Distance from the query string to the string stored in the root node. If the value that is returned is less than the radius, then the rest of the nodes are recursively processed by selecting all the children of the root node.

The associated Code in Python is as follows:

from collections import deque

class BKTree:
    def __init__(self, distance_func):
        self._tree = None
        self._distance_func = distance_func

    def add(self, node):
        if self._tree is None:
            self._tree = (node, {})
            return

        current, children = self._tree
        while True:
            dist = self._distance_func(node, current)
            target = children.get(dist)
            if target is None:
                children[dist] = (node, {})
                break
            current, children = target

    def search(self, node, radius):
        if self._tree is None:
            return []

        candidates = deque([self._tree])
        result = []
        while candidates:
            candidate, children = candidates.popleft()
            dist = self._distance_func(node, candidate)
            if dist <= radius:
                result.append((dist, candidate))

            low, high = dist - radius, dist + radius
            candidates.extend(c for d, c in children.items()
                              if low <= d <= high)
        return result

7. Machine Learning Approach

With the rise in use-cases of Machine Learning and Deep Learning in industry, Autocomplete Operation can be automated and made faster thanks to access to more powerful algorithms and more computational power. For this purpose we can use Long Short-Term Memory (LSTM) algorithm which can perform beam predictions to suggest autocomplete operations with much more efficiency than standard N-Grams Search or any other algorithm. However most of the Machine Learning and Deep Learning algorithm depends on the type of data passed to it and hence developing any autocomplete software on Machine Learning Concepts require heavily cleansed code which can be made possible by removing comments, strings and blank lines and the model can be trained on tokenized code which is more efficient than character level prediction with byte-pair encoding.

With this, you have the complete knowledge of different approaches that can be taken for Text Autocompletion in general. Enjoy.