×

Search anything:

First K maximum occurring words

Binary Tree book by OpenGenus

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we will understand the different approaches to return the first k maximum occurring words from a given array. It includes core Data Structures and Algorithm topics such as Arrays, HashMap, Sorting, Priority Queue and Heap, Trie, and Bucket sort.

Table Of Content:

  1. Problem Statement
  2. Solutions
    1. Hashing and Sorting
    2. Heap
    3. Trie and Bucket Sort

Problem statement

Given an array of n string values and a positive integer k, one has to find the first k words which occur the most. In case more than one word has the same frequency, the words are ordered lexicographically. The String values in the given array consists of lowercase English alphabets only.

Example:

Input: ["when","the","going","gets","tough","the","tough","get","going"], k = 2
Output: ["going","the"]

Explanation: The frequency of unique words in the array are as follows:
when - 1, the - 2, going - 2, gets - 1, tough - 2, get - 1

Here, "the", "going", "tough" have the highest and equivalent frequency. The output should only contain two words. Lexicographically, the order of the words will be - "going", "the", "tough". Therefore, "tough" is excluded and the output is ["going","the"]

It is recommended to attempt to solve the question before going through the solution.

Solutions

Hashing and Sorting

Hashing is a method of mapping keys to corresponding values using a hash function. This allows retrieval of desired value in O(1) time complexity. Hashing is the easiest and fastest way to store and fetch the frequency of all the unique elements in the array.

Steps:

  1. Iterate through the array once and add or update the number of times a word occurs in the array in a hash table (O(n)).
  2. Add the unique words to the output list (O(n)).
  3. Sort the list using a custom comparator so that the words are arranged in descending order of their corresponding frequencies (O(nlogn)).
  4. Return the first k elements in the list.

Pseudocode:

SET output list result
SET hashmap map
FOR each word in input array
    insert word, frequency in map
END FOR
FOR each word in map keyset
    insert word in result 
END FOR
SORT result
RETURN first k elements in result 

Sample implementation in Java:

public List<String> topKFrequent(String[] words, int k) {
     
        List<String> result = new ArrayList<>();      // declare the output list
        Map<String, Integer> map = new HashMap<>();    //hash table to store the frequencies
        for (String word: words) {
            // the following function adds words to the hashmap if not already present, otherwise it increments the existing value by 1
            map.put(word, map.getOrDefault(word, 0)+1);
        }
        
        for (String word: map.keySet()) {
            // add unique words words to the list from the keys in the hashtable
            result.add(word);
        }
        
        //sort the list according to values of words in the list. If the values of two words are the same, sort it lexicographically
        result.sort((a, b) -> map.get(a) == map.get(b) ? a.compareTo(b) : map.get(b) - map.get(a));
        
        return result.subList(0, k);
    }

Time complexity : Time taken to create HashMap O(n) + Time taken to insert elements from HashMap to list O(n) + Time taken to sort list O(nlogn)
Time complexity : O(nlogn)

Space Complexity : HashMap O(n)
Space complexity : O(n)

Heap

Heap is a complete binary tree data structure. Heaps are of two types - Min Heap and Max Heap. In Min Heap, the value of the parent node is smaller than or equal to the values of its children. In Max Heap, the value of the parent node is smaller than or equal to the values of its children. Heaps are thus useful for sorting. Inserting an element in a heap takes O(logn).

Heaps can be implemented using Priority Queue. A Priority Queue is a Queue which processes elements according to priority, usually ascending order, instead of First-In-First-Out sequencing like in a regular Queue data structure.

Steps:

  1. Create a hash table as in the previous solution (O(n))
  2. Define a priority queue with a custom comparator. The comparator of the priority queue is customized such that it will arrange the elements according to the frequency.
  3. Add unique elements from the hash table in the priority queue (O(n)).
  4. Insert the first k words from the priority queue into the list (O(nlogk)).
  5. Return the list.

Pseudocode

SET output list result
SET hashmap map
FOR each word in input array
    insert word, frequency in map
END FOR
SET priority queue pq with custom comparator
FOR each word in map keyset
    insert word in pq 
END FOR
WHILE k-- > 0
    insert word in result 
END WHILE
RETURN result 

Sample implementation in Java:

public List<String> topKFrequent(String[] words, int k) {
        List<String> result = new ArrayList<>();      // declare the output list
        Map<String, Integer> map = new HashMap<>();    //hash table to store the frequencies
        for (String word: words) {
            // the following function adds words to the hashmap if not already present, otherwise it increments the existing value by 1
            map.put(word, map.getOrDefault(word, 0)+1);
        }
        
        //the comparator of the priority queue is customised using lambda expression. If two words have the same frequency, it is sorted lexicographically.
        PriorityQueue<String> pq = new PriorityQueue<>((a, b) -> map.get(b) == map.get(a) ? a.compareTo(b) : map.get(b) - map.get(a));  
        
        
        //keys from the hashtable are added to the priority queue
        for(String s : map.keySet()) pq.add(s);
              
        while(k-- > 0) result.add(pq.poll());
        
        return result;
    }

Time complexity : time taken to add strings and frequency to a hash table O(n) + Time taken to add elements to priority queue O(n) + time taken to retrieve k elements from priority queue O(nlogk)
Time complexity : O(nlogk)

Space complexity : HashMap O(n) + Priority Queue O(n)
Space complexity : O(n)

Trie and bucket sort

In this method, we make use of the Trie data structure and bucket sort.

Instead of using a hash table to store the frequency, a trie data structure can prove to be a better alternative. Trie is a special data structure used to store strings that can be visualized like a graph. A trie node consists of 26 children at most.

Bucket Sort is used to group elements and store them into individual buckets. In this case, the strings with the same frequency will be stored together in a bucket.

Steps:

  1. Create a Trie and method to insert strings into trie as well as a method to traverse through the trie data structure using Depth-First-Search.
  2. Add each word from the input array along with its frequency (O(n)).
  3. The buckets array is initialised which has the size of the maximum frequency.
  4. Traverse through the Trie using Depth-First-Search algorithm (O(n))
  5. Simultaneously, add the strings in the bucket at the corresponding frequency count as index number.
  6. The strings are stored in an array list.
  7. Iterate through the buckets array in reverse and add the k most frequently occurring string to the output list (O(k)).
  8. Return the list.

Since we have to output words lexicographically, we take advantage of Trie for it has no sorting costs.

Pseudocode

SET output list result
SET Trie head
FOR each word in input array
    insert word, frequency in trie
END FOR
SET buckets array
SET max_frequency
REPEAT
    REPEAT 
        DFS through Trie 
    UNTIL trienode count > 0
    insert string in respective bucket
    update max_frequency
UNTIL all strings are added
FOR i = max_frequency to 0
    REPEAT
        insert strings at bucket[i] into result 
        DECREMENT k with each insertion
    UNTIL k > 0
RETURN result 

Sample implementation in Java:

class Solution {
    
    //Definition of a node of a trie. It stores the children nodes, the word, and the frequency 
    class TrieNode {
        TrieNode[] children = new TrieNode[26];
        String word = "";
        int cnt = 0;       
    }
    
    TrieNode head;
    int max;
    List<String>[] buckets;
    public List<String> topKFrequent(String[] words, int k) {
        head = new TrieNode();
        max = 0;
        
        //create a trie
        insert(words, head);
        
        buckets = new List[max + 1];
        
        //store words in buckets of corresponding frequency size
        dfs(head);
        
        List<String> res = new ArrayList();
        
        //the buckets array is iterated through in reverse order to find the words with maximum frequency. 
        for (int i = max; i >= 0; i--){
            if (buckets[i] == null) continue;
            int j = 0;
            while(j < buckets[i].size() && k > 0){
                res.add(buckets[i].get(j));
                j++;
                k--;
            }
        }
        return res;
    }
    
    //the trie is traversed using DFS algorithm. Each word is traversed this way and it is stored in the bucket of corresponding frequency. It traverses each word lexicographically, hence no need to explicitly sort the arraylist of words.
    private void dfs(TrieNode head){
        TrieNode cur = head;
        if (cur.cnt > 0){
            if (buckets[cur.cnt] == null){
                buckets[cur.cnt] = new ArrayList();
            }
            buckets[cur.cnt].add(cur.word);
        }
        for (int i = 0; i <= 25; i++){
            if (cur.children[i] != null){
                dfs(cur.children[i]);
            }
        }
    }
    
    
    //this function constructs a complete trie that consists of all the words. At the end of each individual word trie, the frequency of the word is stored. The highest frequency is also determined after inserting each word. 
    private void insert(String[] words, TrieNode head){
        TrieNode cur = head;
        for(String w : words){
            cur = head;
            for (char c : w.toCharArray()){
                if(cur.children[c - 'a'] == null){
                    cur.children[c - 'a'] = new TrieNode();
                }
                cur = cur.children[c - 'a'];
            }
            cur.word = w;
            cur.cnt++;
            max = Math.max(max, cur.cnt);
        }
    }
}

Time complexity: Time to insert strings into trie O(n) + Time to traverse through trie and add strings to bucket O(n) + Time to add top elements into list O(k)

Space complexity: Trie O(alphabet_size * average_key_length * N) + Buckets array O(n) = O(26 * m * n) + O(n)

Time complexity : O(n + k)
Space complexity : O(m * n)

With this article at OpenGenus, you must have the complete idea of how to find First K maximum occurring words.

First K maximum occurring words
Share this