Top K Frequent Words

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we will study about different approaches to solve the problem "Top K Frequent Words". This will involve the concept of Priority Queue, Trie and Bucket Sort.

Content:

Problem Statement
Solution using:
2.1. Priority Queue
2.2. Trie and Bucket Sort

Problem Statement:

We are given an array of strings words and an integer k, we have to return k strings which has highest frequency.

We need to return the answer which should be sorted by the frequency from highest to lowest and the words which has the same frequency sort them by their alphabetical order.

Example TestCases:

Input:
words = ["i","love","writing","i","love","coding"], k = 2
Output:
["i","love"]
Explanation: "i" and "love" are the two most frequent words.
Input:
words = ["it","is","raining","it","it","raining","it","is","is"], k = 3
Output:
["it","is","raining"]

Explanation: "it", "is" and "raining" are the three most frequent words, with the number of occurrence being 4, 3 and 2 respectively.

First Approach: Using Priority Queue

A priority queue is an abstract data type in which priority is associated with each elements. The element which has highest priority would come first in priority queue.

Algorithm:

First of all we will store the frequency of all the given words in an unordered map.
We will now declare a priority queue which can store k elements of highest frequency and the elements of same frequency in alphabetical order.
We will traverse in the unordered map and store every new pair of elements in a priority queue and if the size of priority queue becomes greater than k, we will remove the element which has lesser frequency or greater alphabetical order.
We will repeat this process until the end of the unordered map.
Now, initialise a vector of strings and store the k words of priority queue in it and remove them from the priority queue.

Code for Top K Frequent Words:

#include<bits/stdc++.h>
using namespace std;

  struct PqComp {
  // PqComp is the comparator so that we can get the decreasing order for the  
  //frequency of the strings and increasing order in the lexiographical way.
  //p1.first > p2.first will return true only when the frequency of first string 
  //is more than the second and p1.second < p2.second will return true when the 
  //string in p1 is alphabetically smaller then string in p2.
 
        bool operator() (const pair<int, string>& p1, const pair<int, string>& p2) {
            if(p1.first != p2.first) {
                return p1.first > p2.first;
            }
            else {
                return p1.second < p2.second;
            }
        }
    };

  vector<string> topKFreqWords(vector<string>& words, int k) {
        //mp is the unordered map to store the frequency of elements of
        //vector words
        unordered_map<string, int> mp; 
        for(string &word : words) {
            mp[word] += 1;
        }
        
        //Initialising the priority queue with custom comparator function PqComp
        // because we want to order the elements in a different way.
        priority_queue<pair<int, string>, vector<pair<int, string>>, PqComp> pq;
        
        //Iterating over unordered map to store elements in priority queue pq
        for(auto it: mp) {
            pq.push(make_pair(it.second, it.first));
            
            //if the size of pq is greater than k, remove the pair with lesser 
            //frequency or greater lexicographical order from pq.
            if(pq.size() > k) pq.pop(); 
        }
        //Initialising a vector of strings 
        vector<string> ans;
        //we will insert the k elements from priority queue,pq in ans vector and 
        //remove the words from pq.
        while(!pq.empty()) {
            ans.insert(ans.begin(), pq.top().second);
            pq.pop();
        }
        return ans;
    }

   int main(){
      int n, k; 
      cin >> n >> k;

      vector<string> arr(n);
  
      cout<<"Input:"<<endl;
      for(int i=0; i<n; i++) {
          string s; cin >> s;
          arr[i] = s;
      }

      vector<string> topKFreq(topKFreqWords(arr, k));
  
      cout<<"Output:"<<endl;
      for(string s : topKFreq) {
          cout << s << " ";
      }
      
      return 0;
      }

Input and Output:

6 2 
Input:
i love coding i love oranges
Output:
i love

Time Complexity:

The time complexity of this algorithm is O(nlogk), n is the length of the words.
Inserting n elements in unordered map will take O(n) time, Inserting elements in priority queue will take O(nlogk) time as it will take O(logk) each time.

Space Complexity:

The space complexity of this algorithm is O(n) because we are storing frequency of each word in the unordered map.

Another Approach: Using Trie

In this method, we will use Trie data structure and bucket sort.
Trie data structure is like a tree which is used to store strings. It is also known as the digital tree or prefix tree. There can be maximum of 26 children in each node of the trie and the root node is always the null node.

Bucket Sort algorithm is used for sorting elements by arranging them into multiple groups which are known as buckets.These buckets are sorted by using any sorting algorithm and then these sorted buckets are combined to form a complete sorted list.

Algorithm:

We will use unordered map to store the frequency of each word.
Since, in this question the least frequency of any word can be greater than or equal to one and the maximum frequency is less than or equal to the length of the given vector so we will use bucket sort to store words.
Then for each bucket we will define a trie to store the words of the same frequency.
We don't have to sort the words inside each bucket as the words will be arranged in the alphabetical order inside trie.

Code for Top K Frequent Words:

  
  #include<bits/stdc++.h>
  using namespace std;
  
  //we are implementing trie from scratch
  struct TrieNode{
        TrieNode* child[26]; //node of trie can have maximum of 26 children
        string word;
        
        TrieNode(){
            string word = "";
            for(int i=0;i<26;i++){
                child[i] = NULL;
            }
        }
    };
   
    void addWord(TrieNode* root, string word){
        TrieNode* curr = root;
        for(auto ch: word){
            int c = ch - 'a';
            if(curr->child[c] == NULL){
                curr->child[c] = new TrieNode();
            }
            curr = curr->child[c]; 
        }
        curr->word = word;
    }
    
    void getWord(TrieNode* node, vector<string> &ans, int k){
        if(node == NULL) return;
        if(node->word != "") {
            ans.push_back(node->word);
        }
        if(ans.size() == k) return;
        for(int i=0;i < 26;i++){ 
            if(node->child[i] == NULL) continue;
            getWord(node->child[i], ans, k);
        }
    }
    
    vector<string> topKFrequent(vector<string>& words, int k) {
    
        //initialising unordered map, mp to store the frequency of the given words
        unordered_map<string, int> mp;
        for(auto word: words){
            mp[word] ++;
        }
        
        int n = words.size();
        
        vector<TrieNode *> freq;
        
        for(int i=0;i<=n;i++){
            TrieNode * newNode = new TrieNode();
            freq.push_back(newNode);
        }
        
        for(auto j: mp){
            addWord(freq[j.second], j.first);
        }
        vector<string> ans;
        
        for(int i = n;i >= 1;i--){
            getWord(freq[i], ans, k);
        }
        return ans;
    }   
    
 int main(){
      int n, k; 
      cin >> n >> k;

      vector<string> arr(n);
  
      cout<<"Input:"<<endl;
      for(int i=0; i<n; i++) {
          string s; cin >> s;
          arr[i] = s;
      }

      vector<string> topKFreq(topKFreqWords(arr, k));
  
      cout<<"Output:"<<endl;
      for(string s : topKFreq) {
          cout << s << " ";
      }
      
      return 0;
      }

9 3
Input:
it is raining it it raining it is is
Output:
it is raining

Time Complexity:

The time complexity of this algorithm is O(n) where n is the length of words.

Space Complexity:

The space complexity of this algorithm is O(n) where n is the length of words.

With this article at OpenGenus, you must have the complete idea of how to find the Top K Frequent Words.

Top K Frequent Words

trie Algorithms

Content:

Problem Statement:

Example TestCases:

First Approach: Using Priority Queue

Algorithm:

Code for Top K Frequent Words:

Time Complexity:

Space Complexity:

Another Approach: Using Trie

Algorithm:

Code for Top K Frequent Words:

Time Complexity:

Space Complexity:

Shannon Fano coding

Cost Functions [6 types]