Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article at OpenGenus, we will learn to how to solve the problem "Find duplicate files in the system" using string stream and hashmap.

Problem Statement

We have list of paths which contain directory info.In these paths we need to print the group of duplicate files that consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

In the output we need to print the following format:
"directory_path/file_name.txt" where directory_path is root directory.

Example 1:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]

Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Here we have for abcd : root/a 1.txt and root/c 3.txt so we have duplicate paths and similarly for efgh.

Therefore our output would be root/a/1.txt , root/c/3.txt etc.

Example 2:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]

Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Constraints:

1 <= paths.length <= 2 * 104
1 <= paths[i].length <= 3000
1 <= sum(paths[i].length) <= 5 * 105
paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.

Approach

We have list of paths as input where in each path we have a directory , filenames and contents of the file in brackets, so since we need to find the files with duplicate content , we get an intuition to use maps to store all the files with that particular content.

Split path into - directory path ,file name and content.
Then we will use map to store directorypath/filename with respective content.
Then after processing all the paths , we traverse the map to eliminate non duplicate files and store only the duplicate ones .
Then we return the duplicate files paths.

Here we use stringstream to implement the code, which converts the string into a stream and we can easily handle the words within the string.So we get words in the string that is spaced strings using string stream. Then we know that the first words is the root directory i.e root/a/b/../. , so we store it for future use. Then later words would be in form file1.txt(file1content) ,file2.txt(file2content) and so on. so we extract file1.txt , file1content and in map we store value root/a/b../file1.txt for the key file1content and similarly for file2.txt(file2content) ,file3.txt(file3content) and so on.

Code

Following is the complete C++ implementation:

#include<bits/stdc++.h>
#include<iostream>
#include<vector>
using namespace std;
    void findDuplicate(vector<string>& paths) {
          vector<vector<string>> ans;
          unordered_map<string,vector<string>> m;
          for(string path:paths){
                stringstream ss(path); //string stream
                bool f=1;
                string s,root,word;
                while(ss>>word){ 
                    if(f){ // root directory 
                        s+=word;
                        f=0;
                    }
                    else{
                        int i=0,j=0;
                        // run till ( encounter
                        while(i<word.length() and word[i]!='(') i++;
                        string file =s+'/'+word.substr(0,i); // file name
                        j=i;
                        // run till ) encounter
                        while(i<word.length() and word[i]!=')') i++;
                        string content = word.substr(j+1,i); // file content
                        m[content].push_back(file);
                    }
                }
          }
         
          for(auto i:m){
              if(i.second.size()>1)
              ans.push_back(i.second);
          }
          for(int i=0;i<ans.size();i++){
          for(int j=0;j<ans[i].size();j++){
            cout<<ans[i][j]<<" ";
          } cout<<endl;
        }
    }
int main(){
vector<string> paths;
int n; 
cin>>n;
for(int i=0;i<n;i++){
    string st; 
    cin>>st;
paths.push_back(st);
}
findDuplicate(paths);

return 0;
}

Result

Input: 
4
root/a 1.txt(abcd) 2.txt(efgh)
root/c 3.txt(abcd)
root/c/d 4.txt(efgh)
root 4.txt(efgh)

Output:
[Success] Your code was executed successfully
root/a/2.txt root/c/d/4.txt root/4.txt 
root/a/1.txt root/c/3.txt

Performance

Time Complexity : O(N*n) , where we have N strings with average length n being processed using string stream.
Space Complexity : O(N*n), where we have N strings with average length n being processed and accordingly stored.

Find Duplicate File in System [Solved with hashmap]

Algorithms

Problem Statement

Approach

Code

Result

Performance

Application Structure of Flask

x64 and x86 architecture