System Design of File Uploading Service

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we'll be discussing things we're supposed to keep in mind while designing a web-based file uploading service, handling and restricting the number of uploads by the user, how to handle version history in the database, and efficiently come up with a design.

Now, I'm sure most of us must have used Google Drive or Dropbox for uploading your files and sharing it across the web if we think about it, how does the internal system work for these applications? What all to keep in mind if we're supposed to design a similar service system. I know it takes a lot of years for making designing a system like these but I'm sure we can come up with a viable solution for a small scale service with less complexity in less than an hour. System/Service design is not just making a function for uploading a document and re-uploading it whenever you want to make changes, there is a lot more to it. A good programmer not only solves the problem but also solves it in an optimized way keeping in mind the limitations of resources available and provides the best experience to the users.

Before getting into technicalities about which all frameworks to be used or which stack you would be working upon, it's necessary to look into the problem statement and analyze it.

What all features to be included?

First of all, what should we expect from this service? What features we are expecting to include in our solution? It is necessary to decide all these things beforehand to make it more systematic. I think features which should be included,

The user can upload a certain number of files and access it through a particular URL.
He/She can update his file whenever needed.
He/She can upload a certain number of files in the system.
The system can handle multiple users/clients.

Now let us discuss some problems that we could face and how to solve and optimize them.

More Space Utilization and Bandwidth while uploading the file

Whenever we want to re-upload a file that is very huge ( around 20-30 MB) to the cloud storage, it is not considered efficient to upload the whole file again and again as it increases the bandwidth, time taken to upload it. So, instead of uploading the whole file, an optimal way would be to upload a new file in smaller chunks so that whenever we want to make changes to the file, we only update the chunk which has been modified. Also, there should be a functionality present to combine the file into one whenever the user wants to view it.

Access the file through URL

To keep the information about the chunks of the file so that we can access it in order, we can create a metadata file which basically will store indexes of chunks and other related information such as URL to access files.

Data consistency and handling multiple users

If a single user is updating a particular file, then the system should be able to update all the copies present in the storage system. The best way to do this is to use a relational database system that ensures data integrity and consistency.

Space utilization in the storage

It is extremely necessary to limit the size and number of files by the users. This will help to optimize space for multiple users and it would be optimal to keep a time limit on the files being kept in the storage. We should be able to set a particular time limit for the files which remain idle in the storage system which will help to create space for more users.

Solution

Note: The solution can vary from person to person. No fixed solution can be used to design any system.

Now, consider the following diagram:
Files

This diagram depicts the rough design of a system client that can be installed on a mobile/desktop for dealing with our service. Let us see what are the functions that different components perform.

Whenever there's a change or an update in the file, the watcher notifies the divider/chunker and the indexer to maintain consistency.
Once the notification is received by the chunker, it will divide the new file into chunks and upload it to the cloud storage. If the file has been updated, it will identify the modified chunk and save it.
Indexer is responsible for updating the indexes of the chunks and the URL of where it is stored in the cloud.
DB indexer saves all the data which the indexer sends.
A Messaging Queue Service is used for synchronization and handling a couple of user requests simultaneously. Every time, there is an update, it is broadcasted to all the clients and the clients update their files accordingly.
Synchronisation service is used by the clients to either send them updates or to receive any particular update from the messaging queue.
Any cloud storage services could be used over here to store files such as the one provided by Amazon (S3) as it will help in reducing a significant amount of load on our web servers.
Metadata Database file which we discussed before will be storing all the necessary information related to the file, so we need a DBMS that provides consistency among data as there will be many clients. Using RDBMS would be the best option.

So, these all are the important points to be kept in mind for designing file uploading service. Rest all the miscellaneous things could be chosen accordingly to the programmer's choice.

Now, to just have an idea of how these services are being used in real life, let us have a look on an example of what all technologies are used by Dropbox for their services:

They use some C, mostly Python(Pylons/Cheetah/Paste) which some of the frameworks in Python.
MySQL for storing the metadata.
All the objects (user objects) are stored in S3/EC2.
They use HTTP Server Technologies such as NGINX.
memcached is used in the front of the database and for handling inter-server coordination.

So, it's not a tough task to design a file upload service, we just need to keep in mind an optimal design for the same and work on appropriate stack for the same.