System Design of Pastebin

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we will be looking into the system design of Pastebin which is a content hosting service for simple text files. The scale of use of this applications makes the System Design interesting.

Table of Content:

What is Pastebin
Key Features/ Use cases
Constraints and Capacity Assumptions
Diving into The Design
Conclusion

What is Pastebin

Pastebin is an online content hosting service where users can store text files(i.e source code snippets, etc) called "pastes" over the internet, either anonymously or attached to an account. Each paste has its unique URL and can be shared with other users for quick access. Pastebin was created in 2002 and has still managed to remain relevant and available today with over 17 million daily active users.

Key Features/Use Cases

Functional Features:

These are the features that are absolutely essential for the user. They are the core provisions of the system. They are:

Users should be able to upload plain text ("Pastes")
Upon uploading the paste user should be provided with a unique URL to access it.
Users should only be able to upload text.
Users should be able to set an expiry time for URLs, after which the content is deleted. By default, a paste should not expire unless otherwise specified
Users are anonymous
Users will get access to a paste's content when they enter the paste's URL

Non-Functional Features

These are quality constraints that the system must satisfy in order to ensure that system as a whole performs well. They are:

The system should be highly reliable, data uploaded should not be lost.
The system should be highly available. So that users can always access their pastes.
Users should be able to access their Pastes in real-time with minimum latency.
Analytics: Monthly visit statistics, etc.

Constraints and Capacity Assumptions

Key assumptions

17 million users
10:1 read to write ratio
10 million paste writes per month
100 million paste reads per month
Fast paste retrieval

Traffic Estimates

We can expect our traffic to not be as intensive as a web application like Twitter or Facebook. With our modest estimate of 100 million paste reads per month and a 10 to 1 read, write ratio. We get:

an average paste read rate of 39 reads/sec (100,000,000 / 30 * 24 * 3600)
an average paste creation rate of 4 writes/sec (10,000,000 / 30 * 24 * 3600)

Storage Estimates

We can set a limit of at most 10Mb per paste content, but on average we can estimate that a paste content size should be about 10kb plus the size of all the other metadata attached to the paste, like short link, created_at, paste_path, etc. Bringing the average paste size to be about 11kb.
This sums up to about 110 GB of new pastes to be stored every month and about 3.96 TB of paste in 3 years.

Diving into The Design

High-Level Design

highlevel
At the high level, we are going to have an application layer and a storage layer. The application layer will be made up of a web server that will forward requests to the read or write server. The storage layer will be made up of a relational database for storing meta information and an object store/file server for storing content.

Core Component Design

Handling Write Requests (New Paste Creation)

Upon receiving a written request, we will generate a unique URL that will act as the key to the paste's content. In essence, we use a SQL database as a hashtable, we store the unique URL as the key along with meta information like creation time and expiration date and the most important part, a link to where the object content is stored(a file path if a file server is used or the URL to the object if we use a managed object store like amazon s3)

In essence, when the Client sends a create paste request to the webserver

The web server forwards it to the write Server
The write server first generates a unique URL maybe by base64 encoding the md5 hash of the current timestamp and a random nonce (the client's IP address).
We check for the uniqueness of the URL by checking with the DB and keep on regenerating URLs until we get a unique one.
The generated Url is then inserted into the SQL table along with other meta information
The Paste data is then stored in an object store
Finally it returns the newly created URL

Handling Reads

When a client sends a read request (pastes a URL) the web server forwards it to the read server.

The read server queries the SQL database for the URL
If the URL exists the paste content is retrieved from the datastore
else it returns an error

Dealing with expired pastes

We can apply a lazy strategy to avoid overwhelming the database server with frequent deletions. We can check for expiration when a paste read occurs, if a paste has expired we can mark it as such and return an error. We can then have a cron job or service that periodically deletes the expired pastes from the database at periods of inactivity.

Scaling The System

We will need to scale our system to be able to handle our estimated numbers and still ensure high availability and consistency. We will have to horizontally scale and have multiple webservers, read servers and write servers. In addition to that, there are other considerations to be made.
scale-level2

Caching

We can introduce caching to deal with lots of reads. We have already estimated that we will get about 100 times as many reads compared to writes. We can then speed up reads by caching the more popular pastes in memory. So our read servers will first check the cache for a paste before hitting the DB.
Some factors to have in mind:

We have to ensure that our cache remains consistent with original content in the case of modifications
We have to provide a mechanism for replacing entries in the cache when it gets full. We can use the LRU (Least Recently Used) policy to replace pastes, that way we can always have the most popular pastes in our cache

Load Balancing

We can introduce a load balancer between the client and webserver to prevent a single point of failure modes and to distribute the traffic load efficiently between servers

Database

We can add extra SQL Read Replicas to scale and handle cache misses. Since we have a relatively low amount of writes (paste creations) a single SQL Write Master-Slave should suffice.

Since we are using Amazon's S3 object store for storing content it can comfortably scale to handle our size estimations

For Analytics, we could use existing solutions to scale, like Google BigQuery and Amazon Redshift.

Conclusion

With this article at OpenGenus, we have been able to dissect how a system like Pastebin would potentially be designed. We have looked at the factors to be considered in other to then scale our design.