In this article, we will look into the system design of Github, the various protocols it offers, and how they work.
Table of Contents
- A little about Git
- Functional Requirements
- Non-Functional Requirements
- Protocols Supported by GitHub
- Handling HTTPS Requests
- Handling SSH Requests
- Handling GIT Requests
GitHub is a repository hosting service. It provides distributed version control and Source Code Management using Git. It helps team members to work together on a project and collaborate.
GitHub is among one the most popular software tools used by developers. It allows them to store the source code remotely and keep track of the changes. It supports all popular programming languages, streamlines the iteration process, and helps all the team members to stay on the same page and stay organized. There are millions of repositories on Github and each minute more are being added.
With that said, GitHub is proprietary software and its internals are hidden from public knowledge. Hence very little is known about the internals of the system.
A little about Git
Git is an open-source, distributed version control software designed by Linus Torvalds, the inventor of the Linux kernel in 2005. The main goals of git are tracking changes in a set of files and allowing collaboration among programmers. GitHub is deeply rooted in the principles of git. It provides remote storage for git repositories.
- Allow users to push, pull, clone, and fork repositories.
- Create new branches and merge branches into one.
- Track the development history and allows users to go back to a point.
- Make private repositories accessible to only mentioned users,=.
- Allow registered users to contribute to repositories.
These are the bare minimum features a software hosting website like GitHub must offer. Github has many other features like hosting static websites (Github Pages), GitHub Education program, GitHub Sponsors, etc.
As the user base grows and the number of repositories increases, the addition of new nodes and databases must be painless.
The system should be able to handle hundreds of millions of requests to pull, push repositories with minimal delay. The requests may be queued for a short period if no server is available. It is worth noting that a website like GitHub is mainly write-heavy.
GitHub should be highly-available with minimal downtime. A lot of companies in the world depend on GitHub and a lot of downtimes can cause losses and a bad user experience.
It must be ensured that the code base is always consistent with the latest commits.
Protocols Supported by GitHub
GitHub supports three different protocol requests, namely HTTPS, SSH, and GIT.
HTTPS allows you to do pretty much anything, like access the GitHub website, Push/Pull repositories, Edit account details, etc. It uses a password for authentication and also allows for pulling repositories anonymously. HTTPS verifies the server automatically using SSL certificate authorities. However, this has been broken into over the years and people consider it not secure enough.
The downside of using HTTPS is that you have to enter your password every time you push. However, if two-factor authentication is enabled you can use a Personal Access Token instead of a regular password. HTTP is not supported by GitHub anymore.
SSH uses public-key authentication. Using keys is more secure than passwords since you can have a separate key for each computer. To use SSH you have to generate a key pair in GitHub. SSH allows for pushing and pulling repositories but not editing account details.
The downside of SSH is whoever obtains the SSH key can push it to your repositories. Another downside is that authentication is needed for all connections, so you will need a GitHub account to even pull or clone.
GIT protocol is similar to SSH but with absolutely no authentication and runs on port 9418. You cannot push over the GitHub protocol. You can enable push access but, without authentication, anyone who gets access to the project's URL can push to your repository.
Let's see how each of these requests is handled behind the scenes by GitHub,
Handling HTTPS Requests
The client sends a request which is interpreted by the load balancer. The load balancer simply distributes the traffic to different servers in the pool and reduces the load on individual servers. GitHub uses a pair of Xen instances running ldirectord. One of them is the master and the other takes over if the master fails. The load balancer can serve as a simple static website in case of any errors.
The load balancer hands over the request to one of the many front-end servers. The NGINX running on the servers accepts the connection and sends it to a Shared UNIX Domain Socket. Unicorn is a HTTP server for Rails. One of the sixteen Unicorn worker processes running then runs the Rails code needed to handle the request.
Data about the different pages are stored in a MySQL Database Server. The Database Server is organized in a master-slave configuration and data replication is achieved by Distributed Replicated Storage System.
If the data about any git repository is needed, for example when loading a repository page, if it is not available in the Memcache then Grit library is used to retrieve the data. The calls that need to access the filesystem are abstracted into a Grit::Grit object which is replaced by a stub(substitute code) that makes RPC calls to a service called smoke. Smoke has direct disk access to the repositories.
Smoke is a load-balanced hostname that maps back to the front-end servers. The front-end runs several ProxyMachine instances behind a HAProxy. The proxy finds the username for a given repository. It then uses a proprietary library called Chimney to find the route for the user. The route is simply the hostname of the file server where the user's repositories are kept. Chimney uses Redis to find the hostnames.
Once the hostname is determined, the Smoke service acts as a transparent proxy to the file server. The file servers are organized in four pairs in a master-slave configuration. The result is sent back to the Grit stub which returns the response.
Once Unicorn is done with the execution, the response is sent from the Nginx to the client directly without going through the load balancer. The Front-end server can also redirect the request to a static website hosted in GitHub pages.
Handling SSH Requests
GitHub ssh relies on the fact that SSH allows you to execute commands in a remote server and view the output in your local terminal. Of course, allowing the execution of any command in your remote server is not a good idea. For this reason, GitHub restricts access to only the git-shell included with git. GitHub uses a proprietary version of git-shell with some added functionality called Gerve( Git sERVE).
The SSH request reaches hits the load balancer and one of the front-end servers. There an SSHD(SSH Daemon) accepts it. It then performs lookups on the MySql database to find the user corresponding to the public key. The user information along with the original command is sent to Gerve.
Gerve first verifies if the user has appropriate permissions to access the repository given in the command. It then uses Chimney to find the route for the user who is the owner of the repository. Once the route has been found, Gerver simply replaces itself with another SSH call to the correct file server. For example,
ssh git@<route> <original_command> <arguments>.
The front-end then acts as a transparent proxy for the client and handles the process in the file server.
Handling GIT Requests
The Git protocol works similar to SSH but uses a Git Daemon instead of SSHD. The request first hits the load balancer and one of our front-ends. The front-end then passes it to one of the ProxyMachine instances. The proxy examines the request and uses Chimney to find the route for the user in the request.
The repo name is then translated to its path on the disk and then sent to the file server. The ProxyMachine now acts as a transparent proxy for the git client and speaks the Git protocol and streams the data back to the Git client.
Git protocol allows only pull functionality from GitHub.
With this article at OpenGenus, you must have the complete idea of System Design of GitHub.