Table of contents
- Code Deployment System
- High-level design
- Database solution
- Capacity estimates
- Definitions and technologies
In this article at OpenGenus, we aim to design a code deployment system such as Github Actions or Jenkins.
The main task of such a product is to deploy code from development to production environments efficiently and reliably, as well as manage code changes when they happen.
Code Deployment System
Here are some of the main functionalities of a code development system, without which such a product couldn't operate:
- Building the code
- Transforming the code into binary
- Running database migrations
- Configuring any environment variables ( if necessary )
- Storing binary
- Deploying the code
- Accounting for all regions
- Sharing code across machines
Let's go ahead and take each functionality and discuss it at a high level.
Pulling code from VCS
In order to do this, our program will have to execute some kind of query in order to pull the latest version of the code from the Version Control System when the building button is pressed. Also, we will have to keep track of the said repository, perhaps arranging some external program to check on it every 10 seconds to see if there have been any changes made. If there have, the query will be executed again.
Here is what that query would look like if the user were to use Github and the branch where the user bundled all of the code would be the main branch:
git pull origin main
Needless to say, when the user selects their VCS, they also have to be able to choose the repository that they want to pull from, as well as the branch on that repository that should be the target. This is information that should be stored in the database.
Building the code
When talking about this category, we are going to refer to the steps that we mentioned beforehand:
Transforming the code into binary
This is, surely, the most lengthy discussion we're going to have here.
First of all, let's talk about binary, outline what it is and why we need it.
What is binary?
Binary is a base 2 numeral system, which only uses two symbols, namely 0 and 1. Usually, 1 is used to represent an "on" state, and 0 an "off" state.
Why do we need binary?
In computing, we need binary because it is the fundamental language of the computer. In other words, it represents the only language all of our computers understand. Therefore, we need to adhere to it in order to deploy our applications to machines.
We could talk about two types of languages here, those being low-level and high-level.
We are usually going to be targeting transforming high-level languages into full-functioning applications. We could also talk about the process involved in building a low-level language, though.
Low-level programming languages
An example of a low-level language would be Assembly. In order to build it to binary code, you would need to use an assembler, which you would choose based on CPU architecture and operating system. For a product such as ours, if we wanted to deal with low-level languages, we would have to provide multiple assemblers, suiting all needs.
Then, after figuring the proper assembler, we would need to run a specific query for that assembler. ( Look at Assembler for specifics )
After going through these steps, we should have an output file of binary code!
High-level programming languages
High-level languages are a bit harder to deal with than low-level ones. That is mainly because, as the language becomes more high-level, more code is being abstracted away and transformed closer to human language, which makes it harder and harder to understand for the computer.
Some popular high-level languages are Python, C++, Java, and many others. Now, of course, there are different scales of high-level here, too. For example, if we regard C++ to Python, it is definetly a lower-level language. Nevertheless, alongside something like C, C++ is a high-level language. It's all about perspective.
Let's get back to our goal. Transforming high-level languages into binary code that can be understood by all of our machines.
For compiled languages, what the compiler does is translate the code it's given into assembly code. Then, the process that we discussed for low-level languages can be applied.
For interpreted languages, on the other hand, there are a few more steps involved. First of all, the code is parsed and analyzed, and then is broken down into an abstract syntax tree ( AST ). Then, the code, directly from the AST is executed ( interpreted ) and converted into machine code.
Running database migrations
As is almost everything we've discussed before, this is highly dependent on the language and database being used. Within our product, we need to automate this process by running queries that migrate the database on build.
We can't go over all of the posibilities, but what we can do is give a quick example of how this might happen:
For example, if we were to use something such as Django, with its built-in SQL database called sqlite, we would have to run database migrations in two steps:
python manage.py makemigrations
This actually commits the migrations, but doesn't make any changes to the database just yet.
python manage.py migrate
This is the command that actually alters the database.
Configuring environment variables
In order to shorten the time spent to write various operations, we might want to add environment variables. This way, we won't spend additional time searching for the specific folder path when trying to run a command in order to make the application run.
In our system, we could try to find the path and, based on operating system, run a query on the command line in order to add that path to the environment variables.
Here is how we would do it in Windows:
setx VAR PATH
Here, "VAR" represents the variable's name, and "PATH" represents its path.
The next question that probably comes to mind is, who will be doing all of the operations we have just talked about? Sure, we know how they need to be done, but we do need to have some kind of service that is set up to do exactly what we've talked about.
Here, we're going to go for a solution that makes use of multiple workers, considering the massive volume of most of the content we need to convert to binary.
First of all, we will need to set up some kind of a table in our database to store the multiple versions of code. In other words, whenever the code is updated in the VCS, the code gets pulled and gets transformed into a job.
In order to store these jobs, we will be using a queue, which operates in a FIFO manner ( First In First Out - Fifo ). This queue will be storing all of the jobs that get pulled from the VCS.
The multiple workers we talked about will operate in a predictable way - each of them will take the first job they find in the queue, process it, and send it to a storage location. Afterwards, it's just rinse and repeat for each of them.
A few questions that may pop up here:
- How do we make sure two workers don't pick the same job? => If we were to go by best practices and pick a SQL database, we wouldn't have to worry about such concurrency issues, since we would benefit from ACID transactions.
- What if a worker breaks down and we don't notice? => This is something that definitely has to be tackled. Perhaps we could add an extra field to each job, called checkup. Every 5 minutes, we could set up the workers to update this checkup, and then have an external system verify whether it was actually sent. If it wasn't, the external system could be set up to take measures and restart the worker or could simply alert its malfunction.
We briefly mentioned that we would need to make workers store the binary code into some kind of location. The obvious solution here would be the use of a blob store ( Blob Store )
Here is a high-level representation of this functionality:
Accounting for all regions
We'd probably like for applications that we deploy to be available all over the world. This is not really possible if we only use one storage in a single place. It is going to be really hard for code that is on a blob store in the USA, for example, to get all the way to France. Something more local, perhaps a server specifically for the European region, would work better.
Thus, we can go ahead and split our system into smaller subsystems, that are each based in different regions.
We can do this by replicating the binary from the main blob store in smaller, regional blob stores that we create. In short, once the build to binary is succeeded and uploaded to the main blob store, we can go ahead and replicate it asynchronously across regions.
How to keep track of whether the code has been replicated in a specific region?
Here, we would need to set up some kind of a service whose job is to check the replication status of the binary code. We would probably need to store both the main build version and the subsystems' versions in our database. Then, the service's job would be to check whether those two match up: if they do, then it will do nothing, otherwise, it will go ahead and update that region's blob store.
Sharing code across machines
In order to share code across machines, we have to find some kind of protocol through which we can do so efficiently. Just making other servers download the whole source code is unreasonable, considering that the average amount of content is 15GB.
We can use a P2P ( Peer to Peer ) network in order to reduce the load and improve download speeds.
How do we use a P2P protocol?
First of all, we need to select the proper technology for our product ( Check out P2P for suggestions ).
Then, we have to "seed" the file. Seeding implies making the binary code available to others by hosting it on a server. The server will be responsible of sharing the binary code.
Usually, a P2P protocol will require a Torrent file, which is a form of documentation of the binary code, including information such as size and structure. When the torrent file is accessed, the code is begins to get downloaded in small chunks by the user's machine.
Needless to say, each regional blob store will use this and make the said torrent file available to machine closest to them in the node of blob stores.
Here is a summary of our discussion:
Throughout the explanation of each function, we've briefly mentioned tables that need to be created, but didn't expand further on what fields are required. Now, we'll focus on the optimal database design for our product.
As mentioned before, we're going to go with a SQL database, since it ensures concurrency, which we need in our operations.
First of all, we will need to keep track of the repository of each application. We are going to fill the table with an id, the current version, and a checkup. The latter is extremely important because, as discussed in the corresponding category, it is what prompts the external service to check for a new version after a predetermined time.
Secondly, we discussed the jobs queue, which is made up of job objects. We are going to need a job table, that will contain an id for the job, some kind of pointer to the current version of the code, such as a SHA, a timestamp at which the job was mounted, a status and a checkup.
Lastly, we've mentioned another tablein the deployment section. We will need to store a main storage for each application, that would likely contain some kind of id and the build version.
Moreover, we'll need a table for the regional storages, which will have an id and a replication status that indicates whether or not the object is the exact replica of the main storage. The latter is extremely useful, because it is a clear indicator that an external system has to replicate the code from the main storage.
This is the overview of the database:
Let's take some average, reasonable measurements for our product and try to calculate the number of workers needed for the job. We'll say that there are 1000 deploys daily and each file has about 15GB worth of content. Let's consider the average building speed to be 10MB/s.
15000 / 10 = 1500 seconds per build
86400 / 1500 = 57.6 ~ 60 builds per worker
1000 deploys daily / 60 builds per worker = 16.6 ~ 20 workers
If the average file is 15GB, and the average building speed of a worker is 10MB / s, then, if we divide them, we'll get how many seconds each build takes on average. In our case, each build takes about 1500 seconds, which is equal to 25 minutes.
If we divide the number of seconds in a day ( 86400 ) by the number of seconds it takes to compute an average build, then we'll get the number of builds that each worker can perform daily. The result here is roughly 60 builds that can be performed by each individual server.
Then, our estimation for the app is 1000 deploys daily. If we divide that by the number of builds each worker can do daily, then we'll get the minimum number of workers required in order for the application to work.
We can also take a bit of a safety measure, and increase this number by a fourth:
Result: 25 workers
Definitions and technologies
Version Control Systems (VCSs)
A VCS will keep track of the changes to your files over time. It also allows you to collaborate with a team, and keeps a histroical record of the project's source code.
Some popular version control systems are Git, Bitbucket and Subversion.
We argued that the best database to use would be a relational one in this situation.
Here are some of the most popular SQL-based databases:
- Oracle Database
The FIFO principle is a method of organizing data based on the way the user enters it. In a FIFO system, the first item that is added is also the first one that is up for the next operation.
The FIFO principle is used in data structures such as queues and buffers.
First of all, we need to find the proper assembler for the user's operating system. Here are some of the most popular:
- NASM ( Netwide Assembler ) - supports Windows, Linux, macOS
- GAS ( GNU Assembler ) - supports Linux, macOS
- MASM ( Microsoft Macro Assembler ) - mainly supports Windows, but it can be used on various platforms with additional tools. For example, it can be used on Linux with the help of an external service such as Windows Subsystem for Linux ( WSL )
Then, after choosing the proper assembler, we would need to run some kind of a specific query, as we've talked about in depth previously.
Here is how that query would look like if we were using an assembler such as NASM:
nasm -f bin input.asm -o output.bin
Here, "-f bin" specifies our output format and "input.asm" is the file that contains assembly code we want converted. The "-o" flag signals the output file, that we place after, in our example the file being called "output.bin".
A blob store is short for "binary large object stores". Such systems are designed specially for the handling of binary data, usually large files, which is why they are such a great solution for our product.
Some popular blob stores from the some of the most influential tech companies include:
- Amazon S3 ( Amazon )
- Azure Blob Storage ( Microsoft )
- Google Cloud Storage ( Google )
ACID stands for Atomicity, Consistency, Isolation and Durability. This is usually a characteristic of SQL databases.
Let's explain each word from the acronym:
- Atomicty => the transaction is treated as a single unit, not allowing any changes being made unless all of them are comitted.
- Consistency => In order to ensure consistency, transactions must follow predefined rules such as foreign key relationships in the case of databases
- Isolation => transactions do not interfere with each other
- Durability => ensures that, after comitting, changes are permanent
A Peer-to-Peer protocol is a set of conventions that establish how machines communicate. In a P2P network, each participant has an equal status and can be both a client and a server. In other words, each user can use products shared by a P2P protocol, or deploy such products.
A great P2P protocol is BitTorrent. This is by far the most popular, and it makes use of the torrent file we discussed previously.
One major technology that uses this protocol is Blockchain.
A code development system is no easy product to make. Each step has to be researched thoroughly, much more in depth than this article has gone into.
I hope that this has got you curious to research additional secondary functionalities or even improve characteristics of the ones discussed here.