In this article, we will focus on partition tolerance when designing a system and how the idea of partition tolerance changed along with the evolution of internet connection, this idea of choosing 2 of 3 items on the CAP theorem still exists? if so, is it still the same as it was in the 2000s? Is there a way of having partition tolerance without any failure? Let’s talk about it.
What is partition?
Partition is a network communication error that happens in distributed systems, an example: you were writing an article in a website that edits your documents, and your internet goes down, now you lost all your progress because you didn’t save what anything.
What is partition tolerance?
I’m sure you have heard about the CAP theorem before, but let’s refresh the concept of letter p. Partition is basically the tolerance to network partitions, this means that even when the system communication fails between some nodes, the whole system will continue to work. Want an example? Imagine you are playing a multiplayer game that has 14 severs around the world, you choose to play at the server that is the nearest to you, after a few minutes the server connection fails, when you exit the game and come back you have to choose another server, so you are able to play again, because the one you were playing before is not available now.
You can think to yourself, “What a bad game! How come the server will fail when I’m playing?”, sure the tech team has to work on this server fail, but at the same time the whole system continues to work, that is partition tolerance.
There is a point that we didn’t discuss in this example, when this server that failed comes back working, will the game that you played count in your rank? Of course, it won’t count, there was a failure! So you can think the more partition tolerance a database or a server is, the less consistent information it has and more available it is.
3 out of 3 in cap theorem
When people discuss having to choose between 2 items in the CAP theorem, almost every time there is a P there, have you thought about why is it like this?
Well think about it, how can a database be consistent and available but doesn’t have partition tolerance? Actually, there is a way! You can have a single monolithic server to store all the data there, the server doesn’t have to get or send information to any other server because he is monolithic.
Well, there is always some downsides… One of them is that you can not expand this system in a good way because it’ll be a single robust node that has all this information, if you want to have an organized large-scale storage you have to use the microservices concept, therefore loosing one of the letters in the CAP theorem.
Another bad aspect of this monolithic way of dealing with data is the latency in the network, a large monolithic database making operations frequently and at the same time, will cause latency because of the amount of operations.
Yeah, microservices are great, but when starting a new project that requires a database and server don’t think about performance or network latency, the best thing to do is creating a prototype first, if at some point your application has to increase performance or become a large-scale storage than you think about all this complex stuff.
Partition tolerance nowadays
Because of the evolution of network communication the CAP theorem choices are much more flexible, although partitions are rare system designers still have a way of turning them around.
A designer can have a strategy that detect partitions or even detect latency and enter a partition mode, recovering the information that was lost or restarting the request and making the functionality available again.
Of course this partition detecting plan has to be analyzed and though it through because in technologize choices has almost always a consequence. For example, a system with a small bound of partition detecting will enter partition-mode more frequently and consume more of the database and the communication network.
A strategy that can make partition less harmful is to use fault tolerance. Fault tolerance is a strategy to keep the system working even when a fail happens, it involves redundant data between multiple servers and databases making information more consistent, this system is configured so when there is a partition another server takes over the place of the primary server that broke down and the client will be sent to this substitute server.
Fault tolerance or partition tolerance
|Pros of fault tolerance
|Cons of fault tolerance
|Pros of Partition tolerance
|Cons of Partition tolerance
|System will never be unavailable
|High cost of hardware
|System won't crash even with failed nodes
|High cost of performance
|Possible restauration of data failure
|Partition detection is hard to configure
Each of these methods of preventing failures has their on possibilities and limits, so when considering adding one of them, think about the context in which the system is in.