×

Search anything:

How Server Outrage do not impact Netflix

Binary Tree book by OpenGenus

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

The most important thing while designing a scalable system to be used by over 1B people is to have resilient and fail-safe. Due to various reasons, systems fail an outrage which means a sudden spike in traffic of users.

For example, during festive season, Amazon will face outrage but Amazon handles it efficiently. In 2011, AWS faced an outrage and it impacted it significantly. Netflix faced an outrage but it had no impact. Clearly, Netflix's architecture was special.

We will explore the key points of Netflix's architecture which will make it outrage safe:

  1. Netflix uses a stateless service architecture. The idea is there are multiple servers and any server can handle any request. Due to this, if a node fails, another node can take its place.

  2. Using multiple zones/ copies of data. As data has multiple copies, the chances of outrage bringing the entire system down or loss of data is minimal. Different zones may mean different countries for large scale systems.

  3. The technique of graceful degradation is used. Three core principles involve:

  • Fail fast: If a system is struggling, it is inform other systems as early as possible so there is not downtime.
  • Feature Fallbacks: There are multiple layers of features. If one feature fails, then a second feature is used.
  • All uncritical features are removed from the system in case of overloading.
  1. Netflix uses N+1 redundancy. The idea is to setup more servers than required. This helps them to serve sudden spikes in traffic easily. These extra servers stay idle most of the time. Amazon use their idle servers to serve AWS.

  2. It is wise to leverage strong system to build your system. Netflix uses Amazon's S3 to store their data. It is resilient for zone failures and is higly reliable.

  3. Netflix designed special features which helped them intelligently transfer traffic to zones which low load and ensures that traffic is equally distributed. This helps to maintain all system at minimal load and avoid outrages.

  4. Netflix designed a tool named Chaos Monkey. This tool kills running services to avoid failures to avoid outrages. This allows the system to automatically start stopped services with manual action.

  5. Netflix uses a novel load balancing technique which enables it to use all available systems to the fullest. Their system is automated.

All these features were implemented in 2011 which were quite innovative at their time. These techniques helped them handle outrages efficiently with no practical impact on their systems.

How Server Outrage do not impact Netflix
Share this