This article will cover the uses of Apache Kafka in System Design, its core capabilities, and how it works.
Tables of Contents:
- Core Capabilities
- What is Event Streaming
- Use Cases
- Companies that use Kafka
- Use Case: Youtube
- How it Works
- Why use it?
Introduction to Kafka
Apache Kafka is an open-source event streaming platform used for its streaming analytics, data pipelines, data integration, and more. The platform is commonly used among leading companies in various industries from manfacturing to insurance to telecom.
More specifically, the software is a framework implementation of a software bus using stream processing. It is an open-source software platform developed by the Apache Software Foundation written in Scala and Java.
Core Capabilities of Kafka are:
- High Throughput
- Permanent Storage
- High Availability
Broken down, Kafka allows us to first publish and read event streams. Kafa then stores those event streams which gives us the functionality of processing those event streams in real-time or with delay.
What is Event Streaming?
Take the example of a human's central nervous system -- that is what event streaming is for the digital world.
Event streaming allows us to capture data in real-time from an amalgamation of sources which can later be employed to for later retrieval.
Industry-wide examples of the uses cases of event streaming:
- Process payments and transactions in real-time
- Continous capture of sensor data from IoT devices and more
- Monitor logistics and transportation through trucks, shipment fleets, etc.
Companies that use Kafka
The software is used by over 60% of all Fortune 100 companies including Box, Goldman Sachs, Costco, Target, Intuit, Cisco and more. Originally developed at LinkedIn, Apache Kafka eventually branched out into its own platform.
Use Case: By Company
Kafka can be used for video streaming. Companies like Pinterest use Kafka to handle events like impressions, clicks, close-ups, and repins. Additionally, companies like LinkedIn use Kafka to monitor, track, newsfeed, and stream data.
The Kafka ecosystem consists of Kafka Core, Kafka Streams, Kafka Connect, Kafka REST Proxy and the Schema Registry. The five main APIs in Kafka in Java and Scala are the Producer API, Consumer API, Connector API, Streams API, and the Admin API.
- Producer API: Publishes streams of events to one or more Kafka topics.
- Consumer API: Allows us to read topics and to process their respective events streams.
- Connector API: Builds and runs connectors that import and export event streams so that they can interact with external applications and services.
- Streams API: Facilitates the development of applications and microservices for stream-processing.
- Admin API: Administrates, manages, and inspects Kafka topics, brokers and other Kafka objects.
How it Works
Kafka employs a distributed system of servers and clients which communicate through the use of TCP/IP network protocol. The software is able to be deployed in cloud environments as well as on bare-bones hardware (as well as more).
Additionally, Kafka stores key-value messages that come from processes known as producers. This data, in the form of key-value messages, can then be partitioned within different topics to the user's choosing. Other processes known as consumers can then read this data.
Kafka Servers can cover multiple datacenters as well as cloud regions. These servers form clusters and are designed to be fault-tolerant so that a loss in servers will not result in any data loss. Additionally, within, some of these servers, known as brokers, form a layer called the storage layer. The remaining servers are then used to import and export event streams through Kafka Connect.
Kafka clients allow the user to write applications that publish and read event streams as described in the Core Capabilities section. There are plenty of clients through the Kafka Community, and users may use the Kafka Streams Library to build applications and microservices to manage Kafka Clusters (stream-processing).
Currently, the alternatives to Kafka include a platform like RabbitMQ. RabbitMQ uses a messaging queue as opposed to Kafka's partitioned log model which combines a messaging queue and publish subscribe approaches. Additionally, Kafka using a TCP protocol as opposed to an advanced messaging queue protocol which is supported by the plugins MQTT and STOMP which RabbitMQ uses. Kafka also supports messages being read by multiple "consumers" while on RabbitMQ messages are removed as they are consumed, meaning that messages cannot be received by multiple consumers.
With this article at OpenGenus, you must have a strong idea of Apache Kafka in System Design.