EP 1: What is Kafka?

Kafka is a distributed event streaming platform for real-time data processing, enabling high-throughput messaging between producers and consumers.

Hello fellow Tech Monks!

Have you ever wondered how apps like Uber track rides in real-time or how banks detect fraud instantly? That’s where Apache Kafka comes in! It’s a powerful distributed event streaming platform that enables real-time data flow between systems, making it ideal for analytics, messaging, and event-driven applications.

Let’s discover WHAT’s, WHY’s and HOW’s of Kafka in this blog today.

Tip: Try to read the blog in light mode(in email) for better visibility. Also, please respond to the poll at the end of the blog, so that we can improve and provide you the best!

Table of Contents

When you purchase any product from Amazon, you need every notification on a real-time basis. You order some food, and you want to track where the driver has reached. Real-time data streaming is everywhere these days. We want everything on a real-time basis, but we did not start with this.

  • Early 2000s: Data generation was minimal, and batch processing was sufficient for weekly or daily analytics.

  • The Problem: Real-time applications (e.g., detecting fraud transactions) demanded instant action, but existing tools like RabbitMQ struggled with high data volume and velocity.

A Quick History

In 2010, LinkedIn, the professional networking platform, faced significant challenges as its rapidly growing user base generated massive amounts of data every second. To address the need for real-time data processing across distributed systems, LinkedIn engineers developed Apache Kafka, a robust distributed streaming platform capable of handling high-throughput, low-latency data streams. Kafka's architecture was specifically designed to process vast amounts of data efficiently in real-time.

Throughput

Throughput is the measure of how much data a system can process over a period of time, typically measured in units like bytes per second, messages per second, or transactions per second.

A system with high throughput can handle large amounts of data or many operations efficiently.

Factors affecting throughput:

  • System resources: CPU, memory, disk, and network bandwidth.

  • Parallelism: Distributed or partitioned workloads can significantly increase throughput.

  • I/O efficiency: Sequential vs. random disk writes, compression, and other optimizations.

Latency

Latency is the time delay between when an event occurs and when it is observed, processed, or responded to. It’s often measured in milliseconds (ms) or microseconds (µs).

A low-latency system responds quickly to inputs or events.

Factors affecting latency:

  • Network delay: Time taken for data to travel across the network.

  • Processing time: How long it takes to handle the event or request.

  • Queuing delays: Time spent waiting in queues due to system bottlenecks.

In 2011, LinkedIn open-sourced Kafka under the Apache Foundation, making it accessible to organizations worldwide. Since then, Kafka has been adopted by leading companies for diverse use cases, including logging, monitoring, event sourcing, and stream processing.

By this time you might have understood that Kafka is used for real-time processing of enormous amount of data. But you must be wondering why do we really need it?

Why Do We Need Kafka?

You might have heard many people arguing that when we have databases why we need Kafka like services?

So before diving into Kafka, let’s address this important question first:

What is a Database?

90% of us will say, we need databases to store the data. While this is true, it’s only part of the story. Databases are also designed to read, query, and manage data efficiently.

For example:

  • You can read data by ID.

  • You can read the data by indexing something.

  • You can read details by Aggregating something.

  • Apply conditions (e.g., AND/OR filters) to get precise results.

So basically database is not only used for storing data but it is also used for reading data

Why databases are not fast? Why do we need services like Kafka and Redis?

The data in the database is durable.

What do you mean by Durable?
It means you can trust the database; if you insert any data in the database then you can also read it and you don't have to worry about your data getting lost.

So let’s say you have a database here, for instance MongoDB or PostgreSQL and you inserted some data into it. You restarted the server. When you restart it, your data will be lying there as it was. This means databases are durable. This is because they store data on secondary memory (e.g., hard disks or SSDs).

We won’t go into the architecture of databases. But in short databases uses hard disk to store the data. Now, if you had read when you were young, there is a primary memory and a secondary memory.

  • Primary Memory (RAM):

    • Fast but volatile (data is lost when the system restarts).

  • Secondary Memory (Hard Disk/SSD):

    • Durable but slower for reads and writes.

Databases use secondary memory to ensure data persistence, but this makes them slower compared to in-memory systems like Kafka.

The Need for Kafka

Kafka, unlike traditional databases, operates with in-memory processing (RAM). So what the point is, the services like Kafka store their data in RAM and the benefit is, the RAM which is the primary memory is very fast. This makes it extremely fast for handling real-time data streams.

Similarly, if databases start to keep their data in RAM then it can achieve the speed, but if your database server goes down or gets restarted then all your data will be lost, right?

Kafka excels at handling unstructured or semi-structured data in real-time, making it ideal for modern applications like:

  • Real-time analytics.

  • Log aggregation.

  • Messaging systems between services.

  1. Structured Data

    • Organized in rows and columns (think relational databases like PostgreSQL).

    • Each column has a specific data type (e.g., strings, numbers, dates).

    • This structure allows for fast and precise querying. For example, with structured data, you can quickly filter or sort based on conditions like WHERE id = 123.

  2. Unstructured Data

    • This refers to data without a predefined schema—like JSON, logs, or images.

    • Examples include real-time data streams generated by applications (e.g., location updates from delivery apps).

    • Storing unstructured data directly in a traditional database is inefficient. Instead, systems like Kafka or message queues are used to handle such data streams efficiently.

Real-life example

Imagine a food delivery app like Zomato. Every time a delivery person moves, their device generates latitude and longitude data, along with timestamps. This data is unstructured and generated in high volumes. If you tried to directly store every entry in a traditional database, it would quickly become overwhelmed, resulting in performance degradation.

To handle this, Kafka (or a similar service) can act as a buffer. The unstructured data is first consumed by Kafka and processed by a consumer service. This service validates the data, computes useful metrics (e.g., total travel time, start and end points), and then stores only the essential information in the database.

For example: Instead of storing 10,000 latitude and longitude entries, the system might calculate the delivery’s total duration and path summary, then store this in an Orders table.

This approach keeps the database clean, reduces unnecessary storage, and ensures that queries remain fast and efficient.

Why Use Systems Like Kafka?

Kafka is invaluable for scenarios where raw, unstructured data needs to be ingested, validated, and processed before being stored. It ensures that:

  • Data can be consumed in real time without overloading the primary database.

  • Consumers can process and structure the data before storage.

  • Applications remain scalable and efficient, even during high traffic.

For example, in the Zomato case, Kafka enables the real-time display of delivery progress to users while offloading the heavy lifting of data processing to background services. Once the order is completed, only the essential computed data is stored in the database, ensuring optimal performance.

What is Kafka?

If you open the official page of Kafka you will find this definition,

Apache Kafka is an open-source distributed event streaming platform.

What does it mean? Let's break down these words to understand in better way.

Event streaming can be through two different tasks:

  • Create Real-time Stream

  • Process Real-time Stream

Let’s explain these with an example.

Example: Paytm Transactions

Imagine you’re using Paytm for a payment. When you make a transaction, this event is sent to Kafka. However, you're not the only Paytm user—millions of users across the globe are transacting at the same time. Kafka handles this massive flow of data efficiently. Sending this continuous stream of transaction events to Kafka is called creating a real-time stream.

Once Kafka receives the data, it needs to process it. For instance, Paytm might want to limit users to 10 transactions per day. To enforce this, a client application reads the data from Kafka, checks the transaction count for each user in real time, and sends notifications if limits are exceeded. This process of reading and acting on data from Kafka is called processing a real-time stream.

In simple terms: Event streaming is the continuous flow of events to Kafka, followed by their real-time processing.

Kafka is distributed, meaning it runs on multiple servers spread across regions to balance the load and ensure high availability.

  • If one server fails, another automatically takes over, preventing downtime.

  • Kafka’s distributed architecture ensures scalability and fault tolerance, making it ideal for high-traffic systems.

How Kafka simplifies Complex Communication

Let’s say you have different applications that wants to produce different types of data to the database server. This looks simple right? What is the problem here? Nothing as of now.

But in future, your application can grow and you might have n number of services to communicate with each other.

Imagine multiple applications (e.g., frontend, backend, security systems) communicating directly with one another. This can lead to:

  1. Data format mismatches – Different apps might require data in varied formats.

  2. Connection complexities – Each app connecting to multiple others can quickly become unmanageable.

  3. High connection count – Managing a large number of direct connections between apps is inefficient.

Kafka simplifies this by acting as a centralized messaging system. Instead of apps communicating directly, they send messages to Kafka, which stores and routes them appropriately. This reduces complexity and minimizes the number of connections.

How Kafka Works

Kafka operates on the Publisher-Subscriber (Pub-Sub) model:

  1. Publisher: Produces events (e.g., transactions) and sends them to Kafka.

  2. Message Broker: Kafka stores these events temporarily.

  3. Subscriber: Reads and processes the events from Kafka as needed.

Kafka mainly operates through four core APIs that enable real-time data streaming and processing:

  1. Producer API:

    • Creates and sends event streams to topics (ordered lists of events).

    • Topics can be saved for minutes, days, or indefinitely, depending on storage needs.

  2. Consumer API:

    • Subscribes to topics to ingest data, either in real-time or from stored records.

    • Direct producer-to-consumer connections are possible for simple setups.

  3. Streams API:

    • Consumes from topics to analyze, aggregate, and transform data in real-time.

    • Produces transformed data streams to the same or new topics, powering advanced use cases like location tracking and analytics.

  4. Connector API:

    • Simplifies integration with external systems (e.g., MongoDB) by using reusable producers and consumers.

    • Developers only need to configure connectors rather than rewriting integrations.

Let’s look into some terminologies of Kafka.

Kafka Message

Every piece of data Kafka handles is a message. A Kafka message has three parts:

  1. Headers: Carry metadata.

  2. Key: Helps with organization.

  3. Value: The actual payload.

Now that you know what a message is, let’s see how Kafka organizes these messages into

Topics and Partition

  • Topics: Messages are organized into topics. Topics are categories that structure the data streams.

  • Partitions: Sub-divisions within a topic that allow messages to be processed in parallel across multiple consumers, enabling high throughput.

Why Do Companies Choose Kafka?

Some of the Kafka’s strengths include:

  1. Multiple Producers: Kafka handles simultaneous data streams without performance degradation.

  2. Multiple Consumers: Different consumer groups can read from the same topic independently.

  3. Consumer Offsets: Kafka tracks what has been consumed, allowing consumers to resume processing from where they left off in case of failure.

  4. Retention Policies: Messages are stored based on time or size limits that we define, ensuring nothing is lost unless explicitly cleared.

  5. Scalability: Start small and grow as needs expand.

Kafka Producers

Producers are applications that create and send messages to Kafka:

  • Batching: Messages are batched together to reduce network traffic.

  • Partitioning:

    • Without a key, messages are distributed randomly across partitions.

    • With a key, messages with the same key go to the same partition for better organization.

Kafka Consumers and Consumer Groups

Consumers are applications that process messages:

  • Consumer Groups: Share responsibility for processing messages from partitions.

  • Partition Assignment: Each partition is assigned to only one consumer in a group at any given time.

If one consumer fails, another takes over its workload, ensuring uninterrupted processing. Kafka’s Group Coordinator handles partition redistribution when consumers join or leave a group.

Kafka Cluster

Kafka clusters consist of Brokers:

  • Brokers: Servers that store and manage data.

  • Replication: Partitions are replicated across multiple brokers using a leader-follower model. If one broker fails, another becomes the leader without losing data.

  • Metadata Management:

    • Older versions use ZooKeeper for metadata management and leader election.

    • Newer versions use KRaft (Kafka Raft), which eliminates ZooKeeper, simplifies operations, and improves scalability.

I hope this gives you a clear picture of Kafka, Why it is needed and How it is used.

In a Nutshell:

Kafka was developed at LinkedIn in 2010 to handle massive real-time data and open-sourced in 2011 under the Apache Foundation.

Kafka is a distributed event streaming platform that enables high-throughput, low-latency real-time data processing. It acts as a message broker, allowing producers to send data to topics and consumers to process it efficiently.

Works on Producer-Consumer Model:

- Producers send messages to Kafka topics.

- Consumers read messages from topics, with consumer groups sharing the workload.

Use Cases:

- Real-Time Analytics: Monitoring user activity or transactions.

- Log Aggregation: Centralized log management across systems.

- Event Streaming: Capturing and reacting to user actions.

- Change Data Capture (CDC): Keeping databases synchronized.

Keep learning. You’ve got this!

Did you understand today's concept?

Login or Subscribe to participate in polls.