What is Apache Kafka?

Kasun Dissanayake
Geek Culture
Published in
11 min readMar 20, 2022

--

In this tutorial, I am going to simply explain what is apache Kafka, what is the History behind Apache Kafka, What are the key concepts which make Apache Kafka such a unique platform with some examples and Apache Kafka Architecture.

Apache Kafka is a,

  • Distributed
  • Event Streaming
  • Message Publish/ Subscribe
  • Open-source Platform

The history behind Apache Kafka

Kafka was originally developed at LinkedIn and was subsequently open-sourced in early 2011. Jay Kreps, Neha Narkhede, and Jun Rao helped co-create Kafka. Graduation from the Apache Incubator occurred on 23 October 2012.

Key concepts of Apache Kafka Architecture

1) Publish-Subscribe Method

Publish-Subscribe method, commonly known as pub/sub, is a popular messaging pattern that is commonly used in today’s systems to help them efficiently distribute data and scale among other things. The pub/sub messaging pattern can be easily implemented through an event broker such as Solace PubSub+, Kafka, RabbitMQ, and ActiveMQ. When you use an event broker, you have a set of applications known as producers and another set of applications known as consumers. Producers are responsible for publishing data to the broker and similarly, consumers are responsible for consuming data from the event broker. By introducing a broker to our architecture, we have removed the need for producers to directly communicate with the consumers. This ensures we have a loosely coupled architecture. Additionally, our broker now becomes responsible for managing connections, security, and subscription interests, instead of implementing this logic in the applications themselves.

To understand Kafka’s implementation of pub/sub, you need to understand producers and consumers.

Producer

A Kafka producer can be programmed to write data to a topic. If the topic has 3 partitions, data will be written to all 3 partitions in a round-robin fashion. This leads to a significant problem. Because our data is scattered across multiple partitions, we don’t get ordering across partitions. Additionally, Kafka publisher applications batch writes to improve performance as they publish data to Kafka brokers.

Consumer

If we want to consume the data we just wrote to our topic, we will need to create a consumer application and connect it to our Kafka cluster. Your Kafka consumer can easily subscribe to a topic and consume the necessary data. However, as mentioned earlier, there is an issue caused by topic partitions. Because our data was written to 3 partitions, we have lost message ordering. When our consumer subscribes to the topic and consumes the data, it will get unordered messages. Depending on your system, this can be a critical issue.

To overcome this issue, you will need to use a key when publishing data so that all the messages pertaining to a specific key will always go to the same partition and hence preserve ordering. However, as you may have guessed already, with this workaround, you lose simple the ability to have balanced parallelization resulting in some partitions overflowing while others will be lightly used.

Furthermore, in Kafka, messages are polled instead of pushed to the consumer. When you program your consumer application, you are expected to provide a timer to constantly poll for data. As you can imagine, this can be highly inefficient if your application is frequently polling for data, especially when there is no data available.

This is the backbone of Kafka Architecture because this enables you to have a platform that completely decoupled the producer with the consumer. Another real-time example we can co-relate with this is in the Airports when we go and collect the luggage from the belts. We as the consumers just stand across that belt and wait for our turn to collect the luggage from the queue. There might be some situations that we are not able to reach but at that time our luggage will not get lost. It will continue to rotate on the belt and again we can pick up our luggage easily. When you have a central pipeline where everyone can dumb the data and based on the need everyone can take the data out it gives you a lot of flexibility.

2) Kafka Event-based Architecture

To understand event-based architecture let’s get an idea about data-driven programming and event-driven programming.

Data-driven programming means that some general code exists. It does not contain any business logic, it does not control flow. It’s just a tool to read and process data and output results. What controls flow and logic is data itself. So, if you want to change business logic (literally change the result of your program), you change data, not code.
And your code is, well, it is a kind of pipeline that executes commands depending on input data. You can think of such code as of eval function in javascript.

Event-driven programming logic is controlled by events. And that means that data is only data, and all business rules are placed in code. The event would carry some data, and logic could be changed depending on the event’s data, but the difference here is where these changing logic rules are placed — in data or in code; and in the case of EDP, the logic is in code.

Example:

In Event-driven integrations we’re more interested in sharing a change in one system that might trigger an action in another system. We usually share a small amount of data (events are generally small containers of data). In essence, an event in one system is triggering an action in another. An example from the top of my head is, say we have a support call completed and we want to trigger an anonymous employee satisfaction survey. If these are separate systems, once the support call is completed, an event may be fired to the other system to trigger the survey process. We probably only share the channel of communication with this second system and nothing about the support call. The second system doesn’t really care.

Event-Driven System

In Data-driven integrations we’re more interested in keeping systems in sync. Data in one system changes and we’re interested in replicating this change in data to other systems. It doesn’t necessarily cause an action on the second system. An example would be that we probably want to keep our CRM system and logistics system in sync in terms of customer shipping addresses. Once this info gets updated via CRM, data gets replicated to the logistics system. It’s possible that no further action is taken at that point in time.

Data-Driven System

Maybe it might help to think in terms of the scope of data exchange. Events are usually very small and carry just enough information to query further if needed. Say we might fire an event from a document management system saying a document is approved, and the event would contain only the document ID and the state change. If the receiving system needs more info, they would call back with this identifier for further details. In data-driven exchange, we would likely replicate all data.

In Kafka, we are having event-driven architecture. An event is a set of Business activities and the event is always continuously happening and it is unique. Unlike data, events are not updated or deleted. It's only Inserted as a new event at the end of the queue or ESB(Enterprise service bus). Kafka provides you the flexibility to store messages events into its Queue for whatever time frame you want. It could spend from 7 days to 1 year. So storage is also a very important feature. Unlike other Middleware products like ESB instead of pulling the messages, it is more for a push methodology to the target.

3) Log Commit in Kafka

Kafka uses Log data structure as the underline architecture. If you simply want to understand Kafka it is nothing but a simple log data structure where the data is stored at the end of the log.

You cannot add any entry in the front but you can only append the data at the end. In this log you have offsets. These offsets are the pointers to understand from where the data needs to be picked up.

Suppose there are 2 consumers which are taking data from a particular log. In that case, suppose Consumers are A and B. A and B take the data from the log and they need to understand their positions. Consumer A is pointing the Offset=9 and Consumer B is pointing the Offset=11. The next time Consumer 1 will take data from the Offset =10. So this is the log data structure, if you have seen any log files how do they continuously append the data at the bottom. Logs never overwrite any existing data. You cannot make an update in the log. Any new event coming in the log will append those data at the end.

4) What is Kafka Message, Topic, and Partition

As above mentioned Message is a kind of an event that we can store in Kafka. But this particular message goes under a certain topic. That topic has to be defined and then inside that topic, you can store this message. The message is the smallest unit of data in Kafka. You can relate this message to a raw in a database table. Ex: Customer Order detail Table one raw which contains whole the data related to Order Number 0001

The topic could be a business activity or a business function. Anything which is uniquely defined which you want to store a message. Suppose topic as a Customer, any message(event) which qualifies for Topic Customer will go to this particular Topic. The topic could be a table of Databases. The table contains multiple messages for all the Customers.

Partition is a subset of a topic. One topic could have multiple partitions.

Here we are having 3 partitions for one Topic. These 3 partitions can be related to different environments. This provides you with fault tolerance and scalability. If something happens to partition 1, the message will not be lost. Message can be retained from partition 2 or partition 3.

How do we make sure that events that are going into these partitions or Queues are written in sequence? Suppose Customer A buys a Product X. Then the same customer changes his mind and again replaces that product with Product Y. These are 2 separate events that have to follow a certain sequence for data availability and data correctness. In Kafka, you can sequence data or events inside a partition only. Events can be ordered or sequenced in the same partition only. Inside a partition, you can define a unique key with unique ID/IDS (Customer ID or Event ID). Then the partition will always make sure when an event is coming it's having the right sequence or not.

5) Kafka Broker and Kafka Cluster

In this diagram, you can see a Kafka Cluster with multiple Kafka Brokers. A single Kafka Server is called a Broker. You can have a Kafka Cluster created in multiple brokers inside it. The broker received messages from the Producer and it stores on a local disk. The broker also caters to a fetch request which is coming from the Consumers and provides the messages which are already written to the underline local disk. Based on the hardware one Kafka Broker can handle 1000s of partitions and millions of messages per second. One partition is assigned to multiple brokers the owner being a single partition.

In this diagram, you can see that Partition 0 is replicated along with Topic A to Broker 1 and Broker 2. In the same manner, Partition1 and Topic A replicated to Broker 1, Broker 2, and Broker 3. This provides redundancy in the messages stored in the partition. If any particular broker fails then the replicated partition takes the leadership. However, in all these cases every consumer and every producer always connect to the leader(The first replica in the list is always the preferred leader. This is true no matter who is the current leader and even if the replicas were reassigned to different brokers using the replica reassignment tool) of that group(Cluster).

Products connected with Kafka

There are various products that are shipped with this overall Kafka Platform. That provides you with multiple functionalities under one product. As a part of a core product, you get messaging caching and storing facility.

There is another product called Kafka Connect. Suppose there is a legacy platform like Mainframe from not very easy to stream data. Then you can use Kafka Connect which enables you to connect mainframe environments. Kafka Connect is a free, open-source component of Apache Kafka that works as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems. The information provided here is specific to Kafka Connect.

KSQL is a structured query language specifically designed for Kafka. If you want to query data while it is been streamed or stored as messages into Kafka Queue you can directly connect with KSQL and run those queries.

If you want to connect Kafka message broker through a client machine you can use Kafka Client.

Kafka Stream is a dedicated product only to manage streaming data into the Kafka Message Queue.

What are the benefits of using Kafka?

  1. Loose Coupling — Most of the applications are Micro-services based and you need small services to be designed and stored in containers. These need to be loosely coupled. That's why big companies like LinkedIn and Netflix use Kafka because it helps to maintain a flexible environment.
  2. Fully Distributed — You can break your messages and subscribe to several topics. You have different brokers to fault-tolerance. One particular broker goes down another will come and serve you with all the messages which are in the Queue.
  3. Event-Based
  4. Zero Downtime — Kafka gives you zero downtime because of the scalable architecture on which this has been built.
  5. Easy to Scale
  6. No Vendor Lock-In — Kafka is open-source software and that's why you can utilize it without any locking agreement.

I hope you now have a better understanding of What is Kafka, How it came into the world of Technology, What are the Key concepts, What are the products connected to Kafka, and the Benefits. See you in another article.

Thank You!

--

--

Kasun Dissanayake
Geek Culture

Senior Software Engineer at IFS R & D International || Former Software Engineer at Pearson Lanka || Former Associate Software Engineer at hSenid Mobile