Topics and Partitions

Apache Kafka is a distributed robust publish/subscribe system. It acts as a message queue but it offers two major advantages over traditional pub/sub systems.

  1. Storing records with Fault Tolerance.

  2. Processing streams (data-streams) as they occur in real-time instead of in a batch.

cbp

Notice in the next figure, the difference in how the entities are stored in a Kafka topic VS how they are stored in an RDBMS. Events are always stored as they are applied in the log and they are never overridden, meanwhile, in RDBMS, you are storing a snapshot of an entity, when you update an entity you are modifying the original record.

dbvsevent

Other differences are:

  • Real Time vs Batch processing.

  • Distributed/Fault Tolerant vs "None" Distributed by construction.

What are Topics?

A topic is an ordered collection of events that are persisted to disk, replicated for fault-tolerance and they have a duration, as an event can be available for just some minutes or forever.

cbtp

What are Partitions?

A topic might be divided into several partitions to improve the performance in cases of heavy load.

The partitions of a topic are distributed across all the Apache Kafka brokers to maximize the parallelism when working with topics.

The sum of all events of all the partitions is what conforms a topic.

pbtpc

Each partition can be configured to be replicated across different Kafka brokers.

bp

Each partition can be replicated and one of the partition is designed the leader. All events are produced and consumed from the leader, and the other replicas just stay in sync with the leader. If the leader becomes unavailable, one of the synced replicas becomes the new leader.

From Apache Kafka 2.4.0, the previous statement is not strictly true. Although this is the most used behaviour, now a consumer can get messages from the closer replicas even though they aren’t leaders.

After KIP 392
Figure 1. Red Partitions are the leaders

Also a ReplicaSelector interface is provided so you can implement a custom selector of the replica from where messages are consumed.

What are Messages?

A message is a Key/Value pair that is stored inside a topic. This message is persisted and durable during the configured lifespan.

A message is a typical small chunk of data with two parts the key (optional) to identify the message and it is used by default to decide in which partition the message is stored, and the value which is the content of the message which can be in any format.

Moreover, each message contains a metadata timestamp attribute that is either set by the producer at creation time or by the broker on insertion time.

Although you can configure Apache Kafka to work with large messages, the default max size is 1 MB, being a few KB the recommended maximum size of a message.
pkp

Create a Topic

Apache Kafka auto-creates topics when a message is published to the topic.

You can also use kafka-topic.sh tool to create a topic manually.

This tool is bundled inside Kafka installation, so let’s exec a bash terminal inside the Kafka container.

docker exec -it $(docker ps -q --filter "label=com.docker.compose.service=kafka") /bin/bash

Inside the container, create a topic with name songs with a single partition and only one replica:

./bin/kafka-topics.sh --create --bootstrap-server localhost:29092 --replication-factor 1 --partitions 1 --topic songs
./bin/kafka-topics.sh --list --bootstrap-server localhost:29092
songs

Now that the topic is created and you’ve validated that the topic has been created, you can run the exit command:

exit

Get Topic information

You can get topic information by using kcat:

  • kcat

  • kcat in Docker

kcat -b localhost:29092 -L
docker run -it --network=host edenhill/kcat:1.7.0 kcat -b localhost:29092 -L
Metadata for all topics (from broker 0: localhost:29092/0):
 1 brokers:
  broker 0 at localhost:29092 (controller)
 1 topics:
  topic "songs" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0