Topics and Partitions

Apache Kafka is a distributed robust publish/subscribe system. It can act as a message queue but it offers two major advantages over traditional pub/sub systems.

  1. Storing records with Fault Tolerance.

  2. Processing data streams as they occur in near real-time instead of in large batches.

cbp

Notice in the next figure, the difference in how the entities are stored in a Kafka topic VS how they are stored in an RDBMS. Events are always stored as they are appended in the log and they are never overridden, whereas in an RDBMS, you are storing a snapshot of an entity and when you update an entity you are modifying the original record "in place".

dbvsevent

Other differences are:

  • Real Time vs batch processing

  • Distributed / Fault Tolerant vs none-distributed / single point of failure

What are Topics?

A topic is a collection of events that are persisted to disk. Topics support the concept of data retention such that events can be appended and kept for longer periods (e.g. days/weeks/months) or only be stored short-lived (e.g. minutes/hours)

cbtp

What are Partitions?

A topic is further divided into multipe partitions - at least one, but can be 10s or 100s - to improve the performance in cases of heavy load. The number of topic partitions defines the maximum parallelism a single consuming application consisting of multiple instances can achive for processing the stored events.

The partitions of a topic are distributed (i.e. replicated) across all the Kafka brokers to achieve fault-tolerance and to increase the parallelism when working with topics.

The sum of all events in all the topic’s partitions is what conforms a topic as a whole.

pbtpc

Topic partitions can be configured to be replicated across different Kafka brokers.

bp

For each replicated topic partition one of the partitions is the designated leader. Usually all events are produced to and consumed from the leader, and the other replicas stay in sync with the leader. If the leader becomes unavailable, one of the synced replicas becomes the new leader.

Starting with Apache Kafka 2.4.0 a consumer also can be configured to process messages from other replicas even though they aren’t leaders.

After KIP 392
Figure 1. Red Partitions are the leaders

Also a ReplicaSelector interface is provided so you can implement a custom selector of the replica from where messages are consumed.

What are Messages?

A message (also known as a Kafka record) is a key/value pair that is stored inside a topic partition. Message are persisted and durable in accordance with the configured retention settings for a topic.

A message is typically a "smallish" chunk of data consisting of its key (optional) to identify the message. If the key is present it is used by default to decide in which topic partition a message should get stored. The value represents the payload of the message. Both, keys and values, can support different serialization formats such as Avro, Json or Protobuf. Kafka brokers themselves don’t care about the data format at all and only store sequences of bytes.

Additionally, each message contains metadata such as a timestamp attribute that is either set by the producer at creation time or by the broker on insertion time.

Although you can configure Apache Kafka to work with larger messages, the default maximum size is 1 MB. Note that the Kafka record size is ideally within that limit and is often around a few 100 KBs.
pkp

Create a Topic

Apache Kafka can be configured to auto-create topics which means it is possible to publish a message to a non-existant topic that will be created on the fly. However, it is not recommended to work with topic auto-creation in production scenarios.

You can also use tools such as kafka-topic.sh to pre-create your topic(s) manually.

This tool is bundled with the Kafka container image, so let’s exec a bash terminal inside the running Kafka container.

docker exec -it $(docker ps -q --filter "label=com.docker.compose.service=kafka") /bin/bash

Inside the container, create a topic with the name songs having a single partition and only one replica:

./bin/kafka-topics.sh --create --bootstrap-server kafka:9092 --replication-factor 1 --partitions 1 --topic songs
./bin/kafka-topics.sh --list --bootstrap-server kafka:9092
songs

Now that you have valdiated the topic exists you can run the exit to leave the container

exit

Get Topic information

You can also get the topic information by using kcat from your host OS:

  • kcat

  • kcat in Docker

kcat -b localhost:29092 -L
Metadata for all topics (from broker 0: localhost:29092/0):
 1 brokers:
  broker 0 at localhost:29092 (controller)
 1 topics:
  topic "songs" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0
docker run --rm -it --network=kafka-tutorial edenhill/kcat:1.7.1 kcat -b kafka:9092 -L
Metadata for all topics (from broker 0: kafka:9092/0):
 1 brokers:
  broker 0 at kafka:9092 (controller)
 1 topics:
  topic "songs" with 1 partitions:
    partition 0, leader 0, replicas: 0, isrs: 0