Part 2: Why are duplicates happening and JOINs slowing ClickHouse?

Learn the root of the duplication and JOINs issues of Kafka to ClickHouse.

Written by Armend Avdijaj28/03/2025, 13.59

Part 2: Why are duplicates happening and JOINs slowing ClickHouse?

So you saw an overview of different ways on how to ingest data from Kafka to ClickHouse. Now it is important to understand the root of the problem before exploring the options for solving duplications and JOINs for ClickHouse later.

Duplications: At-Least-Once guarantee of Kafka

Kafka provides at-least-once delivery guarantees, meaning that every message will be delivered at least once, but there is no built-in guarantee that a message will be processed only once. So, that means that duplicates are something you can expect to happen. The reasons why duplicates can happen are

Consumer failures: Network issues or crashes.
Consumer rebalancing: When a Kafka consumer group scales up or down, partitions are reassigned, and if a consumer hasn’t committed its offset before reassignment, the new consumer may reprocess the same message, causing duplicates.
Timeouts through slow processing: If a consumer takes too long to process a message, Kafka may assume it has failed and reassign the partition to another consumer. The new consumer then processes the same message again, creating duplicates.
Producer Retries: Kafka producers retry sending messages if they don’t receive an acknowledgment from the broker (e.g., due to a temporary network issue). If the original message was actually received but the acknowledgment was lost, the producer sends the same message again.
Log Retention and Consumer Restarts: If a consumer is down for too long and its last committed offset gets deleted (due to Kafka’s log retention policy), it may start from an older offset when it restarts, leading to the reprocessing of old messages.
Lack of Exactly-Once Transactions in the Sink: Even if Kafka would support exactly-once transactions, duplicates can still occur because ClickHouse doesn’t natively support exactly-once writes.

JOINs: ClickHouse isn’t a row-based system

JOINs can massively slow ClickHouse performance because CH is optimized for fast reads and aggregations, not complex relational queries. You need to understand that ClickHouse is a columnar database. It doesn’t use traditional indexes for quick lookups like row-based systems do. This means running a JOIN has to load and process a lot of data in memory, which can be slow, especially with large tables. If the tables involved in the JOIN don’t fit in memory, ClickHouse may have to spill data to disk or shuffle data across the network in distributed setups, both of which can cause delays. You can read more about memory usage here on the ClickHouse docs.

Try GlassFlow Open Source for ClickHouse on GitHub!

Next part: Part 3: ClickHouse ReplacingMergeTree and Materialized Views are not enough