GlassFlow: ClickHouse Duplications and JOINs solved for Kafka Users
I am sure you know about ClickHouse. It is a Real-Time Data Warehouse that is becoming more and more popular. Users love it for its speed and cost efficiency. We at GlassFlow are big believers in ClickHouse and want it to succeed. But there is a reason why I am writing this article. I am seeing ClickHouse users facing the same challenges repeatedly. They have a nicely running Kafka setup, are ingesting data via Kafka Connect to ClickHouse and then…booom 💥…duplicates. Everywhere. But that's not the end of it. When running JOINs within ClickHouse, the query performance goes from top speed 🏎️ to 🐢.
Think about the consequences. If your data has duplicates, you may report wrong numbers, leading to wrong decisions and losing trust in the data team. If JOINs slow ClickHouse, users will become unhappy with it and not use it in their day-to-day work.
So why are duplicates even happening, and why are JOINs slowing ClickHouse down?
Kafka, as the default data streaming solution, guarantees delivery at-least-once but not an exact-once. In general streaming data reality is with duplicates. Think about marketing technology systems that use the same data or network reasons, etc. This means you should expect duplicates in data streaming, and a deduplication strategy is necessary.
JOINs perform slowly because ClickHouse is built for fast reads and aggregations, not complex relational queries. Since it’s a columnar database without traditional indexes, JOINs require loading large amounts of data into memory. If the tables are too big, ClickHouse may spill data to disk or shuffle it across the network, which can cause significant slowdowns.
Yes, I know there are options from ClickHouse like ReplacingMergeTree, Materialized Views, etc., and outside systems like Apache Flink as a possible solution. Still, they are not cutting it, and ClickHouse even recommends reducing the usage of JOINs (link). Those options handle both topics by providing workarounds that require careful configuration or additional processing, with performance trade-offs, especially for large datasets. No worries, I have prepared a comparison of the solutions mentioned in the last part of the series.
Looking at the situation mentioned, we truly believe a more scalable, ready-to-use solution for deduplication and JOIN operations is still needed.
Meet GlassFlow for ClickHouse
At GlassFlow, we love ClickHouse and share the community’s passion for real-time analytics. We have seen streaming data users struggle repeatedly with duplicates and JOINs, so we focused on building GlassFlow for ClickHouse. Our goal is to help more Kafka users gain value from ClickHouse in the easiest way possible.
The idea is to build a solution that is easy to use, scales to millions of events, and connects natively with Kafka and ClickHouse. With GlassFlow duplicates and JOINs are taken care of BEFORE ingesting events to ClickHouse.
Key features are:
- GlassFlow for ClickHouse is Open-Source
- Managed Connectors to ClickHouse and Kafka
- Deduplication built-in with up to 7 days duplication check
- JOINs that perform on scale and are manageable through our UI
- Buffering logics for ClickHouse ingestion
You’ll find more details about the solution in part 5 of the series.
Notify me when repo is available
All parts of the series:
- Next >> Part 1: How do you usually ingest data from Kafka to ClickHouse?
- Part 2: Why are duplicatios happening and JOINs slowing ClickHouse?
- Part 3: ClickHouse ReplacingMergeTree and Materialized Views are not enough
- Part 4: Can Apache Flink be the solution?
- Part 5: How GlassFlow will solve Duplications and JOINs for ClickHouse