Part 5: How GlassFlow will solve Duplications and JOINs for ClickHouse

Learn the details on how GlassFlow will solve Duplications and JOINs.

Written by Armend Avdijaj28/03/2025, 15.41

Part 5: How GlassFlow will solve Duplications and JOINs for ClickHouse

Overall Conclusion: The existing solutions aren’t optimized for ClickHouse

To get you a summarized overview I have prepared the table below listing the options on how to solve JOINs and duplicates for ClickHouse when using Kafka.

Scroll to the right to see all columns

Solution	Use Case	Advantages	Limitations	Best For
ReplacingMergeTree	Deduplication	✅ Efficient for small datasets.	❌ Resource-intensive merging process - Only retains the latest record - FINAL keyword slows down queries - No built-in way to schedule merges at specific times directly - Can lead to outdated results during merge	Small to medium datasets.
Materialized Views	Precomputed JOINs	✅ Fast for static, unchanging data.	❌ Not suitable for real-time data. - Only works for new data, not historical changes - Limited to known JOIN specifications - Not built for real-time due to asynch updates	Real-time data with low updates.
Denormalization	Avoid JOINs by consolidating data	✅ Improves speed, reduces JOINs.	❌ Requires ETL and increases storage. - Requires ETL process - Potential data redundancy	Batch or static data.
Apache Flink	Stream processing (Duplication & JOINs)	✅ Real-time, scalable, flexible.	❌ Complex setup and maintenance - Requires setup and configuration of Kafka Connect and Flink - Adds complexity to the data pipeline - High risk of wrong configurations - No managed connector to ClickHouse - Debugging of Java	Large-scale streaming data.

Summary:

Now after looking at all possible solutions you’ll understand that these solutions are workarounds rather than built-in, easy-to-use features for handling duplications and JOINs in ClickHouse. They often require careful configuration, high effort, additional processing or come with performance trade-offs. Especially with bigger data sets (common use cases of ClickHouse), these limitations become more pronounced, making it challenging to manage duplications and perform JOINs in ClickHouse efficiently. A solution that combines ease of use, built-in functionality, and scalability for both deduplication and JOIN operations is still needed for ClickHouse.

Introducing GlassFlow: The easiest way for Stream Processing for ClickHouse

At GlassFlow, we love ClickHouse and share the community’s passion for real-time analytics. We have seen streaming data users struggle repeatedly with duplicates and JOINs, so we focused on building GlassFlow for ClickHouse. Our goal is to help more Kafka users gain value from ClickHouse in the easiest way possible.

Meet GlassFlow for ClickHouse: Open-source stream processing built to solve ClickHouse duplications and JOINs for Kafka users.

GlassFlow for ClickHouse Positioning (4) (2).svg

Our approach minimizes the users' configuration needs and lets our system handle efforts like memory usage, checkpointing, recovery, and updates. This way, users can get the system up and running in minutes without putting much effort into setup and maintenance.

Let’s get into the details to show you how smooth it is to solve duplications and JOINs for ClickHouse with GlassFlow.

image (3).png

Managed connector to Kafka

The first question is, how do I get my Kafka stream data into GlassFlow so I can run deduplicates and JOINs? We aim to build a managed connector that is safe, reliable and comes with a seamless integration of Kafka. This connector is designed to handle two streams simultaneously. Key functionalities of the connector are:

Reliability: Built with fault-tolerant design to ensure your data flows without interruptions, even during network failures or system restarts.
GlassFlow-Managed Updates: GlassFlow automatically updates it, so you always have the latest features, bug fixes, and performance improvements with zero downtime.
Easy Setup through UI: Designed for easy configuration, the GlassFlow Managed Kafka Connector integrates quickly into your existing pipeline with minimal overhead, letting you focus on your core tasks.

Users of AWS MSK, Confluent Cloud and local users will be able to connect easily to our product.

Deduplication built-in with up to 7 days duplication check

GlassFlow makes deduplication in ClickHouse effortless. Simply select the deduplication key for example trip_id and GlassFlow takes care of the rest.

No need to configure memory, manage state, or tune performance. Our system automatically tracks unique records within a rolling time window, ensuring clean, accurate data without duplicates.

Setting up is seamless: Define your deduplication key and map fields to ClickHouse all within a few clicks on our UI. Once the selected time window closes, old duplicates are automatically cleaned up, preventing storage and memory issues while keeping ClickHouse fast and efficient.

The deduplication is executed instantly (benchmarks will follow). If another event comes in, let's say 6 hours later, with the same primary key (you selected a time window of 24 hours), GlassFlow checks if the key was already used before and rejects the duplicated event.

JOINs are easier than ever

With GlassFlow you can perform JOIN operations between two Kafka topics in before ingesting to ClickHouse. You start by defining a join key and specifying the join parameters. Users first select two topics (e.g., customer-orders-v1 and customer-list) and then specify the keys used to match records from each topic (e.g., user_id from customer-orders-v1 and customer_id from customer-list). The data type for the join key can be selected to ensure compatibility, and a time window for the join operation (up to 60 minutes) is configured to limit the matching events within a specific timeframe. Sample events of both topics will help you select the right keys. The system automatically handles the memory usage for the JOINs. After configuring the JOIN keys, you select the destination table for ClickHouse at the next step, and then the system executes your JOINs.

image (4).png

Managed connector with Buffering for ClickHouse

GlassFlow simplifies buffering to ClickHouse through a fully managed connector, allowing users to fine-tune batch ingestion for optimal performance. Users can configure ingestion based on 2 modes:

Size-based: Where data is sent once a threshold of accumulated records is reached
Time-Windowed: Where batches are flushed at fixed intervals.

This ensures efficient writing, reduces fragmentation, improves write amplification and query latency, and eliminates the operational headache of manually managing buffering.

The 2nd step in connecting to ClickHouse is selecting the destination table and mapping the fields to the relevant data types before deploying the pipeline.

Wanna try? Get notified when the repo is available

I hope we made you curious to try GlassFlow yourself. If so, add your email to our notification list here.

In the meantime, we are working on the docs and are preparing benchmarks to give you an even more detailed understanding of what you can achieve with GlassFlow.

Notify me when repo is available

Part 5: How GlassFlow will solve Duplications and JOINs for ClickHouse

Get started today

Reach out and we show you how GlassFlow interacts with your existing data stack.

Book a demo