Part 3: ClickHouse ReplacingMergeTree and Materialized Views are not enough

Deep dive on limitations of ReplacingMergeTree and Materialized Views.

Written by Armend Avdijaj28/03/2025, 14.42

Part 3: ClickHouse ReplacingMergeTree and Materialized Views are not enough

OK, now that we have examined and understood JOINs and why duplications occur when using Kafka for ClickHouse, I would like to explain the potential solutions that ClickHouse is providing for you.

Handling JOIN Performance in ClickHouse

Please keep in mind when reading this that in general JOINs on large data sets can have a huge impact when running real-time analytics workloads as data will be held in memory for most join algorithms (link)

Using Materialized Views for Pre-Joined Data

So, one way to do a sidestep of running JOINs during queries is to precompute them and store the results in materialized views. That works only if you know the exact specifications of the data that should be joined. This way, ClickHouse can provide faster results.

Limitations of Materialized Views

Only works for new data: A big limitation is that materialized views work only if new data keeps arriving, but historical data doesn’t change. If you frequently update the lookup table (e.g., user details, product info), the materialized view may contain outdated data. Since it does not automatically refresh on updates, you may end up querying stale data unless you manually refresh or rebuild the materialized view.
Performance issues for larger datasets: If your dimension table is very large, it may not fit into memory, causing performance issues.
Limited for real-time or streaming workloads: Materialized views are updated asynchronously in batches, meaning they are not ideal for real-time or low-latency JOINs.

Example:

Imagine you have 2 tables:

Orders: Storing order details
Customers: Storing customer names and locations

You create a JOIN:

So what happens?

When a new order is inserted into the orders table, it is added to the materialized view.
If a new customer is added to the customers table, it will not update the existing rows in the materialized views
If a customer's location is updated in the customers table, the materialized view will still contain the old location for existing records.

Denormalization instead of JOINs

Another option to solve JOINs is denormalization. Which means that your are storing all necessary fields in a single table to avoid real-time JOINs. The limitation is that to create this table you need an ETL tool for it, that ingests the new table then directly into ClickHouse. This way the queries can run on ClickHouse without involving any JOINs. I will explain later in the article how Flink and GlassFlow can be used here.

Lookups through external Dictionaries

ClickHouse offers the option to use a key-value lookup mechanism instead of performing JOINs. What it means is that you can retrieve values from an external data source (another ClickHouse table, a file, or even an external database) in a way that is faster and more memory-efficient than performing a traditional JOIN. So, instead of joining a fact table (events) with a dimension table (users) at query time, a dictionary loads the reference data into memory and enables instant lookups.

Limitations:

It is working very well for small data sized but if you have large datasets, you can imaging that storing massive reference tables in memory can be very costly.
Changes in the source table are not instantly reflected; updates depend on the dictionary’s refresh interval.

Conclusion on solving JOINs directly on ClickHouse

The conclusion here is that handling JOINs within ClickHouse is great for smaller to mid-size data sets but reaches it limitations for larger data sets.
Using materialized views can improve query speed by precomputing and storing results, but they are limited in handling changes to historical data.
Denormalization offers an alternative by consolidating data into a single table, reducing the need for real-time JOINs, but it requires an ETL tool before ingesting.
For smaller datasets, using external dictionaries provides a fast and memory-efficient way to perform lookups without traditional JOINs, though it may not scale well with large datasets.
Ultimately, the options for larger data sets are limited without adding additional processing before ingestions.

Handling Duplicates in ClickHouse

As duplications can naturally occur on Kafka, users are forced to figure out a solution for it. The main option to solve them within ClickHouse is called ReplacingMergeTree.

Using ReplacingMergeTree for Deduplication

ClickHouse’s ReplacingMergeTree automatically removes duplicate records based on a specified column (e.g., an event ID). The concept allows you to define a column that will be used to determine which rows are considered duplicates (often a version or timestamp column). The engine works with the assumption that you have data arriving in a "streaming" fashion, and you want to keep only the latest row for each entity (e.g., user, order, etc.).

Example:

How it works:

Data Insertion: When new data is inserted into a ReplacingMergeTree table, multiple rows with the same primary key (the unique identifier for a record) may exist in the table. ReplacingMergeTree tracks the rows and ensures that only the latest one (based on the specified column) remains
Versioning Column: You specify a versioning column when creating the table. This could be a timestamp, a version number, or any other column that reflects the "freshness" of the data. If multiple rows with the same primary key (e.g., user ID) are inserted, the row with the largest version (based on the versioning column) will eventually replace the older ones.
Merge Process: As data accumulates in the table, ClickHouse performs background merges to optimize storage and deduplicate the data. During a merge, the engine compares rows with the same primary key and keeps only the row with the highest value in the versioning column (i.e., the most recent row).
Querying: When querying data from a ReplacingMergeTree, only the latest rows based on the versioning column are visible in the result. Even if multiple versions of a record exist in the underlying data, only the most recent one is returned unless you specifically request older versions.

Limitations:

Looking into the ReplaceMergeTree concept, the main limitation is the merging process, particularly as data sets grow in size.

The merging process: A merge scheduler runs in the background. It can become resource-intensive and time-consuming when dealing with large volumes of data. As more data is inserted, ClickHouse divides it into parts, and each part may contain multiple versions of records with the same primary key. The merge scheduler is configured to run automatically. As a user you can adjust some settings like merge size, frequency, resource usage and trigger the merges manually but there is no built-in way to schedule merges at specific times directly.
Only the latest record is retained: The merge process works to retain only the most recent version of each record by comparing versioning columns across parts. If the data set is large or if many parts exist, merging can take longer because the system needs to compare and rewrite large amounts of data, potentially affecting system performance during the merge.
Performance issues during merge process: Until the merge is completed, queries might return outdated results as they may include records from different parts that haven’t yet been merged. This can lead to inconsistent query results unless the FINAL function is used.
FINAL function is very slow: The FINAL function forces ClickHouse to return the most recent, merged version of the data, ensuring that outdated records are excluded from the query results. Using FINAL comes with a cost, as it forces an explicit merge of the parts involved, which can slow down query performance significantly (especially on larger datasets) because the system has to process the merge at query time instead of doing it in the background. This means that while FINAL guarantees the accuracy of the query result, it can negatively impact the performance, particularly on very large tables with frequent updates and merges.

Merge duration:

Obviously there are a lot of parameters that are influencing the time it takes to merge the data like data volume, number of parts, part sizes, available CPU, disk, etc. but I would like to give you a rule of thumb here as I have heard many users complaining about it.

Small Datasets (Under 10GB):

For small datasets, merges generally happen quickly, often in a matter of seconds or minutes, especially if part sizes are small and merges are happening frequently.

Medium-Sized Datasets (10GB - 100GB):

As the dataset grows, merges will take longer. A table with tens of GBs of data may take several minutes to an hour to merge, depending on the number of parts, part sizes, and system resources. The merging process will likely involve processing hundreds of parts, so the merge time increases.

Large Datasets (100GB - 1TB):

For large datasets, merges can take much longer, ranging from several minutes to several hours, depending on the system configuration and hardware. The merge time can extend even further if there are a lot of small parts or the disk and CPU resources are not optimized.

Very Large Datasets (1TB and above):

Merges for datasets over 1TB can take hours or even days, especially if the system has many small parts. In such cases, optimizing part sizes and merge configurations is crucial to prevent bottlenecks. Additionally, using faster hardware (e.g., SSDs) and increasing the number of threads for merging can help speed up the process.

Conclusion on solving Duplicates directly on ClickHouse

In conclusion, ClickHouse’s ReplacingMergeTree engine is an effective way to deduplicate smaller datasets by keeping only the latest version of each record. But as data grows, the background merging process can significantly slow down performance. Without the option to schedule the merging process for a specific time it becomes an uncontrollable risk. As merges consume resources and may return outdated results until completed, users during that time can face errors, outdated results and very slow performance. While using the FINAL function ensures up-to-date data, it can further slow queries, especially on larger datasets. This approach works well for smaller datasets but becomes inefficient as data scales which is not optimal as a lot of companies use ClickHouse to process larger data sets.

Notify me when repo is available

Next part: Part 4: Can Apache Flink be the solution?

Part 3: ClickHouse ReplacingMergeTree and Materialized Views are not enough

Get started today

Reach out and we show you how GlassFlow interacts with your existing data stack.

Book a demo