Deduplication is very essential for streaming use cases to ensure that no datapoint is processed more than once because of the possibility of duplicate events caused by retries, network latencies, or distributed system inconsistencies. ReplacingMergeTree table engine is one such solution introduced by ClickHouse to solve this problem.
This tutorial introduces you to ReplacingMergeTree and explains its role in deduplication within ClickHouse. We’ll cover what this engine is, how it performs deduplication, and it’s specific limitations for this particular use case.
Introduction to ReplacingMergeTree for Deduplication
ClickHouse is renowned for its blazing fast analytical processing, and ReplacingMergeTree is one of its unique table engines designed for background deduplication. Sounds interesting, right? Let’s go deeper; In this article, we dive into:
- What exactly is ReplacingMergeTree.
- How its deduplication mechanism operates.
- The limitations and challenges of relying solely on it for deduplication.
What is ReplacingMergeTree?
ReplacingMergeTree is a specialized table engine that was created by ClickHouse to remove duplicate entries during background merge operations. So, the actual deduplication process happens in the background at a time that is not known to the user (managed by ClickHouse). Although there are ways to force this process to run instantaneously, that could be often costly in terms of both reads and writes. So, if your use case doesn’t need immediate deduplication of data and you would like to save some space, this might be a good option for you.
Note that it is different from the MergeTree engine in the sense that it uses an Ordering Key rather than a Primary Key to perform deduplication. We will see how this effects the overall deduplication process in the next section. Take a look at the Figure 1 shown below to get a better idea of how the entire process works.
Figure 1. How ReplacingMergeTree Works in ClickHouse (An Overview)
Now that we know the core concept, let’s dive deeper. In the next section, we’ll explore the various options ClickHouse offers for this table engine, how the deduplication process works, and the key considerations to keep in mind when choosing this approach for your data.
How Deduplication Works with ReplacingMergeTree
In the last section, we've seen how ReplacingMergeTree facilitates background deduplication by retaining only the latest version of each row based on a specified sorting key. But how exactly does this deduplication process work, and what options do we have to further optimize it? Let's dive deeper into the mechanics and customization possibilities.
Understanding the Deduplication Process
As we have already seen in the previous section, when you insert data into a ReplacingMergeTree table, ClickHouse doesn't immediately eliminate duplicate entries. Instead, it stores all versions of rows that share the same sorting key.
The actual deduplication occurs during merge operations, which are background processes that consolidate data parts. During these merges, ClickHouse retains only the most recent version of each row, determined by the sorting key.
It's important to note that these merge operations are asynchronous and happen at unknown times. This means that immediately after inserting data, your table might still contain duplicates until the next merge occurs.
However, we do have some customization options to change things according to our requirements. Let’s see what those are.
Customization Options
ClickHouse's ReplacingMergeTree engine offers several customization features to tailor the deduplication process to your specific needs. By understanding and configuring these options, you can optimize data storage and retrieval for your use case.
Let’s discuss each of the available options.
Figure 2. ReplacingMergeTree Parameters Explained
ver
To manage which row to retain during deduplication, you can define a ver
(version) column in your table schema. This column helps ClickHouse determine the most recent or relevant entry when duplicates are present.
It can be a integer or a date/datetime type column. Note that this is optional and can be skipped. However, if you have a timestamp column in your application you would like to use for tracking a particular entity, you can use that for better deduplication rather than employing the default implementation of keeping the most recent column.
💡is_deleted
If you have an application that requires you to track the deleted records can accessing them for downstream tasks, ClickHouse gives you the option to define a is_deleted column. However, it is important to note that this has to be called during table creation and it requires the version column to be specified.
This column essentially indicates the deletion status of a row during merge operations. It uses binary values (0
or 1
) to mark rows as active or deleted. During merges, ClickHouse considers this column to determine which rows to keep or discard. It is important to note that no matter the operation on the data, the version should be increased. If two inserted rows have the same version number, the last inserted row is kept.
Here is an example query that can be used to create a table in ClickHouse using the ReplacingMergeTree engine with the above mentioned parameters:
Understanding the technicalities of the deduplication process for ReplacingMergeTree gives you more control over it. However, this implementation falls short in a lot of real world use cases because of the way it was designed. Let’s elaborate on the primary limitations of this approach.
Limitations of ReplacingMergeTree for Deduplication
Based on the working of this table engine, there are three main limitations that would make it undesirable for a host of applications. Before we go into the details for each of them, let’s see a visual depiction of these drawbacks for a better idea.
Figure 3. Limitations of using ReplacingMergeTree for Deduplication in ClickHouse (Note: For Performance Considerations, the graphs shown as just an estimate for understanding, not actual benchmarks)
Incomplete Deduplication
The primary disadvantage of using this technique is that it does not provide immediate deduplication, but rather eventual deduplication as shown in Figure 3.
This might result in a lot of problems for applications where immediate deduplication is crucial. For instance, if a warehouse inventory is being managed on ClickHouse using this approach, duplicates may result in inconsistent revenue or stock numbers on the dashboard after immediate updates in the database. The simplest solution to overcome this is to add logic in the queries that only fetch one record for each entry, but that requires extra setup and can impact performance.
Hence, this method is only useful when deduplication does not need immediate reflection (which is the case where the data in the table is constantly being queried).
Performance Considerations For Larger Datasets
To address the above drawback, ClickHouse does provide some options to their users. More specifically, there are two possible ways to gain some more control over the merge operations:
- Adjusting Merge Settings: You can configure settings like
min_age_to_force_merge_seconds
to influence how quickly ClickHouse performs merges after data insertion. Setting this value to a lower threshold prompts more frequent merges, reducing the window during which duplicates might exist. However, this can have it’s own set of implications like increased CPU and memory usage. - Manual Merges with
OPTIMIZE FINAL
: If you need immediate deduplication, you can manually trigger a merge using theOPTIMIZE FINAL
command. This forces ClickHouse to merge all parts, eliminating duplicates. However, keep in mind that this operation can be resource-intensive, especially on large tables, and might impact performance.
So, while there are ways to overcome some of the disadvantages, they are quite resource intensive, which could be a problem for large scale applications.
Improper Historical Management of Deleted Records
While the ReplacingMergeTree engine in ClickHouse is adept at deduplicating data by retaining the latest version of records during background merge operations, it introduces a notable limitation: the inability to preserve historical versions of data. Once a merge occurs, previous iterations of a record are discarded, making it challenging to track changes over time—a crucial requirement for various use cases like auditing and compliance assessment.
For instance, consider a system that monitors user activities over time. If a user performs multiple actions, each action represents a change of state. With ReplacingMergeTree, only the most recent action is preserved after merging, resulting in the loss of earlier activity records. As a result, we would be unable to leverage that deleted data to analyze things like user behavior patterns.
That being said, there are solutions to this problem, like implementing a table with Slowly Changing Dimension (SCD) Type 2 approach (check our Figure 3). This method involves creating a new record for each change, complete with timestamps indicating the validity period of each version. By doing so, all historical states are preserved, enabling detailed analysis and reporting over time. However, this requires the data pipelines to be designed in a way that aligns with this table structure. So, if we want to implement this in an existing application, a lot of rework is required.
Conclusion
Hence, based on our above discussion, it’s fair to say that ReplacingMergeTree is only useful for very specific applications (where data consistency is a requirement) because of the way it is implemented.
So, what is the solution for this? Well, there are various other table engines that ClickHouse provides to address frequent updates and data consistency. You can switch to CollapsingMergeTree or VersionedCollapsingMergeTree if you have a scenario where you want to use a column to figure out the latest entry that is to be kept.
However, if you want to implement a comprehensive and robust deduplication solution, you would have to rely on custom implementations. You can develop these using tools and frameworks like Apache Flink but that requires considerable expertise and efforts. Isn’t there an easier way? Don’t worry, I have got you covered.
With Glassflow, you can setup a deduplication pipeline for your streaming data in minutes and the good news is that almost everything will be managed by us!