Introduction
ClickHouse is a high-performance, column-oriented database specially designed for online analytical processing. Its efficient insert operations, whether synchronous or asynchronous, are key to handling large-scale data ingestion. This article focuses on asynchronous inserts in ClickHouse—exploring how they work, their benefits, and the inherent limitations that make them less suitable for real-time data scenarios.
So, let’s start be getting a better idea of what Asynchronous inserts are and how they are handled by ClickHouse.
How it works?
This section starts with the fundamentals of asynchronous inserts and then delves into how ClickHouse has implemented that in their infrastructure. Finally, we will understand the entire process in detail with a simple example.
What are Asynchronous Inserts?
Figure 1. How Asynchronous Inserts work?
At its core, asynchronous inserts let you add data into any database table without it requiring to write it immediately. Instead, the data is stored in a temporary location and at a pre-set time, all the data in that location is then inserted into the corresponding tables (yes, there can be data from multiple tables as well). Now, your question here might be, why this approach? Well, it has some really good advantages:
- Improved Performance: By deferring the actual disk writes, the system can handle higher rates of incoming data without slowing down the client’s experience.
- Reduced Latency: Clients experience minimal waiting time since the response is sent as soon as the data is queued. (note that is has to be configured in ClickHouse, more on this later).
- Efficient Resource Utilization: The background processing can be optimized to better manage disk I/O, balancing workload and resource consumption.
Synchronous vs Asynchronous Inserts
In contrast, synchronous inserts require the data to be immediately processed and written to the storage engine (i.e. the table). This direct approach:
- Forces the client to wait until the insert is fully completed.
- Can become a bottleneck under heavy load, as the write operation’s speed directly affects the query response time (system resources are in use for the inserts).
- Is more suitable for scenarios where immediate data availability is crucial, such as in real-time analytics.
So, while the async option sacrifices immediate consistency, it might be a very good option in scenarios where near real-time data is enough and there are ample write operations, which might impact performance if the synchronous mode is used.
Now, let’s discuss how ClickHouse approaches this in their infrastructure.
ClickHouse Architecture Overview
ClickHouse’s architecture is designed to efficiently handle massive parallel data processing. Hence, the asynchronous insert mechanism is created keeping this primary requirement in mind. Here is a diagram from ClickHouse that depicts their async insert process.
Figure 2. How ClickHouse handles Asynchronous Inserts (Source)
Here are the key components for this architecture:
- In-Memory Buffers: This is the temporary storage location (which is in-memory) we discussed in the last section, where data is initially placed. When data is inserted asynchronously, it is first stored in these buffers before being merged into the main tables.
- Background Flushes: A dedicated background process called the Flush is responsible for periodically merging buffered data into the main storage, ensuring that the system maintains high performance without manual intervention.
ClickHouse provides various configuration options that let you tune the behavior of asynchronous inserts. This includes settings like buffer size, flush intervals, and thresholds for triggering background merges.
To enable asynchronous insert processing in ClickHouse, you can configure the relevant settings at different levels:
- User Level: Enable asynchronous inserts for a specific user by setting
async_insert=1
. - Query Level: Use the
SETTINGS
clause in your insert queries to specify asynchronous insert settings. For example:
- Connection Level: Specify asynchronous insert settings as connection parameters when using a ClickHouse client. For instance, with the ClickHouse Java JDBC driver:
Apart from this, there are various configurable parameters that allow you to further customize the process according to your requirements and use case. For instance, you can set when the flush operation occurs using the following two parameters:
- buffer size has reached a particular size (configurable via async_insert_max_data_size)
- at least x ms has passed since the last buffer flush (x is configurable via async_insert_busy_timeout_max_ms)
Note: Here, wait_for_async_insert parameter configures when the acknowledgement for your query is sent back to the client. When 1, it is sent after the flush operation for that particular data is complete. If it is set to 0, the acknowledgement is sent immediately (fire-and-forget behavior, as described by ClickHouse) bit it has it’s own implications. Check out more here.
Now, let’s go through a very simple example to understand the entire process better.
Technical Workflow
Here is a very simple example that we will be considering: adding metric information from the client application into a ClickHouse table named “metrics”. Below is the step-by-step process illustrated in the image:
Figure 3: ClickHouse Async Insert Technical Workflow Example
- Client Application Sends Insert Query: The client application submits an
INSERT
statement with asynchronous insert settings:
- ClickHouse Server Receives Data & Buffers It: ClickHouse immediately acknowledges (ACK) the insert request to the client without waiting for the data to be written to disk if wait_for_async_insert is set to 0, otherwise it waits till the flush occurs. The incoming data is stored in an in-memory buffer, structured in a binary format as shown in the flow.
- Background Flush Process: A background process monitors the buffer and triggers a flush when any of the configured limits are exceeded or if they are not set.
- Data is Written to Disk Storage: The buffered data is converted into columnar files and stored in MergeTree storage format as shown in the final stage of the flow.
Don’t worry about the technical details like file names and binary data as it’s mentioned to make you aware of what goes under the hood. This workflow ensures high-throughput ingestion, leveraging ClickHouse's in-memory buffer and asynchronous processing to optimize performance.
Now, let’s discuss the advantages and disadvantages on this mode.
Pros and Cons of this Approach
As seen in the previous sections, Asynchronous inserts in ClickHouse offer several advantages when designing your data ingestion strategy. On the flip side, there are a lot of scenarios where this approach would be a problem. So, let’s clear things out and discuss these pros and cons. The table below outlines these key points:
Table 1: Advantages and Disadvantages of ClickHouse Asynchronous Inserts
Aspect | Advantages | Disadvantages |
---|---|---|
Performance | Increased Throughput and Efficiency: By batching multiple small inserts into larger ones, asynchronous inserts reduce the overhead associated with frequent disk I/O operations, leading to higher data ingestion rates. | Data Latency Issues: Data is temporarily stored in memory buffers before being written to disk, introducing a delay between data ingestion and its availability for querying. |
Resource Utilization | Reduced Write Latency Under Heavy Load: By handling insertions asynchronously, ClickHouse can manage resources more effectively, preventing bottlenecks during peak loads. | Potential Data Consistency Challenges: In scenarios where the server crashes before buffered data is flushed to disk, there is a risk of data loss, leading to inconsistencies. |
Scalability | Enhanced Performance for Bulk Data Ingestion: Asynchronous inserts allow for efficient handling of large volumes of data by consolidating multiple insert operations, which is beneficial for high-ingestion scenarios. | Operational Considerations: Troubleshooting delayed inserts and managing system resources require careful monitoring and configuration to ensure optimal performance and data integrity. |
Flexibility | Flexibility in Handling Large Data Volumes: Asynchronous inserts provide the ability to manage and process large datasets more effectively by leveraging server-side batching mechanisms. | Monitoring Complexity: Effective use of asynchronous inserts necessitates additional monitoring of system tables and logs to detect and address potential issues promptly. |
Hence, Asynchronous inserts in ClickHouse are designed to enhance data ingestion efficiency by batching incoming data on the server side before writing it to disk. While this approach offers significant performance benefits for bulk data processing, it introduces challenges for streaming data applications that require immediate data availability.
Why this won’t work for Streaming Data?
Figure 4. Why Asynchronous Inserts are not a good choice for Streaming data?
- In synchronous insert mode, each
INSERT
statement is processed and written to the database storage immediately. This ensures that the data is available for querying as soon as the insert operation completes, making it suitable for real-time analytics where timely data access is critical. Conversely, asynchronous inserts involve temporarily storing incoming data in an in-memory buffer. The data is flushed to the database storage only when the buffer reaches a certain size or after a specified timeout. This buffering mechanism introduces a delay between data ingestion and its availability for querying, which can be problematic for applications that rely on up-to-the-moment data. - For real-time analytics and decision-making processes, the latency introduced by asynchronous inserts can lead to outdated insights. In scenarios such as monitoring systems, financial transactions, or user activity tracking, decisions are often made based on the most recent data. The delay in asynchronous inserts means that the data being analyzed may not reflect the current state of the system, potentially leading to suboptimal or incorrect decisions.
Therefore, while asynchronous inserts offer advantages in terms of ingestion throughput and resource efficiency, they are not well-suited for use cases that demand real-time data availability. In such situations, synchronous inserts or other real-time data processing mechanisms should be employed to ensure that data is immediately accessible for analysis and decision-making.
Conclusion
So, while asynchronous inserts in ClickHouse can optimize ingestion throughput and resource efficiency, they introduce delays that make them less ideal for streaming data scenarios requiring real-time analytics. For applications where immediate data availability is critical, a more robust, real-time data processing strategy is necessary.
This is where Glassflow comes into play - offering a more effective solution designed specifically to meet the demands of streaming data environments. Glassflow enables users to achieve both high-speed data ingestion and instant data accessibility, ensuring that insights and decisions are always based on the most current information available. Check out what we are building: Link
References: