How to Move Data from Postgres to Snowflake in Real Time
In modern data architectures, real-time data synchronization between transactional and analytical systems is crucial for making informed business decisions. Change Data Capture (CDC) is a technique that efficiently tracks and propagates database changes. This article will explore a practical use case for implementing CDC to stream customer order data from a PostgreSQL database to Snowflake using GlassFlow.
Note: This article assumes that you are familiar with the CDC. If you need an overview, you can check out the previous article on the basics of the CDC.
Use Case Overview
Imagine an e-commerce company that processes thousands of customer orders daily. The order data is stored in a PostgreSQL database, but the company needs to replicate the changes to Snowflake in real-time for advanced analytics and reporting. This ensures that analysts and business intelligence tools always have up-to-date information.
Architecture
The CDC pipeline consists of the following components:
- PostgreSQL (Source Database) — Stores customer orders.
- GlassFlow (Streaming Platform) — Acts as a message broker for change events
- Snowflake (Target Database) – Stores the replicated data for analytics.
PostgreSQL Overview
PostgreSQL is a powerful, open-source relational database system known for its extensibility, performance, and strong support for ACID-compliant transactions. It is widely used in transactional applications due to its robust feature set, including logical replication, which makes it an excellent choice for implementing CDC. With support for JSON, indexing techniques, and stored procedures, PostgreSQL provides flexibility for modern application needs.
Snowflake Overview
Snowflake is a cloud-based data warehousing platform that offers high scalability, elasticity, and performance for analytical workloads. Unlike traditional databases, Snowflake uses a unique architecture that separates storage and compute, allowing businesses to scale their queries and storage independently. Its native support for semi-structured data, automatic optimization, and seamless integration with various ETL tools make it a preferred choice for real-time analytics and data warehousing.
Implementation Steps
Step 1: Configure PostgreSQL for CDC
There are a lot of ways to set up your PostgreSQL server. You can set it up in Azure or AWS. In this article, I will create a PostgreSQL in Digital Ocean as I find this to be the most straightforward approach. You can use any cloud service to set up your database. There are two things to note when you set up your database. First, you need to enable the logical decoding. You need to whitelist GlassFlow’s IP address so it can read your database.
Enabling Logical Decoding
PostgreSQL supports logical decoding, which is required for CDC. Connect to your database and run the following command to enable logical decoding.
You also need to allow the user to have the REPLICATION
role.
In Digital ocean, the wal_level
and the REPLICATION
role has been setup so you can directly create the replication slot.
Whitelist GlassFlow IPs
You need to whitelist the following IPs:
35.198.78.140
34.107.63.104
Setup the Table
We will be using the following table.
Column | Type |
---|---|
order_id | Serial Primary Key |
user_id | Integer |
timestamp | Timestamp |
amount | Float |
user_email | VarChar (255) |
shipping_address | Text |
You can run the following SQL command to create the table.
For a more complete guide on how to setup PostgreSQL for CDC, please visit our GlassFlow Postgres CDC Guide.
Step 2: Set up Snowflake
Let’s create an analytics
database and a table with customer_orders
as the name. You can run the following command in the Snowflake console.
This step is optional, but in some cases, you will need to whitelist the GlassFlow IP. To do this, run the following command in the Snowflake console.
Or if you already have an existing network policy you can update it.
Now you need to apply the network policy.
For a more in-depth guide on how to setup Snowflake, please visit our Snowflake Guide.
Step 3: Configure GlassFlow
Now we need to create the pipeline to connect the PostgreSQL database to the Snowflake. Using the GlassFlow Python SDK you can write the following code.
A little note about the SNOWFLAKE_ACCOUNT
. You should use the account identifier. You can get this value in your Snowflake Account Details. It follows the following format: -. Example: GOOOWIB-DS12345
.
The content of the transform.py
is the following. The code will transform the CDC data format into a data format that Snowflake can understand later.
Step 4: Populating Order Data
We will use Python and the faker
library to populate the table.
When you run the above code, your database will be populated and any changes will be reflected in the Snowflake database. In the GlassFlow pipeline page, you will see the log get updated.
Step 5: Querying the Data in Snowflake
Once the data is synced, you can go to the Snowflake console and run the following query to check for the result.
A fully functional example can be found on our Postgres-Snowflake examples repo.
Conclusion
By leveraging CDC with PostgreSQL, GlassFlow, and Snowflake, businesses can achieve real-time data replication for analytical workloads. This architecture enables faster insights and improved decision-making. Implementing this solution ensures that customer order data remains synchronized across transactional and analytical systems without significant performance overhead.