Pipeline
This page outlines pipeline concepts in GlassFlow.
What is a Pipeline in GlassFlow?
A pipeline in GlassFlow orchestrates the flow of data from various sources, through transformations, and ultimately sends it to destinations where the data is stored or further utilized. Configuring a pipeline involves specifying these elements and defining how data is processed at each stage.
The pipeline automatically integrates a custom function you define in Python into the specified pipeline, executing the transformations in real-time as data passes through.
Pipeline components
Each pipeline consists of:
Data Sources
Points where data is ingested into the pipeline. This can be databases such as PostgreSQL or MongoDB, message queues/brokers like Amazon SQS or Google Pub/Sub, data streaming services like Amazon Kinesis, file systems, event-driven applications, or any other data sources.
Transformation
A custom function that processes and transforms the incoming data. These functions can clean, enrich, or analyze the data to extract meaningful insights.
Sinks
Destinations where the processed data is sent. This can include analytical databases such as ClickHouse or ChromaDB, storage systems such as Amazon S3 or Azure Blob Storage, data warehouses like Snowflake, Google BigQuery, or other services for further use.
Best Practices to name a space and pipeline
Remember:
When you create a resource (such as space or pipeline) by providing a name, GlassFlow generates a uniquely identified ID for the resource. The resource can be accessed by this ID.
Descriptive Names: Choose names that clearly describe the purpose or function of the pipeline, making it easier to identify and manage multiple pipelines.
Consistent Naming Scheme: Adopt a consistent naming scheme across your pipelines, especially if you're managing many of them. This could involve prefixes or suffixes that indicate the pipeline's stage in the data processing workflow (e.g., ingest
, transform
, export
) or its data source.