Modern Data Pipeline

February 11, 2023

Data pipeline

A data pipeline is a set of tools and processes used to automate the movement and transformation of data between a source system and a target repository.

Source systems often have different methods of processing and storing data than target systems. Therefore, data pipeline software automates the process of extracting data from many disparate source systems, transforming, combining and validating that data, and loading it into the target repository.

Building data pipelines breaks down data silos and creates 360degree view of the business. Business can then apply BI and analytics tools to create data visualisations and dashboards to derive and share actionable insights from the data.

Standard data pipelines include

Batch data pipeline: A batch data pipeline periodically transfers bulk data from source to target. For example, the pipeline can run once every twelve hours. Batch pipeline can be organised to run at a specific time daily when there’s low system traffic.

Streaming data pipeline

A streaming data pipeline continually flows data from source to target while translating the data into a target format in real-time. This is suitable for data that requires continuous updating. For example, to transfer data collected from a sensor tracking traffic.

ETL

ETL (extract, transform and load) is a data integration process that includes extracting raw data from various data sources, transforming the data on a separate data processing server before loading into the target system such as data warehouse, data mart or a database for analysis or other purposes.

Data Pipeline vs. ETL

  1. ETL pipeline includes a series of processes that extracts data from a source, transform it, and load it into the target system. On the other hand, a data pipeline is a somewhat broader terminology that includes ETL pipeline
    as a subset. It includes a set of processing tools that transfer data from one
    system to another.
  2. ETL pipelines transform data before loading into the target system. But data pipelines can either transform data after loading it into the target system (ELT) or not transform it at all.
  3. ETL pipelines end after loading data into the target repository. But data pipelines can stream data, and therefore their load process can trigger processes in other systems or enable real-time reporting.

ELT

ELT stands for “extract, load, and transform” is the processes a data pipeline uses to replicate data from a source system into a target system such as a cloud data warehouse.

ELT is a modern variation of the older process of extract, transform, and load (ETL), in which transformations take place before the data is loaded.

As mentioned above ETL process requires a separate processing engines for running transformations prior to loading data into a target system. ELT, on the other hand uses the processing engines in the target system to transform data. This removal of an intermediate step helps in streamlining the data loading process.

Since ETL process transforms data prior to the loading stage,

Recommended

DevOps Interview

DevOps Interview

Read the Interview with Mr. Mahesh Chandra about DevOps and its usage.

Modern Data Pipeline

Modern Data Pipeline

As organisation are adopting data-driven approach to grow and create
value to their business, but the challenges in the traditional methods to get required for analysis takes days and months for business to use the data…