top of page

End-to-End-Sales-Data-Engineering-Pipeline-Azure-Databricks-DLT-CI-CD

  • Writer: Jesse Pepple
    Jesse Pepple
  • Nov 28
  • 4 min read

This project showcases a complete end-to-end Azure Data Engineering solution, handling data from ingestion through transformation to delivery using modern, cloud-native best practices. The architecture leverages Azure Data Factory, Azure Databricks, and Delta Live Tables (DLT) to create a scalable, reliable, and production-ready data pipeline.

The pipeline is designed to manage incremental data ingestion, support schema evolution, implement change data capture (CDC), and enable dimensional modeling—making it fully equipped to handle real-world enterprise workloads efficiently.


Project Architecture
Project Architecture

Problem Statement

Many organisations struggle with data that sits across multiple systems, arrives in different formats, and updates frequently. Traditional pipelines are often brittle, manual, and difficult to scale.

The challenge was to build a pipeline that:

Handles continuous ingestion from source systems

Automatically adapts to schema drift

Supports Change Data Capture (CDC)

Delivers curated data models for analytics

Ensures reliability, auditability, and reproducibility

This project solves those challenges using a modern lakehouse approach.


Phase 1 Data Ingestion


The project kicked off by establishing a Git repository to enable version control and streamline deployment of the Azure Data Factory artifacts. A dedicated development branch was created to manage feature enhancements and isolate ongoing updates from the main production branch, ensuring a controlled and organized development workflow.


Git Repository SetUp
Git Repository SetUp

During the ingestion phase, data was sourced from a website via a REST API. The pipeline was fully parameterized in Azure Data Factory and executed using a ForEach activity to dynamically handle multiple API calls. The ingested data was then written to Azure Data Lake Storage, with a timestamp column included to provide context for auditability and support downstream processing.


Our Pipeline Activity
Our Pipeline Activity

As shown in the diagram, the data source was first validated using a Web Activity. After successful validation, the ingestion process was executed using parameterized ForEach activities to handle dynamic data retrieval. Once the ingestion completed, a pull request was created, reviewed, and merged into the main branch to maintain version control and ensure production readiness.


Parameterized Pipeline
Parameterized Pipeline
Data Lake(Please note our source data ends with "sales-data")
Data Lake(Please note our source data ends with "sales-data")


Phase 2 Transformations

The transformation process began with the initialization of our Databricks bundle project, setting up the environment and project structure needed for scalable and maintainable data processing. Following this, I created external locations within the Data Lake, establishing secure and organized storage paths for raw, processed, and curated datasets.

Catalog Explorer
Catalog Explorer
Bronze External Location
Bronze External Location
Silver External Location
Silver External Location
Gold External Location
Gold External Location

The transformation process leveraged Databricks Autoloaders to ensure idempotent processing and ingestion. Transformations primarily focused on casting each column to the correct data type to enforce the target schema, requiring careful attention due to the numerous columns with varying types. Once the data was transformed, it was loaded into the Data Lake, and the creation of slowly changing dimensions (SCD) modeling in the curated (Gold) layer was initiated using Delta Live Tables (DLT).


AutoLoader Ingestion
AutoLoader Ingestion

Phase 3 Curated Gold Layer With DLT


For the curated (Gold) layer, I leveraged Delta Live Tables (DLT) to implement a robust Lakehouse pipeline. Dimension tables were modeled using Slowly Changing Dimension (SCD) Type 2, while the fact table employed SCD Type 1. As the dimension tables were already well-prepared at the source, no new surrogate keys or additional modeling were required. The primary focus was on creating an automated CDC flow to manage both SCD Type 1 and Type 2 changes. Using DLT, I also defined data quality expectations on key tables to ensure accuracy before finalizing the SCD Type 2 implementation. Once validated, the curated data was loaded into the SQL Data Warehouse for analytics and reporting, and the curated tables were optimized with Z-ordering on primary keys to enhance query performance.


ree
ree
ree
ree
ree
ree

After conducting a dry run and validating the data, I executed the DLT pipeline to apply the transformations and load the curated datasets into the Gold layer. Slowly Changing Dimension (SCD) Type 2 was implemented on the dimension tables to capture historical changes, while SCD Type 1 was applied to the fact tables to retain only the most recent state.


ree
ree
ree
ree

As illustrated in the diagram, the upsert logic executed successfully, and all defined data quality expectations were fully met.


Databricks SQL Warehouse


I validated the curated datasets in the Databricks SQL Warehouse and then created sample dashboards to showcase the usability and analytical value of the curated data.


DimCustomer
DimCustomer
FactSales
FactSales
DimDate
DimDate
DimProducts
DimProducts
DimStores
DimStores

Dashboard 1
Dashboard 1
Dashboard 2
Dashboard 2
Dashboard 3
Dashboard 3
Dashboard 4
Dashboard 4

After completing the Silver and Gold layers, I successfully deployed the Databricks bundle to the designated GitHub repository to maintain version control and enable collaborative development.

ree


Attached are the notebooks and Git repository of our project





BI Reporting


Leveraging Databricks Partner Connect, I provided a BI connector for data analysts, enabling them to directly query and visualize the curated data in Power BI without relying on the SQL Data Warehouse. The curated datasets were also loaded into the Synapse Analytics Warehouse for additional reporting and analytics. Upon project completion, all notebooks and pipelines were deployed to the production folder in Databricks using Databricks Asset Bundles, and the entire project was version-controlled by pushing it to my GitHub repository.


Partner Connect
Partner Connect


Loading Into Synapse


After successfully validating the data in the Databricks SQL Warehouse, I loaded the curated datasets into the Gold layer of our Data Lake and then pushed them to the Synapse Analytics Warehouse for enterprise-wide reporting and consumption.

Creating Master Key, Credentials, Data Sources And File Format
Creating Master Key, Credentials, Data Sources And File Format


Creating Views
Creating Views
Creating External Tables
Creating External Tables
ree
ree
ree
ree
ree
ree
ree

ree

 
 
 

Comments


© 2023 by Jesse Pepple. All rights reserved.

bottom of page