top of page

Azure-Databricks-End-to-End-DataEngineering-Project-With-Azure-Devops

  • Writer: Jesse Pepple
    Jesse Pepple
  • Oct 9
  • 6 min read

This Data Engineering project demonstrates the full potential of Azure Databricks, Azure Data Factory (ADF), and Azure Data Lake, showcasing how modern data solutions can be built with scalability and automation in mind. The project incorporates key engineering practices such as Slowly Changing Dimensions (SCD) Type 1 — implemented manually — and Type 2, automated through Delta Live Tables (DLT), alongside Star Schema modeling and incremental data loading.

Using Spark Structured Streaming, I designed a seamless ingestion framework capable of efficiently processing new and updated data in real time. The transformation phase focused on building reliable and maintainable SCD Type 1 pipelines, ensuring data accuracy across updates. After curating the data from the staging (silver) to the core (gold) layers, the final datasets were published to both Azure Synapse Analytics and Databricks SQL Warehouse, making them ready for high-performance analytics and business reporting.


Azure Services:

  • Azure Databricks

  • Azure Data Factory (ADF)

  • Azure Data Lake Storage (ADLS Gen2)

  • Azure Synapse Analytics


Databricks Components:


  • Delta Lake & Delta Tables

  • Delta Live Tables (DLT for SCD Type 2 automation)

  • Databricks SQL Warehouse


Languages & Tools:


  • Python (PySpark / notebooks)

  • SQL

  • Spark Structured Streaming (for incremental loads)




The Medallion Architecture
The Medallion Architecture

As shown in the image above, the data ingestion process begins with Azure Data Factory (ADF), which orchestrates the extraction of data and stores it in the source container within the Data Lake. From there, I leveraged Databricks Autoloader to seamlessly ingest the data into the staging (bronze) layer, setting the stage for subsequent transformations and preparing it for loading into the core (silver and gold) layers using SCD Type 1 and 2 with Star Schema  for our Fact Table.



Creating And Ingesting Orders Source Data With ADF and Azure Devops


To kick off the project, I began by setting up the foundation in Azure DevOps. After configuring the project details, I created a dedicated branch to manage version control and ensure smooth CI/CD integration. With the groundwork in place, I moved on to Azure Data Factory (ADF) to start ingesting the raw source data—marking the first step in building the end-to-end data pipeline.

Azure Devops Overview
Azure Devops Overview

Configuring The Repository In DataFactory


After creating the branch, I returned to Azure Data Factory to configure the repository integration, ensuring proper version control between ADF and Azure DevOps. This setup established a seamless development workflow and prepared the environment for the data ingestion process.

ree
Configuring Branch Repository
Configuring Branch Repository

Source Ingestion


I proceeded to ingest the source data using the Lookup and ForEach activities in Azure Data Factory. During this process, I transformed the files from CSV to Parquet format to improve performance and storage efficiency. Once the ingestion pipeline executed successfully, I reviewed the results and prepared to merge the branch back into the main repository, ensuring version-controlled and reproducible workflows.


Our Pipeline
Our Pipeline
Source Container
Source Container

Pipeline Template


ree

As seen this is our Pipeline code for the ingestion of our source data


Commits


ree

Merge Illustration

Our Mege Illustration
Our Mege Illustration

ree


Phase 1 Ingestion To Staging Layer Utilising Spark Structured Streaming


Before initiating the ingestion process in Databricks, I created a Catalog to organize and manage the project’s data assets within a governed structure. I also configured External Locations to securely link Databricks with the containers in Azure Data Lake, ensuring controlled access and data lineage tracking. Since Unity Catalog was already enabled in my Databricks workspace (as shown in the image), this setup streamlined governance and made the transition to the transformation phase seamless.


ree
ree
ree

Unfortunately I forgot to screenshot our Silver external location😅 but nervetheless it was indeed created


Ingestion With Autoloaders


The data ingestion using Spark Structured Streaming was successful on the first attempt, seamlessly loading data from the source container into the Bronze layer. One of the key benefits of Spark Structured Streaming is its ability to handle incremental data loads, which I leveraged in this project. By configuring the trigger to once=True, the pipeline automatically processes any new data as it arrives in the Data Lake, while already processed data is tracked and managed through the checkpoint location, ensuring reliability and consistency across updates.


ree
ree

First Ingestion(Our Fact Table)


The first ingestion targeted the fact table, which was still being defined at this stage. The initial dataset contained 9,990 rows, and to demonstrate incremental loading, I added a new file with 10 additional rows to the Bronze container in the Data Lake. As soon as the new file arrived, Spark Structured Streaming automatically detected and processed it, seamlessly updating the dataset incrementally without reprocessing the existing data—showcasing the efficiency and reliability of the streaming pipeline.

Initial Data(9,990)
Initial Data(9,990)
Adding Incremental Data
Adding Incremental Data

After adding the new data, Spark Structured Streaming successfully ingested the incremental load, automatically updating the dataset. During this process, a rescued data column was also created to capture any records that did not conform to the table schema, ensuring no data was lost and maintaining the integrity of the pipeline.

Incremental Data
Incremental Data
ree


Preparing Incremental Data


Customers Incremental Data
Customers Incremental Data
Products Incremental Data
Products Incremental Data

Ingesting Multiple Data In One Go Using Parameterised Notebook


By using dbutils.widgets.text(), I parameterized the notebook to enable flexible and efficient ingestion of all datasets in a single execution. Leveraging a ForEach loop, I automated the loading of data into the Data Lake, ensuring each table—including newly added datasets—was ingested into its respective location. The Regions table was intentionally excluded, as it lacked meaningful context for the curated tables and core layer, maintaining focus on the most relevant data for analytics.


ree
Parameterizing Our Notebook
Parameterizing Our Notebook
Parameter Notebook
Parameter Notebook
Preparing Pipeline
Preparing Pipeline

Our Load Was Successful

ree
ree
ree
Customer Stream was succesful
Customer Stream was succesful
Product Stream was Succesful
Product Stream was Succesful
Checkpoints Created in the Staging Layer
Checkpoints Created in the Staging Layer

Region Data


As seen with the Regions table, no changes were detected since no new data was added. This demonstrates that the incremental load process is working as intended, efficiently processing only new or updated records while leaving unchanged data untouched.


ree


Phase 2 Transformations


After performing the necessary transformations for the enriched (Silver) layer, particularly on the Orders fact table and two of the dimension tables, I implemented Slowly Changing Dimension (SCD) Type 1. For this project, I chose a manual approach, splitting the DataFrames into old and new datasets to create a pseudo table. This method allowed me to efficiently track Dimension Keys during incremental loads and quickly determine the maximum DimensionKey value, ensuring accuracy and consistency without added complexity.


Manual SCD Type 1(Upsert)


After the creation of the DimCustomers Table I manually created the Upsert condition of our Type 1


Manual SCD TYPE 1(DimCustomers)
Manual SCD TYPE 1(DimCustomers)

Automatic SCD Type 2


Using Delta Live Tables (DLT), a declarative ETL framework, I automated the SCD Type 2 logic with ease. Unlike Type 1, which was handled manually using incremental loads and a pseudo table, DLT simplified tracking historical changes while enabling data quality checks to ensure high standards. Despite the manual approach for Type 1, the solution was successful and met all project requirements.

As demonstrated, no data was dropped, confirming that the data quality checks were effective and the curated datasets remained complete and accurate.


ree
ree
ree
ree


Phase 3 Star Schema(Orders_Facts)


With the two dimension tables loaded, the next step was to finalize the Star Schema by preparing the fact table. Once complete, the curated datasets were delivered to both Databricks SQL Warehouse and Azure Synapse Analytics, making them fully ready for analysis, reporting, and business intelligence.

After finalizing the Star Schema, I implemented SCD Type 1 logic on the fact table to handle upserts, ensuring that new data could be inserted and existing records updated efficiently—a crucial step given the table’s size and the ongoing incremental updates.


ree

Our Data is now curated and ready for analysis and reporting


Databricks SQL Warehouse


To validate the curated datasets, I tested them in the Databricks SQL Warehouse, running queries and exploring the data interactively. This process allowed me to create dashboards and verify that the data was accurate, well-modeled, and ready for business analysis, demonstrating the pipeline’s success in delivering analysis-ready datasets.


ree
ree
ree

ree


BI Partner Connect


Thanks to Databricks Partner Connect, I was able to provide a BI connector to the Data Analyst, allowing them to directly query and visualize the cleaned datasets in Power BI without relying on the SQL Data Warehouse. Following this, I loaded the curated data into Azure Synapse Analytics, completing the end-to-end pipeline and bringing the project to a successful conclusion.


ree

For Code & Notebook visit this link




Loading Into Synapse


To load the curated data, I leveraged CETAS functions and OPENROWSET to create External Tables, meticulously defining the data sources and file formats. This approach ensured that the tables were accurately structured and efficiently accessible for downstream analytics and reporting.


Code For MasterKey, Credentials and File Format

Code->

--- CREATE MASTER KEY

CREATE MASTER KEY ENCRYPTION BY PASSWORD = ''


---- Database Scoped Credentials


CREATE DATABASE SCOPED CREDENTIAL jesscreds

WITH IDENTITY = 'Managed Identity'


---External File Format

CREATE EXTERNAL FILE FORMAT parquet

WITH

(

FORMAT_TYPE = PARQUET,

DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'

)


CREATE EXTERNAL FILE FORMAT delta

WITH

(

FORMAT_TYPE = DELTA

)


Creating External Data Source
Creating External Data Source

Creating Schema & Views
Creating Schema & Views
Gold Views
Gold Views
Creating Silver External Tables
Creating Silver External Tables
Creating Gold External Tables
Creating Gold External Tables
Querying Our Data(DimProducts)
Querying Our Data(DimProducts)
Dim Customer(Brief Visualisation)
Dim Customer(Brief Visualisation)
Dim Customers
Dim Customers
Fact_Orders
Fact_Orders

 
 
 

Comments


© 2023 by Jesse Pepple. All rights reserved.

bottom of page