top of page

Flights Azure Databricks End To End DataEngineering Project

  • Writer: Jesse Pepple
    Jesse Pepple
  • Oct 28
  • 4 min read

Updated: 7 days ago

Resources

  • Azure Services

  • Azure Databricks

  • Spark Structured Streaming

  • Delta Live Tables Delta Lake & Delta Tables

  • Databricks SQL Warehouse

  • Unity Catalog


Languages & Tools: Python (PySpark / notebooks) & SQL


Project Overview
Project Overview

This project is built entirely on Azure Databricks, leveraging Spark Structured Streaming for real-time data ingestion and PySpark for large-scale data transformations. I implemented Delta Live Tables to automate Slowly Changing Dimensions (SCDs) and ensure data consistency. The project concludes with dynamic dimensional modeling to create well-curated, analytics-ready datasets. Coming into this unlike my previous projects and workflows that enlists the help of Azure Data Factory this is purely an Azure Databricks project.

Decision Highlight: I chose Delta Live Tables because they provide built-in automation for data quality and SCD handling, which simplified the pipeline and reduced potential for errors.



Setting Up Project


To kickstart the project, I provisioned the Azure Databricks workspace. With Unity Catalog already enabled, I created external locations for the Medallion Architecture layers Bronze, Silver, and Gold as well as a source container to store ingested raw data.

Decision Highlight: Using Unity Catalog ensured centralized governance and access control across all layers, making the pipeline secure and maintainable.


Bronze External Location
Bronze External Location
Silver External Location
Silver External Location
Gold External Location
Gold External Location

Phase 1 Ingestion With AutoLoaders(Spark Structured Streaming)


For ingestion from the source to the staging layer, I implemented Spark Structured Streaming, allowing for incremental, exactly-once processing. After testing the initial streaming load for the first table, I built a Job Pipeline to automate running multiple pipelines efficiently without hardcoding.

Decision Highlight: I chose Spark Structured Streaming because it supports idempotency, enabling seamless incremental loads. This allowed new datasets to be added without reprocessing existing data, which was critical for performance and scalability.

Initially, the dataset contained 1,000 rows, and after streaming new data, it grew to 1,300 rows, confirming the pipeline worked correctly. Additional datasets were added later, further validating the incremental load process. This completed the first phase by populating the staging layer.

Decision Highlight: Using a Job Pipeline ensured that once one table was tested, other datasets could be ingested consistently and automatically.


WriteStream For Our Spark Structured Stream
WriteStream For Our Spark Structured Stream

Adding Data To The DataLake
Adding Data To The DataLake
Last Updated Processing TimeStamp
Last Updated Processing TimeStamp
ree

Data was ingested from the source into the staging layer using Spark Structured Streaming, supporting incremental, exactly-once processing. After testing the initial streaming load, a Job Pipeline was built to automate the ingestion of multiple datasets efficiently. The pipeline successfully handled incremental loads, growing datasets from 1,000 to 1,300 rows, and processed additional datasets seamlessly.


Job Pipeline Run Was Successful


Notebook Job Runs
Notebook Job Runs
Our Staging Layer Runs
Our Staging Layer Runs

The Job Pipeline executed successfully. To further validate the capabilities of Spark Structured Streaming, I added new datasets to the data lake and reran the pipeline, confirming that incremental loads were processed seamlessly.


Incremental Data Flights
Incremental Data Flights
Incremental Data Passengers
Incremental Data Passengers
Incremental Data Airports
Incremental Data Airports
ree
ree
Current Rows For Our Three Tables
Current Rows For Our Three Tables

As shown in the images, the first run processed the initial rows, while the subsequent run successfully ingested additional rows, demonstrating the incremental load process in action.


Running Incremental Data With Spark Structured Streaming

Running New Pipeline
Running New Pipeline
ree
ree
ree
ree
Incremental Data Run Was Succesful New Records Added
Incremental Data Run Was Succesful New Records Added
Our Staging Layer After Being OverWritten
Our Staging Layer After Being OverWritten

The Job Pipeline and incremental load were successfully executed and stored in the Staging Layer, marking the completion of the first phase of our end-to-end Databricks project.



Phase 2 Silver Layer Enrichment(Transformations & DLT)


For the Silver layer, I transformed data from the staging layer and stored it as Delta Live Tables. I defined table expectations and automated Slowly Changing Dimensions (SCDs) for enriched datasets.

Decision Highlight: Delta Live Tables were used because they provide built-in schema enforcement, data quality checks, and automated SCD handling, reducing the need for custom coding while improving reliability.

I initially developed the pipeline as notebooks before converting them to Python files, allowing me to validate transformations interactively before deploying them.

Decision Highlight: This iterative approach minimized errors and ensured transformations were correct before scaling to larger datasets.



DLT Pipeline Source Code
DLT Pipeline Source Code
LakeFlow UI Declarator
LakeFlow UI Declarator
Expectations For Data Quality Met
Expectations For Data Quality Met
ree
ree
ree

ree
ree
ree

After completing the Delta Live Table (DLT), I wrote the transformed data to the Data Lake Enrichment Container.

Enrichment Container
Enrichment Container

Phase 3(Star Schema & Dynamic Dimensional Modelling)


For the Gold layer, I implemented a Dynamic Dimensional Model for tables like Passengers, Airports, and Flights, functioning similarly to an SCD Builder. I started with the Airports table for testing, then dimensionally modeled the remaining tables and applied UPSERT operations to maintain consistency.

Decision Highlight: A dynamic modeling approach allowed the same logic to be reused across multiple tables, reducing development time and ensuring consistency in SCD handling.


Parameters For Dynamic Notebook
Parameters For Dynamic Notebook
ree
ree
ree

Dynamic Notebook Workflow(Video)


WorkFlow Using Passenger Table

After successfully building the Dimensional Tables, I created the Fact Table and validated it to ensure data integrity and the absence of duplicates.


ree
Validating Distinct Data
Validating Distinct Data
SCD TYPE 1 (UPSERT)
SCD TYPE 1 (UPSERT)

If interested in the link to my codes or notebooks




DataBricks SQL Data Warehouse


With the final stage of our curated datasets complete our data is loaded succesfully to Databricks SQL Data Warehouse, I validated data authenticity by querying them in SQL Data Warehouse, which returned the curated datasets seamlessly.


Our Curated Tables
Our Curated Tables
ree
ree
ree
ree
DashBoard 1
DashBoard 1
DashBoard 2
DashBoard 2
DashBoard 3
DashBoard 3


Loading In Synapse Warehouse


Once the curated and enriched datasets were verified in Databricks, I integrated them into the Synapse Workspace. Using OPENROWSET, I created views for analytics and finally built external tables to make the data accessible for reporting and BI tools.

Decision Highlight: Using external tables in Synapse allowed seamless integration with BI tools without duplicating data, while views provided an abstraction layer for end users to query curated datasets safely.


ree
ree
Silver_Views
Silver_Views
Silver External Table
Silver External Table
Silver Bookings Table
Silver Bookings Table
Silver Passenger Table
Silver Passenger Table
Silver Flight Table
Silver Flight Table

Gold Views
Gold Views
Gold Dim Bookings
Gold Dim Bookings

Gold DimAirports Table
Gold DimAirports Table
Gold Dim Passenger Table
Gold Dim Passenger Table
Gold Dim Flights
Gold Dim Flights

Key Takeaways


  • Leveraged Medallion Architecture (Bronze → Silver → Gold) for structured, layered data processing.

  • Used Spark Structured Streaming for efficient incremental ingestion.

  • Automated SCDs with Delta Live Tables to maintain historical accuracy.

  • Implemented dynamic dimensional modeling for reusable and scalable design.

  • Ensured data quality through validation at every stage.

  • Enabled cross-platform integration by making curated datasets available in Synapse.

 
 
 

Comments


© 2023 by Jesse Pepple. All rights reserved.

bottom of page