top of page

Microsoft Fabric Airbnb Data Engineering Project

  • Writer: Jesse Pepple
    Jesse Pepple
  • Jan 8
  • 5 min read

Updated: Jan 11

This project showcases end-to-end data engineering in Microsoft Fabric using Airbnb datasets. Data was ingested via Fabric Data Factory and stored in a Lakehouse following the Medallion Architecture (Bronze, Silver, Gold). Transformations were performed in Fabric Notebooks, with a dynamic SCD Type 2 applied in the Gold layer, and the final Gold layer was modelled as a Star Schema to support efficient analytics and reporting.


 Key Takeaways / Project Impact

  • Manual vs Automated SCDs: Manual SCD Type 2 works but is time-consuming and error-prone; DLT in Databricks could streamline this

  • Fabric Efficiency: Parameterized pipelines and email monitoring improved reliability and reduced operational risk

  • Dimensional Modeling: Star Schema and SCD Type 2 ensured historical tracking and query efficiency

  • Portfolio Impact: Demonstrates modern end-to-end pipeline design within Microsoft Fabric, showcasing adaptability to multiple cloud platforms


Architecture Diagram
Architecture Diagram

Languages


Python & SQL


Architecture Summary


Data Ingestion: Fabric Data Factory


Fabric Data Factory (FDF) orchestrates and manages data ingestion activities.


Raw data is ingested into Onelake.


Ingestion was delivered to the staging layer in preparation for data transformation.


Pipelines use triggers, datasets, and linked services to securely move data.


Unlike Logic Apps via Azure Datafactory Pipeline was monitored via email activity.


Enriched Layer Transformations


Transformations include modular cleaning, enrichment, and business logic application.


Curated Gold Layer: SCD TYPE 2


Since the project was done within Fabric and not Databricks I could not utilise DLT but neverthless manual application of


Slowly Changing Dimensions (SCD Type 2) are applied to dimension tables for historical tracking.


Upserts (SCD Type 1) are applied to fact tables to keep data up to date.


 Data Governance & Architecture

Goal: Ensure secure, centralized governance while implementing a scalable Medallion Architecture.

KPIs & Metrics:

  • Medallion Architecture: Bronze → Silver → Gold layers implemented

  • Governance: OneLake storage with parameterized pipelines and controlled access

  • Data Lineage: End-to-end lineage visualized in Fabric workspace

  • Monitoring: Email alerts configured for pipeline failures

Result Statement:

Successfully built a governed and traceable environment for Airbnb datasets, ensuring data quality, accessibility, and end-to-end visibility.

Phase 1 Ingestion Via Fabric Data Factory


Our WorkSpace
Our WorkSpace

Email Activity Outline
Email Activity Outline

While using Fabric Data Factory, I noticed clear differences in pipeline orchestration that required some adaptation compared to Azure Data Factory. The main focus was on leveraging ForEach activities and parameterization effectively. I created a JSON control table of all parameterized entities in OneLake and used a Lookup activity to dynamically drive the ForEach loop, enabling successful ingestion of all datasets into their respective OneLake folders.


Our Pipeline
Our Pipeline
Our Raw Data
Our Raw Data

After the pipeline was successfully implemented, I added validation and monitoring by configuring an Email activity to trigger on failure, ensuring prompt notifications and easier management of any pipeline issues.


Goal: Efficiently ingest Airbnb source data into the Lakehouse.

KPIs & Metrics:

  • Fabric Data Factory: Orchestrated ingestion with parameterized ForEach activities

  • Dynamic Control: JSON control table enabled dynamic entity ingestion

  • Staging (Bronze): Raw datasets ingested into OneLake for downstream transformations

  • Error Handling: Email notification on pipeline failure for immediate intervention

Result Statement:

Delivered a fully automated ingestion process with dynamic configuration, providing reliable and monitored data flow from REST API sources to the Lakehouse.

Phase 2 Enriched Silver Layer Transformations


In the second phase of the project, I focused on data cleaning and transformations. This primarily involved adding new columns to provide analytical context and casting data types, particularly for our datasets ingested in Parquet format. Through these cleaning and transformation steps, I was able to derive meaningful insights using Fabric Notebooks.


Silver Transformation Utils
Silver Transformation Utils
Notebook Visualisation 1
Notebook Visualisation 1
Notebook Visualisation 2
Notebook Visualisation 2
Notebook Visualisation 3
Notebook Visualisation 3

After cleaning and transforming the data, the enriched datasets were written to the Silver Lakehouse and materialized as Silver tables to support the upcoming SCD Type 2 implementation. With Fabric automatically applying OPTIMIZE and Z-ORDER(Fabric Context VORDER), manual optimizeWrite configuration was unnecessary, making the process seamless and efficient.

OPTIMIZE Maintenace Overview
OPTIMIZE Maintenace Overview

Goal: Clean, enrich, and prepare datasets for analytical use.

KPIs & Metrics:

  • Data Cleaning: Added analytical context, casted data types, and resolved inconsistencies

  • Optimizations: Leveraged Fabric automatic optimizations (OPTIMIZE / Z-ORDER)

  • Materialization: Silver datasets written as tables to support SCD Type 2 implementation

Result Statement:

Produced enriched, analytics-ready datasets in the Silver Lakehouse, minimizing manual optimization effort.

Phase 3 Curated Gold Layer(SCD Type 2 And Star Schema)


To finalize the project, I implemented dynamic dimensional modeling with a focus on SCD Type 2 for historical tracking, effectively creating an all-in-one SCD Type 2 solution in a single notebook. I then curated the fact table to enable efficient joins and support downstream analytics.


Configs Variables
Configs Variables
SCD Type 2 Logic
SCD Type 2 Logic

For the fact table, I applied SCD Type 1 to handle updates and inserts, ensuring the data remained current while maintaining historical tracking in the dimensions.

SCD Type 1 Logic
SCD Type 1 Logic

Goal: Enable historical tracking and efficient analytics with curated datasets.

KPIs & Metrics:

  • SCD Type 2: Implemented dynamic historical tracking in dimension tables

  • SCD Type 1: Applied upserts to fact tables for real-time updates

  • Star Schema: Modeled fact and dimension tables for optimized querying and reporting

  • Curated Gold: Final datasets ready for warehouse ingestion and analytics

Result Statement:

Delivered a fully curated Gold layer with dynamic SCD Type 2 management, enabling historical tracking and star-schema-driven analytics.

Loading Into Fabric DataWarehouse


Lakehouse view of our curated datasets

Dim_Bookings
Dim_Bookings
Dim_Properties
Dim_Properties
Dim_Hostings
Dim_Hostings

Curated datasets were loaded into the warehouse using CTAS to create structured tables and schemas, providing better control and organization. After loading, I validated the datasets with SQL queries and visualized the data to support analysis.


Airbnb_Hostings
Airbnb_Hostings
Facts_Listings
Facts_Listings
Airbnb_Bookings
Airbnb_Bookings
Airbnb_Properties
Airbnb_Properties

Interested in the code and github repository




Data Visualisation


After loading the data, I visualized the curated Airbnb datasets to validate transformations in PowerBI within Fabric and uncover actionable insights.


PowerBI Visualisation 1
PowerBI Visualisation 1
PowerBI Visualisation 2
PowerBI Visualisation 2
PowerBI Visualisation 3
PowerBI Visualisation 3

Goal: Make curated datasets available for reporting and BI consumption.

KPIs & Metrics:

  • Lakehouse to Warehouse: CTAS used to create structured tables for Fabric Data Warehouse

  • SQL Validation: Queries executed to confirm integrity and readiness for analysis

  • Data Visualization: Power BI dashboards created within Fabric to validate transformations and uncover insights

Result Statement:

Enabled seamless access to analytics-ready Airbnb datasets, validated and visualized, supporting business intelligence and reporting needs

Semantic Model


This semantic model illustrates the Star Schema, showing the relationships between the fact table and its dimension tables to support efficient analytics and reporting.


Semantic Model
Semantic Model


Data Lineage


This diagram shows our workspace and full data lineage, tracing data from ingestion through transformations to the final loading of curated datasets in the Fabric Data Warehouse, highlighting how each stage creates analytics-ready data.


Data Lineage Outline
Data Lineage Outline

TakeAways


In my view, integrating Databricks could have streamlined pipeline development, as idempotency is key in modern data engineering. Autoloaders for the Silver layer would handle schema evolution automatically, while DLT/SDP Auto CDC Flow could automate SCD Type 2, reducing manual coding and simplifying historical data management.

Nevertheless, this doesn’t diminish the efficiency of Microsoft Fabric. As the saying goes:“Manual SCDs give you headaches; DLT lets you sleep.”This was the key takeaway from the project.

 
 
 

© 2024 by Jesse Pepple. All rights reserved.

bottom of page