Microsoft Fabric Airbnb Data Engineering Project

Jesse Pepple
Jan 8
5 min read

Updated: Jan 11

This project showcases end-to-end data engineering in Microsoft Fabric using Airbnb datasets. Data was ingested via Fabric Data Factory and stored in a Lakehouse following the Medallion Architecture (Bronze, Silver, Gold). Transformations were performed in Fabric Notebooks, with a dynamic SCD Type 2 applied in the Gold layer, and the final Gold layer was modelled as a Star Schema to support efficient analytics and reporting.

Key Takeaways / Project Impact

Manual vs Automated SCDs: Manual SCD Type 2 works but is time-consuming and error-prone; DLT in Databricks could streamline this
Fabric Efficiency: Parameterized pipelines and email monitoring improved reliability and reduced operational risk
Dimensional Modeling: Star Schema and SCD Type 2 ensured historical tracking and query efficiency
Portfolio Impact: Demonstrates modern end-to-end pipeline design within Microsoft Fabric, showcasing adaptability to multiple cloud platforms

Languages

Python & SQL

Architecture Summary

Data Ingestion: Fabric Data Factory

Fabric Data Factory (FDF) orchestrates and manages data ingestion activities.

Raw data is ingested into Onelake.

Ingestion was delivered to the staging layer in preparation for data transformation.

Pipelines use triggers, datasets, and linked services to securely move data.

Unlike Logic Apps via Azure Datafactory Pipeline was monitored via email activity.

Enriched Layer Transformations

Transformations include modular cleaning, enrichment, and business logic application.

Curated Gold Layer: SCD TYPE 2

Since the project was done within Fabric and not Databricks I could not utilise DLT but neverthless manual application of

Slowly Changing Dimensions (SCD Type 2) are applied to dimension tables for historical tracking.

Upserts (SCD Type 1) are applied to fact tables to keep data up to date.

Data Governance & Architecture

Goal: Ensure secure, centralized governance while implementing a scalable Medallion Architecture.

KPIs & Metrics:

Medallion Architecture: Bronze → Silver → Gold layers implemented
Governance: OneLake storage with parameterized pipelines and controlled access
Data Lineage: End-to-end lineage visualized in Fabric workspace
Monitoring: Email alerts configured for pipeline failures

Result Statement:

Successfully built a governed and traceable environment for Airbnb datasets, ensuring data quality, accessibility, and end-to-end visibility.

Phase 1 Ingestion Via Fabric Data Factory

While using Fabric Data Factory, I noticed clear differences in pipeline orchestration that required some adaptation compared to Azure Data Factory. The main focus was on leveraging ForEach activities and parameterization effectively. I created a JSON control table of all parameterized entities in OneLake and used a Lookup activity to dynamically drive the ForEach loop, enabling successful ingestion of all datasets into their respective OneLake folders.

After the pipeline was successfully implemented, I added validation and monitoring by configuring an Email activity to trigger on failure, ensuring prompt notifications and easier management of any pipeline issues.

Goal: Efficiently ingest Airbnb source data into the Lakehouse.

KPIs & Metrics:

Fabric Data Factory: Orchestrated ingestion with parameterized ForEach activities
Dynamic Control: JSON control table enabled dynamic entity ingestion
Staging (Bronze): Raw datasets ingested into OneLake for downstream transformations
Error Handling: Email notification on pipeline failure for immediate intervention

Result Statement:

Delivered a fully automated ingestion process with dynamic configuration, providing reliable and monitored data flow from REST API sources to the Lakehouse.

Phase 2 Enriched Silver Layer Transformations

In the second phase of the project, I focused on data cleaning and transformations. This primarily involved adding new columns to provide analytical context and casting data types, particularly for our datasets ingested in Parquet format. Through these cleaning and transformation steps, I was able to derive meaningful insights using Fabric Notebooks.

After cleaning and transforming the data, the enriched datasets were written to the Silver Lakehouse and materialized as Silver tables to support the upcoming SCD Type 2 implementation. With Fabric automatically applying OPTIMIZE and Z-ORDER(Fabric Context VORDER), manual optimizeWrite configuration was unnecessary, making the process seamless and efficient.

Goal: Clean, enrich, and prepare datasets for analytical use.

KPIs & Metrics:

Data Cleaning: Added analytical context, casted data types, and resolved inconsistencies
Optimizations: Leveraged Fabric automatic optimizations (OPTIMIZE / Z-ORDER)
Materialization: Silver datasets written as tables to support SCD Type 2 implementation

Result Statement:

Produced enriched, analytics-ready datasets in the Silver Lakehouse, minimizing manual optimization effort.

Phase 3 Curated Gold Layer(SCD Type 2 And Star Schema)

To finalize the project, I implemented dynamic dimensional modeling with a focus on SCD Type 2 for historical tracking, effectively creating an all-in-one SCD Type 2 solution in a single notebook. I then curated the fact table to enable efficient joins and support downstream analytics.

For the fact table, I applied SCD Type 1 to handle updates and inserts, ensuring the data remained current while maintaining historical tracking in the dimensions.

Goal: Enable historical tracking and efficient analytics with curated datasets.

KPIs & Metrics:

SCD Type 2: Implemented dynamic historical tracking in dimension tables
SCD Type 1: Applied upserts to fact tables for real-time updates
Star Schema: Modeled fact and dimension tables for optimized querying and reporting
Curated Gold: Final datasets ready for warehouse ingestion and analytics

Result Statement:

Delivered a fully curated Gold layer with dynamic SCD Type 2 management, enabling historical tracking and star-schema-driven analytics.

Loading Into Fabric DataWarehouse

Lakehouse view of our curated datasets

Curated datasets were loaded into the warehouse using CTAS to create structured tables and schemas, providing better control and organization. After loading, I validated the datasets with SQL queries and visualized the data to support analysis.

Interested in the code and github repository

Fabric_Airbnb_Data_Engineering_Project

Data Visualisation

After loading the data, I visualized the curated Airbnb datasets to validate transformations in PowerBI within Fabric and uncover actionable insights.

Goal: Make curated datasets available for reporting and BI consumption.

KPIs & Metrics:

Lakehouse to Warehouse: CTAS used to create structured tables for Fabric Data Warehouse
SQL Validation: Queries executed to confirm integrity and readiness for analysis
Data Visualization: Power BI dashboards created within Fabric to validate transformations and uncover insights

Result Statement:

Enabled seamless access to analytics-ready Airbnb datasets, validated and visualized, supporting business intelligence and reporting needs

Semantic Model

This semantic model illustrates the Star Schema, showing the relationships between the fact table and its dimension tables to support efficient analytics and reporting.

Data Lineage

This diagram shows our workspace and full data lineage, tracing data from ingestion through transformations to the final loading of curated datasets in the Fabric Data Warehouse, highlighting how each stage creates analytics-ready data.

TakeAways

In my view, integrating Databricks could have streamlined pipeline development, as idempotency is key in modern data engineering. Autoloaders for the Silver layer would handle schema evolution automatically, while DLT/SDP Auto CDC Flow could automate SCD Type 2, reducing manual coding and simplifying historical data management.

Nevertheless, this doesn’t diminish the efficiency of Microsoft Fabric. As the saying goes:“Manual SCDs give you headaches; DLT lets you sleep.”This was the key takeaway from the project.

Jesse Pepple Portfolio

Microsoft Fabric Airbnb Data Engineering Project

Key Takeaways / Project Impact

Languages

Architecture Summary

Data Ingestion: Fabric Data Factory

Enriched Layer Transformations

Curated Gold Layer: SCD TYPE 2

Data Governance & Architecture

Phase 1 Ingestion Via Fabric Data Factory

Phase 2 Enriched Silver Layer Transformations

Phase 3 Curated Gold Layer(SCD Type 2 And Star Schema)

Loading Into Fabric DataWarehouse

Data Visualisation

Semantic Model

Data Lineage

TakeAways

Recent Posts