Data Engineering

Azure Medallion Lakehouse for Hospitality Analytics

A production Azure data lakehouse for a large hospitality enterprise, ingesting reservation and revenue data from two parallel property management systems through a five-layer medallion architecture — Landing → Bronze → Silver → Gold — with an integrated data quality framework, star schema data warehouse, and Power BI consumption layer for enterprise business intelligence.

5-Layer

medallion lakehouse from raw CSV to BI-ready star schema

2 Systems

unified — legacy and cloud PMS into one data model

Full DQ

checks at every layer boundary with automated alerting

1The Problem

A large hospitality enterprise operated two parallel hotel property management systems — a legacy on-premise PMS and a newer cloud-based PMS — each producing reservation, revenue, and master data in incompatible schemas and formats. With no unified data platform, reporting was fragmented, reconciliation between systems was manual and error-prone, and there was no reliable single source of truth for occupancy, revenue, and discount calculations across properties. Undistributed revenue (discount) data arrived via web extraction, company files required external ID mapping, and all source data arrived as raw CSV files with no data quality enforcement. The absence of a structured pipeline meant that any downstream BI dashboard was only as trustworthy as the last manual data reconciliation — an operational risk where pricing and revenue decisions depend on accurate, timely data.

2The Approach

We designed and implemented a modern data lakehouse on Azure following the medallion architecture pattern, orchestrated end-to-end by Azure Data Factory. All source data — PMS reservation exports, master data files, undistributed revenue (discount) files, and company files — enters through a DROPPOINT zone as CSV. From there, ADF pipelines move data through a Landing → Bronze → Silver → Gold progression stored across ADLS Gen2 and Synapse Dedicated SQL Pool. Synapse Spark Pools handle all ETL transformation at each layer transition. The Silver layer is the critical consolidation point where the two PMS data streams are unified into a common data model using SCD Type-1 merge logic. The Gold layer is modelled as a star schema with fact tables, dimension tables, and consumption views that aggregate data for Power BI reporting. A metadata-driven data quality framework is embedded at every layer boundary, with DQ configurations stored in Azure SQL Database, results logged back to the metadata store, and email alerts fired on any DQ check failure.

Technical Architecture

Source & DropPoint: Raw CSV files from the legacy PMS (via R&A scripts), cloud PMS (extract scripts), master data files, undistributed revenue (discount) files via web extraction, and company files — all land at the DROPPOINT zone in ADLS Gen2

Landing Layer (CSV, ADLS Gen2): Files reorganised and made available for processing without modification — the immutable raw copy used for reprocessing and audit

Bronze Layer (Parquet, ADLS Gen2): CSV converted to Parquet for efficient columnar storage; type conversions, de-duplication, cleaning, filtering, sanitisation, and common ID assignment applied via Synapse Spark Pools

Silver Layer (Delta, ADLS Gen2): Consolidation layer where legacy and cloud PMS data streams are unified into a single Common Data Model using SCD Type-1 merge logic; schema standardisation resolves all differences between the two source systems

Gold Layer (SQL, Synapse Dedicated SQL Pool): Star schema with fact and dimension tables built by joining and loading from Silver; Consumption Views aggregate and summarise data specifically for Power BI dashboards

Orchestration (Azure Data Factory): ADF pipelines orchestrate all layer transitions (Ingest → Landing → ETL → Bronze → ETL → Silver → ETL → Gold Main → ETL → Gold Consumption); watermark-based incremental loading managed via metadata DB

Data Quality Framework: Metadata-driven DQ engine configured via dataset_dq_config table in Azure SQL Database; DQ checks injected at LandToBronze, BronzeToSilver, and SilverToGold boundaries; results logged to MetadataDB; email alerts fired on DQ check failure

Metadata Store (Azure SQL Database): Stores dataset configuration, pipeline logging, watermark state, and all data quality check configurations and results

Reporting Layer: Power BI dashboards consume Gold Consumption Views for enterprise BI; pre-aggregated views minimise query load on the Dedicated SQL Pool

Results

Unified two incompatible PMS reservation data streams into a single, reconciled star schema for the first time — eliminating manual cross-system reconciliation
Automated end-to-end pipeline from raw CSV ingestion to Power BI-ready consumption views, replacing manual data preparation processes
Metadata-driven DQ framework catches data quality issues at every layer boundary before they propagate to the Gold layer and downstream dashboards
Incremental watermark-based loading via ADF reduces processing time and compute cost compared to full reloads
Parquet (Bronze) and Delta (Silver) storage formats reduced storage footprint and improved Spark query performance over raw CSV
Star schema Gold layer and pre-aggregated consumption views deliver fast, consistent query performance for Power BI across all enterprise reporting use cases

Key Insights

“The Silver layer is the most important and most underestimated layer in a medallion architecture — it is where two incompatible source systems are forced into a single common data model, and getting the SCD Type-1 merge logic right here is what determines whether every downstream report is trustworthy or not.”

“Embedding data quality checks at every layer boundary rather than only at ingestion or only before Gold is the difference between catching a schema drift on day one and discovering it six weeks later when a finance dashboard produces wrong revenue numbers.”

“A metadata-driven DQ framework (configurations in a table, not hardcoded in pipelines) means new datasets and new checks can be added without touching pipeline code — the right architecture for a platform that needs to onboard new source systems over time.”

“Keeping the orchestration (ADF), compute (Synapse Spark), and storage (ADLS Gen2) as separate concerns makes each independently scalable and replaceable — a monolithic ETL tool that blends all three is a significant operational liability at enterprise scale.”

Tech Stack

Azure Data FactoryAzure Data Lake Storage Gen2Azure Synapse AnalyticsSynapse Spark PoolsSynapse Dedicated SQL PoolAzure SQL DatabaseDelta LakeApache ParquetPower BISQLPySpark

Interested in a similar project?

I'd love to discuss how we can apply this kind of approach to your specific business challenge.

Let's Talk