The Definitive Guide to Data Warehousing: From ERP Replicas to Cloud-Native Models

1. Why are Legacy Databases unsuited for modern reporting?

Ever wonder why it’s so hard to get a straight answer from a company’s main database (often called an ERP)? It’s because it was built for a different job.

Analogy: A Fast-Food Kitchen vs. a Research Lab.
An ERP is like a fast-food kitchen: its only job is to get one order out the door as fast as possible. It’s not designed to analyze every order from the past year. A data warehouse is the research lab, built specifically for that kind of deep analysis.

This “fast-food” design leads to two big problems for data analysts:

No Safety Nets (Lack of Referential Integrity): To keep things fast, old systems often didn’t have rules to ensure an `order` record was linked to a real `customer` record. This was a business trade-off: they chose speed and uptime over perfect data cleanliness.
Crazy Joins: Data is split into many tiny tables. Getting a complete picture of one customer might require joining 10 different tables, which is slow and complex.

2. The First Fix: The “Photocopy” (A Type 1 Warehouse)

When the main database gets too slow from reporters running queries, the first and easiest solution is to make a copy of it. This copy is what we call an ERP Replica or a Type 1 Data Warehouse.

Analogy: A Photocopy Machine. A Type 1 warehouse is like a photocopier. Every night, it takes a fresh picture of the main database and overwrites yesterday’s picture. You only ever have the most recent version.

The Good vs. The Bad of the “Photocopy” Approach

The Good (Why we do it)	The Bad (Why it’s not enough)
It’s Fast and Easy: Making a copy is simple and gives immediate relief to the main database.	History is Erased: If a customer moves, their old address is overwritten and gone forever. You can’t track changes.
It’s Familiar: The tables look exactly like the source, so people can start using it right away.	Joining is Still a Pain: You’ve copied the messy table structure, so analysis is still difficult.

3. The Big Upgrade: Building a “Time Machine” (A Type 2 Warehouse)

What if your boss asks, “How many customers moved from Sydney to Melbourne last year?” Your “photocopy” can’t answer this. You need a Type 2 Data Warehouse — a data time machine.

The Secret Ingredient: The Surrogate Key

To build a time machine, we need a special key.

Analogy: A Permanent Library Card Number. A surrogate key is like a permanent ID number you get from a library. Even if you change your name or address, your library card number stays the same forever. It uniquely identifies *you*, the person.

Example: Tracking a Customer Who Moves

Let’s follow our customer, Jane Doe. Her business ID is `CUST-123`.

When Jane signs up: We create a record in our `dim_customer` table.

SK	Business_ID	Name	Address	Is_Current
101	CUST-123	Jane Doe	123 Main St, Sydney	true

Jane moves to Melbourne: The “time machine” now kicks in. Instead of overwriting her old record, we do two things:

We “retire” her old record by setting `Is_Current` to `false`.
We create a new record for her with a new surrogate key (SK) and her new address.

SK	Business_ID	Name	Address	Is_Current	Comment
101	CUST-123	Jane Doe	123 Main St, Sydney	false	— Retired record
102	CUST-123	Jane Doe	456 Oak Ave, Melbourne	true	— The new, current record

Now, we have a perfect history! We can join old sales to her Sydney record and new sales to her Melbourne record.

4. The Great Migration: A Real-World Scenario

The Company: “Legacy Parts Inc.”

Imagine a company that sells spare parts for old industrial machines. Their ERP system is 20 years old. It has no enforced referential integrity.

Their Problems:

Inaccurate Sales Reports: When a customer company gets acquired and changes its name, all their past sales history is “lost” because the primary name field is overwritten.
Inventory Nightmares: They can’t reliably link inventory records to supplier records, so they don’t know which suppliers provide the most faulty parts.
Slow Performance: The ERP grinds to a halt every time the finance team tries to run a quarterly report.

The Solution: A Two-Phase Approach

Phase 1: Stop the Bleeding (Build the Type 1 Replica)

The first step is to get reporting off the main ERP. The data engineering team uses a tool like Pentaho to set up daily jobs that copy the key tables from their legacy ERP into a new PostgreSQL database. This is the “photocopy” warehouse.

Immediate Win: The ERP is fast again. The business can operate.
Lingering Problem: The reports are still inaccurate, but at least they don’t crash the system.

Phase 2: Build the Time Machine (Migrate to a True DWH)

Now that the fires are out, the team can think strategically. They decide the best long-term solution is to move to a modern cloud data warehouse. This solves their core problems.

Fix Referential Integrity: They can’t fix the old ERP, but they can build the rules in their new system. They use a tool like dbt to write transformation logic that says, “Only create a sales record if a valid customer exists in the customer table.” They build the safety nets that the ERP lacked.
Implement SCD Type 2: They use dbt’s `snapshot` feature or write custom SQL models to build the “time machine” for their customer and supplier data. When “Legacy Parts Inc.” customer “ABC Corp” becomes “Global Mega Corp,” they now have a full history for both names.
Leverage the Cloud: By using a cloud platform like Snowflake or BigQuery, they get virtually unlimited storage for history and can pay for computing power only when they need it to run their transformations.

The Best Approach: The cloud-native ELT (Extract, Load, Transform) approach is the modern gold standard. It allows companies like “Legacy Parts Inc.” to bypass the limitations of their old systems and build a reliable, scalable, and historically accurate data asset without a massive upfront investment.

5. Your Interview Cheat Sheet: The FAQ

1. In simple terms, what’s the difference between a Type 1 and Type 2 DWH?

A Type 1 DWH is like a photograph; it only shows the most recent picture of the data and overwrites the old one. A Type 2 DWH is like a photo album; it keeps every picture, allowing you to see how the data has changed over time.

2. Why would a company have an old database without “referential integrity”?

Usually for two reasons: speed and practicality. Enforcing those rules can slow down a high-speed system. Also, sometimes they had to load data from other messy systems and chose to accept the data rather than have the load fail. It was a business decision to prioritize operations over perfect data cleanliness.

3. When is it okay to use Type 1 (overwriting)?

It’s okay for data where history doesn’t matter. The most common example is fixing a simple typo. If a customer’s name was misspelled “Jnae,” you can just correct it to “Jane.” You don’t need to track that a typo once existed.

4. How has the cloud changed data warehousing?

It has made it cheaper and faster. We can now store massive amounts of historical data without breaking the bank. More importantly, we use an “ELT” approach: we load raw data into a powerful cloud warehouse like Snowflake first, and then use its massive computing power to transform the data using tools like dbt. This is more scalable and flexible than the old way.

The Data Warehouse Evolution: A Visual Journey

%% Compact, Vertically-Aligned Flowchart of Data Warehouse Evolution
graph TD

    %% Phase 1: Legacy System
    subgraph "START HERE: Legacy System"
        LegacyERP["🗃️ Legacy ERP DB"]:::problem
        LegacyERP -->|Nightly Job| PentahoETL["⚙️ Pentaho ETL"]:::tool
        PentahoETL -->|"Creates a\n'Photocopy'"| PG_Type1["📦 PG DWH\n(Type 1)"]:::storage
    end

    subgraph "Problems with Type 1 DWH"
        PG_Type1 --> P1["✅ Quick Setup"]:::solution
        PG_Type1 --> P2["❌ No History"]:::problem
        PG_Type1 --> P3["❌ Unreliable Joins"]:::problem
    end

    %% Transition
    P2 --> Goal{"🎯 Need a 'Time Machine'\n(Type 2 History)"}:::goal

    %% Method 1
    subgraph "METHOD A: On-Prem (Legacy)"
        Goal -->|Path A| SCD2Job["⚙️ Pentaho SCD2"]:::tool
        SCD2Job --> PG_Type2["📦 PG DWH\n(Type 2)"]:::storage
        PG_Type2 --> C1["🐌 Slow & Brittle"]:::problem
        PG_Type2 --> C2["📦 Hard to Maintain"]:::problem
    end

    %% Method 2
    subgraph "METHOD B: Cloud Modern (✅)"
        Goal -->|Path B| LoadRaw["1️⃣ Load Raw\nData to Cloud"]:::tool
        LoadRaw --> Snowflake["☁️ Cloud DWH\n(Snowflake / BQ)"]:::storage
        Snowflake -->|2️⃣ Transform| dbtSQL["⚙️ dbt + SQL"]:::tool
        dbtSQL -->|"Builds Clean\nHistorical Tables"| Snowflake
        Snowflake --> B1["🚀 Fast & Scalable"]:::solution
        Snowflake --> B2["✅ Easy Testing & Versioning"]:::solution
    end

    %% Styling
    classDef problem fill:#ffdddd,stroke:#c0392b,stroke-width:2px,color:#c0392b,font-size:14px
    classDef solution fill:#d1f7e3,stroke:#2ecc71,stroke-width:2px,color:#145a32,font-size:14px
    classDef tool fill:#e3f2fd,stroke:#2980b9,stroke-width:2px,color:#0b3c5d,font-size:14px
    classDef storage fill:#f0f0f0,stroke:#555,stroke-width:2px,font-size:14px
    classDef goal fill:#fff9c4,stroke:#f39c12,stroke-width:2px,color:#b9770e,font-size:14px

Discover more from Data Engineer Journey

Subscribe to get the latest posts sent to your email.

The Definitive Guide to Data Warehousing: From ERP Replicas to Cloud-Native Models

1. Why are Legacy Databases unsuited for modern reporting?

2. The First Fix: The “Photocopy” (A Type 1 Warehouse)

The Good vs. The Bad of the “Photocopy” Approach

3. The Big Upgrade: Building a “Time Machine” (A Type 2 Warehouse)

The Secret Ingredient: The Surrogate Key

Example: Tracking a Customer Who Moves

4. The Great Migration: A Real-World Scenario

The Company: “Legacy Parts Inc.”

Their Problems:

The Solution: A Two-Phase Approach

Phase 1: Stop the Bleeding (Build the Type 1 Replica)

Phase 2: Build the Time Machine (Migrate to a True DWH)

5. Your Interview Cheat Sheet: The FAQ

The Data Warehouse Evolution: A Visual Journey

Like this:

Related

Discover more from Data Engineer Journey

Leave a Comment Cancel Reply

1. Why are Legacy Databases unsuited for modern reporting?

2. The First Fix: The “Photocopy” (A Type 1 Warehouse)

The Good vs. The Bad of the “Photocopy” Approach

3. The Big Upgrade: Building a “Time Machine” (A Type 2 Warehouse)

The Secret Ingredient: The Surrogate Key

Example: Tracking a Customer Who Moves

4. The Great Migration: A Real-World Scenario

The Company: “Legacy Parts Inc.”

Their Problems:

The Solution: A Two-Phase Approach

Phase 1: Stop the Bleeding (Build the Type 1 Replica)

Phase 2: Build the Time Machine (Migrate to a True DWH)

5. Your Interview Cheat Sheet: The FAQ

The Data Warehouse Evolution: A Visual Journey

Share this:

Like this:

Related

Discover more from Data Engineer Journey

Leave a Comment Cancel Reply

Discover more from Data Engineer Journey