
1. Why are Legacy Databases unsuited for modern reporting?
Ever wonder why it’s so hard to get a straight answer from a company’s main database (often called an ERP)? It’s because it was built for a different job.
An ERP is like a fast-food kitchen: its only job is to get one order out the door as fast as possible. It’s not designed to analyze every order from the past year. A data warehouse is the research lab, built specifically for that kind of deep analysis.
This “fast-food” design leads to two big problems for data analysts:
- No Safety Nets (Lack of Referential Integrity): To keep things fast, old systems often didn’t have rules to ensure an `order` record was linked to a real `customer` record. This was a business trade-off: they chose speed and uptime over perfect data cleanliness.
- Crazy Joins: Data is split into many tiny tables. Getting a complete picture of one customer might require joining 10 different tables, which is slow and complex.
2. The First Fix: The “Photocopy” (A Type 1 Warehouse)
When the main database gets too slow from reporters running queries, the first and easiest solution is to make a copy of it. This copy is what we call an ERP Replica or a Type 1 Data Warehouse.
The Good vs. The Bad of the “Photocopy” Approach
The Good (Why we do it) | The Bad (Why it’s not enough) |
---|---|
It’s Fast and Easy: Making a copy is simple and gives immediate relief to the main database. | History is Erased: If a customer moves, their old address is overwritten and gone forever. You can’t track changes. |
It’s Familiar: The tables look exactly like the source, so people can start using it right away. | Joining is Still a Pain: You’ve copied the messy table structure, so analysis is still difficult. |
3. The Big Upgrade: Building a “Time Machine” (A Type 2 Warehouse)
What if your boss asks, “How many customers moved from Sydney to Melbourne last year?” Your “photocopy” can’t answer this. You need a Type 2 Data Warehouse — a data time machine.
The Secret Ingredient: The Surrogate Key
To build a time machine, we need a special key.
Example: Tracking a Customer Who Moves
Let’s follow our customer, Jane Doe. Her business ID is `CUST-123`.
When Jane signs up: We create a record in our `dim_customer` table.
SK | Business_ID | Name | Address | Is_Current |
---|---|---|---|---|
101 | CUST-123 | Jane Doe | 123 Main St, Sydney | true |
Jane moves to Melbourne: The “time machine” now kicks in. Instead of overwriting her old record, we do two things:
- We “retire” her old record by setting `Is_Current` to `false`.
- We create a new record for her with a new surrogate key (SK) and her new address.
SK | Business_ID | Name | Address | Is_Current | Comment |
---|---|---|---|---|---|
101 | CUST-123 | Jane Doe | 123 Main St, Sydney | false | — Retired record |
102 | CUST-123 | Jane Doe | 456 Oak Ave, Melbourne | true | — The new, current record |
Now, we have a perfect history! We can join old sales to her Sydney record and new sales to her Melbourne record.
4. The Great Migration: A Real-World Scenario
The Company: “Legacy Parts Inc.”
Imagine a company that sells spare parts for old industrial machines. Their ERP system is 20 years old. It has no enforced referential integrity.
Their Problems:
- Inaccurate Sales Reports: When a customer company gets acquired and changes its name, all their past sales history is “lost” because the primary name field is overwritten.
- Inventory Nightmares: They can’t reliably link inventory records to supplier records, so they don’t know which suppliers provide the most faulty parts.
- Slow Performance: The ERP grinds to a halt every time the finance team tries to run a quarterly report.
The Solution: A Two-Phase Approach
Phase 1: Stop the Bleeding (Build the Type 1 Replica)
The first step is to get reporting off the main ERP. The data engineering team uses a tool like Pentaho to set up daily jobs that copy the key tables from their legacy ERP into a new PostgreSQL database. This is the “photocopy” warehouse.
- Immediate Win: The ERP is fast again. The business can operate.
- Lingering Problem: The reports are still inaccurate, but at least they don’t crash the system.
Phase 2: Build the Time Machine (Migrate to a True DWH)
Now that the fires are out, the team can think strategically. They decide the best long-term solution is to move to a modern cloud data warehouse. This solves their core problems.
- Fix Referential Integrity: They can’t fix the old ERP, but they can build the rules in their new system. They use a tool like dbt to write transformation logic that says, “Only create a sales record if a valid customer exists in the customer table.” They build the safety nets that the ERP lacked.
- Implement SCD Type 2: They use dbt’s `snapshot` feature or write custom SQL models to build the “time machine” for their customer and supplier data. When “Legacy Parts Inc.” customer “ABC Corp” becomes “Global Mega Corp,” they now have a full history for both names.
- Leverage the Cloud: By using a cloud platform like Snowflake or BigQuery, they get virtually unlimited storage for history and can pay for computing power only when they need it to run their transformations.
The Best Approach: The cloud-native ELT (Extract, Load, Transform) approach is the modern gold standard. It allows companies like “Legacy Parts Inc.” to bypass the limitations of their old systems and build a reliable, scalable, and historically accurate data asset without a massive upfront investment.
5. Your Interview Cheat Sheet: The FAQ
1. In simple terms, what’s the difference between a Type 1 and Type 2 DWH?
A Type 1 DWH is like a photograph; it only shows the most recent picture of the data and overwrites the old one. A Type 2 DWH is like a photo album; it keeps every picture, allowing you to see how the data has changed over time.
2. Why would a company have an old database without “referential integrity”?
Usually for two reasons: speed and practicality. Enforcing those rules can slow down a high-speed system. Also, sometimes they had to load data from other messy systems and chose to accept the data rather than have the load fail. It was a business decision to prioritize operations over perfect data cleanliness.
3. When is it okay to use Type 1 (overwriting)?
It’s okay for data where history doesn’t matter. The most common example is fixing a simple typo. If a customer’s name was misspelled “Jnae,” you can just correct it to “Jane.” You don’t need to track that a typo once existed.
4. How has the cloud changed data warehousing?
It has made it cheaper and faster. We can now store massive amounts of historical data without breaking the bank. More importantly, we use an “ELT” approach: we load raw data into a powerful cloud warehouse like Snowflake first, and then use its massive computing power to transform the data using tools like dbt. This is more scalable and flexible than the old way.
The Data Warehouse Evolution: A Visual Journey
%% Compact, Vertically-Aligned Flowchart of Data Warehouse Evolution graph TD %% Phase 1: Legacy System subgraph "START HERE: Legacy System" LegacyERP["🗃️ Legacy ERP DB"]:::problem LegacyERP -->|Nightly Job| PentahoETL["⚙️ Pentaho ETL"]:::tool PentahoETL -->|"Creates a\n'Photocopy'"| PG_Type1["📦 PG DWH\n(Type 1)"]:::storage end subgraph "Problems with Type 1 DWH" PG_Type1 --> P1["✅ Quick Setup"]:::solution PG_Type1 --> P2["❌ No History"]:::problem PG_Type1 --> P3["❌ Unreliable Joins"]:::problem end %% Transition P2 --> Goal{"🎯 Need a 'Time Machine'\n(Type 2 History)"}:::goal %% Method 1 subgraph "METHOD A: On-Prem (Legacy)" Goal -->|Path A| SCD2Job["⚙️ Pentaho SCD2"]:::tool SCD2Job --> PG_Type2["📦 PG DWH\n(Type 2)"]:::storage PG_Type2 --> C1["🐌 Slow & Brittle"]:::problem PG_Type2 --> C2["📦 Hard to Maintain"]:::problem end %% Method 2 subgraph "METHOD B: Cloud Modern (✅)" Goal -->|Path B| LoadRaw["1️⃣ Load Raw\nData to Cloud"]:::tool LoadRaw --> Snowflake["☁️ Cloud DWH\n(Snowflake / BQ)"]:::storage Snowflake -->|2️⃣ Transform| dbtSQL["⚙️ dbt + SQL"]:::tool dbtSQL -->|"Builds Clean\nHistorical Tables"| Snowflake Snowflake --> B1["🚀 Fast & Scalable"]:::solution Snowflake --> B2["✅ Easy Testing & Versioning"]:::solution end %% Styling classDef problem fill:#ffdddd,stroke:#c0392b,stroke-width:2px,color:#c0392b,font-size:14px classDef solution fill:#d1f7e3,stroke:#2ecc71,stroke-width:2px,color:#145a32,font-size:14px classDef tool fill:#e3f2fd,stroke:#2980b9,stroke-width:2px,color:#0b3c5d,font-size:14px classDef storage fill:#f0f0f0,stroke:#555,stroke-width:2px,font-size:14px classDef goal fill:#fff9c4,stroke:#f39c12,stroke-width:2px,color:#b9770e,font-size:14px
Discover more from Data Engineer Journey
Subscribe to get the latest posts sent to your email.