This article summarizes the core arguments from Modern Data 101‘s piece: Data Products: A Case Against Medallion Architecture. The original article contrasts the traditional Medallion Architecture (Bronze, Silver, Gold) for data lakes with a Data Product approach. It argues that Medallion, while intended to simplify data management, often leads to bottlenecks and quality issues. The authors advocate for Data Products and a “Lakehouse with usable data instead of ALL Data.” Here’s a streamlined breakdown:
Visual Comparison: Pull vs Push Mechanisms
Image illustrating the key differences in data flow between the Medallion (Pull) and Data Product (Push) architectures.
Key Takeaways
- Medallion Architecture (Key Issues):
- Enforces a strict, often unnecessary, pipeline structure.
- Data Engineers pull ALL source data without specific context.
- Results in bottlenecks and increased storage/compute costs.
- Data Product Architecture (Key Benefits):
- Emphasizes a “push” mechanism, driving business context upstream.
- Enables a lean-pull system, moving only the data needed for specific purposes.
- Productized data is high-quality and governed from the start.
- Key Arguments Against Medallion:
- Shifts work and quality responsibility to data consumers.
- Incurs unnecessary data movement and costs.
- Lacks business context until late stages.
- Recommendations:
- Focus on a Model-Driven Data Product approach based on business needs.
- Prioritize a “Lakehouse with usable data” not ALL data.
Medallion Architecture vs. Data Products
Feature | Medallion Architecture | Data Product Architecture |
---|---|---|
Data Flow | Pull Mechanism: Data is pulled through predefined layers (Bronze, Silver, Gold). | Push Mechanism: Business context is pushed to upstream layers, guiding data integration. |
Data Focus | Transformation Stage: Data is categorized based on its refinement level. | Business Need/Use Case: Data is shaped according to specific analytical or operational requirements. |
Data Quality | Incremental Refinement: Quality improvements are applied progressively at each layer. | Embedded Quality: Quality controls and governance are enforced from the outset (Shift-Left). |
Context | Limited upstream context; business understanding is primarily in the Gold layer. | Business context is embedded as early as possible. |
Data Movement | High data movement between layers (ETL processes), leading to redundant copies. | Minimized data movement; purpose-driven data flows based on specific business needs. |
Flexibility | Limited flexibility; predefined pathways and batch-based delivery. | High flexibility; diverse consumption patterns (batch, streaming, APIs). Self-service data consumption. |
Bottlenecks | Creates bottlenecks at each layer due to dependencies and predefined processes. | Eliminates bottlenecks by shifting responsibility leftward and providing consumer-driven data. |
Cost | Higher operational costs due to unnecessary data movement and storage of multiple copies. | Lower operational costs due to reduced data movement, storage, and processing. |
Outcome | Data consumers often have to engineer their own transformations and handle complex quality issues. | Consumers interact with high-quality, reliable data that meets their specific needs. |
Data Governance | Often complicated, lineage tied to pipeline stages rather than business meaning. | Enforces well-defined SLAs, contracts, and ownership of data from the source. |
Key Principle | “Lakehouse with ALL data” | “Lakehouse with usable data” – reducing unneccessary layering with purpose driven storage and processing |
This summary provides a clear comparison of the two architectures, highlighting the benefits of adopting a Data Product approach over the traditional Medallion Architecture.
Discover more from Data Engineer Journey
Subscribe to get the latest posts sent to your email.