
Duplicate records are one of the most common data quality issues in a data warehouse. They can lead to misleading reports, incorrect KPIs, and poor decision-making. In this guide, we’ll explore how to identify duplicate records, prevent them during ETL, and resolve them effectively using SQL and best practices.
Identifying Duplicate Records
Duplicate records can arise due to:
- Source systems send redundant data.
- ETL pipelines fail to validate uniqueness.
- Delta or incremental loads are poorly designed.
Common SQL techniques to identify duplicates:
1. GROUP BY with HAVING COUNT > 1
Used to find repeated rows based on selected columns (e.g. business keys):
SELECT col1, col2, COUNT(*)
FROM my_table
GROUP BY col1, col2
HAVING COUNT(*) > 1;
2. ROW_NUMBER() window function
Used to assign row numbers to duplicates and isolate the extras:
SELECT *
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY business_key ORDER BY load_date DESC) AS rn
FROM my_table
) t
WHERE rn > 1;
3. Hashing (MD5/SHA)
Hash key columns to generate a fingerprint for each row and identify exact matches.
🚫 Preventing Duplicate Records
Once you’ve identified potential duplicates, the next step is to prevent them from entering your warehouse. Here are some effective strategies:
1. Schema Design
- Enforce Primary Keys or Unique Constraints where applicable.
- Use surrogate keys for dimension tables when business keys are not reliable + timestamps
2. ETL Layer
- Use DISTINCT or .dropDuplicates() in tools like SQL or Spark.
- In Talend, use tUniqRow to filter out duplicates and tMap to apply business logic.
- Always load into a staging area before inserting into the main warehouse.
- Implement hash columns to identify and reject already-seen records.
3. Use MERGE or UPSERT instead of INSERT
Rather than inserting blindly, use a merge strategy:
MERGE INTO target_table AS t
USING staging_table AS s
ON t.business_key = s.business_key
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;
4. Audit & Batch Controls
- Add batch identifiers or timestamps (batch_id, load_date) to help track data lineage.
- Design for idempotency – re-running a job should not reinsert or modify existing records incorrectly.
💬 Key Interview Question
Q: How do you identify duplicate records, and how do you prevent or resolve them?
Answer:
I start by identifying duplicates using SQL. I typically use GROUP BY with HAVING COUNT(*) > 1 or the ROW_NUMBER() function to flag extra records within a group based on business keys. For large datasets, I may generate MD5 or SHA hashes to detect exact row matches efficiently.
To prevent duplicates, I load data into a staging area first, where I validate and clean the records. In ETL tools like Talend, I use tUniqRow to remove duplicates and tMap to apply custom rules. For loading into the warehouse, I use MERGE or UPSERT logic to ensure only new or updated records are processed. I also use audit fields like load_date and batch_id for traceability and to manage reprocessing. This approach has helped me maintain clean and reliable data pipelines.
📌 FAQ
Q1: What causes duplicates in a data warehouse?
A: Repeated data loads, missing primary keys, poor delta logic, or unclean source files.
Q2: What is the difference between a business key and a surrogate key?
A: A business key comes from the source system (like order_id), while a surrogate key is generated in the warehouse (like order_sk) and used as a unique identifier.
Q3: How can Talend help prevent duplicates?
A: You can use tUniqRow to filter duplicates and tMap to apply transformation logic or hash generation. Use staging tables to validate before inserting.
Q4: What is idempotency in ETL?
A: Idempotency means you can run the ETL process multiple times without producing duplicate records or corrupting data.
Q5: Why are staging tables important?
A: Staging tables serve as a buffer to validate, deduplicate, and transform incoming data before loading it into the production schema. This minimizes the risk of introducing bad or repeated data.
Discover more from Data Engineer Journey
Subscribe to get the latest posts sent to your email.