A complete guide to detecting, removing, and preventing duplicate data with real-world examples, query analysis, and tricky interview questions.
The Scenario: A Problematic `users` Table
Imagine a users
table containing duplicate entries based on id
and email
. Crucially, it also includes NULL
values, which are a common source of errors in duplication logic.
-- Meticulously aligned for clarity
id | email | name | updated_at
------+-----------------+---------+------------
1 | a@example.com | Alice | 2023-01-01
2 | b@example.com | Bob | 2023-01-03
1 | a@example.com | Alicia | 2023-01-04
3 | c@example.com | Charlie | 2023-01-01
NULL | x@example.com | Xavier | 2023-01-01
NULL | x@example.com | Xav | 2023-01-02
The Flawed Approach: Why `GROUP BY` and `IN` Fails
A common first attempt is to find duplicate keys with GROUP BY
and then select them with an IN
clause. This approach has a critical flaw.
SELECT * FROM users WHERE (id, email) IN (
SELECT id, email FROM users
GROUP BY id, email
HAVING COUNT(*) > 1
);
-- Notice the rows with id = NULL are missing!
id | email | name | updated_at
------+---------------+--------+------------
1 | a@example.com | Alice | 2023-01-01
1 | a@example.com | Alicia | 2023-01-04
- Simple and intuitive for non-nullable columns.
- Critically fails when key columns contain
NULL
, asNULL = NULL
evaluates to unknown (false) in standard SQL comparisons.
The Correct Fix: Find Duplicates with `IS NOT DISTINCT FROM`
To correctly handle NULL
values, use PostgreSQL’s NULL-safe comparison operator: IS NOT DISTINCT FROM
. This treats two NULL
s as equal.
WITH duplicate_keys AS (
SELECT id, email FROM users
GROUP BY id, email
HAVING COUNT(*) > 1
)
SELECT u.* FROM users u
JOIN duplicate_keys d
ON u.id IS NOT DISTINCT FROM d.id
AND u.email IS NOT DISTINCT FROM d.email;
-- Success! All duplicate rows are correctly identified.
id | email | name | updated_at
------+---------------+--------+------------
1 | a@example.com | Alice | 2023-01-01
1 | a@example.com | Alicia | 2023-01-04
NULL | x@example.com | Xavier | 2023-01-01
NULL | x@example.com | Xav | 2023-01-02
Strategy 1: Keep the Latest Record with `ROW_NUMBER()`
This is the most flexible strategy for de-duplication. Use the ROW_NUMBER()
window function to rank rows within each duplicate group and select only the one you want to keep.
WITH ranked_users AS (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY id, email ORDER BY updated_at DESC
) AS rn
FROM users
)
SELECT id, email, name, updated_at FROM ranked_users WHERE rn = 1;
-- A clean, de-duplicated set of data.
id | email | name | updated_at
------+---------------+---------+------------
1 | a@example.com | Alicia | 2023-01-04
2 | b@example.com | Bob | 2023-01-03
3 | c@example.com | Charlie | 2023-01-01
NULL | x@example.com | Xav | 2023-01-02
- Extremely flexible: you can change the
ORDER BY
clause to define the “best” record based on any criteria. - Non-destructive: This
SELECT
statement doesn’t alter the original table.
- This only *selects* the clean data. To actually remove unwanted rows, you must write a separate
DELETE
statement.
Strategy 2: Prevent Duplicates with `ON CONFLICT`
The best long-term solution is to prevent duplicates from ever being inserted. This is achieved by adding a UNIQUE
constraint and then using the PostgreSQL-specific ON CONFLICT
clause.
-- Step 1: Enforce uniqueness at the database level.
ALTER TABLE users ADD CONSTRAINT users_id_email_key UNIQUE (id, email);
-- Step 2: Safely ignore any new rows that violate the constraint.
INSERT INTO users (...) VALUES (...)
ON CONFLICT (id, email) DO NOTHING;
-- Or, update the conflicting row (an "upsert").
INSERT INTO users (...) VALUES (...)
ON CONFLICT (id, email)
DO UPDATE SET name = EXCLUDED.name, updated_at = EXCLUDED.updated_at;
- The most robust solution. Data integrity is enforced by the database itself.
- Concise, efficient, and highly readable for common use cases.
- PostgreSQL-specific syntax; not portable to other databases.
- Requires a
UNIQUE
constraint, which you must create first.
Strategy 3: The Standard SQL `MERGE` Statement
For more complex logic and better cross-database compatibility, the standard SQL MERGE
statement (available in Postgres 15+) is the ultimate tool. It can perform inserts, updates, and even deletes in one atomic operation.
-- Assumes a staging_users table with new/updated data.
MERGE INTO users AS target
USING staging_users AS source
ON target.id IS NOT DISTINCT FROM source.id
AND target.email IS NOT DISTINCT FROM source.email
-- If a user exists and the new record is more recent, update it.
WHEN MATCHED AND source.updated_at > target.updated_at THEN
UPDATE SET name = source.name, updated_at = source.updated_at
-- If the user does not exist, insert it.
WHEN NOT MATCHED THEN
INSERT (id, email, name, updated_at)
VALUES (source.id, source.email, source.name, source.updated_at);
- SQL Standard: This syntax is highly portable across modern databases (SQL Server, Oracle, etc.).
- Extremely powerful: Supports multiple conditional
WHEN MATCHED
clauses and can also performDELETE
actions.
- More verbose and complex than
ON CONFLICT
for simple upserts. - Only available in PostgreSQL 15 and newer.
Tricky Interview Questions
A: A direct `DELETE` with a subquery can cause locking issues. A better, more scalable approach is:
1. Create a new, clean table: `CREATE TABLE users_clean AS SELECT … FROM ranked_users WHERE rn = 1;`
2. Add all necessary indexes and constraints to `users_clean`.
3. In a maintenance window, swap the tables: `ALTER TABLE users RENAME TO users_old; ALTER TABLE users_clean RENAME TO users;`
4. Drop `users_old` after verifying success. This minimizes downtime.
A: By default, PostgreSQL treats `NULL`s as the highest possible value. So, `ORDER BY updated_at DESC` will place `NULL` values first. If you want to treat them as the oldest, you must be explicit: `ORDER BY updated_at DESC NULLS LAST`.
Frequently Asked Questions (FAQ)
A: `EXCLUDED` is a special pseudo-table that represents the row that *failed* to be inserted due to the conflict. It allows you to access the values from the new, rejected row and use them in your `UPDATE` logic.
A: Use `ON CONFLICT` for simple, common “insert or update” / “insert or ignore” patterns in PostgreSQL. Use `MERGE` when you need SQL-standard portability or have more complex conditional logic, like updating only under certain conditions or deleting matched rows.
Final Takeaways
- Always use
IS NOT DISTINCT FROM
for finding duplicates whenNULL
s are possible. - Use
ROW_NUMBER()
for flexible, read-only de-duplication and analysis. - Use
ON CONFLICT
for efficient, PostgreSQL-native upserts. - Use
MERGE
for powerful, SQL-standard conditional logic.
Clean data isn’t a luxury—it’s the foundation of a reliable system. Build with integrity.
SQL Feature Compatibility Cheat Sheet
While the strategies are conceptually similar, the exact syntax varies significantly across major database systems. Here’s a comparative cheat sheet.
Feature | PostgreSQL | MySQL / MariaDB | SQL Server | Oracle | Snowflake |
---|---|---|---|---|---|
NULL-Safe Equals |
Supported
IS NOT DISTINCT FROM
|
Supported
<=>
|
Workaround
INTERSECT
|
Workaround
LNNVL(a = b)
|
Supported
EQUAL_NULL(a, b)
|
Window Functions |
Supported
ROW_NUMBER() |
Supported (v8.0+)
ROW_NUMBER() |
Supported
ROW_NUMBER() |
Supported
ROW_NUMBER() |
Supported
ROW_NUMBER() |
Atomic Upsert |
Supported
ON CONFLICT |
Supported
ON DUPLICATE KEY UPDATE |
Use MERGE
MERGE |
Use MERGE
MERGE |
Use MERGE
MERGE |
Merge Statement |
Supported (v15+)
MERGE |
Supported (MariaDB)
MERGE |
Supported
MERGE |
Supported
MERGE |
Supported
MERGE |
Unique Constraint (Multiple NULLs) |
Supported
UNIQUE |
Supported
UNIQUE |
Allows one NULL only
UNIQUE |
Supported
UNIQUE |
Supported
UNIQUE |
Discover more from Data Engineer Journey
Subscribe to get the latest posts sent to your email.