The Definitive Guide to Handling Duplicates in PostgreSQL

The Definitive Guide to Handling Duplicates in PostgreSQL

A complete guide to detecting, removing, and preventing duplicate data with real-world examples, query analysis, and tricky interview questions.

The Scenario: A Problematic `users` Table

Imagine a users table containing duplicate entries based on id and email. Crucially, it also includes NULL values, which are a common source of errors in duplication logic.

Initial Data
-- Meticulously aligned for clarity
 id   | email           | name    | updated_at
------+-----------------+---------+------------
 1    | a@example.com   | Alice   | 2023-01-01
 2    | b@example.com   | Bob     | 2023-01-03
 1    | a@example.com   | Alicia  | 2023-01-04
 3    | c@example.com   | Charlie | 2023-01-01
 NULL | x@example.com   | Xavier  | 2023-01-01
 NULL | x@example.com   | Xav     | 2023-01-02

The Flawed Approach: Why `GROUP BY` and `IN` Fails

A common first attempt is to find duplicate keys with GROUP BY and then select them with an IN clause. This approach has a critical flaw.

SELECT * FROM users WHERE (id, email) IN (
  SELECT id, email FROM users
  GROUP BY id, email
  HAVING COUNT(*) > 1
);
Output of this Query
-- Notice the rows with id = NULL are missing!
 id   | email         | name   | updated_at
------+---------------+--------+------------
 1    | a@example.com | Alice  | 2023-01-01
 1    | a@example.com | Alicia | 2023-01-04
Analysis
  • Simple and intuitive for non-nullable columns.
  • Critically fails when key columns contain NULL, as NULL = NULL evaluates to unknown (false) in standard SQL comparisons.

The Correct Fix: Find Duplicates with `IS NOT DISTINCT FROM`

To correctly handle NULL values, use PostgreSQL’s NULL-safe comparison operator: IS NOT DISTINCT FROM. This treats two NULLs as equal.

WITH duplicate_keys AS (
  SELECT id, email FROM users
  GROUP BY id, email
  HAVING COUNT(*) > 1
)
SELECT u.* FROM users u
JOIN duplicate_keys d
  ON u.id IS NOT DISTINCT FROM d.id
 AND u.email IS NOT DISTINCT FROM d.email;
Output of this Query
-- Success! All duplicate rows are correctly identified.
 id   | email         | name   | updated_at
------+---------------+--------+------------
 1    | a@example.com | Alice  | 2023-01-01
 1    | a@example.com | Alicia | 2023-01-04
 NULL | x@example.com | Xavier | 2023-01-01
 NULL | x@example.com | Xav    | 2023-01-02

Strategy 1: Keep the Latest Record with `ROW_NUMBER()`

This is the most flexible strategy for de-duplication. Use the ROW_NUMBER() window function to rank rows within each duplicate group and select only the one you want to keep.

WITH ranked_users AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY id, email ORDER BY updated_at DESC
  ) AS rn
  FROM users
)
SELECT id, email, name, updated_at FROM ranked_users WHERE rn = 1;
Output of this Query
-- A clean, de-duplicated set of data.
 id   | email         | name    | updated_at
------+---------------+---------+------------
 1    | a@example.com | Alicia  | 2023-01-04
 2    | b@example.com | Bob     | 2023-01-03
 3    | c@example.com | Charlie | 2023-01-01
 NULL | x@example.com | Xav     | 2023-01-02
Analysis
  • Extremely flexible: you can change the ORDER BY clause to define the “best” record based on any criteria.
  • Non-destructive: This SELECT statement doesn’t alter the original table.
  • This only *selects* the clean data. To actually remove unwanted rows, you must write a separate DELETE statement.

Strategy 2: Prevent Duplicates with `ON CONFLICT`

The best long-term solution is to prevent duplicates from ever being inserted. This is achieved by adding a UNIQUE constraint and then using the PostgreSQL-specific ON CONFLICT clause.

-- Step 1: Enforce uniqueness at the database level.
ALTER TABLE users ADD CONSTRAINT users_id_email_key UNIQUE (id, email);

-- Step 2: Safely ignore any new rows that violate the constraint.
INSERT INTO users (...) VALUES (...)
ON CONFLICT (id, email) DO NOTHING;

-- Or, update the conflicting row (an "upsert").
INSERT INTO users (...) VALUES (...)
ON CONFLICT (id, email)
DO UPDATE SET name = EXCLUDED.name, updated_at = EXCLUDED.updated_at;
Analysis
  • The most robust solution. Data integrity is enforced by the database itself.
  • Concise, efficient, and highly readable for common use cases.
  • PostgreSQL-specific syntax; not portable to other databases.
  • Requires a UNIQUE constraint, which you must create first.

Strategy 3: The Standard SQL `MERGE` Statement

For more complex logic and better cross-database compatibility, the standard SQL MERGE statement (available in Postgres 15+) is the ultimate tool. It can perform inserts, updates, and even deletes in one atomic operation.

-- Assumes a staging_users table with new/updated data.
MERGE INTO users AS target
USING staging_users AS source
  ON target.id IS NOT DISTINCT FROM source.id
 AND target.email IS NOT DISTINCT FROM source.email
-- If a user exists and the new record is more recent, update it.
WHEN MATCHED AND source.updated_at > target.updated_at THEN
  UPDATE SET name = source.name, updated_at = source.updated_at
-- If the user does not exist, insert it.
WHEN NOT MATCHED THEN
  INSERT (id, email, name, updated_at)
  VALUES (source.id, source.email, source.name, source.updated_at);
Analysis
  • SQL Standard: This syntax is highly portable across modern databases (SQL Server, Oracle, etc.).
  • Extremely powerful: Supports multiple conditional WHEN MATCHED clauses and can also perform DELETE actions.
  • More verbose and complex than ON CONFLICT for simple upserts.
  • Only available in PostgreSQL 15 and newer.

Tricky Interview Questions

Q: You need to delete duplicates from a table with billions of rows. The `ROW_NUMBER()` approach is too slow. What do you do?

A: A direct `DELETE` with a subquery can cause locking issues. A better, more scalable approach is:

1. Create a new, clean table: `CREATE TABLE users_clean AS SELECT … FROM ranked_users WHERE rn = 1;`

2. Add all necessary indexes and constraints to `users_clean`.

3. In a maintenance window, swap the tables: `ALTER TABLE users RENAME TO users_old; ALTER TABLE users_clean RENAME TO users;`

4. Drop `users_old` after verifying success. This minimizes downtime.

Q: In the `ROW_NUMBER()` example, how are `NULL` values handled in the `ORDER BY` clause?

A: By default, PostgreSQL treats `NULL`s as the highest possible value. So, `ORDER BY updated_at DESC` will place `NULL` values first. If you want to treat them as the oldest, you must be explicit: `ORDER BY updated_at DESC NULLS LAST`.

Frequently Asked Questions (FAQ)

Q: What exactly is `EXCLUDED` in an `ON CONFLICT` statement?

A: `EXCLUDED` is a special pseudo-table that represents the row that *failed* to be inserted due to the conflict. It allows you to access the values from the new, rejected row and use them in your `UPDATE` logic.

Q: When should I use `ON CONFLICT` vs. `MERGE`?

A: Use `ON CONFLICT` for simple, common “insert or update” / “insert or ignore” patterns in PostgreSQL. Use `MERGE` when you need SQL-standard portability or have more complex conditional logic, like updating only under certain conditions or deleting matched rows.

Final Takeaways

  • Always use IS NOT DISTINCT FROM for finding duplicates when NULLs are possible.
  • Use ROW_NUMBER() for flexible, read-only de-duplication and analysis.
  • Use ON CONFLICT for efficient, PostgreSQL-native upserts.
  • Use MERGE for powerful, SQL-standard conditional logic.

Clean data isn’t a luxury—it’s the foundation of a reliable system. Build with integrity.

Brand-Aware SQL Feature Compatibility Cheat Sheet

SQL Feature Compatibility Cheat Sheet

While the strategies are conceptually similar, the exact syntax varies significantly across major database systems. Here’s a comparative cheat sheet.

Feature PostgreSQL MySQL / MariaDB SQL Server Oracle Snowflake
NULL-Safe Equals
Supported IS NOT DISTINCT FROM
Supported <=>
Workaround INTERSECT
Workaround LNNVL(a = b)
Supported EQUAL_NULL(a, b)
Window Functions
SupportedROW_NUMBER()
Supported (v8.0+)ROW_NUMBER()
SupportedROW_NUMBER()
SupportedROW_NUMBER()
SupportedROW_NUMBER()
Atomic Upsert
SupportedON CONFLICT
SupportedON DUPLICATE KEY UPDATE
Use MERGEMERGE
Use MERGEMERGE
Use MERGEMERGE
Merge Statement
Supported (v15+)MERGE
Supported (MariaDB)MERGE
SupportedMERGE
SupportedMERGE
SupportedMERGE
Unique Constraint (Multiple NULLs)
SupportedUNIQUE
SupportedUNIQUE
Allows one NULL onlyUNIQUE
SupportedUNIQUE
SupportedUNIQUE

Discover more from Data Engineer Journey

Subscribe to get the latest posts sent to your email.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Discover more from Data Engineer Journey

Subscribe now to keep reading and get access to the full archive.

Continue reading