The Data Engineer’s Definitive SQL Interview Playbook

The Data Engineer’s Definitive SQL Interview Playbook (Comprehensive Edition)

This guide is a comprehensive deep dive into the SQL questions that define a modern Data Engineering interview. We’ll move beyond basic syntax to explore performance, data integrity, and complex transformations, focusing on the critical thinking that separates a great engineer from a good one.

Pillar 1: Data Integrity & Cleansing

Scenario: Handling Duplicate Records

Q: Your pipeline has ingested a `customers` table with potential duplicates. Outline your systematic approach to identify, inspect, and resolve this data quality issue.

My approach is a three-step process: identify, inspect, and then execute a safe cleaning strategy.

Step 1: How do you identify which emails are duplicated?

Use `GROUP BY` and `HAVING` to find keys that appear more than once.

Input `customers` table:

idnameemailsigned_up_date
1John Smithjohn.s@email.com2023-01-10
2Jane Doejane.d@email.com2023-02-15
3Johnathan Smithjohn.s@email.com2023-03-20
SELECT email, COUNT(*) AS occurrences FROM customers GROUP BY email HAVING COUNT(*) > 1;

Output: `john.s@email.com` with 2 occurrences.

Step 2: How do you inspect the full details of these duplicate rows?

Use a subquery or CTE to filter the main table for the problematic emails.

SELECT * FROM customers WHERE email IN (
    SELECT email FROM customers GROUP BY email HAVING COUNT(*) > 1
);

Output: Rows with `id` 1 and 3.

Step 3: How do you delete the duplicates but keep the earliest signed-up record?

The safest and most flexible method is using a CTE with the `ROW_NUMBER()` window function.

WITH RankedCustomers AS (
    SELECT id, ROW_NUMBER() OVER(PARTITION BY email ORDER BY signed_up_date ASC) as rn
    FROM customers
)
DELETE FROM customers WHERE id IN (SELECT id FROM RankedCustomers WHERE rn > 1);

Why It Matters for a Data Engineer

Duplicate data breaks downstream processes that rely on unique keys, corrupts analytics, and violates data quality contracts. A DE must design idempotent pipelines that can either handle duplicates gracefully or clean them according to defined business rules. This demonstrates an understanding of robust pipeline design.

The ETL Connection (Talend)

In Talend, the `tUniqRow` component is designed for this. It splits a data flow into “uniques” and “duplicates”. For cleaning, you’d direct the “uniques” output (which keeps the first record for each key) to your target table and log or discard the “duplicates” flow.

Pillar 2: Analytical Queries with Window Functions

Scenario: Ranking, Segmentation, and Running Calculations

Q: Explain the ranking functions. Then, solve two problems: finding the “Top N per group” and calculating a running total.

Window functions perform calculations across a set of related rows. The ranking functions (`ROW_NUMBER`, `RANK`, `DENSE_RANK`) are a core part of this family.

Problem 1: Find the top 2 highest-paid employees in each department.

Input `employees` table:

namedepartmentsalary
AliceEngineering120000
BobEngineering110000
CharlieEngineering110000
DavidSales95000

`DENSE_RANK()` is the right choice to include all employees who tie for a position.

WITH RankedSalaries AS (
    SELECT name, department, salary, DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as rnk
    FROM employees
)
SELECT name, department, salary FROM RankedSalaries WHERE rnk <= 2;

Output: Alice, Bob, and Charlie from Engineering, and David from Sales.

Problem 2: Calculate a monthly running total of revenue.

Input `monthly_revenue` table:

monthrevenue
2023-01-0150000
2023-02-0180000
2023-03-0160000
SELECT month, revenue, SUM(revenue) OVER (ORDER BY month) as running_total_revenue FROM monthly_revenue;

Output:

monthrevenuerunning_total_revenue
2023-01-015000050000
2023-02-0180000130000
2023-03-0160000190000

Interviewer's Deep Dive

Pillar 3: Structuring & Joining Data

Scenario 1: Joining, Combining, and Finding Mismatches

Q: Explain the difference between `INNER JOIN`, `LEFT JOIN`, `UNION`, and `UNION ALL`. Provide a use case for each.

`JOIN`s combine columns from different tables horizontally, while `UNION`/`UNION ALL` stack rows from different tables vertically.

`INNER JOIN` vs. `LEFT JOIN`

An `INNER JOIN` returns only the intersection of two tables. A `LEFT JOIN` returns all records from the left table, and matched records from the right.

Use Case: Use `INNER JOIN` when you only care about records that have a valid relationship in both tables (e.g., `orders` and `customers`). Use `LEFT JOIN` when you need all records from one table, regardless of whether they have a match in the other (e.g., finding all customers, even those who haven't placed an order yet).

`UNION` vs. `UNION ALL`

`UNION ALL` simply appends datasets. `UNION` does the same, but then adds a costly de-duplication step.

Why It Matters for a Data Engineer

Performance. `UNION` can be a huge bottleneck in a pipeline due to its sort/hash de-duplication. Always default to `UNION ALL` unless you have an explicit business need to remove duplicates at that step. It's significantly faster and less resource-intensive.

Interviewer's Deep Dive

Pillar 4: Advanced Transformations & Query Structuring

Scenario: Pivoting, Unpivoting, and Readability

Q: Explain the purpose of a CTE. Then, show how you would pivot data from a row-based format to a columnar report.

CTEs vs. Subqueries

A Common Table Expression (`WITH` clause) creates a temporary, named result set. It is superior to nested subqueries because it makes complex queries highly readable and modular. You can also reference a CTE multiple times in the same query.

Pivoting Data

Pivoting transforms rows into columns, often for reporting. The standard method is conditional aggregation.

Input `quarterly_sales` table:

productquartersales
LaptopQ150000
LaptopQ260000
MouseQ15000
SELECT
    product,
    SUM(CASE WHEN quarter = 'Q1' THEN sales ELSE 0 END) AS q1_sales,
    SUM(CASE WHEN quarter = 'Q2' THEN sales ELSE 0 END) AS q2_sales
FROM quarterly_sales
GROUP BY product;

Pivoted Output:

productq1_salesq2_sales
Laptop5000060000
Mouse50000

Pillar 5: Advanced Data Modeling & Hierarchies

Scenario: Working with Hierarchical Data

Q: Your `employees` table has an `id` and a `manager_id`, representing a hierarchy. How would you write a query to show the entire reporting chain for a specific employee?

This requires a **Recursive CTE**, which is designed specifically for traversing hierarchical or graph-like data structures.

Input `employees` table:

idnamemanager_id
1CEONULL
2VP of Eng1
3Director2
4Engineer3

Goal: Find the reporting chain for the 'Engineer' (id = 4).

WITH RECURSIVE EmployeeHierarchy AS (
    -- Anchor Member: Start with the employee in question
    SELECT id, name, manager_id, 0 AS level
    FROM employees
    WHERE id = 4

    UNION ALL

    -- Recursive Member: Join back to the employees table to find the manager
    SELECT e.id, e.name, e.manager_id, eh.level + 1
    FROM employees e
    JOIN EmployeeHierarchy eh ON e.id = eh.manager_id
)
SELECT * FROM EmployeeHierarchy;

Output:

idnamemanager_idlevel
4Engineer30
3Director21
2VP of Eng12
1CEONULL3

Why It Matters for a Data Engineer

Many real-world datasets are hierarchical (product categories, organizational charts, geographic regions). Knowing how to write a recursive CTE demonstrates you can handle complex data structures beyond simple flat tables. It's a powerful tool that many SQL users are not familiar with.

Pillar 6: Transactions & Data Definition Language (DDL)

Scenario: Ensuring Data Integrity During Writes

Q: Explain the difference between `DELETE`, `TRUNCATE`, and `DROP`. When and why would you use each?

These commands all remove data, but they operate at different levels and have huge implications for performance and recoverability.

  • `DELETE`: A DML (Data Manipulation Language) command. It removes rows one by one and logs each deletion.
    • When to use: When you need to remove a subset of rows based on a `WHERE` clause.
    • Why: It's slow for large tables but safe, as it's part of a transaction and can be rolled back. It also fires triggers.
  • `TRUNCATE`: A DDL (Data Definition Language) command. It deallocates the data pages of the table, instantly removing all rows.
    • When to use: To quickly delete all rows from a large staging table.
    • Why: It's extremely fast and uses minimal system resources. However, it's typically not logged in the same way, so it cannot be easily rolled back and does not fire `DELETE` triggers.
  • `DROP`: A DDL command. It removes the entire table structure and all its data permanently.
    • When to use: When you are completely finished with a table and want to remove it from the database.
    • Why: It's the most destructive command. There's no coming back from a `DROP TABLE` without a backup.

Interviewer's Deep Dive

Pillar 7: Performance & Optimization

Scenario: Diagnosing a Slow Query

Q: A critical query in your pipeline is running slowly. What is your systematic approach to diagnosing and fixing it?

  1. Analyze the Execution Plan: My first action is to use `EXPLAIN` to ask the database for its execution plan. This reveals if it's using a "Full Table Scan" on a large table, which is a major red flag.
  2. Verify Indexes: The execution plan will guide me, but I will confirm that columns in `WHERE` clauses, `JOIN` conditions, and `ORDER BY` clauses are properly indexed. This is the most common cause of poor performance.
  3. Check for Non-Sargable Queries: I'll look for functions applied to columns in the `WHERE` clause, like `WHERE YEAR(order_date) = 2023`. This prevents the database from using an index. The fix is to rewrite it as `WHERE order_date >= '2023-01-01' AND order_date < '2024-01-01'`.

Why It Matters for a Data Engineer

A Data Engineer is responsible for building pipelines that are correct, efficient, and cost-effective. A slow query can delay data availability, miss SLAs, and drive up cloud computing costs. The ability to systematically diagnose and tune query performance is a fundamental and highly valued skill.


Discover more from Data Engineer Journey

Subscribe to get the latest posts sent to your email.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top

Discover more from Data Engineer Journey

Subscribe now to keep reading and get access to the full archive.

Continue reading