The Data Engineer’s Definitive SQL Interview Playbook

Question 1

Q: Your pipeline has ingested a `customers` table with potential duplicates. Outline your systematic approach to identify, inspect, and resolve this data quality issue.

Answer

My approach is a three-step process: identify, inspect, and then execute a safe cleaning strategy.

Step 1: How do you identify which emails are duplicated?

Use `GROUP BY` and `HAVING` to find keys that appear more than once.

Input `customers` table:

id	name	email	signed_up_date
1	John Smith	john.s@email.com	2023-01-10
2	Jane Doe	jane.d@email.com	2023-02-15
3	Johnathan Smith	john.s@email.com	2023-03-20

SELECT email, COUNT(*) AS occurrences FROM customers GROUP BY email HAVING COUNT(*) > 1;

Output: `john.s@email.com` with 2 occurrences.

Step 2: How do you inspect the full details of these duplicate rows?

Use a subquery or CTE to filter the main table for the problematic emails.

SELECT * FROM customers WHERE email IN (
    SELECT email FROM customers GROUP BY email HAVING COUNT(*) > 1
);

Output: Rows with `id` 1 and 3.

Step 3: How do you delete the duplicates but keep the earliest signed-up record?

The safest and most flexible method is using a CTE with the `ROW_NUMBER()` window function.

WITH RankedCustomers AS (
    SELECT id, ROW_NUMBER() OVER(PARTITION BY email ORDER BY signed_up_date ASC) as rn
    FROM customers
)
DELETE FROM customers WHERE id IN (SELECT id FROM RankedCustomers WHERE rn > 1);

Why It Matters for a Data Engineer

Duplicate data breaks downstream processes that rely on unique keys, corrupts analytics, and violates data quality contracts. A DE must design idempotent pipelines that can either handle duplicates gracefully or clean them according to defined business rules. This demonstrates an understanding of robust pipeline design.

The ETL Connection (Talend)

In Talend, the `tUniqRow` component is designed for this. It splits a data flow into “uniques” and “duplicates”. For cleaning, you’d direct the “uniques” output (which keeps the first record for each key) to your target table and log or discard the “duplicates” flow.

Question 2

Q: Explain the ranking functions. Then, solve two problems: finding the “Top N per group” and calculating a running total.

Answer

Window functions perform calculations across a set of related rows. The ranking functions (`ROW_NUMBER`, `RANK`, `DENSE_RANK`) are a core part of this family.

Problem 1: Find the top 2 highest-paid employees in each department.

Input `employees` table:

name	department	salary
Alice	Engineering	120000
Bob	Engineering	110000
Charlie	Engineering	110000
David	Sales	95000

`DENSE_RANK()` is the right choice to include all employees who tie for a position.

WITH RankedSalaries AS (
    SELECT name, department, salary, DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as rnk
    FROM employees
)
SELECT name, department, salary FROM RankedSalaries WHERE rnk <= 2;

Output: Alice, Bob, and Charlie from Engineering, and David from Sales.

Problem 2: Calculate a monthly running total of revenue.

Input `monthly_revenue` table:

month	revenue
2023-01-01	50000
2023-02-01	80000
2023-03-01	60000

SELECT month, revenue, SUM(revenue) OVER (ORDER BY month) as running_total_revenue FROM monthly_revenue;

Output:

month	revenue	running_total_revenue
2023-01-01	50000	50000
2023-02-01	80000	130000
2023-03-01	60000	190000

Interviewer's Deep Dive

Q: "Your running total is for the whole company. How would you change it to calculate a running total for each product line separately?"

A: "I would add a `PARTITION BY product_line` clause to the `OVER()` statement. This resets the calculation for each new product line."

SUM(revenue) OVER (PARTITION BY product_line ORDER BY month)

Question 3

Q: Explain the difference between `INNER JOIN`, `LEFT JOIN`, `UNION`, and `UNION ALL`. Provide a use case for each.

Answer

`JOIN`s combine columns from different tables horizontally, while `UNION`/`UNION ALL` stack rows from different tables vertically.

`INNER JOIN` vs. `LEFT JOIN`

An `INNER JOIN` returns only the intersection of two tables. A `LEFT JOIN` returns all records from the left table, and matched records from the right.

Use Case: Use `INNER JOIN` when you only care about records that have a valid relationship in both tables (e.g., `orders` and `customers`). Use `LEFT JOIN` when you need all records from one table, regardless of whether they have a match in the other (e.g., finding all customers, even those who haven't placed an order yet).

`UNION` vs. `UNION ALL`

`UNION ALL` simply appends datasets. `UNION` does the same, but then adds a costly de-duplication step.

Why It Matters for a Data Engineer

Performance. `UNION` can be a huge bottleneck in a pipeline due to its sort/hash de-duplication. Always default to `UNION ALL` unless you have an explicit business need to remove duplicates at that step. It's significantly faster and less resource-intensive.

Interviewer's Deep Dive

Q: "How would you find records that are in Table A but NOT in Table B? Why is your chosen method superior to others?"

A: "I would use a `LEFT JOIN ... WHERE key IS NULL` or `NOT EXISTS`. I avoid `NOT IN` because it fails unpredictably if the subquery contains any `NULL` values. `NOT EXISTS` is often the most performant as it can short-circuit, but `LEFT JOIN` is sometimes more readable."

Question 4

Q: Explain the purpose of a CTE. Then, show how you would pivot data from a row-based format to a columnar report.

Answer

CTEs vs. Subqueries

A Common Table Expression (`WITH` clause) creates a temporary, named result set. It is superior to nested subqueries because it makes complex queries highly readable and modular. You can also reference a CTE multiple times in the same query.

Pivoting Data

Pivoting transforms rows into columns, often for reporting. The standard method is conditional aggregation.

Input `quarterly_sales` table:

product	quarter	sales
Laptop	Q1	50000
Laptop	Q2	60000
Mouse	Q1	5000

SELECT
    product,
    SUM(CASE WHEN quarter = 'Q1' THEN sales ELSE 0 END) AS q1_sales,
    SUM(CASE WHEN quarter = 'Q2' THEN sales ELSE 0 END) AS q2_sales
FROM quarterly_sales
GROUP BY product;

Pivoted Output:

product	q1_sales	q2_sales
Laptop	50000	60000
Mouse	5000	0

Question 5

Q: Your `employees` table has an `id` and a `manager_id`, representing a hierarchy. How would you write a query to show the entire reporting chain for a specific employee?

Answer

This requires a **Recursive CTE**, which is designed specifically for traversing hierarchical or graph-like data structures.

Input `employees` table:

id	name	manager_id
1	CEO	NULL
2	VP of Eng	1
3	Director	2
4	Engineer	3

Goal: Find the reporting chain for the 'Engineer' (id = 4).

WITH RECURSIVE EmployeeHierarchy AS (
    -- Anchor Member: Start with the employee in question
    SELECT id, name, manager_id, 0 AS level
    FROM employees
    WHERE id = 4

    UNION ALL

    -- Recursive Member: Join back to the employees table to find the manager
    SELECT e.id, e.name, e.manager_id, eh.level + 1
    FROM employees e
    JOIN EmployeeHierarchy eh ON e.id = eh.manager_id
)
SELECT * FROM EmployeeHierarchy;

Output:

id	name	manager_id	level
4	Engineer	3	0
3	Director	2	1
2	VP of Eng	1	2
1	CEO	NULL	3

Why It Matters for a Data Engineer

Many real-world datasets are hierarchical (product categories, organizational charts, geographic regions). Knowing how to write a recursive CTE demonstrates you can handle complex data structures beyond simple flat tables. It's a powerful tool that many SQL users are not familiar with.

Question 6

Q: Explain the difference between `DELETE`, `TRUNCATE`, and `DROP`. When and why would you use each?

Answer

These commands all remove data, but they operate at different levels and have huge implications for performance and recoverability.

`DELETE`: A DML (Data Manipulation Language) command. It removes rows one by one and logs each deletion.
- When to use: When you need to remove a subset of rows based on a `WHERE` clause.
- Why: It's slow for large tables but safe, as it's part of a transaction and can be rolled back. It also fires triggers.
`TRUNCATE`: A DDL (Data Definition Language) command. It deallocates the data pages of the table, instantly removing all rows.
- When to use: To quickly delete all rows from a large staging table.
- Why: It's extremely fast and uses minimal system resources. However, it's typically not logged in the same way, so it cannot be easily rolled back and does not fire `DELETE` triggers.
`DROP`: A DDL command. It removes the entire table structure and all its data permanently.
- When to use: When you are completely finished with a table and want to remove it from the database.
- Why: It's the most destructive command. There's no coming back from a `DROP TABLE` without a backup.

Interviewer's Deep Dive

Q: "You need to perform a series of related `UPDATE` and `INSERT` statements. How do you ensure that either all of them succeed or none of them do?"

A: "I would wrap the entire series of statements in a `TRANSACTION`. I'd start with `BEGIN TRANSACTION`, run my statements, and then finish with `COMMIT` if everything is successful. If any step fails, I can issue a `ROLLBACK` to undo all the changes made since the transaction began. This ensures atomicity, a key ACID property."

Question 7

Q: A critical query in your pipeline is running slowly. What is your systematic approach to diagnosing and fixing it?

Answer

Analyze the Execution Plan: My first action is to use `EXPLAIN` to ask the database for its execution plan. This reveals if it's using a "Full Table Scan" on a large table, which is a major red flag.
Verify Indexes: The execution plan will guide me, but I will confirm that columns in `WHERE` clauses, `JOIN` conditions, and `ORDER BY` clauses are properly indexed. This is the most common cause of poor performance.
Check for Non-Sargable Queries: I'll look for functions applied to columns in the `WHERE` clause, like `WHERE YEAR(order_date) = 2023`. This prevents the database from using an index. The fix is to rewrite it as `WHERE order_date >= '2023-01-01' AND order_date < '2024-01-01'`.

Why It Matters for a Data Engineer

A Data Engineer is responsible for building pipelines that are correct, efficient, and cost-effective. A slow query can delay data availability, miss SLAs, and drive up cloud computing costs. The ability to systematically diagnose and tune query performance is a fundamental and highly valued skill.

The Data Engineer’s Definitive SQL Interview Playbook

Pillar 1: Data Integrity & Cleansing

Scenario: Handling Duplicate Records

Step 1: How do you identify which emails are duplicated?

Step 2: How do you inspect the full details of these duplicate rows?

Step 3: How do you delete the duplicates but keep the earliest signed-up record?

Why It Matters for a Data Engineer

The ETL Connection (Talend)

Pillar 2: Analytical Queries with Window Functions

Scenario: Ranking, Segmentation, and Running Calculations

Problem 1: Find the top 2 highest-paid employees in each department.

Problem 2: Calculate a monthly running total of revenue.

Interviewer's Deep Dive

Pillar 3: Structuring & Joining Data

Scenario 1: Joining, Combining, and Finding Mismatches

`INNER JOIN` vs. `LEFT JOIN`

`UNION` vs. `UNION ALL`

Why It Matters for a Data Engineer

Interviewer's Deep Dive

Pillar 4: Advanced Transformations & Query Structuring

Scenario: Pivoting, Unpivoting, and Readability

CTEs vs. Subqueries

Pivoting Data

Pillar 5: Advanced Data Modeling & Hierarchies

Scenario: Working with Hierarchical Data

Why It Matters for a Data Engineer

Pillar 6: Transactions & Data Definition Language (DDL)

Scenario: Ensuring Data Integrity During Writes

Interviewer's Deep Dive

Pillar 7: Performance & Optimization

Scenario: Diagnosing a Slow Query

Why It Matters for a Data Engineer

Like this:

Related

Discover more from Data Engineer Journey

Leave a Comment Cancel Reply

Pillar 1: Data Integrity & Cleansing

Scenario: Handling Duplicate Records

Step 1: How do you identify which emails are duplicated?

Step 2: How do you inspect the full details of these duplicate rows?

Step 3: How do you delete the duplicates but keep the earliest signed-up record?

Why It Matters for a Data Engineer

The ETL Connection (Talend)

Pillar 2: Analytical Queries with Window Functions

Scenario: Ranking, Segmentation, and Running Calculations

Problem 1: Find the top 2 highest-paid employees in each department.

Problem 2: Calculate a monthly running total of revenue.

Interviewer's Deep Dive

Pillar 3: Structuring & Joining Data

Scenario 1: Joining, Combining, and Finding Mismatches

`INNER JOIN` vs. `LEFT JOIN`

`UNION` vs. `UNION ALL`

Why It Matters for a Data Engineer

Interviewer's Deep Dive

Pillar 4: Advanced Transformations & Query Structuring

Scenario: Pivoting, Unpivoting, and Readability

CTEs vs. Subqueries

Pivoting Data

Pillar 5: Advanced Data Modeling & Hierarchies

Scenario: Working with Hierarchical Data

Why It Matters for a Data Engineer

Pillar 6: Transactions & Data Definition Language (DDL)

Scenario: Ensuring Data Integrity During Writes

Interviewer's Deep Dive

Pillar 7: Performance & Optimization

Scenario: Diagnosing a Slow Query

Why It Matters for a Data Engineer

Share this:

Like this:

Related

Discover more from Data Engineer Journey

Leave a Comment Cancel Reply

Discover more from Data Engineer Journey