This guide is a comprehensive deep dive into the SQL questions that define a modern Data Engineering interview. We’ll move beyond basic syntax to explore performance, data integrity, and complex transformations, focusing on the critical thinking that separates a great engineer from a good one.
Pillar 1: Data Integrity & Cleansing
Scenario: Handling Duplicate Records
Q: Your pipeline has ingested a `customers` table with potential duplicates. Outline your systematic approach to identify, inspect, and resolve this data quality issue.
My approach is a three-step process: identify, inspect, and then execute a safe cleaning strategy.
Step 1: How do you identify which emails are duplicated?
Use `GROUP BY` and `HAVING` to find keys that appear more than once.
Input `customers` table:
id | name | signed_up_date | |
---|---|---|---|
1 | John Smith | john.s@email.com | 2023-01-10 |
2 | Jane Doe | jane.d@email.com | 2023-02-15 |
3 | Johnathan Smith | john.s@email.com | 2023-03-20 |
SELECT email, COUNT(*) AS occurrences FROM customers GROUP BY email HAVING COUNT(*) > 1;
Output: `john.s@email.com` with 2 occurrences.
Step 2: How do you inspect the full details of these duplicate rows?
Use a subquery or CTE to filter the main table for the problematic emails.
SELECT * FROM customers WHERE email IN (
SELECT email FROM customers GROUP BY email HAVING COUNT(*) > 1
);
Output: Rows with `id` 1 and 3.
Step 3: How do you delete the duplicates but keep the earliest signed-up record?
The safest and most flexible method is using a CTE with the `ROW_NUMBER()` window function.
WITH RankedCustomers AS (
SELECT id, ROW_NUMBER() OVER(PARTITION BY email ORDER BY signed_up_date ASC) as rn
FROM customers
)
DELETE FROM customers WHERE id IN (SELECT id FROM RankedCustomers WHERE rn > 1);
Why It Matters for a Data Engineer
Duplicate data breaks downstream processes that rely on unique keys, corrupts analytics, and violates data quality contracts. A DE must design idempotent pipelines that can either handle duplicates gracefully or clean them according to defined business rules. This demonstrates an understanding of robust pipeline design.
The ETL Connection (Talend)
In Talend, the `tUniqRow` component is designed for this. It splits a data flow into “uniques” and “duplicates”. For cleaning, you’d direct the “uniques” output (which keeps the first record for each key) to your target table and log or discard the “duplicates” flow.
Pillar 2: Analytical Queries with Window Functions
Scenario: Ranking, Segmentation, and Running Calculations
Q: Explain the ranking functions. Then, solve two problems: finding the “Top N per group” and calculating a running total.
Window functions perform calculations across a set of related rows. The ranking functions (`ROW_NUMBER`, `RANK`, `DENSE_RANK`) are a core part of this family.
Problem 1: Find the top 2 highest-paid employees in each department.
Input `employees` table:
name | department | salary |
---|---|---|
Alice | Engineering | 120000 |
Bob | Engineering | 110000 |
Charlie | Engineering | 110000 |
David | Sales | 95000 |
`DENSE_RANK()` is the right choice to include all employees who tie for a position.
WITH RankedSalaries AS (
SELECT name, department, salary, DENSE_RANK() OVER(PARTITION BY department ORDER BY salary DESC) as rnk
FROM employees
)
SELECT name, department, salary FROM RankedSalaries WHERE rnk <= 2;
Output: Alice, Bob, and Charlie from Engineering, and David from Sales.
Problem 2: Calculate a monthly running total of revenue.
Input `monthly_revenue` table:
month | revenue |
---|---|
2023-01-01 | 50000 |
2023-02-01 | 80000 |
2023-03-01 | 60000 |
SELECT month, revenue, SUM(revenue) OVER (ORDER BY month) as running_total_revenue FROM monthly_revenue;
Output:
month | revenue | running_total_revenue |
---|---|---|
2023-01-01 | 50000 | 50000 |
2023-02-01 | 80000 | 130000 |
2023-03-01 | 60000 | 190000 |
Interviewer's Deep Dive
A: "I would add a `PARTITION BY product_line` clause to the `OVER()` statement. This resets the calculation for each new product line."
SUM(revenue) OVER (PARTITION BY product_line ORDER BY month)
Pillar 3: Structuring & Joining Data
Scenario 1: Joining, Combining, and Finding Mismatches
Q: Explain the difference between `INNER JOIN`, `LEFT JOIN`, `UNION`, and `UNION ALL`. Provide a use case for each.
`JOIN`s combine columns from different tables horizontally, while `UNION`/`UNION ALL` stack rows from different tables vertically.
`INNER JOIN` vs. `LEFT JOIN`
An `INNER JOIN` returns only the intersection of two tables. A `LEFT JOIN` returns all records from the left table, and matched records from the right.
Use Case: Use `INNER JOIN` when you only care about records that have a valid relationship in both tables (e.g., `orders` and `customers`). Use `LEFT JOIN` when you need all records from one table, regardless of whether they have a match in the other (e.g., finding all customers, even those who haven't placed an order yet).
`UNION` vs. `UNION ALL`
`UNION ALL` simply appends datasets. `UNION` does the same, but then adds a costly de-duplication step.
Why It Matters for a Data Engineer
Performance. `UNION` can be a huge bottleneck in a pipeline due to its sort/hash de-duplication. Always default to `UNION ALL` unless you have an explicit business need to remove duplicates at that step. It's significantly faster and less resource-intensive.
Interviewer's Deep Dive
A: "I would use a `LEFT JOIN ... WHERE key IS NULL` or `NOT EXISTS`. I avoid `NOT IN` because it fails unpredictably if the subquery contains any `NULL` values. `NOT EXISTS` is often the most performant as it can short-circuit, but `LEFT JOIN` is sometimes more readable."
Pillar 4: Advanced Transformations & Query Structuring
Scenario: Pivoting, Unpivoting, and Readability
Q: Explain the purpose of a CTE. Then, show how you would pivot data from a row-based format to a columnar report.
CTEs vs. Subqueries
A Common Table Expression (`WITH` clause) creates a temporary, named result set. It is superior to nested subqueries because it makes complex queries highly readable and modular. You can also reference a CTE multiple times in the same query.
Pivoting Data
Pivoting transforms rows into columns, often for reporting. The standard method is conditional aggregation.
Input `quarterly_sales` table:
product | quarter | sales |
---|---|---|
Laptop | Q1 | 50000 |
Laptop | Q2 | 60000 |
Mouse | Q1 | 5000 |
SELECT
product,
SUM(CASE WHEN quarter = 'Q1' THEN sales ELSE 0 END) AS q1_sales,
SUM(CASE WHEN quarter = 'Q2' THEN sales ELSE 0 END) AS q2_sales
FROM quarterly_sales
GROUP BY product;
Pivoted Output:
product | q1_sales | q2_sales |
---|---|---|
Laptop | 50000 | 60000 |
Mouse | 5000 | 0 |
Pillar 5: Advanced Data Modeling & Hierarchies
Scenario: Working with Hierarchical Data
Q: Your `employees` table has an `id` and a `manager_id`, representing a hierarchy. How would you write a query to show the entire reporting chain for a specific employee?
This requires a **Recursive CTE**, which is designed specifically for traversing hierarchical or graph-like data structures.
Input `employees` table:
id | name | manager_id |
---|---|---|
1 | CEO | NULL |
2 | VP of Eng | 1 |
3 | Director | 2 |
4 | Engineer | 3 |
Goal: Find the reporting chain for the 'Engineer' (id = 4).
WITH RECURSIVE EmployeeHierarchy AS (
-- Anchor Member: Start with the employee in question
SELECT id, name, manager_id, 0 AS level
FROM employees
WHERE id = 4
UNION ALL
-- Recursive Member: Join back to the employees table to find the manager
SELECT e.id, e.name, e.manager_id, eh.level + 1
FROM employees e
JOIN EmployeeHierarchy eh ON e.id = eh.manager_id
)
SELECT * FROM EmployeeHierarchy;
Output:
id | name | manager_id | level |
---|---|---|---|
4 | Engineer | 3 | 0 |
3 | Director | 2 | 1 |
2 | VP of Eng | 1 | 2 |
1 | CEO | NULL | 3 |
Why It Matters for a Data Engineer
Many real-world datasets are hierarchical (product categories, organizational charts, geographic regions). Knowing how to write a recursive CTE demonstrates you can handle complex data structures beyond simple flat tables. It's a powerful tool that many SQL users are not familiar with.
Pillar 6: Transactions & Data Definition Language (DDL)
Scenario: Ensuring Data Integrity During Writes
Q: Explain the difference between `DELETE`, `TRUNCATE`, and `DROP`. When and why would you use each?
These commands all remove data, but they operate at different levels and have huge implications for performance and recoverability.
- `DELETE`: A DML (Data Manipulation Language) command. It removes rows one by one and logs each deletion.
- When to use: When you need to remove a subset of rows based on a `WHERE` clause.
- Why: It's slow for large tables but safe, as it's part of a transaction and can be rolled back. It also fires triggers.
- `TRUNCATE`: A DDL (Data Definition Language) command. It deallocates the data pages of the table, instantly removing all rows.
- When to use: To quickly delete all rows from a large staging table.
- Why: It's extremely fast and uses minimal system resources. However, it's typically not logged in the same way, so it cannot be easily rolled back and does not fire `DELETE` triggers.
- `DROP`: A DDL command. It removes the entire table structure and all its data permanently.
- When to use: When you are completely finished with a table and want to remove it from the database.
- Why: It's the most destructive command. There's no coming back from a `DROP TABLE` without a backup.
Interviewer's Deep Dive
A: "I would wrap the entire series of statements in a `TRANSACTION`. I'd start with `BEGIN TRANSACTION`, run my statements, and then finish with `COMMIT` if everything is successful. If any step fails, I can issue a `ROLLBACK` to undo all the changes made since the transaction began. This ensures atomicity, a key ACID property."
Pillar 7: Performance & Optimization
Scenario: Diagnosing a Slow Query
Q: A critical query in your pipeline is running slowly. What is your systematic approach to diagnosing and fixing it?
- Analyze the Execution Plan: My first action is to use `EXPLAIN` to ask the database for its execution plan. This reveals if it's using a "Full Table Scan" on a large table, which is a major red flag.
- Verify Indexes: The execution plan will guide me, but I will confirm that columns in `WHERE` clauses, `JOIN` conditions, and `ORDER BY` clauses are properly indexed. This is the most common cause of poor performance.
- Check for Non-Sargable Queries: I'll look for functions applied to columns in the `WHERE` clause, like `WHERE YEAR(order_date) = 2023`. This prevents the database from using an index. The fix is to rewrite it as `WHERE order_date >= '2023-01-01' AND order_date < '2024-01-01'`.
Why It Matters for a Data Engineer
A Data Engineer is responsible for building pipelines that are correct, efficient, and cost-effective. A slow query can delay data availability, miss SLAs, and drive up cloud computing costs. The ability to systematically diagnose and tune query performance is a fundamental and highly valued skill.
Discover more from Data Engineer Journey
Subscribe to get the latest posts sent to your email.