
Calculating a cumulative sum, or a "running total," is one of the most practical and frequently asked SQL questions in data analytics and business intelligence interviews. It demonstrates your understanding of modern SQL features and your ability to solve real-world business problems.
This guide will walk you through every scenario, from the basics to advanced challenges, all framed from the perspective of a technical interview.
The Setup: Our Sample `sales` Table
For all our examples, we'll work with a simple `sales` table. This is our ground truth.
Input Table: `sales`
CREATE TABLE sales (
order_id INT PRIMARY KEY,
product_category VARCHAR(50),
order_date DATE,
revenue DECIMAL(10, 2)
);
INSERT INTO sales (order_id, product_category, order_date, revenue) VALUES
(1, 'Electronics', '2023-10-01', 150.00),
(2, 'Books', '2023-10-01', 25.50),
(3, 'Electronics', '2023-10-02', 200.00),
(4, 'Home Goods', '2023-10-02', 75.00),
(5, 'Books', '2023-10-03', 40.00),
(6, 'Electronics', '2023-10-03', 500.00),
(7, 'Books', '2023-10-03', 15.00);
order_id | product_category | order_date | revenue |
---|---|---|---|
1 | Electronics | 2023-10-01 | 150.00 |
2 | Books | 2023-10-01 | 25.50 |
3 | Electronics | 2023-10-02 | 200.00 |
4 | Home Goods | 2023-10-02 | 75.00 |
5 | Books | 2023-10-03 | 40.00 |
6 | Electronics | 2023-10-03 | 500.00 |
7 | Books | 2023-10-03 | 15.00 |
Scenario 1: The Basic Cumulative Sum
The Solution: Using Window Functions
The most efficient and modern way to solve this is with the `SUM() OVER()` window function.
SELECT
order_date,
revenue,
SUM(revenue) OVER (ORDER BY order_date ASC) AS cumulative_revenue
FROM
sales
ORDER BY
order_date;
Output Table:
order_date | revenue | cumulative_revenue |
---|---|---|
2023-10-01 | 150.00 | 175.50 |
2023-10-01 | 25.50 | 175.50 |
2023-10-02 | 200.00 | 450.50 |
2023-10-02 | 75.00 | 450.50 |
2023-10-03 | 40.00 | 1005.50 |
2023-10-03 | 500.00 | 1005.50 |
2023-10-03 | 15.00 | 1005.50 |
💡 Logical Reasoning (Your Interview Answer)
"I'll solve this using a window function, specifically `SUM() OVER()`, because it's highly efficient and designed for this type of calculation.
- `SUM(revenue)` specifies the aggregation I want to perform.
- The `OVER()` clause defines the 'window' or set of rows for the sum.
- `ORDER BY order_date ASC` inside the `OVER()` clause is the key. It tells SQL to sum the revenue for the current row and all preceding rows based on the date.
- Notice that rows with the same date are treated as peers. The cumulative value is the same for both rows on '2023-10-01' because the window includes all rows up to that date. The default window frame includes all peers of the current row."
For the row with revenue 200.00 on 2023-10-02, the window looks like this:
150.00
25.50
200.00
75.00
Sum = 150 + 25.50 + 200 + 75 = 450.50
Scenario 2: Cumulative Revenue Per Day (Aggregated)
The Solution: Aggregate First with a CTE
This is a two-step problem. The best practice is to use a Common Table Expression (CTE) to first aggregate the data, then apply the window function.
WITH daily_revenue AS (
-- Step 1: Aggregate revenue by day
SELECT
order_date,
SUM(revenue) AS total_daily_revenue
FROM
sales
GROUP BY
order_date
)
-- Step 2: Calculate the cumulative sum on the aggregated data
SELECT
order_date,
total_daily_revenue,
SUM(total_daily_revenue) OVER (ORDER BY order_date ASC) AS cumulative_revenue
FROM
daily_revenue
ORDER BY
order_date;
Output Table:
order_date | total_daily_revenue | cumulative_revenue |
---|---|---|
2023-10-01 | 175.50 | 175.50 |
2023-10-02 | 275.00 | 450.50 |
2023-10-03 | 555.00 | 1005.50 |
💡 Logical Reasoning (Your Interview Answer)
"This requires a two-step logical process, which I'll implement using a Common Table Expression (CTE) for readability and maintainability.
- The CTE `daily_revenue`: First, I need to get the total revenue for each day. I do this by `GROUPING BY order_date` and using `SUM(revenue)`. This creates a temporary table with one row per day.
- The Final `SELECT`: Then, I can treat this CTE as a regular table. I apply the same `SUM() OVER (ORDER BY ...)` window function to the `total_daily_revenue` column. This correctly calculates the running total on the daily aggregated figures.
Scenario 3: Cumulative Revenue with Partitions
The Solution: The `PARTITION BY` Clause
This is the perfect use case for `PARTITION BY`, which acts like a `GROUP BY` for window functions.
SELECT
order_date,
product_category,
revenue,
SUM(revenue) OVER (PARTITION BY product_category ORDER BY order_date ASC) AS category_cumulative_revenue
FROM
sales
ORDER BY
product_category,
order_date;
Output Table:
order_date | product_category | revenue | category_cumulative_revenue |
---|---|---|---|
2023-10-01 | Books | 25.50 | 25.50 |
2023-10-03 | Books | 40.00 | 65.50 |
2023-10-03 | Books | 15.00 | 80.50 |
2023-10-01 | Electronics | 150.00 | 150.00 |
2023-10-02 | Electronics | 200.00 | 350.00 |
2023-10-03 | Electronics | 500.00 | 850.00 |
2023-10-02 | Home Goods | 75.00 | 75.00 |
💡 Logical Reasoning (Your Interview Answer)
"Absolutely. I can achieve this by adding the `PARTITION BY` clause to the window function.
- `PARTITION BY product_category`: This clause logically splits the data into separate partitions, one for 'Books', one for 'Electronics', and one for 'Home Goods'.
- The `SUM()` calculation is then performed independently within each of these partitions. When the query moves from the 'Books' partition to the 'Electronics' partition, the running total resets and starts from scratch.
- The `ORDER BY order_date` still dictates the summation order, but it operates *within* each partition."
25.50 → 25.50
40.00 → 65.50
15.00 → 80.50
150.00 → 150.00
200.00 → 350.00
500.00 → 850.00
75.00 → 75.00
The calculations in each box are completely separate.
🚀 Advanced Scenarios & Bonus Points
Scenario 4: Handling Days with No Sales
The Solution: Calendar Table + `LEFT JOIN`
This is a senior-level question that tests your ability to handle sparse data. The key is to generate a complete date series (a calendar) and `LEFT JOIN` your sales data to it.
-- This CTE generates a date series. Syntax may vary by SQL dialect.
-- (This is a portable recursive CTE version)
WITH RECURSIVE calendar AS (
SELECT CAST('2023-10-01' AS DATE) AS report_date
UNION ALL
SELECT DATE_ADD(report_date, INTERVAL 1 DAY) -- Use DATEADD for SQL Server
FROM calendar
WHERE report_date < '2023-10-31'
),
daily_revenue AS (
SELECT
order_date,
SUM(revenue) AS total_daily_revenue
FROM sales
GROUP BY order_date
),
-- Join the complete calendar to our actual sales data
filled_daily_revenue AS (
SELECT
cal.report_date,
COALESCE(dr.total_daily_revenue, 0) AS daily_revenue
FROM calendar cal
LEFT JOIN daily_revenue dr ON cal.report_date = dr.order_date
)
-- Finally, calculate the cumulative sum on the complete, filled-in data
SELECT
report_date,
daily_revenue,
SUM(daily_revenue) OVER (ORDER BY report_date ASC) AS cumulative_revenue
FROM
filled_daily_revenue
ORDER BY
report_date;
💡 Logical Reasoning (Your Interview Answer)
"This is a common data warehousing problem. My approach involves four steps to ensure every day is represented:
- Generate a Calendar: First, I'll create a complete list of dates for the desired period using a recursive CTE. This acts as my master list.
- Aggregate Sales Data: I'll use the same daily aggregation CTE from before to get the total revenue for days that actually had sales.
- `LEFT JOIN` and `COALESCE`: The crucial step is to `LEFT JOIN` the calendar to the sales data. This keeps all dates from the calendar. For days with no sales, the revenue will be `NULL`. I then use `COALESCE(revenue, 0)` to convert these `NULL`s to zero, which is essential for the sum to work correctly.
- Final Cumulative Sum: Now that I have a clean, complete dataset with a row for every day, I can apply the standard `SUM() OVER()` window function to get the correct running total that carries forward over the gaps.
Scenario 5: The "Old School" Method (No Window Functions)
The Solution: Correlated Subquery
This tests your knowledge of more fundamental SQL techniques. A correlated subquery is a classic way to solve this, though less efficient.
WITH daily_revenue AS (
SELECT
order_date,
SUM(revenue) AS total_daily_revenue
FROM sales
GROUP BY order_date
)
SELECT
d1.order_date,
d1.total_daily_revenue,
(SELECT SUM(d2.total_daily_revenue)
FROM daily_revenue d2
WHERE d2.order_date <= d1.order_date) AS cumulative_revenue
FROM
daily_revenue d1
ORDER BY
d1.order_date;
💡 Logical Reasoning (Your Interview Answer)
"Without window functions, I'd use a correlated subquery. Here's how it works:
- First, I'd still aggregate the daily revenue into a CTE, just like before.
- For the main query, I select the date and daily revenue from the CTE, which I'll alias as `d1`.
- The cumulative revenue column is calculated with a subquery. For each row (each day) in the outer query (`d1`), the subquery runs again. It sums up all `total_daily_revenue` from the CTE (aliased as `d2`) where the date (`d2.order_date`) is less than or equal to the current row's date (`d1.order_date`).
- Crucial Follow-up: It's important to note that while this works, it's far less performant than window functions, especially on large datasets. The subquery is executed for every single row, leading to a complexity that approaches O(N²), whereas window functions are typically much more optimized, closer to O(N log N) or O(N)."
🏆 Top 5 Interview Questions: Quick Recap
1. How do you calculate a basic running total?
Use SUM(column) OVER (ORDER BY date_column)
.
2. How do you find the cumulative sum per day (not per transaction)?
Use a CTE or subquery to GROUP BY
day first, then apply the window function to the aggregated daily totals.
3. How do you restart the running total for different groups (e.g., categories)?
Add PARTITION BY category_column
inside the OVER()
clause.
4. How do you handle missing days in a report?
Generate a calendar table, LEFT JOIN
your sales data to it, use COALESCE
to replace NULLs with 0, then apply the window function.
5. Why are window functions preferred over self-joins/correlated subqueries for running totals?
Performance. Window functions are significantly more efficient and scalable as they typically require only one pass over the data, while self-joins can lead to quadratic complexity (O(N²)) and become very slow on large tables.
❓ Frequently Asked Questions (FAQs)
What's the difference between `ROWS` and `RANGE` in the window frame?
This is an advanced detail. The default frame is `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`. With `RANGE`, rows with the same `ORDER BY` value are treated as peers (as seen in Scenario 1). If you used `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, it would only sum up to the physical current row, giving a different result if the ordering of peer rows is not guaranteed. For most cumulative totals by date, `RANGE` is what you want.
Can you use other functions like `AVG()` or `COUNT()` as window functions?
Yes! You can calculate a cumulative average (`AVG() OVER (...)`), a cumulative count (`COUNT() OVER (...)`), or use other functions like `MIN()`, `MAX()`, `LAG()`, and `LEAD()` to get amazing insights across your data window.
What if my `order_date` column is actually a `DATETIME` or `TIMESTAMP` with time?
That makes the `ORDER BY` more precise. If you want to group by day, you should first `CAST(order_timestamp AS DATE)` in your aggregation CTE. If you want a running total by the exact timestamp, you can just use `ORDER BY order_timestamp` directly.
Discover more from Data Engineer Journey
Subscribe to get the latest posts sent to your email.