I have a large set of data for a company’s sales and payments received.
How can I quickly find the latest sale that cannot be covered by the total payments for each customer?
Eg. Suppose the data table looks like this
Date | Customer ID | Type | Amount |
---|---|---|---|
2024-01-01 | 1 | SALE | 100 |
2024-01-08 | 1 | PAYMENT | 50 |
2024-01-15 | 1 | SALE | 100 |
2024-01-22 | 1 | PAYMENT | 100 |
2024-01-02 | 2 | SALE | 200 |
2024-01-09 | 2 | SALE | 200 |
2024-01-16 | 2 | PAYMENT | 150 |
For Customer 1, it would be 2024-01-15, since total payment amount is 150, so the first sale gets covered; And for customer 2, it would be 2024-01-02, since the only payment cannot cover its first sale.
So far I use a window function to compute the cumulative sales and total payments. Then use this generated table to compute the latest uncovered sale, and other statistics.
with cumsum_entry as (
select
id, date, customer_id, type, amount,
sum(amount) filter (where type='PAYMENT') over win_customer
- sum(amount) filter (where type='SALE') over win_cumsum as cover
from entry
window win_customer as (partition by customer_id),
win_cumsum as (partition by customer_id order by date rows between unbounded proceeding and current row)
)
select
customer_id,
sum(amount) filter (where date >= '2024-01-01') as monthly_total,
min(date) filter (where cover < 0) as latest_uncovered
from cumsum_entry
where date <= '2024-01-31'
group by customer_id
In fact I store the generated table “cumsum_entry” as a view in PostgreSQL, and use Django’s orm to do the rest of the work, but the core logical Is the same as above.
Apparently I am not satisfied with the responding speed(2 seconds for ~10^3 customers / ~10^5 entries). How can I further optimize responding time?