I have a table with 10 million unique Product_ID. This Unique Ids are being tested by the hr and the metric has a [good : -1, neutral : 0, bad : 1], I call this column hr_status. I want to develop a workflow that looks into the recursive sum of these integers and makes a desicion on designing the product for inspection or not.
The logic is as follows
TOP SUM:
if the cumulative sum of these hr status gets to be 5, then sets it for inspection, BUT do not go over 5, meaning 5 is out top and if more signals come with (bad:1) I will not continue adding, instead leave the recursive sum on 45
BOTTOM:
for the case that many good singals in a row happend I do not want to go below 0, so 0 will be out bottom lowest value for the SUM.
Sample Table:
Product_ID | time | hr_status |
---|---|---|
AA | 6/10/2024 01:10:10 AM | -1 |
AA | 6/10/2024 02:10:10 AM | 0 |
AA | 6/10/2024 03:10:10 AM | 1 |
AA | 6/10/2024 04:10:10 AM | 1 |
AA | 6/10/2024 05:10:10 AM | 1 |
AA | 6/10/2024 06:10:10 AM | 1 |
AA | 6/10/2024 07:10:10 AM | 1 |
AA | 6/10/2024 08:10:10 AM | 1 |
AA | 6/10/2024 09:10:10 AM | 1 |
My Expected CUM_SUM column for the above example should look like this [assuming we starting from 0]
CUM_SUM |
---|
0 |
0 |
1 |
2 |
3 |
4 |
5 |
5 |
5 |
Previous attempt was a POC but it was just in python with for loops inside for loops and that wont work for big data.
I want want to slve this in BigQuery but it also exists the posibility of using pyspark.
So far I have tried a combination of row_number() over partition but have not been sucessful. I wonder if SQL recursive is an option that it is worth exploring.
Victor Franco is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.