I am trying to use the pandas.cumsum() function but in a way that ignores rows with a value in the ID column which is duplicated, and specifically only add the last value to the cumulative sum, ignoring all earlier values.
Example code below (I couldn’t share real code which is for work).
import pandas as pd, numpy as np
import random as rand
id = ['a','b','c','a','b','e','f','a','b','k']
value = [12,14,3,13,16,7,4,6,10,18]
df = pd.DataFrame({'id':id, 'value':value})
df["cumsum_of_value"] = df['value'].cumsum()
df["desired_output"] = [
12,26,29,30,32,39,43,36,30,48
]
df["comments"] = [""]*len(df)
df.loc[df.index==0, "comments"]="standard cumsum"
df.loc[df.index==1, "comments"]="standard cumsum"
df.loc[df.index==2, "comments"]="standard cumsum"
df.loc[df.index==3, "comments"]="cumsum of rows 1-3, ignore row 0"
df.loc[df.index==4, "comments"]="cumsum of rows 2-4, ignore rows 0, 1"
df.loc[df.index==5, "comments"]="cumsum of rows 2-5, ignore rows 0, 1"
df.loc[df.index==6, "comments"]="cumsum of rows 2-6, ignore rows 0, 1"
df.loc[df.index==7, "comments"]="cumsum of rows 2,4-7, ignore rows 0, 1, 3"
df.loc[df.index==8, "comments"]="cumsum of rows 2,5-8, ignore rows 0, 1, 3, 4"
df.loc[df.index==9, "comments"]="cumsum of rows 2,5-9, ignore rows 0, 1, 3, 4"
print(df)
In this example, there are 7 unique values in the ID column (a,b,c,d,e,f,g), so the cumsum should only ever sum a max of 7 records as it’s output on any row.
Is this possible without use of an iterative loop, but using combinations of functions such as cumsum(), groupby(), duplicated(), drop_duplicates()?
I’ve tried the below
df["duped"] = np.where(df["id"].duplicated(keep='last'),0,1)
df["value_duped"] = df["duped"] * df["value"]
df["desired_output_attempt"] = df["cumsum_of_value"] - df["value_duped"]
but it doesn’t come close to the correct answer, can’t think how to get something like this to result in the desired output without iterating.