I wish to sum specific rows of a column based on the value of other columns.
e.g., example file (comma-separated):
Id | Date | Time | value |
---|---|---|---|
11001 | 20240901 | 01:00:00 | 0.5 |
11001 | 20240901 | 02:00:00 | 0.5 |
11001 | 20240901 | 03:00:00 | 0.8 |
11001 | 20240901 | 04:00:00 | 0.5 |
11001 | 20240901 | 05:00:00 | 0.0 |
11001 | 20240901 | 06:00:00 | 0.5 |
11001 | 20240901 | 07:00:00 | 0.2 |
11001 | 20240901 | 08:00:00 | 0.5 |
11001 | 20240901 | 09:00:00 | 0.1 |
11001 | 20240901 | 10:00:00 | 0.2 |
11001 | 20240901 | 11:00:00 | 0.2 |
11001 | 20240901 | 12:00:00 | 0.7 |
11002 | 20240901 | 01:00:00 | 0.1 |
11002 | 20240901 | 02:00:00 | 0.3 |
11002 | 20240901 | 03:00:00 | 0.1 |
11002 | 20240901 | 04:00:00 | 0.6 |
11002 | 20240901 | 05:00:00 | 0.9 |
11002 | 20240901 | 06:00:00 | 0.1 |
11002 | 20240901 | 07:00:00 | 0.15 |
11002 | 20240901 | 08:00:00 | 0.3 |
11002 | 20240901 | 09:00:00 | 0.2 |
11002 | 20240901 | 10:00:00 | 0.7 |
11002 | 20240901 | 11:00:00 | 0.5 |
11002 | 20240901 | 12:00:00 | 0.3 |
So separately for each ID, I want to sum the values from 01:00:00 – 06:00:00, 07:00:00 – 12:00:00, 13:00:00 – 18:00:00 and 19:00:00 – 00:00:00 (of the next day).
My final output with the example above should look as such:
Id | Date | Time | value |
---|---|---|---|
11001 | 20240901 | 06:00:00 | 2.8 |
11001 | 20240901 | 12:00:00 | 1.9 |
11002 | 20240901 | 06:00:00 | 2.1 |
11002 | 20240901 | 12:00:00 | 2.15 |
Note that the Time
column here contains the ending time for each period.
Any advice on how to do this most efficiently? Thanks
14
I Would proceed as follows:
1. Create “Bins” so:
01:00:00 – 06:00:00 = Bin 1
07:00:00 – 12:00:00 = Bin 2
13:00:00 – 18:00:00 = Bin 3
19:00:00 – 23:59:59 = Bin 4
You can do this with the pandas cut module: Here
example code:
import pandas as pd
#Create Bins Variable
bins = [pd.to_datetime('06:00:00').time(), pd.to_datetime('12:00:00').time(),
pd.to_datetime('18:00:00').time(), pd.to_datetime('23:59:59').time()]
#Define the labels for the bins
labels = ['Bin 1', 'Bin 2', 'Bin 3']
#Create bins
df['bins'] = pd.cut(df['col_a'].dt.time, bins=bins, labels=labels, right=False)
2. Use the Group By function to sum by each bin
Again a pandas module can help here, simply group by the ID and bins column and sum the column of interest, then depending on how you want the resulting table to look like, you might have to take some extra steps.
example code:
grouped_df = df.groupby(['bins', 'Id'])['col_a'].sum().reset_index()
My step by step:
Step 1 and Starting Table
Step 2
Table after Step 2
Step 3
Here’s one approach using pandas
:
- Use
pd.read_csv
withparse_dates
to createDate_Time
indatetime
format:
df = pd.read_csv(StringIO(CSV), parse_dates=[['Date', 'Time']])
df.head()
Date_Time Id value
0 2024-09-01 01:00:00 11001 0.5
1 2024-09-01 02:00:00 11001 0.5
2 2024-09-01 03:00:00 11001 0.8
3 2024-09-01 04:00:00 11001 0.5
4 2024-09-01 05:00:00 11001 0.0
- Use
df.groupby
with ‘Id’ +pd.Grouper
and getgroupby.sum
:
out = (
df.groupby(
['Id', pd.Grouper(
key='Date_Time', freq='6H',
closed='right', label='right'
)]
)['value'].sum()
)
Output:
Id Date_Time
11001 2024-09-01 06:00:00 2.80
2024-09-01 12:00:00 1.90
11002 2024-09-01 06:00:00 2.10
2024-09-01 12:00:00 2.15
Name: value, dtype: float64
If you want the original Date
and Time
columns back, you could do something like this:
out = (
df.groupby(
['Id', pd.Grouper(
key='Date_Time', freq='6H',
closed='right', label='right'
)],
as_index=False
)['value'].sum()
)
out[['Date', 'Time']] = (out.pop('Date_Time')
.dt.strftime('%Y%m%d %H:%M:%S')
.str.split(' ', expand=True)
)
out = out.loc[:, ['Id', 'Date', 'Time', 'value']]
Output:
Id Date Time value
0 11001 20240901 06:00:00 2.80
1 11001 20240901 12:00:00 1.90
2 11002 20240901 06:00:00 2.10
3 11002 20240901 12:00:00 2.15
Note, incidentally, that pd.Grouper
will correctly handle range 19:00:00 - 00:00:00
(of the next day). E.g.:
# some values for range `19:00:00 - 00:00:00`
extra_data = [[pd.Timestamp('2024-09-01 22:00:00'),11001,1],
[pd.Timestamp('2024-09-01 23:00:00'),11001,2]]
# add to original `df`
df = pd.concat([df, pd.DataFrame(extra_data, columns=df.columns)],
axis=0, ignore_index=True)
# applying `out = (df.groupby(...)['value'].sum())`
out
Id Date_Time
11001 2024-09-01 06:00:00 2.80
2024-09-01 12:00:00 1.90
2024-09-02 00:00:00 3.00 # next day
11002 2024-09-01 06:00:00 2.10
2024-09-01 12:00:00 2.15
Name: value, dtype: float64
Data used
import pandas as pd
from io import StringIO
CSV = """Id,Date,Time,value
11001,20240901,01:00:00,0.5
11001,20240901,02:00:00,0.5
11001,20240901,03:00:00,0.8
11001,20240901,04:00:00,0.5
11001,20240901,05:00:00,0.0
11001,20240901,06:00:00,0.5
11001,20240901,07:00:00,0.2
11001,20240901,08:00:00,0.5
11001,20240901,09:00:00,0.1
11001,20240901,10:00:00,0.2
11001,20240901,11:00:00,0.2
11001,20240901,12:00:00,0.7
11002,20240901,01:00:00,0.1
11002,20240901,02:00:00,0.3
11002,20240901,03:00:00,0.1
11002,20240901,04:00:00,0.6
11002,20240901,05:00:00,0.9
11002,20240901,06:00:00,0.1
11002,20240901,07:00:00,0.15
11002,20240901,08:00:00,0.3
11002,20240901,09:00:00,0.2
11002,20240901,10:00:00,0.7
11002,20240901,11:00:00,0.5
11002,20240901,12:00:00,0.3"""
2