I have a dataset in Pandas consisting of phone call logs, from several caller numbers into several destination numbers. Calls are logged as soon as they are completed.
Main features:
- unix_timestamp (index)
- caller_number
- dest_number (US Toll Free)
- real_duration
Sample df
unix_ts | caller_number | dest_number | real_duration |
---|---|---|---|
1674567050.6435 | 16175962600.0 | 18448213248 | 435.0 |
1680545624.27747 | 14103914538.0 | 18775057141 | 497.0 |
1681923808.21773 | 19182890899.0 | 18006485010 | 132.0 |
1684535428.48401 | 15202197200.0 | 18883507446 | 450.0 |
1697056646.38694 | 13236327145.0 | 18888544812 | 390.0 |
I need to identify those destiantion numbers that have limited the call duration. Especially those where at least the last five calls have the same duration (flat line).
The pseudo code to identify the flat line condition is as follows:
- Split stage: groupby dest_number, without calculations
- Apply stage: for each individual frame after the split, add 5 new columns with shifted values of real_duration, calculate the mean and the std of the last 5 calls real_duration, column wise. Then add a column with the ratio_pre = std / mean.
- Combine stage: output is a df with index dest_number and one column with the last_ratio_pre for each frame
After these transformations, the resulting dataframe should look like this:
dest_number | real_duration_sum | last_ratio_pre |
---|---|---|
18008259452 | 107278.0 | 2.1310975127055425 |
18773310081 | 94171.0 | 0.001336663795067271 |
18885436765 | 56977.0 | 0.03930810818390873 |
18009423141 | 40031.0 | 0.14554363343695811 |
18886450451 | 22803.0 | 0.9798228336647965 |
After extensive research, the code with which I implemented the solution is as follows:
# create a pandas groupby class, without applying any calculation
df_dest_temp = df.groupby(['dest_number'], sort=True)
# add column last_ratio_pre and fill it with ones as placeholders
df_dest['last_ratio_pre'] = 1
# define how many previous values
control = 5
# iterate over groupby components, and assign to df_dest one by one *** SUPER SLOW!! ***
for num, frame in df_dest_temp:
# create a temp df
df_temp = frame[['dest_number', 'real_duration']]
# create 5 new columns that hold the shifted values of real_duration
for i in range(control):
df_temp['pre_'+str(i+1)] = df_temp.real_duration.shift(i+1)
# select only the columns related to real_duration
cols = ['real_duration'] + [col for col in df_temp.columns if col.startswith('pre_')]
# calculate mean of last 5 real_durations, by columns
mean_pre = df_temp[cols].mean(axis=1)
# calculate std of last 5 real_durations, by columns
std_pre = df_temp[cols].std(axis=1)
# assign results to new columns
df_temp['mean_pre'] = mean_pre
df_temp['std_pre'] = std_pre
# calculate the ratio between std and mean
df_temp['ratio_pre'] = std_pre / mean_pre
# take each fram and groupby dest_number and choose last value for ratio_pre
df_temp_grouped = df_temp.groupby('dest_number').ratio_pre.last()
# assign result to df_dest
df_dest.loc[num, 'ratio_pre'] = float(df_temp_grouped.values)
As it can be seen, the code above solves the problem but in a VERY slow way. I did not find a way to include all the functionality in a lambda function, or in a vectorized way.
I will certainly welcome comments, remarks and suggestions to make the code more efficient.