I have a table similar to the one below where I have a summary column with groups and an associated value separated by a colon where each separate group with its value is separated by a comma. (group1:value1,group2:value2…).
Area DTTM ID ID2 Summary
0 M 1/1/1960 A A 1:2980
1 M 3/8/2021 A B 1:2980
2 M 5/12/2024 A B 1:8732
3 M 1/1/1960 A C 1:8732
4 H 3/8/2021 B C 1:53174,9:2332
5 H 5/12/2024 B D 1:22017,9:118
6 H 1/1/1960 B A 1:4239, 6:184, 9:1243, 14:482
I am trying to separate the summary column into separate columns for each group (the number before the colons) and include the value (number after the colons) in each column for each row. How can I go about this? A sample output is show below:
Area DTTM ID ID2 Bin1 Bin6 Bin9 Bin14
0 M 1/1/1960 A A 2980 0 0 0
1 M 3/8/2021 A B 2980 0 0 0
2 M 5/12/2024 A B 8732 0 0 0
3 M 1/1/1960 A C 8732 0 0 0
4 H 3/8/2021 B C 53174 0 2332 0
5 H 5/12/2024 B D 22017 0 118 0
6 H 1/1/1960 B A 4239 184 1243 482
2
IIUC, extractall
and pivot
, then join
:
out = df.join(df['Summary'].str.extractall(r'(d+):(d+)')
.astype(int)
.droplevel(1).pivot(columns=0, values=1)
.fillna(0).astype(int)
.add_prefix('Bin')
).drop(columns='Summary')
Output:
Area DTTM ID ID2 Bin1 Bin6 Bin9 Bin14
0 M 1/1/1960 A A 2980 0 0 0
1 M 3/8/2021 A B 2980 0 0 0
2 M 5/12/2024 A B 8732 0 0 0
3 M 1/1/1960 A C 8732 0 0 0
4 H 3/8/2021 B C 53174 0 2332 0
5 H 5/12/2024 B D 22017 0 118 0
6 H 1/1/1960 B A 4239 184 1243 482
5
Option 1 (literal_eval
+ pd.json_normalize
)
from ast import literal_eval
df = pd.DataFrame(data)
out = (df.join(
pd.json_normalize(
('{' + df.pop('Summary') + '}').apply(literal_eval)
)
.fillna(0)
.astype(int)
.sort_index(axis=1)
.add_prefix('Bin')
)
)
Output:
Area DTTM ID ID2 Bin1 Bin6 Bin9 Bin14
0 M 1/1/1960 A A 2980 0 0 0
1 M 3/8/2021 A B 2980 0 0 0
2 M 5/12/2024 A B 8732 0 0 0
3 M 1/1/1960 A C 8732 0 0 0
4 H 3/8/2021 B C 53174 0 2332 0
5 H 5/12/2024 B D 22017 0 118 0
6 H 1/1/1960 B A 4239 184 1243 482
Explanation
- Add curly brackets to values of
df['Summary']
to turn them into dictionary strings. Applyliteral_eval
and pass topd.json_normalize
. Usedf.pop
to drop the original column. - Apply
df.fillna
and change the dtype toint
(df.astype
). - Now, sort the columns with
df.sort_index
and prefix ‘Bin’ (df.add_prefix
). - Finally, use
df.join
to attach the result todf
.
Option 2 (Series.str.extractall
+ df.pivot_table
)
df = pd.DataFrame(data)
out2 = df.join(
df.pop('Summary')
.str.extractall('(?P<col>d+):(d+)')
.astype(int)
.reset_index()
.pivot_table(index='level_0', columns='col', values=1,
aggfunc='first', fill_value=0)
.add_prefix('Bin')
)
Output:
out2.equals(out)
# True
Explanation
- Use
df.pop
as above, and applySeries.str.extractall
to get the digits before and after:
in two separate columns (naming the firstcol
), addingdf.astype
to get integers. Regex explained here. - Now, use
df.reset_index
and applydf.pivot_table
withaggfunc='first'
(thus benefiting fromfill_value
parameter absent fromdf.pivot
). - Finally, as above: add the prefix, and join to original
df
.
Data used
import pandas as pd
data = {'Area': {0: 'M', 1: 'M', 2: 'M', 3: 'M', 4: 'H', 5: 'H', 6: 'H'},
'DTTM': {0: '1/1/1960', 1: '3/8/2021', 2: '5/12/2024', 3: '1/1/1960',
4: '3/8/2021', 5: '5/12/2024', 6: '1/1/1960'},
'ID': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'B', 5: 'B', 6: 'B'},
'ID2': {0: 'A', 1: 'B', 2: 'B', 3: 'C', 4: 'C', 5: 'D', 6: 'A'},
'Summary': {0: '1:2980', 1: '1:2980', 2: '1:8732', 3: '1:8732',
4: '1:53174,9:2332', 5: '1:22017,9:118',
6: '1:4239, 6:184, 9:1243, 14:482'}
}
A possible solution:
summary = (
df.pop("Summary")
.str.split(r",s*", regex=True)
.explode()
.str.split(":", expand=True)
)
summary[0] = "Bin" + summary[0]
summary = summary.pivot(columns=0, values=1).fillna(0)
out = df.join(summary)
Area DTTM ID ID2 Bin1 Bin14 Bin6 Bin9
0 M 1/1/1960 A A 2980 0 0 0
1 M 3/8/2021 A B 2980 0 0 0
2 M 5/12/2024 A B 8732 0 0 0
3 M 1/1/1960 A C 8732 0 0 0
4 H 3/8/2021 B C 53174 0 0 2332
5 H 5/12/2024 B D 22017 0 0 118
6 H 1/1/1960 B A 4239 482 184 1243
I’d suggest you use pure python to process the strings in the Summary
column, and then reassign back to the original dataframe. You should get better performance, since these are strings.
from collections import defaultdict
from itertools import zip_longest
box = defaultdict(list)
keys = {}
for num, entry in zip(df.index, df.Summary):
split = entry.split(',')
for value in split:
value = value.strip().split(':')
key, val = value
box[num].append(val)
keys[f"Bin{key}"] = 1
values = zip_longest(*box.values(), fillvalue=0)
dictionary = dict(zip(keys, values))
(df
.drop(columns='Summary')
.assign(**dictionary)
)
Area DTTM ID ID2 Bin1 Bin9 Bin6 Bin14
0 M 1/1/1960 A A 2980 0 0 0
1 M 3/8/2021 A B 2980 0 0 0
2 M 5/12/2024 A B 8732 0 0 0
3 M 1/1/1960 A C 8732 0 0 0
4 H 3/8/2021 B C 53174 2332 0 0
5 H 5/12/2024 B D 22017 118 0 0
6 H 1/1/1960 B A 4239 184 1243 482
Alternativey, you can create the dataframe, convert the dtypes, befor assigning back to the original dataframe:
dictionary = pd.DataFrame(dictionary).astype(int)
(df
.drop(columns='Summary')
.assign(**dictionary)
)
Area DTTM ID ID2 Bin1 Bin9 Bin6 Bin14
0 M 1/1/1960 A A 2980 0 0 0
1 M 3/8/2021 A B 2980 0 0 0
2 M 5/12/2024 A B 8732 0 0 0
3 M 1/1/1960 A C 8732 0 0 0
4 H 3/8/2021 B C 53174 2332 0 0
5 H 5/12/2024 B D 22017 118 0 0
6 H 1/1/1960 B A 4239 184 1243 482