I have an df which has 3 columns: Region, Country and AREA_CODE.
Region Country AREA_CODE AREA_SUB_CODE_1 AREA_SUB_CODE_2
===========================================================================
AMER US A1 A1_US_1 A1_US_2
AMER CANADA A1 A1_CA_1 A1_CA_2
AMER US B1 B1_US_1 B1_US_2
AMER US A1 A1_US_1 A1_US_2
Is there a way to get output list of both the AREA_SUB_CODE_1 and AREA_SUB_CODE_2 as a list under each of the previous column value. something like the below?
{
"AREA_SUB_CODE_1": {
"AMER": {
"US": {
"A1": ["A1_US_1"],
"B1": ["B1_US_1"]
},
"CANADA": {
"A1": ["A1_CA_1"],
}
}
},
"AREA_SUB_CODE_2": {
"AMER": {
"US": {
"A1": {
"A1_US_1": ["A1_US_2"]
},
"B1": {
"B1_US_1": ["B1_US_2"]
},
"CANADA": {
"A1": {
"A1_CA_1": ["A1_CA_2"],
}
}
}
},
}
So far i have tried to groupby on 3 columns it works which is,
for (k1, k2), v in df.groupby(['Region', 'Country'])['AREA_CODE']:
tTmp.setdefault(k1, {})[k2] = sorted(v.unique())
But when i try to groupby 4 columns, it is throwing error
too many values to unpack (expected 2)
for (k1, k2), v in df.groupby(['Region', 'Country', 'AREA_CODE'])['AREA_SUB_CODE_1']:
tTmp.setdefault(k1, {})[k2] = sorted(v.unique())
How to apply groupby for 4 columns and 5 columns? Or any other way to achieve this?
2
I think we can achieve this with the following recursive function:
f = lambda s: ({k: f(s[k]) for k in s.index.levels[0]}
if s.index.nlevels > 1
else {k: s.loc[[k]].unique().tolist()
for k in s.index.unique()})
Here, s
is expected to be a pandas.Series
with hierarchical indexing. At each indexing level, we map the keys to the corresponding depth of the resulting dictionary. At the last level, we extract unique values into a list. The double square brackets in s.loc[[k]]
ensure the output is a series, the following unique
method returns a numpy.ndarray
with unique values of the series, and tolist
converts the array into a Python list.
If we know there’s exactly one unique value at the final level, we can simplify the function:
f = lambda s: {k: f(s[k]) for k in s.index.levels[0]}
if s.index.nlevels > 1
else s.to_dict()
In this case, we skip creating a list at the end. But if needed, we can insert additional mapping like s.map(lambda x: [x]).to_dict()
.
Before applying any of the function above, we have to transform the data into a properly indexed series:
inner = ['Region', 'Country', 'AREA_CODE']
values = df.melt(inner).set_index(['variable', *inner]).squeeze()
Here, 'variable'
is the default name for the new column with the rest of column names excluding the inner
list after melting. The final answer is f(values)
Let’s see the example:
df = pd.DataFrame({
'Region': ['AMER', 'AMER', 'AMER', 'AMER'],
'Country': ['US', 'CANADA', 'US', 'US'],
'AREA_CODE': ['A1', 'A1', 'B1', 'A1'],
'AREA_SUB_CODE_1': ['A1_US_1x', 'A1_CA_1', 'B1_US_1', 'A1_US_1y'],
'AREA_SUB_CODE_2': ['A1_US_2', 'A1_CA_2', 'B1_US_2', 'A1_US_2']})
f = lambda s: ({k: f(s[k]) for k in s.index.levels[0]}
if s.index.nlevels > 1
else {k: s.loc[[k]].unique().tolist()
for k in s.index.unique()})
inner = ['Region', 'Country', 'AREA_CODE']
values = df.melt(inner, var_name='sub_code').set_index(['sub_code', *inner]).squeeze()
answer = f(values)
Note, that in this example, we have 2 different values for the key set ('AREA_SUB_CODE_1', 'AMER', 'US', 'A1')
and 2 equal ones for the key set ('AREA_SUB_CODE_2', 'AMER', 'US', 'A1')
, so the second case will end up as a list with one value in the final answer
:
{'AREA_SUB_CODE_1': {'AMER': {'CANADA': {'A1': ['A1_CA_1']},
'US': {'A1': ['A1_US_1x', 'A1_US_1y'],
'B1': ['B1_US_1']}}},
'AREA_SUB_CODE_2': {'AMER': {'CANADA': {'A1': ['A1_CA_2']},
'US': {'A1': ['A1_US_2'], 'B1': ['B1_US_2']}}}}
If we drop the last record in the example data, then we can use the alternative function with s.to_dict()
at the end.
3
You could melt
, then use a single groupby
and a recurvise defaultdict
:
from collections import defaultdict
d = lambda: defaultdict(d)
out = d()
for (k1, k2, k3), g in (df.melt(['Region', 'Country', 'AREA_CODE'])
.set_index('AREA_CODE')
.groupby(['variable', 'Region', 'Country'])
[['value']]
):
print(keys)
out[k1][k2][k3] = g.T.to_dict('list')
Then, if desired, convert the nested defaultdict to classical dict using this recipe:
def default_to_regular(d):
if isinstance(d, defaultdict):
d = {k: default_to_regular(v) for k, v in d.items()}
return d
out = default_to_regular(out)
Output:
{'AREA_SUB_CODE_1': {'AMER': {'CANADA': {'A1': ['A1_CA_1']},
'US': {'A1': ['A1_US_1'],
'B1': ['B1_US_1']}}},
'AREA_SUB_CODE_2': {'AMER': {'CANADA': {'A1': ['A1_CA_2']},
'US': {'A1': ['A1_US_2'],
'B1': ['B1_US_2']}}}}
Same logic using a recursive function and nested groupby
(might be less efficient):
def recurs_grp(df, keys):
if len(keys) > 1:
return {k: recurs_grp(g, keys[1:]) for k, g in df.groupby(keys[0])}
else:
return {k: g.T.to_dict('list')
for k, g in df.groupby(keys[0])[['value']]}
out = recurs_grp(df.melt(['Region', 'Country', 'AREA_CODE'])
.set_index('AREA_CODE'),
['variable', 'Region', 'Country'])
2
If you don’t need the hierarchical dict structure, you could try:
grouped = df.groupby(['Region', 'Country', 'AREA_CODE'])
pd.concat([grouped.AREA_SUB_CODE_1.apply(list), grouped.AREA_SUB_CODE_2.apply(list)], axis=1)