I’m a newbie to python and have been struggling with data transformation.
I’ve a dataframe with data as below.
<code>UserId PurchaseCnt
u1 Bread:6, Milk:11
u2 Water:3
</code>
<code>UserId PurchaseCnt
u1 Bread:6, Milk:11
u2 Water:3
</code>
UserId PurchaseCnt
u1 Bread:6, Milk:11
u2 Water:3
I want to translate this into a dataframe as follows. How would I do that?
<code>UserId Bread Milk Water
u1 6 11
u2 0 0 3
</code>
<code>UserId Bread Milk Water
u1 6 11
u2 0 0 3
</code>
UserId Bread Milk Water
u1 6 11
u2 0 0 3
I need to do this for a large amount of data. Therefore need a decent efficient code.
What is a python code to do this?
3
Code
<code>import re
pat = r'(w+):(d+)'
s = df['PurchaseCnt'].map(lambda x: dict(re.findall(pat, x)))
out = df[['UserId']].join(pd.json_normalize(s).astype('float'))
</code>
<code>import re
pat = r'(w+):(d+)'
s = df['PurchaseCnt'].map(lambda x: dict(re.findall(pat, x)))
out = df[['UserId']].join(pd.json_normalize(s).astype('float'))
</code>
import re
pat = r'(w+):(d+)'
s = df['PurchaseCnt'].map(lambda x: dict(re.findall(pat, x)))
out = df[['UserId']].join(pd.json_normalize(s).astype('float'))
out:
<code> UserId Bread Milk Water
0 u1 6.0 11.0 NaN
1 u2 NaN NaN 3.0
</code>
<code> UserId Bread Milk Water
0 u1 6.0 11.0 NaN
1 u2 NaN NaN 3.0
</code>
UserId Bread Milk Water
0 u1 6.0 11.0 NaN
1 u2 NaN NaN 3.0
Example Code
<code>import pandas as pd
data = {
'UserId': ['u1', 'u2'],
'PurchaseCnt': ['Bread:6, Milk:11', 'Water:3']
}
df = pd.DataFrame(data)
</code>
<code>import pandas as pd
data = {
'UserId': ['u1', 'u2'],
'PurchaseCnt': ['Bread:6, Milk:11', 'Water:3']
}
df = pd.DataFrame(data)
</code>
import pandas as pd
data = {
'UserId': ['u1', 'u2'],
'PurchaseCnt': ['Bread:6, Milk:11', 'Water:3']
}
df = pd.DataFrame(data)
IIUC, extractall
to separate the item name and number, then pivot
. Optionally fillna
with zeros:
<code>out = (df.drop(columns='PurchaseCnt')
.join(df['PurchaseCnt']
.str.extractall(r'(?P<col>w+):(d+)')
.droplevel('match').astype({1: 'Int64'})
.pivot(columns='col', values=1)
.fillna(0)
)
)
</code>
<code>out = (df.drop(columns='PurchaseCnt')
.join(df['PurchaseCnt']
.str.extractall(r'(?P<col>w+):(d+)')
.droplevel('match').astype({1: 'Int64'})
.pivot(columns='col', values=1)
.fillna(0)
)
)
</code>
out = (df.drop(columns='PurchaseCnt')
.join(df['PurchaseCnt']
.str.extractall(r'(?P<col>w+):(d+)')
.droplevel('match').astype({1: 'Int64'})
.pivot(columns='col', values=1)
.fillna(0)
)
)
Output:
<code> UserId Bread Milk Water
0 u1 6 11 0
1 u2 0 0 3
</code>
<code> UserId Bread Milk Water
0 u1 6 11 0
1 u2 0 0 3
</code>
UserId Bread Milk Water
0 u1 6 11 0
1 u2 0 0 3