There are many answers out there to this question, but I couldn’t find one that applies to my case.
I have a dataframe that contains ID’s:
df = pd.DataFrame({"id": [0, 1, 2, 3, 4]})
Now, I query a REST API for each ID’s to get additional attributes that are returned to me as a dictionary:
{"id": 0, "values": {"first_name": "Bob", "last_name": "Smith"}}
What I want is to add the content of values
as additional columns to the matching row of the dataframe.
An important point is that, at each iteration, I may get different attributes, so I don’t know how many columns will be added in the end, or even their names. So, sometimes I need to add a column (which I would do with pd.concat
), but sometimes I need to set the value to an existing one.
id | first_name | last_name | something | something_else |
---|---|---|---|---|
0 | Bob | Smith | ||
… | ||||
4 |
Any thought?
Code
your example
import pandas as pd
df = pd.DataFrame({"id": [0, 1, 2, 3, 4]})
data = {"id": 0, "values": {"first_name": "Bob", "last_name": "Smith"}}
code
# make data dict to dataframe (For multiple dict, use concat)
d = pd.json_normalize(data)
# merge
out = df.merge(d, how='left')
out:
id values.first_name values.last_name
0 0 Bob Smith
1 1 NaN NaN
2 2 NaN NaN
3 3 NaN NaN
4 4 NaN NaN
If you want to remove value.
from a column name, you can use the rename
function or the str.replace
function.
I agree there are many ways out there to do this. You can do it this way with List Comprehension approach which will be Faster I guess, and will handle dynamic columns well.
import pandas as pd
df = pd.DataFrame({"id": [0, 1, 2, 3, 4]})
# Simulated API response
def get_api_data(id):
data = {
0: {"first_name": "Bob", "last_name": "Smith"},
1: {"first_name": "Alice", "something": "extra"},
2: {"last_name": "Jones", "something_else": "value"},
3: {"first_name": "Charlie", "age": 30},
4: {} # No data available for this ID
}
return data.get(id, {})
def update_dataframe(df):
all_columns = set()
data_list = []
for _, row in df.iterrows():
api_data = get_api_data(row['id'])
all_columns.update(api_data.keys())
row_data = row.to_dict()
row_data.update(api_data)
data_list.append(row_data)
result_df = pd.DataFrame(data_list, columns=list(all_columns) + list(df.columns))
return result_df
updated_df = update_dataframe(df)
print(updated_df)
Output:
id first_name last_name something something_else age
0 0 Bob Smith NaN NaN NaN
1 1 Alice NaN extra NaN NaN
2 2 NaN Jones NaN value NaN
3 3 Charlie NaN NaN NaN 30.0
4 4 NaN NaN NaN NaN NaN
Another possible solution, whose steps are:
-
First, it converts the dictionary
d
into a dataframe usingpd.DataFrame
. -
Then, it resets the index of this dataframe with
reset_index
. -
Next, it pivots the dataframe using
pivot
to reshape it, settingid
as the index,values
as the values, andindex
as the columns. -
After pivoting, it resets the index again and removes the axis name with
rename_axis
. -
Finally, it merges this transformed dataframe with the original dataframe
df
on theid
column usingmerge
.
df.merge(
pd.DataFrame(d).reset_index()
.pivot(index='id', values='values', columns='index')
.reset_index().rename_axis(None, axis=1), on='id', how='left')
Output:
id first_name last_name
0 0 Bob Smith
1 1 NaN NaN
2 2 NaN NaN
3 3 NaN NaN
4 4 NaN NaN