My question is an expansion of the one answered here:
Adding two pandas dataframes
Assuming the same dataframes but with a new column containing strings.
import pandas as pd
df1 = pd.DataFrame([('Dog',1,2),('Cat',3,4),('Rabbit',5,6)], columns=['Animal','a','b'])
df2 = pd.DataFrame([('Dog',100,200),('Cat',300,400),('Rabbit',500,600)], columns=['Animal','a','b']
Using the solution would create this:
df_add = df1.add(df2, fill_value=0)
Out:
Animal a b
0 DogDog 101 202
1 CatCat 303 404
2 RabbitRabbit 505 606
A potential solution could be just to index the Animal column and then run the .add function and then unindex the animal column again. But is there a more simple way that just adjusts this formula df_add = df1.add(df2, fill_value=0)so that the following solution is given:
Out:
Animal a b
0 Dog 101 202
1 Cat 303 404
2 Rabbit 505 606
I tried df_add.iloc[:,1:] = df1.iloc[:,1:].add(df2.iloc[:,1:], fill_value=0) and it did not work.
3
Your question is not fully clear. Pandas performs addition operations after aligning the indexes (index + columns), thus if you want to ensure Dog is added to Dog irrespective of its position, setting the index is the way to go:
key = ['Animal']
out = df1.set_index(key).add(df2.set_index(key), fill_value=0).reset_index()
Output:
Animal a b
0 Dog 101 202
1 Cat 303 404
2 Rabbit 505 606
This will ensure that there is no mismatch. For example:
df1 = pd.DataFrame([('Cat',3,4),('Dog',1,2),('Parrot',8,9)], columns=['Animal','a','b'])
df2 = pd.DataFrame([('Dog',100,200),('Cat',300,400),('Rabbit',800,00)], columns=['Animal','a','b'])
key = ['Animal']
df_add = df1.set_index(key).add(df2.set_index(key), fill_value=0).reset_index()
Output:
Animal a b
0 Cat 303.0 404.0
1 Dog 101.0 202.0
2 Parrot 8.0 9.0
3 Rabbit 500.0 600.0
Now, if your DataFrames are already aligned. I.e. the Animals are in the same order and the DataFrame indices are identical. You could use quick trick to ignore the Animal column: set that of df2
as empty string:
df_add = df1.add(df2.assign(Animal=''))
Output:
Animal a b
0 Dog 101 202
1 Cat 303 404
2 Rabbit 505 606
This is however risky if you’re not fully sure if the animals and DataFrame indices are identical.
1
I would recommend you to keep the columns for the key and the aggregation as separate variables so the code would be prettier and more flexible:
key_columns = ['Animal']
sum_columns = ['a', 'b']
After that all you need to do is to concatenate the dataframes and aggregate them by the key columns:
pd.concat([df1, df2], ignore_index=True)
.fillna(0)
.groupby(key_columns)
.agg({col: 'sum' for col in sum_columns})
.reset_index()
In case if you don’t want to list all the columns in sum_columns
, you can calculate it from all column names in df1.columns
and key_columns
like this, for example:
key_columns = ['Animal']
sum_columns = list(set(df1.columns) - set(key_columns))
Enjoy!
PS: fillna(0)
fills NaN values with 0, reset_index()
restores Animal
as a column, because it became an index after groupby...agg
procedure.
2