I have many dataframe like:
df1 df2 and so on...
gene | counts gene | counts
KRAS 136 KRAS 96
DNAH5 3 DNAH5 4
TP53 105 TP53 20
I want to merge them and sum the column ‘counts’ so I end with only one dataframe
merged_df
gene | counts
KRAS 232
DNAH5 7
TP53 125
I have tried to use pd.merge but it only accepts 2 dataframes at once, I have 14 dataframes. I used pd.concat for multiple dataframes but can’t sum them after.
You can try this :
dataframes = [df1, df2, df3] # Add your dataframes here
concatenated_df = pd.concat(dataframes)
result_df = concatenated_df .groupby('gene', as_index=False).sum()
Avoiding the need for df.groupby
(as here) or renaming columns (as here):
- Use a list comprehension to apply
df.set_index
to eachdf
- Pass the list to
pd.concat
withaxis=1
- Apply
df.sum
onaxis=1
anddf.reset_index
import pandas as pd
df1 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [136, 3, 105]})
df2 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [96, 4, 20]})
dfs = [df1, df2]
out = (
pd.concat([df.set_index('gene') for df in dfs], axis=1)
.sum(axis=1)
.reset_index(name='counts')
)
Output:
gene counts
0 KRAS 232
1 DNAH5 7
2 TP53 125
Indeed, pd.merge
only merges 2 dataframes. But pd.join
can join many, if they have the same index:
# Some example data. Note the None in `df3`. We want our code to handle that well.
df1 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [136, 3, 105]})
df2 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [96, 4, 20]})
df3 = pd.DataFrame({'gene': ['KRAS', 'DNAH5', 'TP53'], 'counts': [1000, None, 3000]})
dfs = [df1, df2, df3]
# We need the same index for the pd.DataFrame.join to work
dfs = [df.set_index('gene') for df in dfs]
# All the non-index columns need unique names, so chanding `columns` to `columns_0`, `columns_1`
dfs = [df.rename(columns={'counts': f'counts_{i}'}) for i, df in enumerate(dfs)]
# Actual join. We are joining first with the rest
df = dfs[0].join(dfs[1:], how='outer')
# Since we don't have any other data, we can just sum all columns.
df.sum(1)
This prints:
gene
KRAS 1232.0
DNAH5 7.0
TP53 3125.0
1
You probably want to use the df.groupby()
function in the end.
Here is how you do it:
First you df = df.concat([df1, df2, df3...])
your dataframes, so you end up with one big dataframe.
After that you need to use df = df.group_by('gene').sum().reset_index()
Basically you group all the rows based on the unique ‘gene’ column and then sum up all the ‘counts’ values for each ‘gene’.
Reset index just converts the grouped ‘gene’ values back into regular column.
You can try running each command one-by-one to understand what’s happening.