I have a Dataframe df1
with the columns. I need to compare the headers of columns in df1
with a list of headers from df2
df1 =['a','b','c','d','f']
df2 =['a','b','c','d','e','f']
I need to compare the df1
with df2
and if any missing columns, I need to add them to df1
with blank values.
I tried concat
and also append
and both didn’t work. with concat
, I’m not able to add the column e
and with append
, it is appending all the columns from df1
and df2
. How would I get only missing column added to df1
in the same order?
df1_cols = df1.columns
df2_cols = df2._combine_match_columns
if (df1_cols == df2_cols).all():
df1.to_csv(path + file_name, sep='|')
else:
print("something is missing, continuing")
#pd.concat([my_df,flat_data_frame], ignore_index=False, sort=False)
all_list = my_df.append(flat_data_frame, ignore_index=False, sort=False)
I wanted to see the results as
a|b|c|d|e|f - > headers
1|2|3|4||5 -> values
pandas.DataFrame.align
df1.align(df2, axis=1)[0]
- By default this does an
'outer'
join - By specifying
axis=1
we focus on columns - This returns a
tuple
of both an aligneddf1
anddf2
with the calling dataframe being the first element. So I grab the first element with[0]
pandas.DataFrame.reindex
df1.reindex(columns=df1.columns | df2.columns)
- You can treat
pandas.Index
objects like sets most of the time. Sodf1.columns | df2.columns
is the union of those two index objects. I thenreindex
using the result.
Lets first create the two dataframes as:
import pandas as pd, numpy as np
df1 = pd.DataFrame(np.random.random((5,5)), columns = ['a','b','c','d','f'])
df2 = pd.DataFrame(np.random.random((5,7)), columns = ['a','b','c','d','e','f','g'])
Now add those columns of df2 to df1 (with nan values), which are not in df1:
for i in list(df2):
if i not in list(df1):
df1[i] = np.nan
Now display the columns of df1 alphabetically:
df1 = df1[sorted(list(df1))]