I see lots of similar questions but not quite my situation.
I have 2 dataframes, each with many many columns. A very simplified reproduction is:
df1:
serial_num name
0 39jr93j Phyl
1 <NA> Gil
2 21pr12n Ann
df2:
serial_num name alternate_nombre different_name nm_alternate
0 5c2v40l Jill Jyl. Jill Smith. Gil Smyth
1 21pr12n Karen Caren. Karen Ann. Caryn Anne
2 39jr93j Phyllis Fillus Phil Phyllus
Desired result:
serial_num name
0 39jr93j Phyl
1 5c2v40l Gil
2 21pr12n Ann
DF1 is missing some of the serial numbers but the only name column in df1
may not be the same spelling used in df2
. I want to look up all the rows that are missing df1['serial_num']
in df2
to fill it in. But in df2
, there are a significant number of alternate name columns such that it’s not feasible to itemize which columns to look in. And the name in df1
may be a substring of a name/alternate name in df2
. So I want to look in the whole df2
as though the df1['name']
was a substring.
Then I want to fill in the corresponding df1['serial_num']
with the df2['serial_num']
.
I can’t wrap my head around how to do this without iterating over the rows.
I have created a subset of rows that are missing df['serial_num']
with the following:
missing_serials = df1[df1['serial_num'].isnull()]
I see some solutions for searching a whole dataframe for a substring looking like this:
df2[df2.apply(lambda r: r.str.contains('specific static substring', case=False).any(), axis=1)]
But I can’t seem to figure out how to replace the 'specific static substring'
with my missing_serials['name']
without iterating through the rows or how to combine my missing_serials
and df2
when I can’t predict which column will match. I’m starting to wonder if using the missing_serials
sub-dataframe is even the right way to go because I don’t know how i’ll get the values back into df1
. Maybe there’s some kind of df1['serial_num'].fillna
approach that’s better?