The merge
in pandas has a cross-join function. I was hoping to
avoid a full cross-join in pairing each record of a large dataframe
with upto (say) a dozen records in another dataframe (possibly the
same dataframe) based on complex criteria involving a function of
multiple fields, not necessarily an equivalence relationship between
“key” columns.
Below is one simple example of matching based on inequality
relationships. I’ve commented out some input records to make the
output smaller, but they can be easily added back in for
experimentation with stricter matching criteria:
import pandas as pd
# Input dataframe
df_in = pd.DataFrame([ [ 1, 1, 4, 1 ],
[ 3, 4, 2, 4 ],
[ 0, 2, 0, 0 ],
[ 3, 1, 4, 4 ],
[ 0, 4, 4, 3 ],
# [ 1, 4, 2, 2 ],
# [ 3, 2, 0, 2 ],
# [ 3, 3, 1, 1 ],
# [ 0, 2, 2, 2 ],
# [ 4, 1, 2, 4 ],
# [ 3, 1, 1, 2 ],
# [ 2, 2, 3, 2 ],
# [ 2, 1, 3, 1 ],
# [ 2, 4, 1, 2 ],
# [ 1, 0, 0, 3 ],
[ 2, 1, 1, 3 ],
[ 2, 0, 1, 1 ],
[ 4, 1, 2, 1 ],
[ 3, 2, 2, 4 ],
[ 3, 1, 0, 0 ] ],
columns=['A','B','C','D'] )
# Full cross-join
df_out = df_in.merge( df_in, how='cross' )
# Select a very small subset
df_out = df_out[ ( df_out.A_x < df_out.B_y ) &
( df_out.C_x > df_out.D_y) ]
A_x B_x C_x D_x A_y B_y C_y D_y
2 1 1 4 1 0 2 0 0
4 1 1 4 1 0 4 4 3
34 3 1 4 4 0 4 4 3
40 0 4 4 3 1 1 4 1
42 0 4 4 3 0 2 0 0
44 0 4 4 3 0 4 4 3
45 0 4 4 3 2 1 1 3
47 0 4 4 3 4 1 2 1
49 0 4 4 3 3 1 0 0
In SQL, I didn’t need to create a full cross-join before filtering
away the unwanted matches. I could incorporate the filtering criteria
as a WHERE clause of a join, essentially in place of an ON clause.
The complex join criterion could involve mathematical or string
functions, or I wrote my own functions in VBA (at least I could for
Microsoft Access).
These joins didn’t take too much time despite the large number of
records. I strongly suspect that the query engine optimized the join so that
nothing close to a full cross-join is constructed.
Is there any way to incorporate complex join criteria into pandas
joins without the memory and time requirements of an intermediate
cross-join? How flexible can the criteria be, e.g., can they
incorporate arbitrary functions, including user defined functions?
Background
The dataframe consists of many messages with various sender IDs. The
IDs are typed in, so the data is very dirty. I will use Levenshtein
distance to group together IDs that are likely to be the same, but
each ID needs to be tested against every other ID, even though a match
will only be made between small subsets of the IDs.
That is only the starting point for a matching criterion. The
messages contain travel destinations, various time stamps, and other
identifying data. The join criterion will be developed to make
use of matching conditions between other fields as well.