I’m trying to combine tables, each which have a person ID (multiple in each table). I would like to combine the two data sets so that for each person_ID with a unqniue date of surgery they get assigned all their potential infection dates (made up in table 2)
I have tried various joins in PySpark – but that has not gone very well, tends to exclude too many variables.
Was not sure if there was a function or I needed to write a loop to run through all the variables (>40k rows in each table)
example of the tables & desired outcome
Deven Carroll is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.