I have two dataframes created from ingestion json data w/ the below schemas
provider:
{
npi: "..."
name: "..."
location: {
address: "...",
insurances: ["...", "..."],
...
},
...
}
insurance:
{
id: ...,
...
}
I would like to join provider df on insurance df where provider.location.insurances contains insurance.id, and add matching insurances as a new array field insurances. Is this possible?
So the resulting data structure would be something like this:
{
npi: "..."
name: "..."
location: {
address: "...",
insurances: [123, ...],
...
},
insurances: [{id: 123, ...}, ...]
}
I was able to accomplish this with the below code, but please let me know if you see a more memory optimized way to do this, thanks.
insuranceDF = insuranceDF.withColumn("insurance", F.struct("uuid", "carrier_name"))
joinDF = providerDF.join(insuranceDF, F.expr("array_contains(location.insurances, uuid)")).groupBy("npi", "location").agg(F.collect_list("insurance"))