I have 2 streaming data in Kafka topic, I have to join
Stream1.join(Stream2) on common key and I have already applied watermark and window for both stream at event timestamp field. I can see joined results as well,
but use case scenario is,
If I get joined 1 to many records (DataFrame), I have to pivot them into one record right after join
Stream1
Name Standard Eventtime
XXX SSLC 2024-05-05:09:30:00AM
YYY HSC 2024-05-04:09:35:00AM
Stream 2
Name subject Eventtime
XXX Sub1 2024-05-05:10:15:00AM
XXX Sub2 2024-05-05:10:15:00AM
XXX Sub3 2024-05-05:10:15:00AM
YYY Sub1 2024-05-05:10:15:00AM
YYY Sub2 2024-05-05:10:15:00AM
YYY Sub3 2024-05-05:10:15:00AM
YYY Sub4 2024-05-05:10:15:00AM
Step 1: joinedDF = Stream1.join(stream2).on(“Name”)
Step 2: joinedDF.groupBy(“Name”).pivot(“Subject”)
Result:
XXX Sub1 Sub2 Sub3
YYY Sub1 Sub2 Sub3 Sub4
I understood multiple aggregation issue not supported during may arise at step2
What are best way to implement this as stateful spark streaming job.
Stream1
Name Standard Eventtime
XXX SSLC 2024-05-05:09:30:00AM
YYY HSC 2024-05-04:09:35:00AM
Stream 2
Name subject Eventtime
XXX Sub1 2024-05-05:10:15:00AM
XXX Sub2 2024-05-05:10:15:00AM
XXX Sub3 2024-05-05:10:15:00AM
YYY Sub1 2024-05-05:10:15:00AM
YYY Sub2 2024-05-05:10:15:00AM
YYY Sub3 2024-05-05:10:15:00AM
YYY Sub4 2024-05-05:10:15:00AM
Step 1: joinedDF = Stream1.join(stream2).on(“Name”)
Step 2: joinedDF.groupBy(“Name”).pivot(“Subject”)
Result:
XXX Sub1 Sub2 Sub3
YYY Sub1 Sub2 Sub3 Sub4