Relative Content

Tag Archive for apache-sparkpysparkapache-spark-sql

Dynamic partition pruning between large and small table without extra filter conditions

I have a large partitioned fact table and a small dimension table.
The partition column of the large table is the key column of the dimension table.
I would like to use the small table to reduce the number of partitions I read from the large table.
There is no condition other than that there has to be a small table record corresponding to each large table partition.
Both ‘INNER JOIN’ and ‘LEFT SEMI JOIN’ are acceptable here.

Finding overlap in groups and sorting into new distinct groups

Inititaly I thought this was an easy problem, but I just can’t figure it out.
Here is a simplified example. I have 8 different people buying some items from a store. Afterwards I want to look at all the items and sort them into groups so that each overlapping initial shopping goes into the same new group.