I have a table like
rowid | date | x | y |
and values of x and y can range from -1 to 10, and possibly null.
If I want to get the frequency of -1 to 10 occurring for both x and y columns per a date,
I think I do have to use pivot, but because the values can range from -1 to 10 and null,
the column names will be like
date | x_-1 | x_0 | ... | y_-1 | y_0 | ... | y_10 |
This approach is very slow as I have lots of data (~TB of data per day), and the x, y and can grow, meaning I can have x, y, z, aa, bb.. to pivot on for future.
I thought about doing the aggregate and write the intermediate data like
x_df = df.pivot(col("x")).agg(count(*))
x_df.cache()
y_df = df.pivot(col("y")).agg(count(*))
y_df.cache()..
// then join later on date
But it does get really slow even with caching.
What is the best way to handle this kind of data? Am I handling this big data more like a RDBMS data?