I have very large table groupedby_x_y
(2B rows) with the columns:
date,
x,
y,
z_count
This is the result of counting this z
value from a more raw table and grouped by columns x
and y
. Turns out we are also computing from the underlying raw table another table groupedby_x
(500M rows) which has the columns:
date,
x,
z_count
The way I see it groupedby_x
should just be groupedby_x_y
with z_count
summed over y
:
recover_groupedby_x =
SELECT date, x, sum(z_count) as z_count from groupedby_x_y
group by x
But when I do this, the two tables groupedby_x
and recover_groupedby_x
are nearly identical but not quite – they differ by about 5M and I cannot figure out why. The groupedby_..
tables are perfectly and fully rectangular with no nulls or funny malformed data etc.
Is my idea just wrong or might I be missing something a little subtle here ?