As part of a project I’m working on – I came across an interesting phenomena while using bq console.
consider given SQL:
WITH
users AS (
SELECT
*
FROM
my-gcp-proj.my_dataset.users
WHERE
MOD(id, 97) != 0),
teams_events AS (
SELECT
*
FROM
my-gcp-proj.my_dataset.events
-- my_dataset.events
WHERE
team_id IS NOT NULL),
helper1 AS (
SELECT
users.id AS user_id,
timestamp,
event_name,
event_name = 'Cartoon' AS cartoon
FROM
users
INNER JOIN
my_dataset.teams
ON
teams.id = users.team_id
LEFT JOIN
teams_events
ON
teams_events.team_id = teams.id
AND timestamp >= '2023-06-01 00:00:00'
AND timestamp < '2023-11-01 00:00:00' ),
helper2 AS (
SELECT
user_id AS entity_id,
timestamp,
event_name,
event_name = 'activation' AS activation,
NULL AS cartoon
FROM
my_dataset.events
WHERE
timestamp >= '2023-06-01 00:00:00'
AND timestamp < '2023-11-01 00:00:00'
UNION ALL
SELECT
user_id AS entity_id,
timestamp,
event_name,
NULL AS activation,
cartoon
FROM
helper1
WHERE
timestamp >= '2023-06-01 00:00:00'
AND timestamp < '2023-11-01 00:00:00' ),
helper3 AS (
SELECT
users.id AS entity_id,
SUM(CASE
WHEN activation OR cartoon THEN 1
ELSE 0
END
) AS custom_signal
FROM
users
LEFT JOIN
helper2
ON
users.id = helper2.entity_id
GROUP BY
1 )
SELECT
*
FROM
helper3
The amount of bytes will be processed is 128.07 MB
But, If I switch the table name being used in line 13, the amount of bytes will be processed changes to 120.2 MB.
Mind that the tables being used are equal!
One variation is using the gcp project as prefix, the other does not.
The 2 SQLs return same output.
Can someone explain how this is possible?
and perhaps what Can I do to avoid such difference for next time?
(I do understand this SQL vague – but just wanted to supply the most accurate example I can)