I am working with survey response data, where a user could have taken the survey multiple times. Users have a “weight” assigned to them, which is an integer. I am trying to calculate the sum of user_weight by city, making sure I don’t double count the user_weight for users who have taken the survey multiple times.
Using the example data I provided here, note that each of the users here took the survey twice.
I can calculate the correct city_weight using a two-pass approach. Below is snowflake sql code using a CTE that de-dupes the (city, user, user_weight) tuples first and then sums up the user_weight by city. This approach works, and I get the correct weights (of 150 for chicago, and 55 for denver).
Is there a one-pass approach to solve this problem?
create or replace temp table temp as
select 1 as response_id, 1 as user_id, 50 as user_weight, 'chicago' as city
union select 2, 1, 50, 'chicago'
union select 3, 2, 100, 'chicago'
union select 4, 2, 100, 'chicago'
union select 5, 3, 30, 'denver'
union select 6, 3, 30, 'denver'
union select 7, 4, 25, 'denver'
union select 8, 4, 25, 'denver'
;
with base as (
select
distinct
city,
user_id,
user_weight
from
temp
)
select
city,
sum(user_weight) as city_weight
from
base
group by city
;