I have to make a data quality function that marks a row if a set of columns don’t have a one to one mapping column with another set of columns in the same table.
For example, consider the following table
> CREATE OR REPLACE TABLE t1
AS VALUES (1,2,4,5), (1,2,3,7), (1,2,NULL,5) AS (c1,c2,c3,c4);
> SELECT *
FROM t1
Output
c1 c2 c3 c4
1 2 4 5
1 2 3 7
1 2 Null 5
Now suppose the one to one mapping is in between (c1,c2) and (c3,c4). So I want to present something like –
‘One to one error (c1,c2) has multiple mappings to (c3,c4) -> (NULL,5), (4,5), (3,7)’
I do need the NULL string or atleast something to represent null value.
I have made it with three different ways with the difference being in the arguments of collect_set. Here are the three versions
All three have used the following CTE expression
WITH cte
AS (
SELECT c1, c2, COALESCE(c3, 'NULL') AS c3, COALESCE(c4, 'NULL') AS c4
FROM t1
)
First, using string in the collect_set
SELECT CONCAT('One to one error (c1,c2) has multiple mappings to (c3,c4) -> ',
ARRAY_JOIN(COLLECT_SET(CONCAT('(', c3,',', c4,')')),', '))
FROM cte
GROUP BY c1, c2
HAVING COUNT(DISTINCT c3, c4) > 1
Second, using struct in the collect_set
SELECT CONCAT('One to one error (c1,c2) has multiple mappings to (c3,c4) -> ',
ARRAY_JOIN(TRANSFORM(COLLECT_SET(STRUCT(c3,c4)), st -> CONCAT('(', st.c3, ',', st.c4, ')')),', '))
FROM cte
GROUP BY c1, c2
HAVING COUNT(DISTINCT c3, c4) > 1
Third is using array in the collect_set
SELECT CONCAT('One to one error (c1,c2) has multiple mappings to (c3,c4) -> ',
ARRAY_JOIN(TRANSFORM(COLLECT_SET(ARRAY(c3,c4)), ar -> CONCAT('(', ar[0], ',', ar[1], ')')),', '))
FROM cte
GROUP BY c1, c2
HAVING COUNT(DISTINCT c3, c4) > 1
Is there any difference in performance of these three? I think that struct would be better than using string in the collect_set as it has a structure and might help in comparing things faster.
Also, if possible can it be done in a better way?
The number of columns are not fixed in both sides and I will generate this query dynamically.