I have a table in a Snowflake SQL database table:
CREATE TABLE accounts (
ACCOUNT_ID VARCHAR,
FIRST_NAME VARCHAR,
LAST_NAME VARCHAR,
DATE_OF_BIRTH VARCHAR
EMAIL VARCHAR,
PHONE_NUMBER VARCHAR
);
INSERT INTO accounts VALUES
( 'A847290', 'BENJIMAN', 'SISKO', '2332-01-01', '[email protected]', '999999999' ),
( 'B234839', 'BENJIMAN', 'SISKO', NULL, '[email protected]', '999999999' ),
( 'C123594', 'BENJIMAN', 'SISKO', NULL, NULL, NULL ),
( 'A239865', 'BENJIMAN', 'SISKO', '2332-01-01', '[email protected]', NULL ),
( 'Z389202', 'BENJIMAN', 'SISKO', '2332-01-01', NULL, '999999999' );
I would like to use SQL to create an identifier for records which have more than 4 matches for each of the fields. For example, my expected output would be:
ACCOUNT_ID | ID
-----------|--------------
A847290 | I12304
B234839 | I12304
A239865 | I12304
Z389202 | I12304
The end goal is to be able to build a table that would group accounts to an identifier that would represent an individual of multiple accounts. I also expect there is a many to many relationship between account and individuals.
Thanks!
I have tried to:
- Create a self-cartesian join
- Filter records where the match number is > 3
- Create a hash of the values for a identifier
SELECT a1.ACCOUNT_ID
, a2.FIRST_NAME
, a2.LAST_NAME
, a2.DATE_OF_BIRTH
, a2.EMAIL
, a2.PHONE_NUMBER
, MD5(a2.FIRST_NAME, a2.LAST_NAME, a2.DATE_OF_BIRTH, a2.EMAIL, a2.PHONE_NUMBER) AS ID
FROM accounts a1
JOIN accounts a2
ON (IFF(a1.FIRST_NAME = a2.FIRST_NAME, 1, 0)
+ IFF(a1.LAST_NAME = a2.LAST_NAME, 1, 0)
+ IFF(a1.DATE_OF_BIRTH = a2.DATE_OF_BIRTH, 1, 0)
+ IFF(a1.EMAIL = a2.EMAIL, 1, 0)
+ IFF(a1.PHONE_NUMBER = a2.PHONE_NUMBER, 1, 0)) > 3
The problem is that in the example above each record will get a different ID as the match fields are not consistent even though there is a match > 3.
schone is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.