I have a table in Bigquery with 1.4 million records. This table has 3 columns – AccNum which is primary key, AccDt which is a date field and the final column is Instances which has a random number ranging from 0 to 100,000. Out of 1.4M records, 1.1M records have ‘Instances’ populated, other 300k are null. I want to split this field into 3 segments – ‘High’ (Top 33%), ‘Medium’ (Next 33%) and ‘Low’ (Last 34%).
When calculating the split, I want to ignore the nulls, only records which has populated ‘Instances’ value must be considered. But for the final query, I need to get all records including the nulls. Currently, I’m using the following code:
WITH SegmentedData AS (
SELECT
AccNum,
AccDt,
Instances,
CASE
WHEN Instances IS NULL THEN 0
ELSE NTILE(3) OVER (ORDER BY Instances DESC)
END AS segment
FROM
table_name
)
SELECT
AccNum,
AccDt,
Instances,
CASE
WHEN Instances IS NULL THEN 'NA'
WHEN segment = 1 THEN 'High'
WHEN segment = 2 THEN 'Medium'
ELSE 'Low'
END AS segment_category
FROM
SegmentedData;
This solution almost works, but when considering the split, NTILE also considers ‘null’ values into the picture. I need to consider only 1.1million records when segmenting into the 3 categories, but when fetching final results, I need all 1.4million records. How do I achieve this?