I’m getting started with Dataform and working on a project where I need to transform a source table containing a JSON column into a flattened and aggregated table.
My goal:
First of all, i would like to extract the top-level keys from the JSON column in the source table.
Then, dynamically turn these keys into columns in a new table.
And to conclude, extract the corresponding values for these keys and structure the result as a table where:
- Columns: Represent the top-level keys from the JSON.
- Rows: Contain the corresponding values for each record.
At the moment, what I’ve done so far :
A SQL query to extract all unique keys from the source table using JSON_KEYS()
:
function get_keys(table_alias) {
return `WITH key_extraction AS (
SELECT DISTINCT
key
FROM (
SELECT DISTINCT
JSON_KEYS(event_data, 1) AS all_keys
FROM
${table_alias}
), UNNEST(all_keys) as key
)
SELECT key as keys_array
FROM key_extraction`;
}
This query retrieves a list of top-level keys from the JSON column.
-> I would like to use the output of this query inside my next function that dynamicaly generate parameters to build my flattened table.
The JavaScript function to dynamically generate a SQL query based on the extracted keys :
function generate_aggregated_queries(all_keys, table_alias) {
const keys = all_keys;
if (!Array.isArray(keys)) {
throw new Error('Keys must be an array :', toString(keys));
}
const queries = keys.map((key) => {
return `JSON_EXTRACT(event_data, '$.${key}') AS ${key.replace('-', '_')}`;
});
return `SELECT ${queries} FROM ${table_alias}`;
}
When I manually pass a static array of keys (for example [‘items’, ‘value’]) to this function in a new file like definitions/bronze_layer/json_keys.sqlx :
config {
type: "operations",
hasOutput: true,
schema: "bronze_layer",
name: "json_keys_staging"
}
${helpers.generate_aggregated_queries(['items', 'value'],ref("source_table"))}
I get this example output :
row | items | value |
---|---|---|
1 | [{item_id: “1234”, item_name: “test”, price: 10}, {item_id: “1234”, item_name: “test”, price: 10}] | 20 |
2 | [{item_id: “1234”, item_name: “test”, price: 10}, {item_id: “1234”, item_name: “test”, price: 10}] | 20 |
it works perfectly. The generated query produces a table with columns named after the keys, containing the corresponding values. But used keys in this example do not match all the keys retrieved in the get_keys() method because I wanna dynamise the ingestion of the keys.
I want to automate this process by dynamically passing the array of keys extracted from the get_keys() query into my JavaScript function. However, I’m stuck on how to use the output of the get_keys() query as an input for the generate_aggregated_queries() function.
Do you have any suggestions or ideas to help me solve this ?
Thank you in advance for your help! 🙏
Julien BUCHARD is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.