There is an Athena table in which 2 of those columns contains values as json data.
The requirement is to flatten the column.
I am currently following the manual method to flatten the data.
SELECT
names ['first'] AS first_name,
names ['last'] AS last_name,
department
FROM
dataset
CROSS JOIN UNNEST(people) AS t (names)
But data in these columns is huge (around 5000 columns in one column)
So, manual method of select is taking lot of time.
Is there any alternate faster way to flatten the json column into individual columns?
1
This is a limitation with SQL. All columns must be known before a query runs, dynamic generation of columns is not support in SQL (at least ANSI/standard). Wildcards like SELECT *
are allowed, because the *
can be expanded during query planning using the table schema.
I don’t know the schema of your table, which makes answering a bit tricky, but assuming you are querying JSON data you can declare the people
column as a struct
and include all the properties. This will give the query planner enough information for your queries to be able to wildcard expand all the property names.
For example:
CREATE EXTERNAL TABLE dataset (
department string,
people array<struct<first:string,last:string>>
)
…
With the table above, you can do this:
SELECT department, person.*
FROM dataset
CROSS JOIN UNNEST(people) AS t (person)
Here is a working example that uses a CTE to provide the data. row
is the same thing as struct
, you just need to use the latter in DDL and the former in DML – DDL uses Hive syntax and naming, DML uses Trino syntax and naming:
with
example_data (people) as (values '[{"first":"f1","last":"l1"},{"first":"f2","last":"l2"}]',
'[{"first":"f3","last":"l3"}]'),
parsed_data as (select cast(json_parse(people) as array(row(first varchar, last varchar))) as people
from example_data)
select t.person.*
from parsed_data
cross join unnest(people) as t (person)
Now, will this be practical with 5000 columns? No. You will also run into other issues with that many columns. Anything above a couple of hundred columns risks running into internal limits. Query plans can’t be infinite in size, and the list of column names has a definite limit (which isn’t related to the number of columns, but the memory usage of the compiled form, which means that long names and larger types reaches the limit faster).