I am trying to flatten a JSON which has nested array objects with 1:M mapping. Please refer below JSON response.
{
"result": [
{
"data": [
{
"dimensions": [
"SYNTHETIC-1"
],
"timestamps": [
17064756000,
17064752000,
17064751000,
17064750000
],
"values": [
null,
null,
100,
99.6354
]
},
{
"dimensions": [
"SYNTHETIC-2"
],
"timestamps": [
17064755000,
17064754000,
17064753000,
17064752000
],
"values": [
101,
null,
100,
99.6354
]
}
]
}
]
}
The above JSON response is an output from a web activity in Azure Synapse. I have used Data Flow to flatten this JSON and unroll by first ‘result’ and then by ‘data’. This gives me dimensions[], timestamps[] and values[] array, but it is not giving the right output. The output I am getting is below:
dimensions timestamps values
SYNTHETIC-1 17064756000 null
SYNTHETIC-1 17064756000 null
SYNTHETIC-1 17064756000 100
SYNTHETIC-1 17064756000 99.6354
SYNTHETIC-1 17064752000 null
SYNTHETIC-1 17064752000 null
SYNTHETIC-1 17064752000 100
SYNTHETIC-1 17064752000 99.6354
SYNTHETIC-1 17064751000 null
SYNTHETIC-1 17064751000 null
SYNTHETIC-1 17064751000 100
SYNTHETIC-1 17064751000 99.6354
SYNTHETIC-1 17064750000 null
SYNTHETIC-1 17064750000 null
SYNTHETIC-1 17064750000 100
SYNTHETIC-1 17064750000 99.6354
SYNTHETIC-2 17064755000 101
SYNTHETIC-2 17064755000 null
SYNTHETIC-2 17064755000 100
SYNTHETIC-2 17064755000 99.6354
SYNTHETIC-2 17064754000 101
SYNTHETIC-2 17064754000 null
SYNTHETIC-2 17064754000 100
SYNTHETIC-2 17064754000 99.6354
SYNTHETIC-2 17064753000 101
SYNTHETIC-2 17064753000 null
SYNTHETIC-2 17064753000 100
SYNTHETIC-2 17064753000 99.6354
SYNTHETIC-2 17064752000 101
SYNTHETIC-2 17064752000 null
SYNTHETIC-2 17064752000 100
SYNTHETIC-2 17064752000 99.6354
The output I am expecting is like below:
dimensions timestamps values
SYNTHETIC-1 17064756000 null
SYNTHETIC-1 17064752000 null
SYNTHETIC-1 17064751000 100
SYNTHETIC-1 17064750000 99.6354
SYNTHETIC-2 17064755000 101
SYNTHETIC-2 17064754000 null
SYNTHETIC-2 17064753000 100
SYNTHETIC-2 17064752000 99.6354
Can someone please help me to achieve this?
TIA.
Currently, Dataflow doesn’t have any feature to flatten all the nested arrays as per your requirement. In your case, you can try the below workaround using combination of flattens and join transformations.
For some reason, your timestamps
array is giving me only null values when the type is an integer array. So, I have set it to the string array. You need to change this array as per your requirement in the source dataset schema.
First, use flatten transformation like below to flatten the data
array and to get the inner arrays as columns. Use Rule-based mapping for this.
It will give the result like below.
Then follow the transformations as shown in the above image.
-
Use
flatten2
transformation to flatten thedimensions
array. In flatten, give thedimension
array for both Unroll by and Unroll root options. It will flatten this array. -
Then take a
select2
transformation and remove thetimestamps
column. -
Take a new branch from
flatten2
and add another select transformationselect1
and here remove thevalues
column. -
Now, add another flatten transformation
flatten3
to theselect2
and flatten the array columnvalues
. For both Unroll by and Unroll root options, givevalues
column only. -
Similarly, add
flatten4
to theselect1
and here do it fortimestamps
array column. -
After that, take SurrogateKey transformation
surrogateKey1
to theflatten3
and create a key columnkey1
and give the step and start value as 1. -
Similarly, do the same for
flatten4
transformation. give the same key name as well. -
Now, inner join both surrogate key transformations using join transformation based on the common column
key1
. -
Now, it will have required results, use another select transformation and remove the duplicate columns and extra
key1
column.
Now, you will get the desired result.
Use the below dataflow script to build your dataflow.
source(output(
result as (data as (dimensions as string[], timestamps as string[], values as string[])[])[]
),
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: false,
documentForm: 'singleDocument') ~> source1
source1 foldDown(unroll(result.data, result),
mapColumn(
each(result.data,match(true()))
),
skipDuplicateMapInputs: false,
skipDuplicateMapOutputs: false) ~> flatten1
flatten1 foldDown(unroll(dimensions, dimensions),
mapColumn(
dimensions,
timestamps,
values
),
skipDuplicateMapInputs: false,
skipDuplicateMapOutputs: false) ~> flatten2
flatten2 select(mapColumn(
dimensions,
timestamps
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> select1
flatten2 select(mapColumn(
dimensions,
values
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> select2
select2 foldDown(unroll(values, values),
mapColumn(
dimensions,
values
),
skipDuplicateMapInputs: false,
skipDuplicateMapOutputs: false) ~> flatten3
select1 foldDown(unroll(timestamps, timestamps),
mapColumn(
dimensions,
timestamps
),
skipDuplicateMapInputs: false,
skipDuplicateMapOutputs: false) ~> flatten4
flatten3 keyGenerate(output(key1 as long),
startAt: 1L,
stepValue: 1L) ~> surrogateKey1
flatten4 keyGenerate(output(key1 as long),
startAt: 1L,
stepValue: 1L) ~> surrogateKey2
surrogateKey1, surrogateKey2 join(surrogateKey1@key1 == surrogateKey2@key1,
joinType:'inner',
matchType:'exact',
ignoreSpaces: false,
broadcast: 'auto')~> join1
join1 select(mapColumn(
dimensions = flatten3@dimensions,
values,
dimensions = flatten4@dimensions,
timestamps
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> select3
select3 sink(allowSchemaDrift: true,
validateSchema: false,
input(
result as (data as (dimensions as string[], timestamps as string[], values as string[])[])[]
),
umask: 0022,
preCommands: [],
postCommands: [],
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> sink1
1