I am new to Apache Spark (Java) and am trying to create a text file consisting of multiple json objects that represent a combination of these two datasets. The firstToSecondGeneration is very long so I omitted some columns.
Here are the two datasets I’m trying to join:
Dataset firstToSecondGeneration:
name|ch1|ch2 |ch3 |…..|ch99
Bob|Joe|James| | |
Sue|Joe|James| | |
John| | | |Johnny
DataSet secondToThirdGeneration:
chName, gChname
Joe| Joe Jr.
Joe| Josephine
James| James Jr.
James| Jamie
Johnny| Johnny Jr.
And here is what I want to return:
Expected result:
{
“name”: “Bob”,
“children”: [
{
“childName”: “Joe”,
“grandChildren”:[
{
“grandChildName”: “Joe Jr.”
},
{
“grandChildName”: “Josephine”
}
]
},
{
“childName”: “James”,
“grandChildren”:[
{
“grandChildName”: “James Jr.”
},
{
“grandChildName”: “Jamie”
}
]
}
]
}
{
“name”: “Sue”,
“children”: [
{
“childName”: “Joe”,
“grandChildren”:[
{
“grandChildName”: “Joe Jr.”
},
{
“grandChildName”: “Josephine”
}
]
},
{
“childName”: “James”,
“grandChildren”:[
{
“grandChildName”: “James Jr.”
},
{
“grandChildName”: “Jamie”
}
]
}
]
}
{
“name”: “John”,
“children”: [
{
“childName”: “Johnny”,
“grandChildren”:[
{
“grandChildName”: “Johnny Jr.”
}
]
}
]
}
I have a working solution where I simply collect all the chName in a list and do some string concatenation to create the json strings, but I want to avoid doing that as I feel I’m not fully leveraging Spark with my current solution.
saugust is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.