A response.json()
returns a JSON in the following format:
{
"workbooks": [
{
"name": "WORKBOOK_A",
"embeddedDatasources": [
{
"upstreamTables": [
{"name": "WORKBOOK_A_TABLE_A"}]},
{
"upstreamTables": [
{"name": "WORKBOOK_A_TABLE_B"},
{"name": "WORKBOOK_A_TABLE_C"}]},
{
"upstreamTables": []}]},
{
"name": "WORKBOOK_B",
"embeddedDatasources": [
{
"upstreamTables": [
{"name": "WORKBOOK_B_TABLE_A"},
{"name": "WORKBOOK_B_TABLE_B"}]},
{
"upstreamTables": [
{"name": "WORKBOOK_B_TABLE_C"},
{"name": "WORKBOOK_B_TABLE_D"}]}]}]}
I am trying to convert it to a dataframe like this:
workbooks | upstreamTables |
---|---|
WORKBOOK_A | WORKBOOK_A_TABLE_A |
WORKBOOK_A | WORKBOOK_A_TABLE_B |
WORKBOOK_A | WORKBOOK_A_TABLE_C |
WORKBOOK_B | WORKBOOK_B_TABLE_A |
WORKBOOK_B | WORKBOOK_B_TABLE_B |
WORKBOOK_B | WORKBOOK_B_TABLE_C |
WORKBOOK_B | WORKBOOK_B_TABLE_D |
"upstreamTables": []
should be ignored in this case.
Playing around with json_normalize
df = pd.json_normalize(json_data)
didn’t play out so far and extracting the data as separate dataframes and rejoining them seems too convulsive.
7
Here’s one approach:
- Pass
resp
(i.e.,response.json()
) topd.json_normalize
with bothrecord_path
andmeta
. Addmeta_prefix
to avoid aValueError: Conflicting metadata
. Cf. this post. Otherwise we would end up with 2name
columns. - Use
df.rename
to rename the columns and re-order them.
import pandas as pd
# resp = {...}
df = (pd.json_normalize(resp['workbooks'],
record_path=['embeddedDatasources', 'upstreamTables'],
meta='name',
meta_prefix='meta_'
)
.rename(columns={'name': 'upstreamTables',
'meta_name': 'workbooks'})
[['workbooks', 'upstreamTables']]
)
Output:
workbooks upstreamTables
0 WORKBOOK_A WORKBOOK_A_TABLE_A
1 WORKBOOK_A WORKBOOK_A_TABLE_B
2 WORKBOOK_A WORKBOOK_A_TABLE_C
3 WORKBOOK_B WORKBOOK_B_TABLE_A
4 WORKBOOK_B WORKBOOK_B_TABLE_B
5 WORKBOOK_B WORKBOOK_B_TABLE_C
6 WORKBOOK_B WORKBOOK_B_TABLE_D
2
to keep it compact I used list comprehensions, I hope it is readable.
import pandas as pd
responseData={}
for item in response['workbooks']:
embbeddedDataList=item['embeddedDatasources']
response_elements=[listElem['upstreamTables'] for listElem in embbeddedDataList if not listElem['upstreamTables']==[]]
tabular_elements=[elem['name'] for elementList in response_elements for elem in elementList]
responseData[item['name']]=tabular_elements
workbooks=[] ; upstreamTables=[]
for workbook in responseData:
for streamEntry in responseData[workbook]:
workbooks.append(workbook)
upstreamTables.append(streamEntry)
tabularResponse=pd.DataFrame()
tabularResponse['workbooks']=workbooks
tabularResponse['upstreamTables']=upstreamTables
0
To convert the JSON response into the desired DataFrame, you can iterate through the JSON structure
import pandas as pd
json_data = {
"workbooks": [
{
"name": "WORKBOOK_A",
"embeddedDatasources": [
{"upstreamTables": [{"name": "WORKBOOK_A_TABLE_A"}]},
{"upstreamTables": [{"name": "WORKBOOK_A_TABLE_B"}, {"name": "WORKBOOK_A_TABLE_C"}]},
{"upstreamTables": []}
]
},
{
"name": "WORKBOOK_B",
"embeddedDatasources": [
{"upstreamTables": [{"name": "WORKBOOK_B_TABLE_A"}, {"name": "WORKBOOK_B_TABLE_B"}]},
{"upstreamTables": [{"name": "WORKBOOK_B_TABLE_C"}, {"name": "WORKBOOK_B_TABLE_D"}]}
]
}
]
}
data = []
for workbook in json_data['workbooks']:
workbook_name = workbook['name']
for datasource in workbook['embeddedDatasources']:
for table in datasource['upstreamTables']:
# Add workbook name and table name to the data list
data.append({
"workbooks": workbook_name,
"upstreamTables": table['name']
})
df = pd.DataFrame(data)
print(df)
This would give the result in the desired table structure. But make sure to always pass the correct JSON or use any online tool like JSON Reader or any tool