When running an AWS Glue Visual ETL to convert to Parquet format which has:
AWS Glue Catalog => Change Schema => S3 Bucket
The following error is generated:
Error Category: UNCLASSIFIED_ERROR; An error occurred while calling
o171.pyWriteDynamicFrame. Inconsistent data type results in choice
type
I have an XML Structure that I am converting to Parquet format so that I can use AWS Athena to Query the data. The XML has a repeating Address Structure, where in some cases there will be 1 address and in other cases there could be 2,3,4 addresses. Like the following:
<root>
<dataRecord>
<addressList>
<address>
<add1>123 Street</add1>
<add2>Some Town</add2>
<postCode>A1 1AA</postCode>
</address>
<addressList>
<dataRecord>
<dataRecord>
<addressList>
<address>
<add1>456 Street</add1>
<add2>Other Town</add2>
<postCode>B1 1BB</postCode>
</address>
<address>
<add1>789 Street</add1>
<add2>New Town</add2>
<postCode>C1 1CC</postCode>
</address>
<addressList>
<dataRecord>
</root?
So when the XML is Crawled, the repeating dataRecord can have an addressList which contains 1 or more addresses. So when it comes to defining the data type for address, AWS Glue says either ‘array’ or ‘struct’.
I have tried to add a custom code step in the visual with the following code based on what I’ve read on other posts in order to resolve the choice, however I’m still getting an error:
def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0])
df_resolved = df.resolveChoice(specs = [('addressList.address', 'cast:array')])
return (DynamicFrameCollection({"CustomTransform0": df_resolved}, glueContext))
In terms of giving the path to the field to resolve however, I am not sure and have tried a number of variations.