The DDB table was exported from DynamoDB to S3 and I would like to run some Athena queries against the table. Table’s items have two keys (PK
and pk
) which are the same when they are in lowercase.
Currently tables has two items only:
{"Item":{"PK":{"S":"c#hpD_dnHyoGsgGgNdb8xq36zieC2bdz-QgEyUmVXffuw"},"SK":{"S":"VPZQ+OIh7YXE8qt+GFUoOQ==#jrZ1h4vLmvwmxBgjXoQM5w"},"pk":{"S":"4"},"dc":{"S":"28"}}}
{"Item":{"PK":{"S":"c#qf_hDkjVLu96FICbwRfsUKZqs89D3LT5dG2KrfKz_M8"},"SK":{"S":"LyQg1ehf6lJaWGis7wKPHQ==#PfG2IemA1VVjGZmHrOgKtA"},"pk":{"S":"4"},"dc":{"S":"7"}}}
I’ve read following docs how to resolve the same keys issue in Athena table
- https://repost.aws/knowledge-center/json-duplicate-key-error-athena-config
- https://docs.aws.amazon.com/athena/latest/ug/openx-json-serde.html (case.insensitive section)
and based on recommendations had created the following schema for the table with "case.insensitive" = "FALSE"
property and remapping for the duplicated fields
CREATE EXTERNAL TABLE `test`(
`item` struct<PK:struct<S:string>,pk:struct<S:string>,sk:struct<S:string>,dc:struct<S:string>>)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"case.insensitive" = "FALSE",
"mapping.pk1" = "PK",
"mapping.pk2" = "pk",
'paths'='Item')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://bucket/310724/AWSDynamoDB/id/test/'
TBLPROPERTIES (
'classification'='json',
'compressionType'='none',
'objectCount'='1',
'typeOfData'='file')
Now when I query the table to preview the data (only two rows are in the table in this moment)
SELECT * FROM "test" limit 10;
the response doesn’t show any data however returns correct number of rows in the table (two items)
# item
1
2
When I try to query particular column I get
SELECT Item.PK.s FROM "test" limit 10;
>>> AMBIGUOUS_NAME: line 1:13: Ambiguous row field reference: PK
SELECT Item.pk.s FROM "test" limit 10;
>>> AMBIGUOUS_NAME: line 1:13: Ambiguous row field reference: pk
SELECT Item.pk2.s FROM "test" limit 10;
>>> COLUMN_NOT_FOUND: line 1:8: Column 'item.pk2' cannot be resolved or requester is not authorized to access requested resources
SELECT Item.dc.s FROM "test" limit 10;
>>> two empty rows
Any advice what I’m doing wrong?