Glue 5.0
Having issues where the DynamicFrame join is resulting in no data.
As shown in the code below:
- It’s performing a flatten of the data.
- Then it attempts to join the root_component.supporting_info to root.
Below is the sample code and output. I’m fixing to experiment with the raw spark frame to see if its reproducible there.
unnested = dyf.relationalize("root", "s3://mybucket/temp/")
print("#####KEYS####")
print(unnested.keys())
print("#####KEYS####")
print("")
dyf_root = unnested.select("root")
print("#####ROOT DATA####")
dyf_root.toDF().show(vertical=True)
print("#####ROOT DATA####")
print("")
dyf_root_supporting_info = unnested.select("root_component.supporting_info")
print("#####ROOT SUPPORTING INFO####")
dyf_root_supporting_info.toDF().show(vertical=True)
print("#####ROOT SUPPORTING INFO####")
print("")
dyf_joined = dyf_root.join(paths1=["component.supporting_info"], paths2=["id"], frame2=dyf_root_supporting_info)
print("#####JOINED SCHEMA#####")
dyf_joined.printSchema()
print("#####JOINED SCHEMA#####")
print("")
print("#####JOINED DATA#####")
dyf_joined.toDF().show(vertical=True)
print("#####JOINED DATA#####")
Output:
#####KEYS####
dict_keys(['root', 'root_component.tuple.licenses', 'root_component.supporting_info', 'root_image_tags', 'root_component.raw_tuple.licenses', 'root_component.original_tuple.licenses', 'root_component.supporting_info.val.cpes'])
#####KEYS####
#####ROOT DATA####
-RECORD 0-------------------------------------------------------------
image_id | 1127318
product_id | 600
scan_id | 1727750
action_id | 665a693e3d03044f1...
action_timestamp | 2024-06-01T00:20:...
image_tags | 1
component.found_by | protecode
component.image_id | 1127318
component.raw_tuple.component_type | Open Source
component.raw_tuple.latest_known_version | 2.7.3
component.raw_tuple.maven_id | xalan:xalan
component.raw_tuple.licenses | 1
component.raw_tuple.name | xalan
component.raw_tuple.version | 2.7.2
component.raw_tuple.supplier | apache
component.raw_tuple.codetype | java
component.raw_tuple.tuple_id | b9fff82793e1ddb9f...
component.raw_tuple.homepage | null
component.reusable_module_id | 1127318
component.tuple.component_type | Open Source
component.tuple.latest_known_version | 2.7.3
component.tuple.maven_id | xalan:xalan
component.tuple.licenses | 1
component.tuple.name | xalan
component.tuple.version | 2.7.2
component.tuple.supplier | apache
component.tuple.codetype | java
component.tuple.tuple_id | b9fff82793e1ddb9f...
component.tuple.homepage | null
component.origin_id | 665a693e3d03044f1...
component.original_tuple.component_type | Open Source
component.original_tuple.latest_known_version | 2.7.3
component.original_tuple.maven_id | xalan:xalan
component.original_tuple.licenses | 1
component.original_tuple.name | xalan
component.original_tuple.version | 2.7.2
component.original_tuple.supplier | apache
component.original_tuple.codetype | java
component.original_tuple.tuple_id | b9fff82793e1ddb9f...
component.original_tuple.homepage | null
component.supporting_info | 1
component.id | 665a693e3d03044f1...
gc_image | true
-RECORD 1-------------------------------------------------------------
image_id | 1127318
product_id | 600
scan_id | 1727750
action_id | 665a693e3d03044f1...
action_timestamp | 2024-06-01T00:20:...
image_tags | 2
component.found_by | protecode
component.image_id | 1127318
component.raw_tuple.component_type | null
component.raw_tuple.latest_known_version | 6.9.0
component.raw_tuple.maven_id | org.apache.bcel:bcel
component.raw_tuple.licenses | 2
component.raw_tuple.name | bcel
component.raw_tuple.version | 2.7.2
component.raw_tuple.supplier | apache
component.raw_tuple.codetype | java
component.raw_tuple.tuple_id | c4c44c1c3ad99644d...
component.raw_tuple.homepage | https://commons.a...
component.reusable_module_id | 1127318
component.tuple.component_type | Open Source
component.tuple.latest_known_version | 6.9.0
component.tuple.maven_id | org.apache.bcel:bcel
component.tuple.licenses | 2
component.tuple.name | bcel
component.tuple.version | 2.7.2
component.tuple.supplier | apache
component.tuple.codetype | java
component.tuple.tuple_id | c4c44c1c3ad99644d...
component.tuple.homepage | https://commons.a...
component.origin_id | 665a693e3d03044f1...
component.original_tuple.component_type | Open Source
component.original_tuple.latest_known_version | 6.9.0
component.original_tuple.maven_id | org.apache.bcel:bcel
component.original_tuple.licenses | 2
component.original_tuple.name | bcel
component.original_tuple.version | 2.7.2
component.original_tuple.supplier | apache
component.original_tuple.codetype | java
component.original_tuple.tuple_id | c4c44c1c3ad99644d...
component.original_tuple.homepage | https://commons.a...
component.supporting_info | 2
component.id | 665a693e3d03044f1...
gc_image | true
#####ROOT DATA####
#####ROOT SUPPORTING INFO####
-RECORD 0-------------------------------------------------------------
id | 1
index | 0
component.supporting_info.val.sha1 | 9ee6066b9f7152234...
component.supporting_info.val.path | alu-sr-cli-8.49/p...
component.supporting_info.val.package_name | xalan:xalan
component.supporting_info.val.confidence | 0.67
component.supporting_info.val.matching_method | signature
component.supporting_info.val.cpes | 1
-RECORD 1-------------------------------------------------------------
id | 2
index | 0
component.supporting_info.val.sha1 | 9ee6066b9f7152234...
component.supporting_info.val.path | alu-sr-cli-8.49/p...
component.supporting_info.val.package_name | org.apache.bcel:bcel
component.supporting_info.val.confidence | 0.77
component.supporting_info.val.matching_method | signature
component.supporting_info.val.cpes | 2
#####ROOT SUPPORTING INFO####
#####JOINED SCHEMA#####
root
#####JOINED SCHEMA#####
#####JOINED DATA#####
(0 rows)
#####JOINED DATA#####
# Relationalize the data
unnested = dyf.relationalize("root", "s3://mybucket/temp/")
# Print the keys
print("#####KEYS####")
print(unnested.keys())
print("#####KEYS####")
print("")
# Select the root data
dyf_root = unnested.select("root")
# Print the root data
print("#####ROOT DATA####")
dyf_root.toDF().show(truncate=False, vertical=True)
print("#####ROOT DATA####")
print("")
# Select the supporting info data
dyf_root_supporting_info = unnested.select("root_component.supporting_info")
# Print the supporting info data
print("#####ROOT SUPPORTING INFO####")
dyf_root_supporting_info.toDF().show(truncate=False, vertical=True)
print("#####ROOT SUPPORTING INFO####")
print("")
# Join the root data with the supporting info data
dyf_joined = dyf_root.join(
paths1=["component.supporting_info"],
paths2=["id"],
frame2=dyf_root_supporting_info
)
# Print the joined schema
print("#####JOINED SCHEMA#####")
dyf_joined.printSchema()
print("#####JOINED SCHEMA#####")
print("")
# Print the joined data
print("#####JOINED DATA#####")
dyf_joined.toDF().show(truncate=False, vertical=True)
print("#####JOINED DATA#####")
New contributor
Abubakar Muazu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.