I have below Python
script where currently it generates several gz files with size 4MB
in S3 bucket
. Its bydeafult what AWS glue
has created. But now i want to create multiple files of specific size around 100-250MB
in s3 bucket
. I have tried below logic in python script but it did not work and still creates several gz files with size 4MB
.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import datetime
args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
outputbucketname = args['target_BucketName']
timestamp = datetime.datetime.now().strftime("%Y%m%d")
filename = f"tbd{timestamp}"
output_path = f"{outputbucketname}/{filename}"
# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node075257312 = glueContext.create_dynamic_frame.from_catalog(database="ardt", table_name="_ard_tbd", transformation_ctx="AWSGlueDataCatalog_node075257312")
# Script generated for node Amazon S3
AmazonS3_node075284688 = glueContext.write_dynamic_frame.from_options(frame=AWSGlueDataCatalog_node075257312, connection_type="s3", format="csv", format_options={"separator": "|"}, connection_options={"path": output_path, "compression": "gzip", "recurse": True, "groupFiles": "inPartition", "groupSize": "100000000"}, transformation_ctx="AmazonS3_node075284688")
job.commit()
Use pandas for this :
# Convert DynamicFrame to PySpark DataFrame
df_spark = AWSGlueDataCatalog_node075257312.toDF()
# Convert PySpark DataFrame to Pandas DataFrame
df_pandas = df_spark.toPandas()
# Determine the number of rows per file based on your desired file size
rows_per_file = 100000 # Adjust this number based on your data and desired file size
# Split the DataFrame into smaller DataFrames and save each to S3 in gzip format
for i in range(0, len(df_pandas), rows_per_file):
df_chunk = df_pandas[i:i + rows_per_file]
df_chunk.to_csv(
f's3://your-bucket/your-output-path/part_{i // rows_per_file}.csv.gz',
index=False,
compression='gzip'
)
1
The groupFiles option that you’re using is only used for reading data more efficiently.
You have 2 options here. The easier one would be to repartition data – but in this case you’d have to guesstimate the number of partitions beforehand.
Another possibility is to use maxRecordsPerFile
Spark option:
df.write.option("maxRecordsPerFile", 10000).save("s3://some-bucket/my-prefix/")
Again, this would require estimating the number of records required to obtain the desired file size.
Finally, the most bulletproof way you could do this (but also the most complex one) – write output of your job without regard for file sizes, look at the average file size, then re-read, repartition and write again.
1