I have an Apache Flink application that retrieves messages from Google PubSub and stores them in Hadoop/HDFS using the Parquet format, utilizing the PubSubSource connector for data ingestion.
The application functions smoothly, but I’m encountering an issue where the file sizes on Hadoop are consistently under 100KB. This small file size is suboptimal for HDFS’s performance.
To manage file storage, I’m leveraging FileSink.forBulkFormat(…). I attempted to address the small file issue by customizing the CheckpointingPolicy to generate a new file only when it exceeds 150MB. However, this approach hasn’t been successful.
Could you suggest a way to resolve this issue?