I have a few Flink 1.17 workloads that save checkpoints to S3. They are running on Kubernetes and managed by Flink Operator. I have as well decided to RETAIN_ON_CANCELLATION
for safety.
I can see the expected structure in S3:
/user-defined-checkpoint-dir
/{job-id}
|
+ --shared/
+ --taskowned/
+ --chk-1/
+ --chk-2/
+ --chk-3/
...
However, the number of files under shared
seem to be constantly growing. Some are months old despite the fact that I have a TTL of 1 day max.
Here’s an example extracted today (2024-09-05), using LastModified
S3 property:
Period starting 2024-07-18: 130545.63 MB
Period starting 2024-07-25: 1014.31 MB
Period starting 2024-08-01: 9.93 MB
Period starting 2024-08-08: 2267.53 MB
Period starting 2024-08-15: 104870.59 MB
Period starting 2024-08-22: 92.28 MB
Period starting 2024-08-29: 95672.27 MB
Is there a way to reliably expire these objects without corrupting the checkpoints?