While observing the billing reports for a given Project B, I noticed that we are paying heavy costs for Regional Standard Class A Operations for some gcs buckets within our project.
It’s not the storage which is costing high.
A high-level overview as:
Project A – gcs bucket act as landing zone for incoming files
Project B – run cloud composer and dataproc which read files from Project A and write or update the data to bigquery datasets which are defined in Project A
We have many Airflow dags
which runs very frequently and submit spark jobs to Dataproc cluster
which than reads the files from gcs buckets and submit spark job to read and write data to bigquery and then submit subsequent jobs to do the aggregation and some other calculations.
Our concern is related to the buckets defined in Project B.
cloud composer bucket wherein dags are stored, and logs being written.
dataproc cluster staging bucket to store spark job logs
cloud composer bucket observability graph as:
dataproc staging bucket observability graph as:
I don’t understand these graphs a much, but it seems that composer bucket is the one which is contributing to the cost more.
Not sure what all operations is contributing to this much high usage. Any pointers to look upon?