I am reading product content in JSON format from Kafka and creating files in the GCS bucket. The file content is JSON, one of the fields in the JSON is update time. I am creating one file for each product.
I want to make sure that I do not override the file if the data in the GCS bucket is more recent than the incoming data (based on update time).
I can read the file, parse the JSON, compare the update time, and then override the file. But I want to know if there is a more efficient approach.
If you’re happy to use the last update time of the GCS object itself, you can get that in the normal object metadata (without reading the content).
If you need to record the update time from elsewhere, you could include that as custom Cloud Storage metadata when you upload the object.
Keeping information you need for filtering as object metadata is an effective way of avoiding having to load the content just to get at one piece of information. The only downside is that you do need to make sure you keep the metadata consistent with the actual file content. (You’ve basically got two sources of truth at that point. If they diverge, you could run into problems. It worth considering how big those problems might be, and how to mitigate them.)