I have a very large netcdf dataset consisting of daily chunks data from April 1985 to April 2024. As the arrays are divided into daily chunks, I often open them by using ds = xr.open_mfdataset(*.nc)
. The entire dataset has up to 1.07TB, which is way way far than I can handle loading into memory:
By slicing over lat/lon coordinates ds.sel(latitude=y, longitude=x, method='nearest')
I get a single pixel along my timeseries, which is now far lighter than the original dataset and allows me to perform the analysis I need:
However, even though the sliced dataset is now very light, it stills takes so much time for it to get loaded into memory (more than 1h) with ds.load()
. This would not be a big deal if I didn’t need to perform this operation more 100,000 times, which would take incredible 10 years to finish!
I don’t have a powerful machine, but it’s decent enough for performing the tasks I need. Although I was expecting this task to took some time, I really wish to finish it before becoming a dad. Besides going for a more powerful machine (which I think will still not reduce the amount of required time to the order of days), is there any way I can try to optimize this task?