The xarray documentation for the open_mfdataset
function states that you can use the preprocess
argument to apply a function to each dataset before concatenation. The NetCDF datasets I have do not have coordinates assigned when you open them one-by-one, so I was attempting to assign them before concatenation with combine='by_coords'
in the open_mfdataset
function.
This is what a single one of the datasets looks like if you open it:
path = 'path/to/my/file/file.nc'
ds = xr.open_dataset(path, decode_times=False)
ds
# <xarray.Dataset> Size: 1GB
# Dimensions: (comid: 2677612, time_mn: 120, time_yr: 10)
# Dimensions without coordinates: comid, time_mn, time_yr
#Data variables:
# COMID (comid) int32 11MB ...
# Time_mn (time_mn) int32 480B ...
# Time_yr (time_yr) int32 40B ...
# RAPID_mn_cfs (comid, time_mn) float32 1GB ...
# RAPID_yr_cfs (comid, time_yr) float32 107MB ...
To use open_mfdataset
my code looks like this. The assignCoordinates
function works as intended, but it still fails to open the datasets.
def assignCoordinates(df):
df = df.assign_coords({
"comid": df['COMID'],
"time_mn": fd.calcDatetimes(df, 'Time_mn', df.sizes['time_mn']), #this just calculates datetimes for the weird time units used in these files, the function works properly
"time_yr": fd.calcDatetimes(df, 'Time_yr', df.sizes['time_yr'])
})
return df
path = "path/to/files/*.nc"
ds = xr.open_mfdataset(path, preprocess=assignCoordinates, combine='by_coords', decode_times=False)
ds
This is the error I receive:
ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation
I assume the preprocessed files are not actually being used by open_mfdataset
, but then I don’t really understand what the point of that argument is. My suspicion that it doesn’t use the preprocessed datasets is further enforced by the fact that if it was, I should be able to remove decode_times=False
because the times are now calculated in a way that makes sense and could be decoded after running through the assignCoordinates
function, but if I remove it, I get an error about the times being unable to be decoded.
Is there a way to do what I am wanting or do I really have to open each dataset individually?
Minimum Reproducible Example
Copy this code, and fill out the export path. This will create three .nc
files in the directory you specify.
import xarray as xr
import numpy as np
np.random.seed(0)
temperature = 15 + 8 * np.random.randn(2, 3, 4)
precipitation = 10 * np.random.rand(2, 3, 4)
lon = [-99.83, -99.32]
lat = [42.25, 42.21]
instruments = ["manufac1", "manufac2", "manufac3"]
time = pd.date_range("2014-09-06", periods=4)
reference_time = pd.Timestamp("2014-09-05")
ds = xr.Dataset(
data_vars=dict(
temperature=(["loc", "instrument", "time"], temperature),
precipitation=(["loc", "instrument", "time"], precipitation),
),
attrs=dict(description="Weather related data."),
)
for i in range(1,4):
ds.to_netcdf(f'yourdirectory/test{i}.nc') #### EDIT HERE #####
After doing the above, run this code (remember to alter the directory to where you saved the files created above):
def assignCoordinates(df):
df = df.assign_coords({
"loc": df['loc'],
"instrument": df['instrument'],
"time": df['time']
})
return df
ds = xarr.open_mfdataset('yourdirectory/*.nc', preprocess=assignCoordinates, combine='by_coords') #### EDIT HERE #####
ds