Optimal way to convert an xr.Dataset to a Dask DataFrame?
I am running a function in a Dask cluster to compute the normalized difference between two of my data variables in my xarray Dataset object. However, I need this to be in a Dask DataFrame format first before I can do the computation. (I realize I can do this keeping the object as an xarray Dataset, I’m testing future functions where I need to convert from xarray Dataset to a Dask DataFrame). So, I run a method xr.Dataset.to_dask_dataframe(), then run my code. I compared the performance of running this function on just an xarray Dataset object and a Dask DataFrame object and the Dask DataFrame test-case uses a significantly higher number of resources due to all of the new tasks Dask sets up to repartition/rechunk my data (see screenshots, first shot is the xarray run, second is the Dask DataFrame run). I understand that repartitioning and rechunking is a very expensive task. Using xr.Dataset.to_dask_dataframe() does not convert my chunked data into partitions so I run a repartition step in my code. Also, excuse all of the red task streams and the white spaces in-between tasks. This is not optimal by any means. What are my options for efficiently converting the xarray Dataset to a Dask DataFrame and potentially reducing the number of tasks in my task stream? I feel it has to come down to how I chunk/partition the data, but I’m not sure.