I have two group by operations.
One uses pandas:
dg_cmp = pd.DataFrame(data).groupby(header)
where data is a numpy.ndarray (120120,1500) and header is a pandas.core.series.Series:
<class 'pandas.core.series.Series'>
1 1
2 2
3 3
4 4
5 5
...
120116 2138
120117 2139
120118 2140
120119 2141
120120 2142
Name: CDP, Length: 120120
The groups returned look like:
{1.0: array([1]), 2.0: array([2]), 3.0: array([ 3, 121]), 4.0: array([ 4, 122]),
The other uses xarray.DataArrays:
dg_cmp = data.groupby(cdp_dataarray)
where data is
<xarray.DataArray 'data' (cdp: 120120, twt: 1500)> Size: 721MB
dask.array<open_dataset-data, shape=(120120, 1500), dtype=float32, chunksize=(51860, 647), chunktype=numpy.ndarray>
Coordinates:
* cdp (cdp) int64 961kB 0 1 2 3 4 ... 120115 120116 120117 120118 120119
* twt (twt) float64 12kB 0.0 4.0 8.0 ... 5.988e+03 5.992e+03 5.996e+03
and cdp_dataarray is:
<xarray.DataArray 'CDP' (trace: 120120)> Size: 240kB
array([ 1, 2, 3, ..., 2140, 2141, 2142], dtype=uint16)
Coordinates:
* trace (trace) int64 961kB 0 1 2 3 4 ... 120116 120117 120118 120119
The DataArrayGroupBy groups returned are:
{1: [0], 2: [1], 3: [2, 120], 4: [3, 121], 5: [4, 122, 240],
Which is not the same as the pandas groupby. Can anyone spot why this is and suggest how to get xarray groupby to match the pandas?