Working with embedded software, I’m using numpy
arrays with specified datatypes to save memory and other reasons. I’m using version ‘1.23.4’ and Python 3.10.0, but this seems to be an issue with latest versions too. In a task I noticed that sorting this array with np.sort(..., order='id')
seems to be very slow. This comes as a surprise, as if we just sort the ID column separately with np.sort
it is handled fastly, and no other column is relevant for the sorting.
A similar small example:
dtypes = [('x', 'u1'), ('id', '<u8')]
x_values = [n % 2 for n in range(5000000)]
ids = list(range(5000000, 0, -1))
x = np.array(list(zip(x_values, ids)), dtype=dtypes)
Now, if I run np.sort(x, order='id')
, it takes on my machine around 15-20 seconds. Compared to np.sort(x['id'])
which runs in 1/100th of the time, it seems unreasonably slow. This suggests workarounds to not sort the complete array directly.
What increases this complexity when using a multi datatype array?