I have a numpy.array items
of N~10^7 integers and a numpy.array coefficients
of N floats.
items
contains repeated entries, and only ~sqrt(N) unique entries.
I would like an array of unique entries unique_items
and an array of accumulated coefficients unique_coefficients
such that unique_coefficients[i]
is the sum over j
of coefficients[j]
such that items[j]=unique_items[i]
.
For example if we have
items = [1,1,2,2,3,4]
coefficients = [1,2,3,4,5,6]
then
unique_items = [1,2,3,4]
unique_coefficients = [3,7,5,6]
The first entry of unique_coefficients
is the sum of the 2 first entries of coefficients
because the two first entries of items
are 1 etc. Note that items
is not necessarily sorted.
For now, I first compute a dict where each key is a unique element of items
, and each value is a list of indices of the element in items
, then I sum over the indices stored in this dict:
d = {}
for index, item in enumerate(items):
if not item in d:
d[item] = []
d[item].append(index)
unique_items = d.keys()
unique_coefficients = [np.sum(coefficients[i]) for i in d.values()]
Is there a way to avoid doing this python for loop and stay in numpy instead ? This for loop is where my code spends the most time.