Efficient algorithm/implementation for k-maximum inner product search
I am in a situation where I have a query space Q and a key space K, both filled with N d-dimensional real vectors (N ≈ 10^6, d ≈ 50). For each query q, I want to find the k≈10 keys k_i that have the highest inner products with q. One may assume that both keys and queries are normally distributed. Ideally, I would have a solution that solves this problem for N=10^6 in +- 1 second on a powerful GPU. An approximate solution is acceptable.