I am working on a high-performance sorting algorithm to sort hundreds of millions of uint64_t data. To improve efficiency, I want to use parallel execution with the x86simdsort library. I can provide an array of pivots which can approximately divide the data into n equal parts. Below is a simplified version of my implementation:
cpp
extern "C" void parallel_sort(uint64_t * arr, size_t size, const std::array<uint64_t, 3>& pivots) {
auto partition = [](uint64_t* arr, size_t low, size_t high, uint64_t pivot) {
size_t i = low;
size_t j = high;
while (true) {
while (i <= high && arr[i] <= pivot) i++;
while (j >= low && arr[j] > pivot) j--;
if (i >= j) break;
std::swap(arr[i], arr[j]);
i++;
j--;
}
return i;
};
size_t mid1 = partition(arr, 0, size - 1, pivots[1]);
#pragma omp parallel sections
{
#pragma omp section
{size_t low_mid = partition(arr, 0, mid1 - 1, pivots[0]);}
#pragma omp section
{size_t high_mid = partition(arr, mid1, size - 1, pivots[2]);}
}
#pragma omp parallel sections
{
#pragma omp section
{x86simdsort::qsort(arr, low_mid, false);}
#pragma omp section
{x86simdsort::qsort(arr + low_mid, mid1 - low_mid, false);}
#pragma omp section
{x86simdsort::qsort(arr + mid1, high_mid - mid1, false);}
#pragma omp section
{x86simdsort::qsort(arr + high_mid, size - high_mid, false);}
}
}
However, I found that my custom partition function is very inefficient, taking up more than 60% of the total sorting time. This makes the overall speed not much faster than single-threaded x86simdsort::qsort.
My Goal
I want to use the highly efficient partition_unrolled function from the x86simdsort library to replace my partitioning step. The function is defined in xss-common-qsort.h as follows:
template <typename vtype,
typename comparator,
int num_unroll,
typename type_t = typename vtype::type_t>
X86_SIMD_SORT_INLINE arrsize_t partition_unrolled(type_t *arr,
arrsize_t left,
arrsize_t right,
type_t pivot,
type_t *smallest,
type_t *biggest);
My Question
How can I directly call the partition_unrolled function in my code? The structure of the x86simdsort library is quite complex. I need guidance on how to expose this function and make it usable in my parallel sorting implementation. What is the best approach to call the partition_unrolled function from the library in my project?
Additional Information
I am using Meson to build the project, as recommended in the project’s README