Virtual functions can be used to implement dynamic polymorphism. I know that in most C++ compilers, it is implemented with a vtable
, and each class can find correct function definition with an extra vtable
read (one extra memory access). I think it is not efficient on GPU (CUDA): since I don’t know how nvcc
deals with virtual functions, I assume it is the same with CPU and the vtable
, together with the memory it points to locates on global memory. Then the extra memory access (on GPU) can be slow, since it has no chance to be coalesced and it accesses the slow global memory.
I therefore opt for variant
, as suggested by std::variant to avoid virtual functions on the GPU. Though CUDA does not directly support std::variant
, I do find implementation that can be used in CUDA, and here is my code used for testing:
#include <variant/variant.h>
#include <cuda_runtime.h>
#include <iostream>
struct Type1 {
int a, b;
__host__ __device__ Type1(int _a = 0, int _b = 0): a(_a), b(_b) {}
__host__ __device__ int operation(int opr1, int opr2) const {
return (a + opr1) * (b + opr2);
}
};
struct Type2 {
int a, b;
__host__ __device__ Type2(int _a = 0, int _b = 0): a(_a), b(_b) {}
__host__ __device__ int operation(int opr1, int opr2) const {
return a * opr1 + b * opr2;
}
};
struct TypeVisitor {
int opr1, opr2;
__host__ __device__ TypeVisitor(int op1 = 1, int op2 = 1): opr1(op1), opr2(op2) {}
__host__ __device__ int operator()(const Type1& t) const { return t.operation(opr1, opr2); }
__host__ __device__ int operator()(const Type2& t) const { return t.operation(opr1, opr2); }
};
using VarType = variant::variant<Type1, Type2>;
__global__ void kernel_op(VarType* objects, int* result_buffer) {
result_buffer[threadIdx.x] = variant::apply_visitor(TypeVisitor(), objects[threadIdx.x]);
}
int main() {
VarType* vars;
CUDA_CHECK_RETURN(cudaMallocManaged(&vars, sizeof(VarType) * 8));
vars[0] = Type1(0, 1);
vars[1] = Type2(0, 1);
vars[2] = Type1(1, 1);
vars[3] = Type2(1, 1);
vars[4] = Type1(2, 1);
vars[5] = Type2(2, 1);
vars[6] = Type1(1, 2);
vars[7] = Type2(1, 2);
int* res_buffer;
CUDA_CHECK_RETURN(cudaMallocManaged(&res_buffer, sizeof(int) * 8));
kernel_op<<<1, 8>>>(vars, res_buffer);
CUDA_CHECK_RETURN(cudaDeviceSynchronize());
for (int i = 0; i < 8; i++)
printf("%dn", res_buffer[i]);
CUDA_CHECK_RETURN(cudaFree(vars));
CUDA_CHECK_RETURN(cudaFree(res_buffer));
return 0;
}
My question is:
- Is
variant
here based on static polymorphism and therefore the object information are know at compile-time? What if I need to load it from a vector (which might not be known until run-time)? - People say that there is “no free lunch”, so what is the cost of
variant
and is it really better than virtual functions?