I have a C++ library that runs kernels on GPUs (it uses Kokkos). I would like to expose this in python and couple it with PyTorch, so I am using pybind11.
I would like to avoid adding a dependency on the PyTorch C++ API to my library so I am attempting to simply pass pointers between the C++/Python interface.
I have examples here:
C-code:
https://github.com/jtchilders/python_kokkos_tests/blob/main/tensor_example.cpp
Python Code:
https://github.com/jtchilders/python_kokkos_tests/blob/main/tensor_example.py
I succeed at filling a PyTorch Tensor and passing it into the pybind11 C-function via:
tensor = torch.randn(10, device='cuda')
result = tenex.process_tensor(tensor.data_ptr(), tensor.size(0))
The function in C:
TensorData process_tensor(uintptr_t input_data_ptr, std::size_t size){
// Convert uintptr_t back to float* after passing from Python
float* actual_data_ptr = reinterpret_cast<float*>(input_data_ptr);
// Create an unmanaged Kokkos view from the raw pointer that points at the device
Kokkos::View<float*, Kokkos::DefaultExecutionSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged>> input_view(actual_data_ptr, size);
// Allocate device-side memory to hold the output using Kokkos
float* output_data_ptr = static_cast<float*>(Kokkos::kokkos_malloc<Kokkos::DefaultExecutionSpace>(size * sizeof(float)));
// Create an unmanaged Kokkos view for the output that wraps the created memory
Kokkos::View<float*, Kokkos::DefaultExecutionSpace, Kokkos::MemoryTraits<Kokkos::Unmanaged>> output_view(output_data_ptr, size);
// the reason to use unmanaged views is to avoid Kokkos from
// trying to clean up the memory automatically
// Perform operations on the GPU
Kokkos::parallel_for("scale_data", Kokkos::RangePolicy<Kokkos::Cuda>(0, size), KOKKOS_LAMBDA(const int i) {
output_view(i) = input_view(i) * 2.0f; // Example operation
});
Kokkos::fence(); // Ensure Kokkos operations are completed
// Return the output data pointer as uintptr_t
TensorData result; // this is just a struct to hold the pointer and array size
result.data_ptr = reinterpret_cast<uintptr_t>(output_view.data());
result.size = size;
return result;
}
How can I create a PyTorch Tensor from a pointer & size of array?
Keep performance in mind. All the steps I’ve taken try to avoid unnecessary host/device copies which slow things down, so I want the PyTorch Tensor to simply take ownership of the device-memory pointed to by the array. Ideally it would also destroy the memory at the end of the python code.