I’m currently working on porting the B+ Tree data structure and some functionalities into CUDA to accelerate operations. I’m focusing on accelerating the speed of a SELECT * operation by leveraging the parallelism of the many-core processor. I’m wondering if somebody here could have some ideas or insight here that I’m missing.
What I’m trying is an approach in which the B+ Tree is represented as an array and using braided parallelism, having each thread mapped to 1 possible child of the tree, i.e. if the tree of order 128 (127 keys/128 children), I use 128 threads. With this threadmapping, I can quickly traverse over the list of data, decently faster than the CPU, but when I “perform” the SELECT by printing out the key-value pairs of the B+ Tree, it becomes slower than CPU because the GPU is strong at performing ALU operations, not prints. If anybody has any ideas on possible improvements I could make, feedback would be appreciated.