I implemented my computation logic in C++ (actually using C++ code generated by PyTorch’s Inductor), utilizing OpenMP to accelerate Tensor computations. For example, one of the functions is as follows:
extern "C" void cpp_fused_convolution_0(const float* in_ptr0,
const float* in_ptr1,
float* out_ptr0,
float* out_ptr1,
const long ks0)
{
#pragma omp parallel num_threads(64)
{
int t_nums = omp_get_num_threads();
int tid = omp_get_thread_num();
printf("Current thread ID: %d, all thread nums: %dn", tid, t_nums);
{
#pragma omp for
for(long x0=static_cast<long>(0L); x0<static_cast<long>(ks0); x0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long x1=static_cast<long>(0L); x1<static_cast<long>(3L); x1+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long x2=static_cast<long>(0L); x2<static_cast<long>(50176L); x2+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(x2 + (50176L*x1) + (150528L*x0))];
out_ptr0[static_cast<long>(x1 + (3L*x2) + (150528L*x0))] = tmp0;
}
}
}
}
#pragma omp single
{
{
#pragma GCC ivdep
for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long x1=static_cast<long>(0L); x1<static_cast<long>(3L); x1+=static_cast<long>(1L))
{
#pragma GCC ivdep
for(long x2=static_cast<long>(0L); x2<static_cast<long>(49L); x2+=static_cast<long>(1L))
{
auto tmp0 = in_ptr1[static_cast<long>(x2 + (49L*x1) + (147L*x0))];
out_ptr1[static_cast<long>(x1 + (3L*x2) + (147L*x0))] = tmp0;
}
}
}
}
}
}
}
I configured 64 threads for parallel computation!
I added a line of code to print the thread ID and the total number of threads.——”printf(“Current thread ID: %d, all thread nums: %dn”, tid, t_nums);”
Then, I compiled this code into a dynamic library using GCC. I added the OpenMP enablement flag, and later testing in the Python environment showed that the dynamic library behaves as expected.
g++ /home/admin/zy429782/alibaba/aios/basic_ops/testdata/pytorch_models/resnet18_cpu/data/t7/ct73jxe2z6yybfbkncpbmdx6egd236o66jam7hou27boj5eumdpl.cpp -fPIC -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -D_GLIBCXX_USE_CXX11_ABI=0 -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/TH -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/THC -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/include/python3.8 -mavx512f -mavx512dq -mavx512vl -mavx512bw -mfma -D CPU_CAPABILITY_AVX512 -O3 -DNDEBUG -ffast-math -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -march=native -fopenmp -D TORCH_INDUCTOR_CPP_WRAPPER -D C10_USING_CUSTOM_GENERATED_MACROS -c -o /home/admin/zy429782/alibaba/aios/basic_ops/testdata/pytorch_models/resnet18_cpu/data/t7/ct73jxe2z6yybfbkncpbmdx6egd236o66jam7hou27boj5eumdpl.o
g++ /home/admin/zy429782/alibaba/aios/basic_ops/testdata/pytorch_models/resnet18_cpu/data/t7/ct73jxe2z6yybfbkncpbmdx6egd236o66jam7hou27boj5eumdpl.o /home/admin/zy429782/alibaba/aios/basic_ops/testdata/pytorch_models/resnet18_cpu/data/7e/c7eavtek6ibammokniksmjwkgefuz6cjrgpw46g7vvm5ts6rb42t.o -shared -fPIC -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -D_GLIBCXX_USE_CXX11_ABI=0 -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/TH -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/THC -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/include/python3.8 -L/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/lib -L/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib -L/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/lib -ltorch -ltorch_cpu -lgomp -lc10 -mavx512f -mavx512dq -mavx512vl -mavx512bw -mfma -D CPU_CAPABILITY_AVX512 -O3 -DNDEBUG -ffast-math -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -march=native -fopenmp -D TORCH_INDUCTOR_CPP_WRAPPER -D C10_USING_CUSTOM_GENERATED_MACROS -o resnet18_cpu.so
Since I used PyTorch’s aot_compile feature to generate the C++ code, I can load this dynamic library directly in the Python environment. Essentially, it’s still calling the functions within the library, just like any regular dynamic library loading.
import torch
model = torch._export.aot_load('resnet18_cpu.so', 'cpu')
print(model(torch.ones(4, 3, 224, 224)))
The output looks like this, and everything works as expected:
Current thread ID: 24, all thread nums: 64
Current thread ID: 48, all thread nums: 64
Current thread ID: 6, all thread nums: 64
Current thread ID: 49, all thread nums: 64
Current thread ID: 40, all thread nums: 64
Current thread ID: 46, all thread nums: 64
Current thread ID: 58, all thread nums: 64
Current thread ID: 1, all thread nums: 64
.................
Current thread ID: 17, all thread nums: 64
Current thread ID: 57, all thread nums: 64
Current thread ID: 11, all thread nums: 64
Current thread ID: 25, all thread nums: 64
Current thread ID: 43, all thread nums: 64
Current thread ID: 44, all thread nums: 64
Current thread ID: 62, all thread nums: 64
Current thread ID: 34, all thread nums: 64
Current thread ID: 20, all thread nums: 64
Current thread ID: 8, all thread nums: 64
Current thread ID: 42, all thread nums: 64
Current thread ID: 14, all thread nums: 64
Current thread ID: 36, all thread nums: 64
Current thread ID: 56, all thread nums: 64
Current thread ID: 22, all thread nums: 64
tensor([[-0.0391, 0.1145, -1.7968, ..., -1.5152, 0.1724, 0.1825],
[-0.0391, 0.1145, -1.7968, ..., -1.5152, 0.1724, 0.1825],
[-0.0391, 0.1145, -1.7968, ..., -1.5152, 0.1724, 0.1825],
[-0.0391, 0.1145, -1.7968, ..., -1.5152, 0.1724, 0.1825]])
>>>
However, when I write a custom operator in C++ TensorFlow and load this dynamic library for inference, the code looks something like this:
_aotiModelContainerRunner = std::make_shared<torch::inductor::AOTIModelContainerRunnerCpu>(modelFilePath);
vector<at::Tensor> realOutputs = _aotiModelContainerRunner->run(realInputs);
Strangely, all the printed thread IDs are 0, yet the total number of threads is indeed 64.
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
...........................
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Moreover, the computation results are wildly incorrect, with NAN values appearing.
"output tensors: (0th) Shape:[4,1000], Content:[-nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan
I believe there must be some setting affecting OpenMP’s expected behavior. Even after adding ‘export OMP_NUM_THREADS=64’, it didn’t help. I also created a symlink for /lib64/libgomp.so.0.1 pointing to the libgomp library in my conda environment, but that didn’t work either. Something is definitely impacting OpenMP’s expected behavior, but after searching for a long time, I still haven’t found a solution. Does anyone have any ideas or thoughts on possible causes?
$ldd /home/admin/zy429782/.sandbox/bazel-sandbox.1866e40fc46b8e58d1839b91358c2661f1ec6978b8ad743585739725887aba2e/linux-sandbox/13/execroot/com_taobao_aios/bazel-out/k8-fastbuild/bin/aios/basic_ops/basic_ops/ops/model/pytorch_model_predict_op_test.runfiles/com_taobao_aios/aios/basic_ops/basic_ops/ops/model/pytorch_model_predict_op_test | grep gomp
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f97c1644000)
$ldd resnet18_cpu.so | grep omp
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f0aafce2000)
“I hope to find the cause of the bug or a way to work around it.”
7
The OpenMP multithreading issue arises because PyTorch relies on a specific commit ID of libgomp.so. For example, when downloading torch-2.3.0-cp38-cp38-manylinux1_x86_64.whl from pytorch.org and renaming it to .zip, extracting it reveals libgomp-a34b3233.so.1 in the torch/lib directory. Some behaviors in PyTorch, especially in libtorch_cpu.so, depend on this modified OpenMP library.
However, there is another libgomp.so.1 library located in /lib64, which behaves differently. The forward logic of nn.Module compiled by Torch Inductor uses the system’s gcc, relying on /lib64/libgomp.so.1. Thus, my executable prioritizes loading this system library.
To fix this, preload libgomp-a34b3233.so.1 before running the program, for example, by using LD_PRELOAD=libgomp-a34b3233.so.1.
zhangyu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.