Thiết kế website giá rẻ

Question

I implemented my computation logic in C++ (actually using C++ code generated by PyTorch’s Inductor), utilizing OpenMP to accelerate Tensor computations. For example, one of the functions is as follows:

extern "C" void cpp_fused_convolution_0(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0,
                       float* out_ptr1,
                       const long ks0)
{
    #pragma omp parallel num_threads(64)
    {
        int t_nums = omp_get_num_threads();
        int tid = omp_get_thread_num();
        printf("Current thread ID: %d, all thread nums: %dn", tid, t_nums);
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(ks0); x0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(3L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(50176L); x2+=static_cast<long>(1L))
                    {
                        auto tmp0 = in_ptr0[static_cast<long>(x2 + (50176L*x1) + (150528L*x0))];
                        out_ptr0[static_cast<long>(x1 + (3L*x2) + (150528L*x0))] = tmp0;
                    }
                }
            }
        }
        #pragma omp single
        {
            {
                #pragma GCC ivdep
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x1=static_cast<long>(0L); x1<static_cast<long>(3L); x1+=static_cast<long>(1L))
                    {
                        #pragma GCC ivdep
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(49L); x2+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr1[static_cast<long>(x2 + (49L*x1) + (147L*x0))];
                            out_ptr1[static_cast<long>(x1 + (3L*x2) + (147L*x0))] = tmp0;
                        }
                    }
                }
            }
        }
    }
}

I configured 64 threads for parallel computation!
I added a line of code to print the thread ID and the total number of threads.——”printf(“Current thread ID: %d, all thread nums: %dn”, tid, t_nums);”
Then, I compiled this code into a dynamic library using GCC. I added the OpenMP enablement flag, and later testing in the Python environment showed that the dynamic library behaves as expected.

g++ /home/admin/zy429782/alibaba/aios/basic_ops/testdata/pytorch_models/resnet18_cpu/data/t7/ct73jxe2z6yybfbkncpbmdx6egd236o66jam7hou27boj5eumdpl.cpp -fPIC -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -D_GLIBCXX_USE_CXX11_ABI=0 -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/TH -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/THC -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/include/python3.8 -mavx512f -mavx512dq -mavx512vl -mavx512bw -mfma -D CPU_CAPABILITY_AVX512 -O3 -DNDEBUG -ffast-math -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -march=native -fopenmp -D TORCH_INDUCTOR_CPP_WRAPPER -D C10_USING_CUSTOM_GENERATED_MACROS -c -o /home/admin/zy429782/alibaba/aios/basic_ops/testdata/pytorch_models/resnet18_cpu/data/t7/ct73jxe2z6yybfbkncpbmdx6egd236o66jam7hou27boj5eumdpl.o

g++ /home/admin/zy429782/alibaba/aios/basic_ops/testdata/pytorch_models/resnet18_cpu/data/t7/ct73jxe2z6yybfbkncpbmdx6egd236o66jam7hou27boj5eumdpl.o /home/admin/zy429782/alibaba/aios/basic_ops/testdata/pytorch_models/resnet18_cpu/data/7e/c7eavtek6ibammokniksmjwkgefuz6cjrgpw46g7vvm5ts6rb42t.o -shared -fPIC -Wall -std=c++17 -Wno-unused-variable -Wno-unknown-pragmas -D_GLIBCXX_USE_CXX11_ABI=0 -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/TH -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/include/THC -I/home/admin/zy429782/miniforge3/envs/torch240_cuda121/include/python3.8 -L/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/lib -L/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib -L/home/admin/zy429782/miniforge3/envs/torch240_cuda121/lib/python3.8/site-packages/torch/lib -ltorch -ltorch_cpu -lgomp -lc10 -mavx512f -mavx512dq -mavx512vl -mavx512bw -mfma -D CPU_CAPABILITY_AVX512 -O3 -DNDEBUG -ffast-math -fno-finite-math-only -fno-unsafe-math-optimizations -ffp-contract=off -march=native -fopenmp -D TORCH_INDUCTOR_CPP_WRAPPER -D C10_USING_CUSTOM_GENERATED_MACROS -o resnet18_cpu.so

Since I used PyTorch’s aot_compile feature to generate the C++ code, I can load this dynamic library directly in the Python environment. Essentially, it’s still calling the functions within the library, just like any regular dynamic library loading.

import torch
model = torch._export.aot_load('resnet18_cpu.so', 'cpu')
print(model(torch.ones(4, 3, 224, 224)))

The output looks like this, and everything works as expected:

Current thread ID: 24, all thread nums: 64
Current thread ID: 48, all thread nums: 64
Current thread ID: 6, all thread nums: 64
Current thread ID: 49, all thread nums: 64
Current thread ID: 40, all thread nums: 64
Current thread ID: 46, all thread nums: 64
Current thread ID: 58, all thread nums: 64
Current thread ID: 1, all thread nums: 64

.................

Current thread ID: 17, all thread nums: 64
Current thread ID: 57, all thread nums: 64
Current thread ID: 11, all thread nums: 64
Current thread ID: 25, all thread nums: 64
Current thread ID: 43, all thread nums: 64
Current thread ID: 44, all thread nums: 64
Current thread ID: 62, all thread nums: 64
Current thread ID: 34, all thread nums: 64
Current thread ID: 20, all thread nums: 64
Current thread ID: 8, all thread nums: 64
Current thread ID: 42, all thread nums: 64
Current thread ID: 14, all thread nums: 64
Current thread ID: 36, all thread nums: 64
Current thread ID: 56, all thread nums: 64
Current thread ID: 22, all thread nums: 64
tensor([[-0.0391,  0.1145, -1.7968,  ..., -1.5152,  0.1724,  0.1825],
        [-0.0391,  0.1145, -1.7968,  ..., -1.5152,  0.1724,  0.1825],
        [-0.0391,  0.1145, -1.7968,  ..., -1.5152,  0.1724,  0.1825],
        [-0.0391,  0.1145, -1.7968,  ..., -1.5152,  0.1724,  0.1825]])
>>>

However, when I write a custom operator in C++ TensorFlow and load this dynamic library for inference, the code looks something like this:

_aotiModelContainerRunner = std::make_shared<torch::inductor::AOTIModelContainerRunnerCpu>(modelFilePath);
vector<at::Tensor> realOutputs = _aotiModelContainerRunner->run(realInputs);

Strangely, all the printed thread IDs are 0, yet the total number of threads is indeed 64.

Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64

...........................

Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64
Current thread ID: 0, all thread nums: 64

Moreover, the computation results are wildly incorrect, with NAN values appearing.

  "output tensors: (0th) Shape:[4,1000], Content:[-nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan -nan

I believe there must be some setting affecting OpenMP’s expected behavior. Even after adding ‘export OMP_NUM_THREADS=64’, it didn’t help. I also created a symlink for /lib64/libgomp.so.0.1 pointing to the libgomp library in my conda environment, but that didn’t work either. Something is definitely impacting OpenMP’s expected behavior, but after searching for a long time, I still haven’t found a solution. Does anyone have any ideas or thoughts on possible causes?

$ldd /home/admin/zy429782/.sandbox/bazel-sandbox.1866e40fc46b8e58d1839b91358c2661f1ec6978b8ad743585739725887aba2e/linux-sandbox/13/execroot/com_taobao_aios/bazel-out/k8-fastbuild/bin/aios/basic_ops/basic_ops/ops/model/pytorch_model_predict_op_test.runfiles/com_taobao_aios/aios/basic_ops/basic_ops/ops/model/pytorch_model_predict_op_test | grep gomp
    libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f97c1644000)

$ldd resnet18_cpu.so | grep omp
libgomp.so.1 => /lib64/libgomp.so.1 (0x00007f0aafce2000)

“I hope to find the cause of the bug or a way to work around it.”

Thiết kế website giá rẻ

Danh mục

What interfered with my expected OpenMP behavior in the custom operator of C++ TensorFlow 1.12?