Approaches to optimize “tight” code fragments in CUDA code

I am optimizing CUDA kernel that processes image and amount of work that is made for each source pixel changes during the process. For example, I have a 2D loop at some place, something like

float sum = 0;
for (int i = 0; i < 5; i++)
    for (int j = 0; j < 5; j++)
        sum += calc_value(x + i * alpha, y + j *beta, vec, something_else);

and less intensive fragments like

Vec3 vec = calc_vector(x, y, something_else_2);

The code is actually much more complicated and include outer loops and gathering of results from such 2D loops, but generally this is something like

__global__ process_pixel(Data data)
{
    int x = threadIdx.x;
    int y = threadIdx.y;
    Vec3 results[8][10];

    for (int delta = 0; delta < 8; delta++)
        for (int stage = 0; stage < 10; stage++)
        {
            Vec3 vec = calc_vector(x, y, stage);
            Vec3 vec2 = calc_vector2(x, y, delta);
            float sum = 0;
            for (int i = 0; i < 5; i++)
                for (int j = 0; j < 5; j++)
                    sum += calc_value(x + i * alpha, y + j * beta, vec, vec2, delta, stage);
            results[delta][stage] = func(sum);
        }
    find_best(results);    // And so on
}

Indeed a lot of memory is used by each thread, actually much more than I have shown. And so there is a problem with lack of registers. And there are calculations that can be optimized because they do not depend from delta or from stage or from i, but this requires even more memory.

So I decided to regroup threads so each group of 5 threads worked in parallel on one pixel. Thus we could spread the required memory among them. So instead of e.g. 16 * 8 threads each processing one pixel I started to use 16 * 10 thread blocks and added additional outer loop by pixels, so when we enter for (int delta = 0; delta < 8; delta++) we are going to process only 16 * 2 pixels together. This gave some improvement in memory and compute GPU subsystems occupancy, but they are still far from optimal, like 40%. And another problem has arose is that I need to calculate some things with only part of threads or use a lot of shared memory:

    int tid = threadIdx.x + threadIdx.y * blockDim.x;
    for (int delta = 0; delta < 8; delta++)
        for (int stage = 0; stage < 10; stage++)
        {
            __shared__ Vec3 vecs[block_size / 5];
            if (tid < block_size / 5)
                vecs[tid] = calc_vector(x, y, stage);
            __syncthreads();

            __shared__ float sums[block_size / 5];
            float sum = 0;
            int i = threadIdx.y;
            for (int j = 0; j < 5; j++)
                sum += calc_value(x + i * alpha, y + j *beta, vec, vec2, delta, stage);
            atomicAdd(&(sums[threadIdx.x]), sum);     // Another bad thing, but let us not discuss it now
            ...

Although generally the approach looks logical, it looks like this processing with only 32 threads out of 160 and __syncthreads() produces big performance bottleneck. At one code fragment I even got worse results with such approach in comparison to calculation of the same with all threads. I thought it would not be a problem that 128 threads are idle because we should have many other working thread blocks in parallel (I was able to get theoretical occupancy like 80%). But it looks like it is a problem (although I could make mistake during experiments and profiling) and also the assembly code gets a lot of shared stores and loads in comparison to “not optimized” code in which there are many FFMULs and that’s all.

And then the fact arises that shared memory is even smaller than registers memory… If we do not take GF 4090, we have e.g. 96 KB. So if we want 2048 threads per SM we need to limit ourselves with 12 float values per thread. Even less than 32 registers…

So the question is which other approaches could be applied here? Something simple and effective… I see the following ways. Since we already have variant 1 – simply calculate everything by all threads and 2 – calculate with small subset of threads, this will be 3 and 4.

Use bigger shared memory buffer and precalculate things with all threads, so

     __shared__ Vec3 vecs[5][block_size / 5];
     for (int stage = 0; stage < 10; stage++)
     {
         if (stage == 0 || stage == 5)
         {
             vecs[...][...] = calc_vector...
             __syncthreads();
         } ...

Try hard to parallelize calc_vector too like we did above with for (int i = 0; i < 5; i++), to spread results in local variables and then exchange them among neighbor threads with __shfl_up_sync and so on. Looks too monstrous.

Any other ideas?

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 16:35

Thẻ: c++algorithmoptimizationcuda

Approaches to optimize “tight” code fragments in CUDA code

I am optimizing CUDA kernel that processes image and amount of work that is made for each source pixel changes during the process. For example, I have a 2D loop at some place, something like

float sum = 0;
for (int i = 0; i < 5; i++)
    for (int j = 0; j < 5; j++)
        sum += calc_value(x + i * alpha, y + j *beta, vec, something_else);

and less intensive fragments like

Vec3 vec = calc_vector(x, y, something_else_2);

The code is actually much more complicated and include outer loops and gathering of results from such 2D loops, but generally this is something like

__global__ process_pixel(Data data)
{
    int x = threadIdx.x;
    int y = threadIdx.y;
    Vec3 results[8][10];

    for (int delta = 0; delta < 8; delta++)
        for (int stage = 0; stage < 10; stage++)
        {
            Vec3 vec = calc_vector(x, y, stage);
            Vec3 vec2 = calc_vector2(x, y, delta);
            float sum = 0;
            for (int i = 0; i < 5; i++)
                for (int j = 0; j < 5; j++)
                    sum += calc_value(x + i * alpha, y + j * beta, vec, vec2, delta, stage);
            results[delta][stage] = func(sum);
        }
    find_best(results);    // And so on
}

    int tid = threadIdx.x + threadIdx.y * blockDim.x;
    for (int delta = 0; delta < 8; delta++)
        for (int stage = 0; stage < 10; stage++)
        {
            __shared__ Vec3 vecs[block_size / 5];
            if (tid < block_size / 5)
                vecs[tid] = calc_vector(x, y, stage);
            __syncthreads();

            __shared__ float sums[block_size / 5];
            float sum = 0;
            int i = threadIdx.y;
            for (int j = 0; j < 5; j++)
                sum += calc_value(x + i * alpha, y + j *beta, vec, vec2, delta, stage);
            atomicAdd(&(sums[threadIdx.x]), sum);     // Another bad thing, but let us not discuss it now
            ...

Use bigger shared memory buffer and precalculate things with all threads, so

     __shared__ Vec3 vecs[5][block_size / 5];
     for (int stage = 0; stage < 10; stage++)
     {
         if (stage == 0 || stage == 5)
         {
             vecs[...][...] = calc_vector...
             __syncthreads();
         } ...

Try hard to parallelize calc_vector too like we did above with for (int i = 0; i < 5; i++), to spread results in local variables and then exchange them among neighbor threads with __shfl_up_sync and so on. Looks too monstrous.

Any other ideas?

Filed under: Kiến thức lập trình - @ 16:35

Thẻ: c++algorithmoptimizationcuda

Thiết kế website giá rẻ

Danh mục

Approaches to optimize “tight” code fragments in CUDA code

Approaches to optimize “tight” code fragments in CUDA code