How to correctly simulate `atomicAdd` on `u64` by using two `u32` buffers?

I’m trying to do atomic operations on u64. But since it’s not supported, the number is stored in TWO u32 buffers

The issue is that I’m not sure how to do atomicAdd correctly to simulate the effect it would have had on u64. All while avoiding memory modification by other threads between loading and storing the values.

my current idea is this:


fn tou64(value: u32) -> vec2u {
        return vec2u(u32(value / BASE), value % BASE);
}

fn add(a: vec2u, b: vec2u) -> vec2u {
    let x = a.x + b.x + u32((a.y + b.y) / BASE);
    let y = (a.y + b.y) % BASE;
    return vec2u(x, y);
}

fn main() {
// .....

// convert the value from u32 to 2-buffer representation of u64
let b: vec2u = tou64(value);
// fetch the old value from the 2 buffers
var a = vec2u(0); 
a.x = atomicLoad(&buffer[index]);
a.y = atomicLoad(&buffer[index+1]);
// add the value to the buffer value
let result = add(a, b);
// store back the buffer results 
atomicStore(&buffer[index], result.x);
atomicStore(&buffer[index+1], result.y);
}

This works only when no other thread modifies the buffer at the same index. But it’s a very weak implementation otherwise. Thread 1 could change the value of buffer[index+1] while thread 2 just read the old buffer[index] value and the new buffer[index+1] value

Edit: in CUDA’s guide, it’s noted that:

Note that any atomic operation can be implemented based on atomicCAS() (Compare And Swap)

and this example of AtomicAdd on double is provided

#if __CUDA_ARCH__ < 600
__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;

    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);

    return __longlong_as_double(old);
}
#endif

can this be applied to webgpu as well using atomicExchange? this answer shows how to do atomic operations on user-defined types. How can I do something similar but for webgpu?

Here is a solution for atomicAddU64, thanks to @PeterCordes!
Manually do the carry between a u32 low half and a u32 high half, with each u32 addition being an atomic RMW. This requires the fetch_add / atomicAdd return value from the low half so we can check for carry-out.
Since we only have 2 halves, not a wider integer, we don’t need to handle carry-in and carry-out from the same add. Just the sum = a+b; carry_out = sum<b; trick.

The total count at the end will be correct, but there’s no way to read a correct snapshot of the current count while other threads are adding. Multiple threads might have added to the low half and still be waiting to add to the high half, for example.

(It might be better to have each thread add into per-thread local accumulators to sum once at the end, instead of having all threads access a single shared variable all the time.)

fn main() {
    // .....

    // convert the value from u32 or f32 to 2-buffer representation of u64
    let b: vec2u = tou64(value);

    // low: no need for modulu since it will wrap by itself
    // take old value on the buffer to check for carry
    let oldValue = atomicAdd(&buffer[low_index], b.y);
    // high: add high part + carry
    // check if the sum cause value to wrap
    atomicAdd(&buffer[high_index], b.x + u32((oldValue + b.y) < b.y));
}

To convert u64 to vec2u and back:

// this is just a pseudocode! wgsl doesn't support u64 yet
// do this operation in c++/js or while data processing
fn u64ToVec2u(value: u64) -> vec2u {
    let low = u32(value);
    let high = u32(value >> 32);
    return vec2u(high, low);
}

fn vec2uToU64(value: vec2u) -> u64 {
    return (value.x << 32) + value.y; // x = high, y = low
}

and here’s a solution for float computations as well, can be done in wgsl

// to save 2 ^ 10 = 1024 (ie 3 digits after the decimal point)
const DEGREE_TO_SAVE = 10; 
fn tou64(value: f32) -> vec2u {
    // modulu is important here!! 
    // because converting from float to u32 will not automatically wrap
    let low = u32((value * pow(2., DEGREE_TO_SAVE)) % pow(2., 32));
    let high = u32(value /  pow(2., 32 - DEGREE_TO_SAVE));
    return vec2u(high, low);
}

fn tof32(value: vec2u) -> f32 {
    return f32(value.x) * pow(2., 32 - DEGREE_TO_SAVE) + f32(value.y) / pow(2., DEGREE_TO_SAVE);

}

As others have said, the simple answer is no. There is no easy way to get a 64-bit add using the primitives that WGSL provides.

However, if your data is sufficiently well behaved, you can get around this problem. For example, rather than having a single 64-bit adder, use 4 16-bit adders. Have your program split the 64-bit number into (x & 0xFFFF), ((x >> 16) & 0xFFFF), ((x >> 32) & 0xFFFF) and ((x >> 48) & 0xFFFF). Atomically add each of the pieces into four locations a0, a1, a2, a3.

Your JavaScript program can then put these together. a0 + (a1 << 16) + (a2 << 32) + (a3 << 48).

This works if you’re doing less that 2^16 additions, so that there is no chance of overflow. If you’re doing, say, 2^24 additions, then break the data into 8 8-bit pieces.

Essentially you’re looking the GPU just do addition, and letting JS take care of the carries.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: Kiến thức lập trình - @ 11:27

Thẻ: cudaatomicuint64webgpuwgsl

Thiết kế website giá rẻ

Danh mục

How to correctly simulate `atomicAdd` on `u64` by using two `u32` buffers?