I am investigating creating a tracer for NCCL and have a problem for time synchronization between CPU and GPU. The existing way I found for this was to use cudaHostalloc()
to allocate a pointer documenting the CPU timestamp, so that both CPU and GPU can have access to the pointer and the CPU writes to it while the GPU reads from it, therefore the GPU knows the CPU timestamp. We have a CPU thread where the CPU loops to modify the timestamp. On the GPU side, the GPU reads from it and also documents the corresponding GPU timestamp using clock64()
. Thus, CPU and GPU synchronization can be done.
Here is the thread updating CPU timestamp:
void NpKit::CpuTimestampUpdateThread() {
uint64_t curr_steady_clock = 0;
while (!cpu_timestamp_update_thread_should_stop_) {
curr_steady_clock = std::chrono::steady_clock::now().time_since_epoch().count();
NpKit::atomic_cpu_timestamp_->store((curr_steady_clock + NpKit::init_clock_offset), std::memory_order_seq_cst);
// cudaDeviceSynchronize();
// std::atomic_thread_fence(std::memory_order_seq_cst);
}
}
Here is how the code running on the GPU does synchronization:
#if defined(ENABLE_NPKIT) && defined(ENABLE_NPKIT_EVENT_TIME_SYNC_CPU)
if (tid == 0) {
__threadfence_system();
volatile uint64_t* cpuTimestamp = ncclShmem.comm.cpuTimestamp;
NpKit::CollectGpuEvent(NPKIT_EVENT_TIME_SYNC_CPU, 0, 0, atomicAdd((unsigned long long*)cpuTimestamp, 0), ncclShmem.comm.npKitEventCollectContexts + npKitCtxIdx);
}
#endif
#if defined(ENABLE_NPKIT) && defined(ENABLE_NPKIT_EVENT_TIME_SYNC_GPU)
if (tid == 0) {
NpKit::CollectGpuEvent(NPKIT_EVENT_TIME_SYNC_GPU, 0, 0, clock64(),
ncclShmem.comm.npKitEventCollectContexts + npKitCtxIdx);
}
#endif
However, I found that the CPU timestamp read by the GPU may always be earlier than the actual CPU timestamp. I have the following assumption:
- Although we use
volatile
for every read and write for the CPU timestamp, the cache-coherence in the system may not be strong enough to ensure every write by the CPU is to the host memory and some write is only to the cache, so the GPU only reads some ‘old’ CPU timestamp. - The CPU thread responsible for updating the CPU timestamp may somehow get paused by the scheduler or one loop may take too long so that the GPU is unable to know the most up-to-date CPU time stamp.
Does anyone know how to solve the bug?
I have tried to change the flags of cudaHostAlloc()
and change everything related to accessing the CPU timestamp to atomic operations. I also tried using a thread-fence after the CPU updates the value, like: cudaDeviceSynchronize()
, std::atomic_thread_fence(std::memory_order_seq_cst)
, __sync_synchronize();
I am expecting that every write by the CPU is visible to the GPU and the synchronization can be accurate.
5