I was running TensorFlow 2.15 + Keras 2.15 (training without issue) but was running into an error when loading the model after training. I found out the bug was resolved in TensorFlow 2.17 and Keras 3.4.1. So I decided to upgrade, however once I did, I noticed my GPU (RTX4090) was not being used when training, or at least that’s how it appeared because when monitoring my GPU in the Activity Monitor, it seemed like it wasn’t using my GPU at all, it would run at like 0-3%, while the memory was at 100% and would take 100x the amount of time per epoch. Prior to this, when using TF 2.15 it would take at most 2 minutes per epoch. So I decided to do a clean install and install CUDA Toolkit 12.3 + cuDNN 8.9.7 (as suggested by the TensorFlow Documentation). But found no change.
I’ve tried multiple things; tried installing CUDA 11.8, CUDA 12.2 and CUDA 12.3.2 all with cuDNN 8.9.7, I’ve tried using the TensorFlow CUDA toolkit that can be installed alongside TensorFlow ( pip install tensorflow[gpu] ), I’ve tried with the standalone install of CUDA and cuDNN and the issue persists. Upograding Drivers to version 552.44.
I initially had NVIDIA Driver version 546.17 installed and since have updated it to 552.44 and still the issue persists. I’m hesitant to continue upgrading as I’m not sure if this might break other things.
Please let me know if there are other details you need. I’m at a loss right now.
Additional Details:
OS Version: Ubuntu 22.04 in WSL2 (Kernel Version: 5.15)
NVIDIA Driver Version: NVIDIA-SMI 550.76.01 Driver Version: 552.44 CUDA Version: 12.4
Errors at the beginning of training:
2024-08-12 22:11:19.969062: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-12 22:11:20.184444: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-12 22:11:20.265092: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-12 22:11:20.288201: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-12 22:11:20.455104: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-12 22:11:21.229471: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1723515082.702010 401 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1723515082.862328 401 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1723515082.862383 401 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Num GPUs Available: 1
I0000 00:00:1723515082.865279 401 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1723515082.865315 401 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1723515082.865337 401 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1723515082.995930 401 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
I0000 00:00:1723515082.995986 401 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-08-12 22:11:22.996004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2112] Could not identify NUMA node of platform GPU id 0, defaulting to 0. Your kernel may not have been built with NUMA support.
I0000 00:00:1723515082.996040 401 cuda_executor.cc:1001] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-08-12 22:11:22.996531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2021] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21
458 MB memory: -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:01:00.0, compute capability: 8.9 Epoch 1/2
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1723515085.116771 533 service.cc:146] XLA service 0x7efd50007590 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1723515085.116798 533 service.cc:154] StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2024-08-12 22:11:25.159139: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-08-12 22:11:25.318807: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907
2024-08-12 22:11:26.353627: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_5', 88 bytes spill stores, 88 bytes spill loads
2024-08-12 22:11:26.462451: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_38', 24 bytes spill stores, 24 bytes spill loads
2024-08-12 22:11:26.679557: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_40', 96 bytes spill stores, 96 bytes spill loads
2024-08-12 22:11:26.820076: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_40', 80 bytes spill stores, 80 bytes spill loads
2024-08-12 22:11:26.835967: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_33', 112 bytes spill stores, 112 bytes spill loads
2024-08-12 22:11:27.036687: I external/local_xla/xla/stream_executor/cuda/cuda_asm_compiler.cc:393] ptxas warning : Registers are spilled to local memory in function 'gemm_fusion_dot_33', 112 bytes spill stores, 112 bytes spill loads
I0000 00:00:1723515088.423223 533 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
7/10740 ━━━━━━━━━━━━━━━━━━━━ 3:58:36 1s/step - loss: 883.6491 - mse: 883.5972Segmentation fault
PS
Please forgive me, this is my first time posting on StackOverflow
JohnnyTemple is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.