I am trying to implement this repo.
I am using the Kvasir-SEG dataset.
Error is occuring at model.fit
line (81).
This link is close to my problem, but none of the answers worked. So I included more of the error logs to help people provide me with an answer.
When running the python3 run.py
, it runs sometimes randomly with no problem but in most cases I get the following error:
Epoch 1/200
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1726636494.249767 16271 service.cc:146] XLA service 0x7f1f90001ce0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1726636494.249802 16271 service.cc:154] StreamExecutor device (0): NVIDIA GeForce RTX 3070 Laptop GPU, Compute Capability 8.6
2024-09-18 10:44:54.852933: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-09-18 10:44:56.370033: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:531] Loaded cuDNN version 8907
2024-09-18 10:44:59.966977: E external/local_xla/xla/service/gpu/buffer_comparator.cc:153] Difference at 648122: 570.72, expected 698.733
2024-09-18 10:44:59.967805: E external/local_xla/xla/service/gpu/conv_algorithm_picker.cc:697] Results mismatch between different convolution algorithms. This is likely a bug/unexpected loss of precision in cudnn.
(f32[4,256,32,32]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,256,32,32]{3,2,1,0}, f32[256,256,3,3]{3,2,1,0}, f32[256]{0}), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0},"force_earliest_schedule":false} for eng26{k2=0,k13=2,k14=3,k18=0,k22=0,k23=0} vs eng11{k2=0,k3=0}
2024-09-18 10:44:59.967826: E external/local_xla/xla/service/gpu/conv_algorithm_picker.cc:312] Device: NVIDIA GeForce RTX 3070 Laptop GPU
2024-09-18 10:44:59.967832: E external/local_xla/xla/service/gpu/conv_algorithm_picker.cc:313] Platform: Compute Capability 8.6
2024-09-18 10:44:59.967837: E external/local_xla/xla/service/gpu/conv_algorithm_picker.cc:314] Driver: 12060 (INVALID_ARGUMENT: expected %d.%d, %d.%d.%d, or %d.%d.%d.%d form for driver version; got "1")
2024-09-18 10:44:59.967842: E external/local_xla/xla/service/gpu/conv_algorithm_picker.cc:315] Runtime: <undefined>
2024-09-18 10:44:59.967850: E external/local_xla/xla/service/gpu/conv_algorithm_picker.cc:320] cudnn version: 8.9.7
2024-09-18 10:45:00.530038: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1857] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
E0000 00:00:1726636500.530095 16271 gpu_timer.cc:156] INTERNAL: Could not synchronize CUDA stream: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
E0000 00:00:1726636500.530124 16271 gpu_timer.cc:162] INTERNAL: Error destroying CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
E0000 00:00:1726636500.530136 16271 gpu_timer.cc:168] INTERNAL: Error destroying CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-09-18 10:45:00.530143: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1652] error deallocating host memory at 0x205200200: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-09-18 10:45:00.569431: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:1886] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered ::
Failed to determine best cudnn convolution algorithm for:
%cudnn-conv-bias-activation.169 = (f32[4,64,256,256]{3,2,1,0}, u8[0]{0}) custom-call(f32[4,16,256,256]{3,2,1,0} %maximum.63, f32[64,16,3,3]{3,2,1,0} %transpose.577, f32[64]{0} %arg189.190), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_target="__cudnn$convBiasActivationForward", metadata={op_type="Conv2D" op_name="functional_1/conv2d_29_1/convolution" source_file="/home/suhas/ResUNetPlusPlus/.venv/lib/python3.10/site-packages/tensorflow/python/framework/ops.py" source_line=1177}, backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"cudnn_conv_backend_config":{"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0},"force_earliest_schedule":false}
Original error: INTERNAL: Failed to synchronize GPU for autotuning conv instruction
To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false. Please also file a bug for the root cause of failing autotuning.
[[{{node StatefulPartitionedCall}}]] [Op:__inference_one_step_on_iterator_45815]
2024-09-18 10:45:00.913684: W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]
nvidia-smi for my 3070 laptop GPU:
Fri Sep 20 10:15:02 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 561.09 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3070 ... On | 00000000:01:00.0 Off | N/A |
| N/A 64C P0 33W / 130W | 4MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
I have tried restarting my pc multiple times which allows me to run this file sometimes.
Also tried rm -rf ~/.nv/
to clear cache.
My setup:
CUDA Toolkit 12.3
CuDNN 8.9.7
Python 3.10.12
Tensorflow 2.17.0
Running in WSL2
Installed Tensorflow using python3 -m pip install tensorflow[and-cuda]
suhas9176 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.