I’m trying to run a Mask R-CNN model with aerial imagery. To optimise this, I run everything with CUDA. But this creates a few errors. Here is my code:
# Python
import torch
import torchvision
from torchvision.models.detection import MaskRCNN
import gc
import torch.nn as nn
from torchvision.models.detection.rpn import AnchorGenerator
from torch.cuda.amp import GradScaler
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
gc.collect()
torch.cuda.empty_cache()
# Define the model
resnet_net = torchvision.models.resnet18(weights=torchvision.models.ResNet18_Weights.DEFAULT)
modules = list(resnet_net.children())[:-1]
backbone = nn.Sequential(*modules)
backbone.out_channels = 512
# Define the anchor generator
anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
aspect_ratios=((0.5, 1.0, 2.0),))
# Define the model with the configured backbone and anchor generator
model = MaskRCNN(backbone=backbone, num_classes=91, rpn_anchor_generator=anchor_generator)
# Move the model to the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
# Define the optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)
scaler = GradScaler()
# Train the model
num_epochs = 5
for epoch in range(num_epochs):
model.train()
counter = 0
for images, height, targets, names in train_ds:
print(counter)
counter += 1
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
optimizer.zero_grad()
with torch.cuda.amp.autocast():
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
scaler.scale(losses).backward()
scaler.step(optimizer)
scaler.update()
If I run this code on the gpu, I will at some point get this error:
RuntimeError: CUDA error: an illegal memory access was encountered Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions.
And if I run it on the cpu, I will get this error:
[error] Disposing session as kernel process died ExitCode: 3221225477, Reason: 0.00s - Debugger warning: It seems that frozen modules are being used, which may 0.00s - make the debugger miss breakpoints. Please pass -Xfrozen_modules=off 0.00s - to python to disable frozen modules. 0.00s - Note: Debugging will proceed. Set PYDEVD_DISABLE_FILE_VALIDATION=1 to disable this validation.
I have encountered some CUDA memory problems before with this code, and this seems related. What are these frozen modules and is it safe to turn them off? Also, I tried to enable this TORCH_USE_CUDA_DSA in my code by adding this:
os.environ["TORCH_USE_CUDA_DSA"] = "1"
But that didn’t solve it. Also, I had one run where i didn’t encounter any of these problems, and where the code ran smoothly (on the gpu).