I’m working on a model that’s written to run on either GPU or CPU exclusively but since I’m working on a HPC cluster, I’m running both.
I’ve been trying to convert the code to work on my setup but keep getting the error message Expected all tensors to be on the same device.
The code is running on the Lightning library and the original setup has these parameters:
slurm_args = dict()
if os.environ.get("SLURM_NODELIST") is not None:
# Add SLURM arguments for distributed training
slurm_args = {
"accelerator": "gpu",
"devices": int(os.environ["SLURM_GPUS_ON_NODE"]),
"num_nodes": int(os.environ["SLURM_NNODES"]),
"strategy": "ddp",
}
…
trainer_args = {
"max_epochs": args.nb_epochs,
"log_every_n_steps": LOGGING_STEPS,
"val_check_interval": VAL_CHECK_STEPS,
"logger": wandb_logger,
"callbacks": callbacks,
"accelerator": accelerator,
**slurm_args,
}
…
trainer = lightning.Trainer(**trainer_args)
…
trainer.fit(**train_args, datamodule=datamodule)
The command model.to("cuda")
is not present anywhere in the code and when added doesn’t change anything.
Does anyone know which settings I need to change and to what in order to be able to run under these parameters:
#SBATCH --time=60:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --mem=180000
#SBATCH --cpus-per-task=20
#SBATCH --gres=gpu:1