So I’m working with really big tensors from pytorch, and as a result of a certain operation i need my tensor to mantain really big values(wich represent some indexes), but obviously overflow makes all numbers in the tensor assume the value -9223372036854775808.
This is the repo https://github.com/SamuelMastrelli/neural-astar. When i try to launch train_maps.py, the outputs clearly indicates me some overflow, as the variable with the negativa number i printed has to be a loc for another tensor:
scripts/train_maps.py:21: UserWarning:
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
@hydra.main(config_path="config", config_name="train_maps")
torch.Size([1, 1, 300, 300]) torch.Size([1, 1, 300, 300]) torch.Size([1, 1, 300, 300]) torch.Size([1, 1, 300, 300])
tensor([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]]])
tensor([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]]])
tensor([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]]])
tensor([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]]])
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: model/maps/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-----------------------------------------------
0 | planner | NeuralAstar | 391 K
1 | vanilla_astar | VanillaAstar | 9
-----------------------------------------------
391 K Trainable params
18 Non-trainable params
391 K Total params
1.566 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/torch/functional.py:478: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2894.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
tensor([-9223372036854775808], device='cuda:0')
torch.Size([1, 90000])
../aten/src/ATen/native/cuda/IndexKernel.cu:91: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1190, in _run_train
self._run_sanity_check()
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1262, in _run_sanity_check
val_loop.run()
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/neural_astar/utils/training.py", line 68, in validation_step
outputs = self.forward(map_designs, start_maps, goal_maps)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/neural_astar/utils/training.py", line 53, in forward
return self.planner(map_designs, start_maps, goal_maps)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/neural_astar/planner/astar.py", line 207, in forward
return self.perform_astar(
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/neural_astar/planner/astar.py", line 63, in perform_astar
astar_outputs = astar(
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/neural_astar/planner/differentiable_astar.py", line 260, in forward
path_maps = backtrack(start_maps, goal_maps, parents, t)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/neural_astar/planner/differentiable_astar.py", line 128, in backtrack
loc = parents[range(num_samples), loc]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scripts/train_maps.py", line 63, in main
trainer.fit(module, train_loader, val_loader)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 63, in _call_and_handle_interrupt
trainer._teardown()
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1161, in _teardown
self.strategy.teardown()
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/pytorch_lightning/strategies/strategy.py", line 496, in teardown
self.lightning_module.cpu()
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/lightning_lite/utilities/device_dtype_mixin.py", line 78, in cpu
return super().cpu()
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 738, in cpu
return self._apply(lambda t: t.cpu())
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/home/mastrelli/neural-astar/.venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 738, in <lambda>
return self._apply(lambda t: t.cpu())
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.