I am trying setup the environment in linux(linux vm from windows) for this Github project: https://github.com/NJU-PCALab/OpenVid-1M
And I am unable install Nvidia Apex package with this command:
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git
I am have CUDA 12.5 with the following libraries(I even tried with CUDA 11.8 and Torch 2.3.1):
colossalai==0.4.0
accelerate==0.32.1
diffusers==0.29.2
ftfy==6.2.0
gdown==5.2.0
mmengine==0.10.4
pre-commit==3.7.1
pyav==12.2.0
tensorboard==2.17.0
timm==1.0.7
tqdm==4.66.4
transformers==4.39.3
wandb==0.17.4
torch==2.2.2
torchvision==0.17.2
packaging==24.1
ninja==1.11.1.1
After running the above command I get this error:
[2024-07-12 04:57:25,233] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Traceback (most recent call last):
File "/OpenVid-1M/scripts/inference.py", line 5, in <module>
from mmengine.runner import set_random_seed
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/runner/__init__.py", line 2, in <module>
from ._flexible_runner import FlexibleRunner
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 14, in <module>
from mmengine._strategy import BaseStrategy
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/_strategy/__init__.py", line 4, in <module>
from .base import BaseStrategy
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/_strategy/base.py", line 19, in <module>
from mmengine.model.wrappers import is_model_wrapper
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/model/__init__.py", line 6, in <module>
from .base_model import BaseDataPreprocessor, BaseModel, ImgDataPreprocessor
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/model/base_model/__init__.py", line 2, in <module>
from .base_model import BaseModel
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/model/base_model/base_model.py", line 9, in <module>
from mmengine.optim import OptimWrapper
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/optim/__init__.py", line 2, in <module>
from .optimizer import (OPTIM_WRAPPER_CONSTRUCTORS, OPTIMIZERS,
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/optim/optimizer/__init__.py", line 5, in <module>
from .builder import (OPTIM_WRAPPER_CONSTRUCTORS, OPTIMIZERS,
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/optim/optimizer/builder.py", line 174, in <module>
TRANSFORMERS_OPTIMIZERS = register_transformers_optimizers()
File "/conda/envs/venv/lib/python3.10/site-packages/mmengine/optim/optimizer/builder.py", line 165, in register_transformers_optimizers
from transformers import Adafactor
File "/conda/envs/venv/lib/python3.10/site-packages/transformers/__init__.py", line 26, in <module>
from . import dependency_versions_check
File "/conda/envs/venv/lib/python3.10/site-packages/transformers/dependency_versions_check.py", line 16, in <module>
from .utils.versions import require_version, require_version_core
File "/conda/envs/venv/lib/python3.10/site-packages/transformers/utils/__init__.py", line 33, in <module>
from .generic import (
File "/conda/envs/venv/lib/python3.10/site-packages/transformers/utils/generic.py", line 478, in <module>
_torch_pytree.register_pytree_node(
AttributeError: module 'torch.utils._pytree' has no attribute 'register_pytree_node'. Did you mean: '_register_pytree_node'?
[2024-07-12 04:57:30,430] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 112) of binary: /conda/envs/venv/bin/python
Traceback (most recent call last):
File "/conda/envs/venv/bin/torchrun", line 10, in <module>
sys.exit(main())
File "/conda/envs/venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/conda/envs/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
run(args)
File "/conda/envs/venv/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/conda/envs/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/conda/envs/venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I have to tried a lot of things to debug this – I tried to install the Apex package with CUDA instead of pip and I have tried to build Apex by cloning various older branches but ultimately nothing seems to work as they all seem to fail in a “failed to build wheel for apex.” I have seen others on stack overflow and github with similar questions but none of the solutions seem to work for me.
I would greatly appreciate some help with this. Thanks in advance!