I’m trying to training yolov5 model with sagemaker using s3 bucket data.
The total size of the training data stored in S3 exceeds 100 GB, so I am trying to use Pipe mode to load the data.
the main code is here:
import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.inputs import TrainingInput
sagemaker_session = sagemaker.Session()
bucket = 'sdij-ml'
prefix = 'yolov5-training'
role = sagemaker.get_execution_role()
estimator = PyTorch(
entry_point='train.py',
source_dir='/root/yolov5',
role=role,
framework_version='2.2',
py_version='py310',
instance_count=1,
instance_type='ml.p3.2xlarge',
hyperparameters={
'img': 960,
'batch': 32,
'epochs': 100,
'data': 'data.yaml',
'device': 0,
'weights': 'yolov5l.pt',
'project': '/opt/ml/model',
'name': 'SDIJ_Textbook'
}
)
train_input = TrainingInput(
f's3://{bucket}/textbook/Dataset/train/',
distribution='FullyReplicated',
content_type='application/x-image',
s3_data_type='S3Prefix',
input_mode='Pipe'
)
val_input = TrainingInput(
f's3://{bucket}/textbook/Dataset/valid/',
distribution='FullyReplicated',
content_type='application/x-image',
s3_data_type='S3Prefix',
input_mode='Pipe'
)
estimator.fit({'train': train_input, 'val': val_input})
and the data.yaml file is here :
names:
- Problem
- Problem_no
- Meta_data
- Problem_source
- Page_num
- Theme
- Solution_no
- Solution
- Solution_broken
- Problem_broken
nc: 10
train: /opt/ml/input/data/train
val: /opt/ml/input/data/val
test: /opt/ml/input/data/test
The error appears as follows. I believe the dataset downloaded from S3 is not being recognized properly. Which part of the code might be causing this issue?
Traceback (most recent call last):
File "/opt/ml/code/train.py", line 852, in <module>
main(opt)
File "/opt/ml/code/train.py", line 627, in main
train(opt.hyp, opt, device, callbacks)
File "/opt/ml/code/train.py", line 176, in train
data_dict = data_dict or check_dataset(data) # check if None
File "/opt/ml/code/utils/general.py", line 563, in check_dataset
raise Exception("Dataset not found ❌")
Exception: Dataset not found ❌
2024-07-02 18:20:27,719 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.
2024-07-02 18:20:27,719 sagemaker-training-toolkit INFO Done waiting for a return code. Received 1 from exiting process.
2024-07-02 18:20:27,719 sagemaker-training-toolkit ERROR Reporting training FAILURE
2024-07-02 18:20:27,720 sagemaker-training-toolkit ERROR ExecuteUserScriptError:
ExitCode 1
ErrorMessage "raise Exception("Dataset not found ❌")
Exception: Dataset not found ❌"
Command "/opt/conda/bin/python3.10 train.py --batch 32 --data data.yaml --device 0 --epochs 100 --img 960 --name SDIJ_Textbook --project /opt/ml/model --weights yolov5l.pt"
2024-07-02 18:20:27,720 sagemaker-training-toolkit ERROR Encountered exit_code 1
2024-07-02 18:20:36 Uploading - Uploading generated training model
2024-07-02 18:20:45 Failed - Training job failed