I have the following project structure in a Python project:
> nn-project
-.env
- data
- raw
- boston_housing_price
> - src
> - models
> - bird-model
> - env.py
> - train_model.py
I have in my .env file, the following:
PROJECT_ROOT_FOLDER = ../
In my env.py, I do the following:
project_root = os.environ.get('PROJECT_ROOT_FOLDER')
if not project_root:
raise ValueError("PROJECT_ROOT environment variable is not set.")
absolute_path = os.path.abspath(project_root)
data_dir = Path(os.path.join(absolute_path, 'data/raw/boston_housing_price/'))
models_dir = Path(os.path.join(absolute_path, 'models/boston_housing_price/'))
print('***************** LOAD ENVIRONMENT ********************+')
print("Project Root DIR", project_root)
print("Project Root DIR abs", absolute_path)
print("Project Data DIR", data_dir)
print("Models Dump DIR", models_dir)
print('***************** LOAD ENVIRONMENT ********************+')
I get to see the following printed:
***************** LOAD ENVIRONMENT ********************+
Project Root DIR ../nn-project/
Project Root DIR abs /home/user/Projects/Private/ml-projects/nn-project
Project Data DIR /home/user/Projects/Private/ml-projects/nn-project/data/raw/boston_housing_price
Models Dump DIR /home/user/Projects/Private/ml-projects/nn-project/models/boston_housing_price
***************** LOAD ENVIRONMENT ********************+
I then have the following method in train_model.py that is supposed to load the dataset:
def load_data(data_dir):
print(data_dir)
# Check if the dataset file exists in the data directory
dataset_file = Path(os.path.join(data_dir, env.boston_dataset))
print(dataset_file)
if os.path.exists(dataset_file):
# If the dataset file exists, load it directly
raw_df = pd.read_csv(dataset_file, sep="s+", skiprows=22, header=None)
else:
# If the dataset file doesn't exist, fetch it from the URL
response = requests.get(env.boston_dataset_url)
if response.status_code == 200:
# Parse the CSV data from the response content
csv_data = response.text
raw_df = pd.read_csv(StringIO(csv_data), sep="s+", skiprows=22, header=None)
# Save the dataset to the data directory for future use
raw_df.to_csv(dataset_file, index=False)
else:
print("Failed to fetch data from URL:", env.boston_dataset_url)
return None
return raw_df
I call it like this:
boston = train_model.load_data(data_dir=env.data_dir)
But it fails when I run it with the message:
OSError: Cannot save file into a non-existent directory: '../nn-project/data/raw/boston_housing_price'
Question is, why is it not respecting the full path of the data_dir that I pass in as parameter to the method?
6