I am training a food image classification deep learning model using TensorFlow/Keras, but I am encountering a strange issue where some epochs are skipped, and the training logs show 0 accuracy and 0 loss for those epochs. Here’s what I have observed:
• Some epochs report 0 accuracy and 0 loss, while others seem to train correctly.
• The data generators are set up correctly, and the number of images in the training and validation datasets are as expected.
Environment:
• TensorFlow version: 2.x
• Python version: 3.8
• Dataset: Food-101 (with 101 classes)
What I Have Tried:
1. Verified Data Directory Structure: Ensured that the directory structure is correct and contains the expected number of images.
2. Simplified Model: Used a simpler model to rule out complexity-related issues.
3. Removed Callbacks: Removed all callbacks to check if they were causing the problem.
4. Calculated Steps per Epoch: Ensured that steps_per_epoch and validation_steps are correctly calculated based on the number of samples and batch size.
5. Monitored Batch Processing: Added debugging callbacks to monitor batch processing and ensure consistency.
My code:
# Create ImageDataGenerators for data augmentation
train_datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=30,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest',
validation_split=0.2
)
# Create training and validation data generators
train_generator = train_datagen.flow_from_directory(
data_dir,
target_size=(150, 150), # Use a smaller target size for faster debugging
batch_size=32,
class_mode='categorical',
subset='training'
)
validation_generator = train_datagen.flow_from_directory(
data_dir,
target_size=(150, 150),
batch_size=32,
class_mode='categorical',
subset='validation'
)
# Verify the number of samples
print(f"Number of training samples: {train_generator.samples}")
print(f"Number of validation samples: {validation_generator.samples}")
# Build a simpler model for debugging
model = Sequential([
Flatten(input_shape=(150, 150, 3)),
Dense(128, activation='relu'),
Dropout(0.5),
Dense(101, activation='softmax') # Adjust the number of classes if necessary
])
# Compile the model
model.compile(
optimizer=Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Calculate steps per epoch and validation steps
steps_per_epoch = train_generator.samples // train_generator.batch_size
validation_steps = validation_generator.samples // validation_generator.batch_size
print(f"Steps per epoch: {steps_per_epoch}")
print(f"Validation steps: {validation_steps}")
# Train the model without callbacks
history = model.fit(
train_generator,
steps_per_epoch=steps_per_epoch,
validation_data=validation_generator,
validation_steps=validation_steps,
epochs=50
)
Debugging Steps Taken:
1. Verified the directory structure and number of images in each class.
2. Simplified the model to rule out complexity-related issues.
3. Removed all callbacks to check if they were causing the problem.
4. Ensured the steps_per_epoch and validation_steps are correctly calculated.
Please help with this isse , not sure what to try next