I trained a cnn network model using InceptionV3, to classify images for detection of pulmonary tuberculosis in chest x-rays.
The problem was that at the time of training the metrics seemed to be going well. But when evaluating it there is quite a difference.
First use this method evaluate the model using the validation data generator.
## MODEL EVALUATION
loss, accuracy, precision, recall, auc = model.evaluate(valid_generator)
print(f'Loss: {loss}, Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, AUC: {auc}')
And it gave me the following result:
38/38 [==============================] – 5s 128ms/step – loss: 0.1684 – accuracy: 0.9339 – precision: 0.9978 – recall: 0.8515 – auc: 0.9952
Loss: 0.16840478777885437, Accuracy: 0.9339389204978943, Precision: 0.9977973699569702, Recall: 0.8515037298202515, AUC: 0.9952080845832825
But also evaluate the model using sklearn.metrics through this code:
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
import sklearn.metrics
##Getting predictions for the validation data set
valid_generator.reset()
Y_pred = model.predict(valid_generator, steps=len(valid_generator), verbose=1)
Y_pred = np.round(Y_pred)
##Convert true labels to array format
Y_true = valid_generator.classes
print("Classification Report:n", classification_report(Y_true, Y_pred))
# confusion matrix
conf_mat = confusion_matrix(Y_true, Y_pred)
print("Confusion Matrix:n", conf_mat)
Y me arrojo este resultado:
38/38 [==============================] – 5s 130ms/step
Classification Report:
precision recall f1-score support
Normal 0.57 0.64 0.60 679
Tuberculosis 0.46 0.39 0.42 532
accuracy 0.53 1211
macro avg 0.51 0.51 0.51 1211
weighted avg 0.52 0.53 0.52 1211
Confusion Matrix:
[[432 247]
[325 207]]
Clearly there is a discrepancy between the results of these different evaluation methods, which one is correct?
Compile and train my model as follows:
#InceptionV3
base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(299, 299, 3))
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(1, activation='sigmoid')(x)
model = Model(inputs=base_model.input, outputs=predictions)
for layer in base_model.layers:
layer.trainable = False
model.compile(optimizer=Adam(learning_rate=0.0001),
loss='binary_crossentropy',
metrics=['accuracy', tf.keras.metrics.Precision(), tf.keras.metrics.Recall(), tf.keras.metrics.AUC()])
history = model.fit(
train_generator,
steps_per_epoch=train_generator.samples // train_generator.batch_size,
validation_data=valid_generator,
validation_steps=valid_generator.samples // valid_generator.batch_size,
epochs=50
)
It should be remembered that I am classifying only 2 classes.
I hope you can help me solve this doubt I have. Thanks in advance.
Being able to know what evaluation metrics to use and which ones are correct.