I have trained a two-stage ML model training for binary classification. In the first stage, I applied an autoencoder model consisting of encoder and decoder. The encoder model has reduced the dimension of the input feature from 1000 to 32. The decoder reconstructs the input by increasing the input feature from 32 to 1000. From the encoder output, I have applied a DNN model to binary classify whether the object is a flower or fruit.
Now I want to calculate the feature importance score of this two-stage model. For this, I have used the SHAP value.
I computed the SHAP value for the DNN model using shap.DeepExplainer. Then I backpropagated this SHAP value through the encoder part of the autoencoder model using this code:
def reverse_map_shap(shap_values, layer):
weights = layer.weight.detach().numpy()
return np.dot(shap_values, weights)
def calculate_shap_value(model_layer,shap_value):
shap_values_original = shap_value
for layer in model_layer:
if isinstance(layer[0], torch.nn.Linear):
shap_values_original = reverse_map_shap(shap_values_original, layer[0])
elif isinstance(layer[1], torch.nn.Linear):
shap_values_original = reverse_map_shap(shap_values_original, layer[1])
return shap_values_original
My purpose is to explain which input features impact most binary classification. Is my method correct? How can I compute the SHAAp value to get the top 5 or top 10 features that impact most in binary classification model?