I am new to ML and shap. I am trying to plot a shap summary plot using random forest on the titanic survival data, but I cannot get it to work properly. It generated an interaction graph (with two x/y axis) instead of the beeswarm plot. The shap value array has two values for each feature (one positive and another negative as shown below). Could you help me understand what is happening here please? Thank you!
This is the code I have used:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import shap
train = pd.read_csv("train.csv")
#select 3 features for training
train_imp = train[['Survived', 'Pclass','Sex','Embarked']]
## assign data type
train_imp.Pclass = train_imp.Pclass.astype('category')
train_imp.Embarked = train_imp.Embarked.astype('category')
train_imp.Sex = train_imp.Sex.astype('category')
## replace categories with numbers
categories = {"female": 1, "male": 0}
train_imp['Sex']= train_imp['Sex'].map(categories)
categories = {"S": 1, "C": 2, "Q": 3}
train_imp['Embarked']= train_imp['Embarked'].map(categories)
# define X and y
X = train_imp.drop(['Survived'], axis=1)
y = train_imp['Survived'].values
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12)
#train model and predict
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print(classification_report(y_pred, y_test))
classification report
#shap value and summary plot
explainer = shap.Explainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values,X_test)
shap value summary plot
print(shap_values)
shap value array