I have two models trained on a number of tags and use it for predicting the genre of a game. I noticed that due to have the models were trained, sometimes the same input data can have the two models output wildly different genres.
I would like to limit the predictions to what the other model suggested but do not know how to do this. example below
Model1_labels = [“JRPG”, “Horror”, “FPS”, “Platformer”]
Model2_Labels = [“Mario”, “War_shooter”, “fantasy_rpg”]
training_data = Label1 Label2 Tags
JRPG fantasy_rpg open_world, action, level-up, fantasy
JRPG fantasy_rpg level-up, turn-based, fantasy
FPS War_shooter open-world, 1st person, tanks, planes, shooter
FPS War_shooter 1st person, war, shooter, level-up
JRPG Mario level-up, turn-based, shooter
…
from the example, War_shooter can only ever be an FPS as the description of a war shooter is a fps game set during the war.
but how to limit?
code for how I train and predict below:
SDG_PARAMS_DICT: Final[Dict[str, Any]] = dict(alpha=1e-5, penalty=”l2″, max_iter=1000, loss=”log_loss”)
VECTORIZER_PARAMS_DICT: Final[Dict[str, Any]] = dict(ngram_range=(1, 4), min_df=5, max_df=0.8)
def build_model(x_data, y_data) -> Pipeline:
game_predict_pipeline = Pipeline(
[
("vect", CountVectorizer(VECTORIZER_PARAMS_DICT)),
("tfidf", TfidfTransformer()),
("clf", SelfTrainingClassifier(SGDClassifier(**SDG_PARAMS_DICT), verbose=True)),
]
)
X_train, X_test, y_train, y_test = train_test_split(x_data,
y_data,
train_size=0.3)
game_predict_pipeline.fit(X_train, y_train)
return game_predict_pipeline
game_data = pd.read_excel("c:/my_game_data.xlsx", keep_default_na=False)
model1 = build_model(game_data["Tag"], game_data["Label1"])
model2 = build_model(game_data["Tag"], game_data["Label2"])
test_tags = "level-up, open-world, shooter"
model1.predict(test_tag)
model2.predict(test_tag)
results
model1 - correct
FPS
model2 - incorrect
Mario
I thought about performing predict_proba and removing labels form the list but his doesn’t change the probability of the prediction resulting in many scores not reaching a theoretical cutoff
comparison_dict: Dict = {"FPS":["War_shooter"]}
prediction2: np.ndarray = model2.predict_proba(test_tag)
classes: np.ndarray = model2.classes_
prediction_dict: Dict = {}
for idx, model_cls in enumerate(classes):
if model_cls in comparison_dict.get(model1_precition):
if prediction2[0][idx] < 0.6: # cutoff
prediction_dict[model_cls] = prediction2[0][idx]
output
None
output without cutoff
"War_shooter": 0.42