I have conducted an experiment where I trained and tested eight ML and DL models, each undergoing hyperparameter optimization, on survival analysis tasks. After tuning, each model was trained once on training data and tested once on test data, resulting in eight c-index scores representing the models’ performance.
Now, I want to determine if there are significant differences in performance between these models. Since I have multiple models and one set of test results per model, what statistical hypothesis test should I use to assess the significance of the performance differences? Should I consider the Kruskal-Wallis test, ANOVA, or another test? Additionally, how do I interpret the results obtained from the chosen test? Any insights or guidance would be greatly appreciated.