I want to bin a score into bins. The score indicates the likelihood of default = 1. The bins should be found automatically using glmtree
from the partykit library in R. A bin should contain scores with a similar default rate. This worked before, but now the score has gotten much better at predicting defaults and it seems partykit does not find a solution.
Here is an example using synthetic data which is close to mine.
library(tidyverse)
sigmoid <- function(x) {
1 / (1 + exp(-x))
}
n_sample <- 10^5
score <- runif(n_sample, min = -5, max = 5)
defaults <- rbinom(length(score), size = 1, prob = sigmoid(score))
df <- tibble(score = score, default_flag = defaults) %>%
mutate(score_bin = cut(score, breaks = 100))
partykit::glmtree(formula = default_flag ~ score,
data = as.data.frame(df),
family = binomial)
The average default rate looks as follows when binning the score into 100 equally sized bins
df %>%
mutate(score_bin = cut(score, breaks = 100)) %>%
group_by(score_bin) %>%
summarise(default_rate = sum(default_flag)/ n()) %>%
plot()
My intuition is, that partykit does not find a solution, because many cuts would work very well, i.e. would split into groups with more and with less defaults. Does this make sense?
How can I make partykit::glmtree find a binning for this example?
I have tried
- increasing maxit to 100
- increasing minsize
- using ctree instead which finds a solution quickly