I am migrating from R to PySpark. I have a process that creates a regression tree that is currently built using R’s rpart
algorithm.
While configuring this in PySpark, I am unable to see an option to specify a custom
custom impurity function. I have a skewed dataset, and instead of using mean and variance/ standard deviation in the formula as criterion for impurity of a node, I want to use a metric more suited for my skewed data.
How can I define a custom impurity function in PySpark?
I’ve looked at the documentation for Decision Tree Regression and documentation for the impurity
parameter only mentions support for variance
impurity = Param(parent=’undefined’, name=’impurity’, doc=’Criterion used for information gain calculation (case-insensitive). Supported options: variance’)
Is there any workaround to define a custom impurity function?