My train data is a Spark DataFrame. One column of the DataFrame is a sparse vector constructed using SparseVector, and the length of the sparse vector is 10000.
When I train the model, even if I sample the training set and only have 10,000 records left, I still get a poll timeout.
My pipeline like this.
indexer = StringIndexer(inputCols=ohe_cols, outputCols=index_out_cols, handleInvalid='keep')
encoder = OneHotEncoder(inputCols=index_out_cols, outputCols=ohe_out_cols)
va = VectorAssembler(inputCols=va_input_cols, outputCol='features')
xgb = XGBoostClassifier(
featuresCol='features',
labelCol='is_convert',
weightCol='weight',
objective='reg:logistic',
evalMetric='auc',
numRound=5,
eta=0.1,
maxDepth=2,
missing=0.0,
numWorkers=10,
timeoutRequestWorkers=900,
colsampleBytree=0.5
)
pp = Pipeline(stages=[indexer, encoder, va, xgb])
Is it because sparkxgb does not support too many features?