I asked the question here, but no one has answered.
catboost version: 1.2.3
Operating System: Ubuntu Linux 20.04, 5.15 kernel
CPU: x86_64
GPU: N/A
PySpark version: 3.5.0
Python version: 3.9
Spark jar package: ai.catboost:catboost-spark_3.5_2.12:1.2.3
I am currently saving/serializing a PySpark CrossValidator instance. The CrossValidator uses a PySpark ML Pipeline and uses catboost_spark.CatBoostRegressor as the estimator. There are many stages to the pipeline and obviously, CatBoostRegressor is the last stage. The CrossValidator seemingly saves/serializes without issue. Examination of the CrossValidator artifacts/files shows the following metadata contents for the CatBoostRegressor:
{"class":"ai.catboost.spark.CatBoostRegressionModel","timestamp":1714510495510,"sparkVersion":"3.5.0","uid":"CatBoostRegressionModel_c919e14ac4a5"}
I then use the same PySpark environment – with the same config settings for properties like spark.jar*, spark.*, etc. – in an attempt to load my CrossValidator that was previously saved. It seems like the CrossValidator successfully parses all stages except for the CatBoostRegressor stage. I get the following error:
Traceback (most recent call last):
File "/home/myaccount/Projects/my-project/src/models/predict_model.py", line 303, in <module>
main()
File "/home/myaccount/Projects/my-project/src/my_project/utils/logger_utils.py", line 33, in wrapper
return func(*args, **kwargs)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/myaccount/Projects/my-project/src/models/predict_model.py", line 219, in main
predict_model(spark_session, enable_hive_support, metastore_database, metastore_table_processed_data,
File "/home/myaccount/Projects/my-project/src/my_project/utils/logger_utils.py", line 33, in wrapper
return func(*args, **kwargs)
File "/home/myaccount/Projects/my-project/src/models/predict_model.py", line 240, in predict_model
regression_models: list[catboost_spark.CatBoostRegressionModel] = load_cross_validator_models_from_file(
File "/home/myaccount/Projects/my-project/src/my_project/utils/pyspark_dataframe_file_utils.py", line 200, in load_cross_validator_models_from_file
models.append(CrossValidatorModel.load(path))
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/pyspark/ml/util.py", line 369, in load
return cls.read().load(path)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/pyspark/ml/tuning.py", line 549, in load
metadata, estimator, evaluator, estimatorParamMaps = _ValidatorSharedReadWrite.load(
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/pyspark/ml/tuning.py", line 434, in load
estimator: Estimator = DefaultParamsReader.loadParamsInstance(estimatorPath, sc)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/pyspark/ml/util.py", line 650, in loadParamsInstance
instance = py_type.load(path)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/pyspark/ml/util.py", line 369, in load
return cls.read().load(path)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/pyspark/ml/pipeline.py", line 249, in load
uid, stages = PipelineSharedReadWrite.load(metadata, self.sc, path)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/pyspark/ml/pipeline.py", line 439, in load
stage: "PipelineStage" = DefaultParamsReader.loadParamsInstance(stagePath, sc)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/pyspark/ml/util.py", line 649, in loadParamsInstance
py_type: Type[RL] = DefaultParamsReader.__get_class(pythonClassName)
File "/home/myaccount/.pyenv/versions/my-project/lib/python3.9/site-packages/pyspark/ml/util.py", line 556, in __get_class
return getattr(m, parts[-1])
AttributeError: module 'ai.catboost.spark' has no attribute 'CatBoostRegressor'
What’s going on? Does CatBoost not work with PySpark CrossValidator deserialization? Is this related to a similar issue described here?
I also get a similar error when saving using PySpark PipelineModel as an alternative to using CrossValidator.