I’ve been using scikit-learn for Gaussian process regressors for a while, working with adaptively constructed models where the existing GP is used to select new datapoints for the GP. Recently I’ve run into the following error on some cases:
mu, sig = gprMdl.predict(candidate_pts, return_std=True)
File "filepath_hidden/python/3.9.8/lib/python3.9/site-packages/sklearn/gaussian_process/_gpr.py", line 371, in predict
X = self._validate_data(X, ensure_2d=ensure_2d, dtype=dtype, reset=False)
File "filepath_hidden/python/3.9.8/lib/python3.9/site-packages/sklearn/base.py", line 561, in _validate_data
X = check_array(X, **check_params)
File "filepath_hidden/python/3.9.8/lib/python3.9/site-packages/sklearn/utils/validation.py", line 792, in check_array
_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
File "filepath_hidden/python/3.9.8/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Previous answers such as in ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’). While fitting in model would imply that somewhere in the dataset is a NaN value causing this. However, the input data is generated by random samples from a Gaussian distribution, while the output data is restricted to lie within [-1,1]. I have made sure to check the data used to produce the GP fit with np.where(np.isnan(data)) and this confirms that there are no NaN values present in the GP’s dataset prior to the evaluation that produced this error.
My next instinct was that this was caused by some sort of conditioning issue, since the tanh filter applied to the data could potentially create a large number of output data values that are roughly -1 or 1, but this is not the case either as the conditioning number for the GP kernel evaluated at its input datapoints remains steady at 1e7 throughout all previous models constructed with the same data sans the most recent data point. And removing or loosening the filter has not had any effect regardless.
The last thing I can imagine this to be coming from is that this may just be an issue of computing the model at too many outputs with too many input datapoints (roughly 100 datapoint GP evaluated at 25,000 points) but this does not seem particularly beyond use cases for the package and nor have I had this problem previously for the same code but with a different dataset.
Therefore, I have to ask, what could possibly be causing this?