I am trying to model nitrate concentrations in the streams in Bavaria in Germany using Random Forest model. I am using Python and primarily sklearn for the same. I have data from 490 water quality stations. I am following the methodology in the paper from LongzhuQ.Shen et al which can be found here: https://www.nature.com/articles/s41597-020-0478-7
I want to split my dataset into training and testing set such that the spatial distribution of data in both sets is identical. The idea is that if data splitting ignores the spatial distribution, there is a risk that the training set might end up with a concentration of points from densely populated areas, leaving out sparser areas. This can skew the model’s learning process, making it less accurate or generalizable across the entire area of interest. sklearn train_test_split just randomly divides the data into training and testing sets and it does not consider the spatial patterns in the data.
The paper I mentioned above follows this methodology: “We split the full dataset into two sub-datasets, training and testing respectively. To consider the heterogeneity of the spatial distribution of the gauge stations, we employed the spatial density estimation technique in the data splitting step by building a density surface using Gaussian kernels with a bandwidth of 50 km (using v.kernel available in GRASS GIS33) for each species and season. The pixel values of the resultant density surface were used as weighting factors to split the data into training and testing subsets that possess identical spatial distributions.”
I want to follow the same methodology but instead of using grass GIS, I am just building the density surface myself in Python. I have also extracted the probability density values and the weights for the stations. (attached figure)
Now the only problem I am facing is how do I use these weights to split the data into training and testing sets? I checked there is no keyword in the sklearn train_test_split function that can consider the weights. I also went back and forth with chat GPT 4 but it is also not able to give me a clear answer. Neither did I find anything concrete on the internet about this. Maybe I am missing something.
Is there any other function I can use to do this? Or will I have to write my own algorithm to do the splitting? In case of the latter, can you please suggest me the approach so I can code it myself?
In the attached figure you can see the location of the stations and the probability density surface generated using the kernel density estimation method (using Gaussian kernels).
Also attaching a screenshot of my dataframe to give you some idea of the data structure. (all columns after longitude (‘lon’) column are used as features. the NO3 column is used as the target variable.)
I will be grateful for any answers.
Please find the attached images for reference.
Probability density surface generated using the kernel density estimation method with gaussian kernels.
the dataset I am using to model the nitrate concentrations