I realise the values for sample_size in boost_tree represent the proportion (in XGB only, absolute number otherwise) of observations to be subsampled per iteration(tree) but what I don’t understand is how the proportions can be so high (e.g. .49) when sampling is supposed to be random without replacement. A proportion of .49 of the dataset per tree would surely mean repetition across 1000 trees? I don’t know if this is due to my lack of understanding of how subsamples are selected but any help would be greatly appreciated!
I looked at the various parameter combinations in the grid to understand how subsampling works (I’ve used kfold cross validation/resamples and trying to understand how that interacts with subsampling during training). I checked ‘sample_size’ to try to understand how subsamples are selected but found the proportions to be much higher than expected considering boosting uses random sampling without replacement (each observation only used once).
Hannah Lumley is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.