For context:
STATA has a command called ‘splitsample’ where it effectively splits your current dataset into different partitions based on some user defined value. nsplit() defines how many groups, gen() creates the group identifiers for each.
There are additional arguments that can be passed such as: balance() and rround.
balance() forces the split groups to preserve semi-equal proportions/averages of some variable
rround helps randomly round sample sizes to satisfy the sample-size ratios
splitsample, gen(group) nsplit(2) balance(clicks networkrevenue) rround
Here’s an example output for this in STATA
Since I specified rround in STATA, the proportions are near 50/50 but not exact
tab group
group | Freq. Percent Cum.
1 | 1,903 50.44 50.44
2 | 1,870 49.56 100.00
Total | 3,773 100.00
And because I also specified:
balance(clicks networkrevenue)
When I look at the summary statistics for my two groups, their averages are also alike
tab group, sum(clicks)
| Summary of Clicks
group | Mean Std. dev. Freq.
1 | 30.31319 108.58514 1,903
2 | 29.740107 98.621471 1,870
Total | 30.029155 103.75319 3,773
tab group, sum(networkrevenue)
| Summary of NetworkRevenue
group | Mean Std. dev. Freq.
1 | 44194.721 242880.98 1,903
2 | 44970.651 220502.77 1,870
Total | 44579.293 232029.24 3,773
What I’m looking for is a Python equivalent to this command that has similar behavior to the balance() and rround argument.
I’ve done some cursory research into the available Python sampling libraries but they all behave differently than STATA’s splitsample.
I’ve looked at Imbalanced-learn’s over and under sampling methods + scikit’s sampling methods.
Any help is greatly appreciated!
Jun Liu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.