I am writing a Python library for scientific calculations. The user should have the possibility to try out these calculations on some test data that (ideally) ships with the Python package. For example:
from mypackge.data import dataset1
from mypackage.science import do_stuff
ds = dataset1() # Downloaded on demand
# result = do_stuff(ds)
The test data are too large that they could be hosted in a Github repository (10-100 MB). For development purposes, I am using dvc and hosting the on a Google Cloud Storage bucket. However, this does not work in production.
Is there a Python library that automatically fetches datasets from a remote bucket? Or more generally speaking, are there any conventions and best practices when shipping test data with Python packages?