I’m doing data analysis, and for reproducibility I want to make sure my results are marked with the version of the code used to produce them. I do this with
import git
repo = git.Repo(search_parent_directories=True)
last_commit_hex = repo.head.object.hexsha
But that’s not necessarily the actual state of the code if there’s uncommitted changes. So I want an automatic check in the program just before it starts running the analysis. I can use repo.is_dirty()
to check for uncommitted changes in the repo. But that’s checking too much. Besides the core package, there’s also a bunch of miscellaneous files. Testing, other analysis scripts that use that core package, etc. And those don’t need to be fully committed, but will trigger is_dirty()
. How do I only check for the relevant files?
So, if my file system looks like this:
Repo folder
|- package_folder
|- __init__.py
|- module1.py
|- module2.py
|- ...
|- some_analysis_script1.py
|- some_analysis_script2.py
|- some_testing.py
|- ...
I would want to check only the contents of package_folder
. Getting the list of files can be done with os.listdir
, but how do I check which ones have been changed?