Background
I have a project of a Drupal site in a GIT repo. I inherited this project and noticed that the repo itself is much bigger than it should be. As this is GIT repo is part of the deployment, I was asked by the hosting company to do something about it.
The problem
My understanding is that just customizations should be part of the repository (configuration, custom modules and theme, working files, etc) so things like Drupal core files, vendor projects, and contributed modules should not be in it.
By executing some commands, I realized that there’s a lot of files in my repo that should not be there. Files that at a moment in time, were added to the git before the person responsible at the time realized they should be ignored (added to .gitignore
).
The scope
To find out which files I had in my repo I ran the following command
git rev-list --objects --all |
git cat-file --batch-check='%(objectname) %(objecttype) %(objectsize) %(rest)' |
sed -n 's/^.* blob //p' |
awk '{printf "%.3f MBt%sn", $1/1048576, $2}' |
sort -n -k 1 > git-list-file-with-size.txt
The result was a large TXT file (close to 100,000 lines) showing some interesting results..
- a bunch of ZIP files from /.composer-cache folder
- a bunch on files from folders that should be (or eventually were ) added to
.gitignore
. Folders like /vendor, /core, /modules/contrib, and themes/contrib
Here are a couple of lines from the TXT file
18.408 MB vendor/phpstan/phpstan/phpstan.phar
29.474 MB .composer-cache/files/drupal/core/c9e0b432f7eeb1b6ece89b59f659cf09ab4421f5.zip
It is my understanding that even if these files are not in any actual current state of any branch, they still exist in the repo (in case we have to revert to a position in time when they existed.
From the 100,000 lines ( 90,000 would correspond to those examples mentioned above, i.e., from my understanding, files that should not be in my repo. Is that correct?
The solution
After doing some research, these are the steps I came up with to “clean” the repository:
- Install git filter-repo (https://github.com/newren/git-filter-repo), make it executable and move it to the bin folder.
- Git clone mirror my repo to a new location in my local environment (
git clone --mirror
) - Execute git filter-repo, listing all folders and files that should be exclude. Based on everything that I found, I ended with this full command:
git filter-repo --path .composer-cache --path vendor --path docroot/core --path docroot/libraries --path docroot/modules/contrib --path docroot/themes/contrib --path web/libraries --invert-paths
- Confirm the clean up. Execute
git rev-list --objects --all
and it should show that all those folders and their coment have disappeared.
The result
I believe that as a result of these changes, when I first push to the cloud repo, I will have to force:
git push origin --force --all
git push origin --force --tags
Question
Is all this correct? Am I missing anything? Am I over complicating?
Thanks,
Alex