Rewriting History To Remove Unwanted Binaries
We’re in the middle of a TFVC (TFS) to Git migration that I’ll probably blog about more completely later. Right now, I want to cover one thing that we’re cleaning up in the process.
These Aren’t The Packages You’re Looking For⌗
We got into a bad situation where we were checking in our packages
folder instead of letting NuGet restore handle it.
- Some of this was because we were using older build templates in VSO and NuGet restore wasn’t working how we expected
- Some of this was because we have private packages that aren’t on any feed that VSO can currently access
Rewriting History⌗
If we’re going to have a new project in VSO, and we’re going to be creating a new repo based on the TFVC history, we might as well move up to the new templates and pretend those packages were never there.
The naive approach would be to just do the git tf
migration, then do a follow-up commit that deletes all the packages. They’d still be in the history, the repo would still be oversized, and the index would still take the hit of having all those binaries hanging around.
Here’s a command that would do the trick for the Newtonsoft.Json.6.0.1
package alone.
DANGER! WARNING! WE’RE DOING THIS TO A NEW REPO, DURING MIGRATION, BEFORE IT BECOMES SHARED HISTORY. READ THE MANUAL AND CAVEATS BEFORE YOU DO THIS TO EXISTING, SHARED REPOS read it here git filter-branch -f –index-filter “git rm -rf –cached –ignore-unmatch packages/Newtonsoft.Json.6.0.1” HEAD
Breakdown⌗
tl;dr this is going to roll through all of our commits, and for each one, remove the Newtonsoft.Json.6.0.1 directory and everything inside of it from both the working tree and index. It NEVER EXISTED.
packages/Newtonsoft.Json.6.0.1
is the most self-explanatory part. This is the sub-directory we’re going to pretend was never added.-rf
should be similarly self-explanatory for anyone with a bit of *nix background. Recursive. Force.--ignore-unmatch
sparked a bit of discussion between Michael and me. What it boils down to is that this instructsgit rm
to exit with a success code even if no files match the pattern. Otherwise it would exit as a failure if no files were matched and removed.git rm
says to remove the files.--cached
removes it from the index but leaves the working tree alone.- This is all wrapped in quotes because it’s a parameter to
git filter branch
git filter-branch
on the left is going to rewrite history for every commit--index-filter
is better explained in the docs*HEAD
on the right side says we want to save the rewritten history in HEAD.-f
forces. I will cover why we added this, but not just yet.
But That’s Just One Of Them⌗
We want to do this for all the public packages, and we want to do it for none of our private ones. We’ll need to get a list.
From the packages directory
git log --diff-filter=A --summary . `
|? { $_ -ne $null -and $_ -match 'create mode \d+ (.*)?(/lib/.*)' } `
|% { $($matches[1]) } `
| select -unique | sort `
|% { "git filter-branch -f --index-filter ""git rm -rf --cached --ignore-unmatch $_"" HEAD" }
Breakdown⌗
- The first line dumps all of the adds that happened in the current directory (
packages
) - The second line filters out blank lines, and matches the regex of created files, capturing the path as a group.
- The third line pulls out the matched path
- dlls, pdbs, nupkgs themselves, who knows what else might have been added, but our
filter-branch + rm
above is going-rf
on it anyway. Dedupe them, and sort just for convenience. - Write the command we want to execute to the output stream. This doesn’t execute it, it just outputs what you want to execute.
The output looks something like this
git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch packages/bar" HEAD
git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch packages/foo" HEAD
You could go so far as to not wrap that in quotes and dump it to the output stream. You could just execute it directly. But then, you wouldn’t have accounted for the private packages, so you’d want to have some kind of $safePackages = ("x","y","z")
and add a |?
to strip them. Your choice, but until we’ve run this through its paces, I kind of like the safety of having the chance to review what’s going to be done.
The Final Stroke, And Why We -f It⌗
Each filter-branch
creates a .git-rewrite
directory and a bunch of temporary history. While this exists, the next filter-branch
will not execute and ask you to clean it up. We are going to be doing this a whole bunch of times and then clean it up after all of them are done, so we just -f
through it.
Then, at the end, you do want to clean it up and run a gc
. How?
rm -rf .git/refs/original/ && git reflog expire --all && git gc --aggressive --prune
Congrats! Now before you push your fancy, newly migrated to Git repo, it’ll be a lot lighter without all those unnecessary binaries. You could certainly consider applying this procedure to all kinds of other directories where you’ve been putting garbage into source control.
* Hat tip to an old post by David Underhill that got us started.
** More thanks to Josh for responding quite quickly to my call for his favorite way to find subdirectories that used to exist but got deleted. I ended up going a different way, and just looking for everything that was ever added, whether it exists or not. And using PowerShell. He says he’s considering blogging his solution soon!!