David Ruttka

I make computers do things

Rewriting History to Remove Unwanted Binaries

| Comments

We’re in the middle of a TFVC (TFS) to Git migration that I’ll probably blog about more completely later. Right now, I want to cover one thing that we’re cleaning up in the process.

These Aren’t The Packages You’re Looking For

We got into a bad situation where we were checking in our packages folder instead of letting NuGet restore handle it.

  • Some of this was because we were using older build templates in VSO and NuGet restore wasn’t working how we expected
  • Some of this was because we have private packages that aren’t on any feed that VSO can currently access

Rewriting History

If we’re going to have a new project in VSO, and we’re going to be creating a new repo based on the TFVC history, we might as well move up to the new templates and pretend those packages were never there.

The naive approach would be to just do the git tf migration, then do a follow-up commit that deletes all the packages. They’d still be in the history, the repo would still be oversized, and the index would still take the hit of having all those binaries hanging around.

Here’s a command that would do the trick for the Newtonsoft.Json.6.0.1 package alone.

DANGER! WARNING! WE’RE DOING THIS TO A NEW REPO, DURING MIGRATION, BEFORE IT BECOMES SHARED HISTORY. READ THE MANUAL AND CAVEATS BEFORE YOU DO THIS TO EXISTING, SHARED REPOS read it here

git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch packages/Newtonsoft.Json.6.0.1" HEAD

Breakdown

tl;dr this is going to roll through all of our commits, and for each one, remove the Newtonsoft.Json.6.0.1 directory and everything inside of it from both the working tree and index. It NEVER EXISTED.

  • packages/Newtonsoft.Json.6.0.1 is the most self-explanatory part. This is the sub-directory we’re going to pretend was never added.
  • -rf should be similarly self-explanatory for anyone with a bit of *nix background. Recursive. Force.
  • --ignore-unmatch sparked a bit of discussion between Michael and me. What it boils down to is that this instructs git rm to exit with a success code even if no files match the pattern. Otherwise it would exit as a failure if no files were matched and removed.
  • git rm says to remove the files.
  • --cached removes it from the index but leaves the working tree alone.
  • This is all wrapped in quotes because it’s a parameter to git filter branch
  • git filter-branch on the left is going to rewrite history for every commit
  • --index-filter is better explained in the docs*
  • HEAD on the right side says we want to save the rewritten history in HEAD.
  • -f forces. I will cover why we added this, but not just yet.

But That’s Just One Of Them

We want to do this for all the public packages, and we want to do it for none of our private ones. We’ll need to get a list.

From the packages directory

git log --diff-filter=A --summary . `
    |? { $_ -ne $null -and $_ -match 'create mode \d+ (.*)?(/lib/.*)' } `
    |% { $($matches[1]) } `
    | select -unique | sort `
    |% { "git filter-branch -f --index-filter ""git rm -rf --cached --ignore-unmatch $_"" HEAD" }

Breakdown

  1. The first line dumps all of the adds that happened in the current directory (packages)
  2. The second line filters out blank lines, and matches the regex of created files, capturing the path as a group.
  3. The third line pulls out the matched path
  4. dlls, pdbs, nupkgs themselves, who knows what else might have been added, but our filter-branch + rm above is going -rf on it anyway. Dedupe them, and sort just for convenience.
  5. Write the command we want to execute to the output stream. This doesn’t execute it, it just outputs what you want to execute.

The output looks something like this

git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch packages/bar" HEAD
git filter-branch -f --index-filter "git rm -rf --cached --ignore-unmatch packages/foo" HEAD

You could go so far as to not wrap that in quotes and dump it to the output stream. You could just execute it directly. But then, you wouldn’t have accounted for the private packages, so you’d want to have some kind of $safePackages = ("x","y","z") and add a |? to strip them. Your choice, but until we’ve run this through its paces, I kind of like the safety of having the chance to review what’s going to be done.

The Final Stroke, And Why We -f It

Each filter-branch creates a .git-rewrite directory and a bunch of temporary history. While this exists, the next filter-branch will not execute and ask you to clean it up. We are going to be doing this a whole bunch of times and then clean it up after all of them are done, so we just -f through it.

Then, at the end, you do want to clean it up and run a gc. How?

rm -rf .git/refs/original/ && git reflog expire --all &&  git gc --aggressive --prune

Congrats! Now before you push your fancy, newly migrated to Git repo, it’ll be a lot lighter without all those unnecessary binaries. You could certainly consider applying this procedure to all kinds of other directories where you’ve been putting garbage into source control.

* Hat tip to an old post by David Underhill that got us started.

** More thanks to Josh for responding quite quickly to my call for his favorite way to find subdirectories that used to exist but got deleted. I ended up going a different way, and just looking for everything that was ever added, whether it exists or not. And using PowerShell. He says he’s considering blogging his solution soon!!

Comments