Rewriting git history

I recently had to revive a git repository by removing some accidental commits of large binary files. Overnight, a fresh clone of the repository grew from a few hundred MB to several GB and this had a particularly nasty effect on our CI system since it stores a fresh clone for every build. In the process, I got a much better understanding of how git works.

Published history

Since a git clone contains a full copy of the entire history of the repository, even if the large files have been deleted in a subsequent commit they are still lurking in the history taking up space. So these commits need to be removed from the history. When figuring out how to do this I came across Seth Robertson’s git choose your own adventure, which lead me to the conclusion, ‘I am a bad person and must rewrite published history’. The word published is important here: rewriting history to remove bad commits or for other reasons is fine if you have never pushed that history. But after pushing, someone else may have pulled those changes, and that means they could get pushed back into the shared repository and (in our case) reintroduce the problematic large binary commits.

I have spoken to teams who routinely rewrite published history on feature branches in the name of producing a cleaner commit history, using push --force to change the published history. Bit this is a risky habit and it’s safer to prevent force push with a git hook.

Basic git internals

After reading the excellent Pro Git some time ago, I had a basic understanding of how git worked internally, but my recent experience really drove home two points:

  1. Every commit in git is an independent representation of the state of all files in your working directory at that point in time. Git does not store diffs, and whenever they are needed they are calculated on the fly. To avoid storing a full copy of every file in every commit, it instead uses hashes to reference a single compressed copy of a file’s contents where it is the same for multiple commits [git-internals].
  2. Commits are immutable, and can only be destroyed if they are no longer referenced. The hash for a commit is calculated using both the contents of that commit and the hashes of all parent commits. While most commits have a single parent, a merge commit has more than one parent, and at the point a branch is created a single commit will have multiple children. This means that any change to any commit would invalidate all descendants, so git doesn’t give you tools to do that; instead it allows you to create equivalent commits using rebase.

Commits reference each other

The most common type of reference to a commit is from its direct descendent commit(s). For a problematic commit to become unreferenced, all of the downstream commits need to also not be referenced.

Branches reference commits

Other than the references from other commits, the most obvious kind of commit reference is a branch pointer. This is, in fact, all that a branch is in git: a reference to a particular commit, along with some workflow logic which means when changes are pushed to that branch, the reference is updated to point to the new HEAD commit [git book]. In our case, the changes were on the master branch, so we needed to move the master branch pointer to reference a commit with clean lineage, excluding the problematic commits.

Interactive rebase master

One tool for rewriting history in git is the interactive rebase: git rebase -i. This allows the state of each commit in the history to be changed, with the option to leave it intact, modify or remove it entirely. We wanted to remove some commits and leave all others intact. It’s important to understand that what this actually does is to create a new commit corresponding to each commit we wish to keep by calculating and applying diffs.

Simplifying slightly, we started with:

  • Some commits on master that should not have been there, adding and removing large binary files (C and D below)
  • A feature-1 branch from a point before the bad commits
  • A feature-2 branch from a point after the bad commits
learning-git (1)
Starting point, with some bad commits and branches from points both before and after the bad commits.

Before performing the interactive rebase on master, we create a reference to the current master, since we’ll need that in later steps:

git branch old-master master

Then we perform the interactive rebase to create new commits corresponding to each of the commits we want to keep on master.

# Hash identifies last good commit (B in diagram)
git rebase -i 2e03b96d

Git performs the rebase by calculating the diffs from D to F and F to H and applying these onto B, creating new commits F’ and H’, and pointing the master branch pointer at the new HEAD.

learning-git (2)
After interactive rebase, master points to a clean lineage of commits

Rebase problematic branches

As well as the master branch, we also need to rebase any other branches which indirectly reference the problematic commits. We ensure that all remote branches have corresponding local tracking branches, since we’ll need to operate across all of them:

# List all remote branches | exclude some special cases \
  | print 'git branch --track <branch-name>' for each \
  | execute that command for each
git branch -r | grep -v -e HEAD -e master | sed -e 's/^ origin\///' \
  | awk '{ print "git branch --track " $1 " origin/" $1 }' \
  | bash -f

We find the affected branches using a combination of git rev-list and git branch --contain, and rebase each onto the new master based on their differences relative to old-master.

# List all commit hashes between divergence point and old-master (HEAD)
# | print "git branch --contains <hash>" for each
# | execute that command # | sort, only keeping unique results
# | print "git rebase --onto master old-master <branch-name> for each
# | execute that command for each
git rev-list 2e03b96d..old-master \
  | awk '{ print "git branch --contains " $1 }' | bash -f | sort -u \ 
  | grep -v old-master \
  | awk '{ print "git rebase --onto master old-master " $1 }' | bash -f

After this, the affected branches have been transplanted to the HEAD of the new master branch by creating new corresponding commits I’ and J’. The old commits I and J are now eligible for garbage collection, which is an internal git process which usually runs automatically, but can be also triggered as we will see later.

learning-git (3)
After rebasing the problematic feature-2 branch onto the new master, it no longer has bad commits in its history

We have now removed all the branch references which are preventing the problematic commits from being purged from  the repository via garbage collection, but we still have some other references to consider.

Tags hold references

Like a branch, a tag is just a reference to a commit, which means we also need to remove any tags referencing problematic commits or any downstream commits. If we want to keep any particular tags —  for example because they identify a candidate release which was deployed to production — then we need to move them across to the corresponding new commit manually by deleting the old tag and creating a new one. Then we can remove any remaining tags on the old-master branch, both locally and on the remote:

# show all commits on old-master but not master \
  | grab all the tag names \
  | print "git tag -d <tag-name> && git push --delete origin <tag-name>" for each \
  | execute that command for each
git log --decorate=full --simplify-by-decoration --pretty=oneline master..old-master \
  | grep -oh 'tag: v[0-9]*\.[0-9]*\.[0-9]*' \
  | awk '{ print "git tag -d " $2 " && git push --delete origin " $2 } ' \
  | bash -f

We have now finished with the old-master branch, so we remove it:

git branch -D old-master

And we now have:

learning-git (4)
Tag of interest moved to new master; old-master removed. Commits now eligible for garbage collection indicated.

Garbage collection

We now need to publish our rewritten history using push --force. If we’re following the good practice of forbidding this then we need to temporarily disable the git hook which prevents it.

# Set a sane default branch matching policy
git config --global push.default matching

# Push changes to branches and new tags:
git push --force --all
git push --force --tags

# Bring our local record of remote branches up to date
git fetch

We can now force a garbage collection to remove the unreferenced commits from storage. The slight trick to this is that we have to tackle the last set of references which would otherwise keep commits alive: the git reflog, which is a list of recent changes to the commits that references such as branches, tags and HEAD point to. These reflog entries themselves are counted as references, so before forcing a garbage collection, we need to explicitly expire those references:

git reflog expire --expire=now --all

I also found that I had to limit the memory that git will use to avoid it trying to allocate every last byte on my machine:

git config --global pack.windowMemory "1024m" && git config --global pack.packSizeLimit "1024m"

And finally we reclaim our wasted disk space by forcing the garbage collection:

git gc --prune=now && git repack -a -d -l
learning-git (5)
Final state after removing all trace of the problematic commits

Human considerations

I mentioned above that bad commits are hard to destroy once published, because if they have been pulled by other users then a later push from any of them can push the commits back in. To reduce the risk of this, we took a couple of precautions:

  1. Before starting, we asked all developers to push all changes and delete their local clones of the repository (and watched them do it, to be sure).
  2. Rather than working on a direct clone of the problematic repository, we used a fork of the problematic repository within our managed git system and removed permissions on the original repository. This can’t completely prevent any remaining clones from being repointed at the new fork and causing havoc, but it makes that an act of deliberate sabotage instead of casual error.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s