Contents
In a previous article, I discussed how we converted a large SVN repository to Git. This article explores our next project—cleaning up our new Git repository.
We got to a usable state after the transition to Git, but we can do better. Now, we will clean up the repository by rewriting all of the commit messages, and removing a lot of extra junk.
Given how long many of these operations take—especially the just-completed SubGit conversion—and the number of iterations I went through to make each piece perfect, I strongly advise making a copy of the repository after each major step. Feed the new copy into the next step. That lets you rerun the later steps without worrying about corrupting previous work.
To help with that and to keep your sanity intact, you’ll almost certainly want to combine the various commands presented here into a script personalized for your situation. I ran through various parts of the process a hundred times and was extremely grateful for a consistent script.
Let’s review our requirements from last time.
Our project requirements
- Convert an SVN repo to a Git repo, maintaining history for
git blame
. - Minimize repo size as much as possible. Target is 2 GB.
- Developers need to be able to keep working in SVN until we’re ready to switch.
- Reformat commit messages to remove obnoxious template and to include the original SVN revision number.
Our tools
- BFG Repo-Cleaner — Cleans out bad files and commits from your history much more efficiently than git’s built-in
filter-branch
command. - Incremental-filter-branch — Another tool for efficiently filtering Git commits.
- Git LFS — Git Large File Storage, a Git add-on, for storing large files outside of the repository.
- JFrog Artifactory — Storage for binaries. We’ll use it as our LFS store, although there are other options available.
Step 1: Rewrite Commit Messages
Our SVN repository required that commit messages follow a bulky XML-esque form, representing noble intentions gone wrong. I knew that I wanted to reformat the messages in Git to be cleaner and easier to read. The incremental-filter-branch project provides a relatively quick way to do this:
incremental-git-filterbranch --no-lock -- $REPO_PATH "--msg-filter \"$PATH_TO_SOME_SCRIPT\"" $NEW_REPO_PATH
$PATH_TO_SOME_SCRIPT
will get executed for each revision, receiving the current commit message on STDIN
and expecting the modified message on STDOUT
. I wrote a Ruby script to parse our template and rework the message into something more palatable. Feel free to use your language of choice. As the “incremental” portion of the name implies, if you rerun this script against the same repository, it will just modify new commits.
Step 2: Remove Large Objects
(Note: In retrospect, I might have skipped this step in favor of just including everything in Git LFS. Depending on your situation, you might still prefer to get rid of old, large files, particularly if some are egregiously large.)
Our repository contained a lot of very large files. I was able to strip out many of them with SubGit’s configuration, but that’s an all-or-nothing process. I want to delete all large files—let’s say over 1 MB—that are no longer in use on the head of main
or a few other important branches. This gives a compromise between repo size and commit integrity that was acceptable for our situation—and if we really need a deleted file, we could boot the SVN server. So far, this hasn’t happened.
There’s a great tool called BFG Repo-Cleaner that provides many options for selectively deleting files from the repository. If main
is the only branch you care about preserving, you can call BFG directly and it should just work. However, because there are multiple branches we need to keep healthy, our situation requires a little more work. We need to generate a list of all of the objects that are currently in use on main
or any other important branch.
git ls-tree -r main | cut -f 1 | cut -d ' ' -f 3 >> /tmp/active-objects.list
Repeat the above line, replacing main
with the name of each of your important branches. git ls-tree
lists information on every object currently used by that branch. The cut
calls trim those lines down to just the blob ID and then append those to a file.
We ought to sort and remove duplicates from our list:
sort /tmp/active-objects.list | uniq > /tmp/active-objects-sorted.list
Now that we have our list of active objects, we can use some more command-line magic to generate a list of all objects that are larger than 1 MB and are not currently in use on our important branches. Thanks to Stack Overflow contributors for figuring this one out:
comm -23 \
<(git rev-list --objects --all | git cat-file --batch-check="%(objecttype) %(objectname) %(objectsize) %(rest)" | grep ^blob | awk '$3 > 1024 * 1024 { print $2 }' | sort) \
/tmp/active-objects-sorted.list \
> /tmp/large-blobs.list
Great — /tmp/large-blobs.list
is the list of offending files. Let’s nuke them with the BFG. From the root folder of our latest repo copy:
java -jar ~/bfg.jar --private -bi /tmp/large-blobs.list
Alright! Our repo should be looking much leaner now. If you check out some old commits, they might be missing some of their large files if the file is different than the one that’s on the branch head. But that was kind of the point here!
Step 3: Convert to LFS
Next, we’re going to hook up Git LFS. We’ve removed many of the large objects from our repository, but there are still plenty left. High resolution PNGs, the occasional committed binary, and some large test files can all add up to significant disk space. Git LFS works by moving the actual file to an external storage solution, and replacing it in the repository with a pointer to its new location. This keeps the size of the repository down so your initial git clone
of the repository is fast, and you only need to download the full file when it’s needed for the commit you want to check out.
You’ll want to create a new repository in Artifactory (or your binary storage solution of choice), give out appropriate permissions, and then tell Git to use it. There are two ways to do this. You can modify your local Git settings via:
git config lfs.url ssh://git@artifactory.company.com/artifactory/myrepo-lfs
Or, you can create a file named .lfsconfig
in the root of your repository and put a similar entry in it:
[lfs]
url = ssh://git@artifactory.company.com/artifactory/myrepo-lfs
There are pros and cons to both options. If you use the .lfsconfig
file, everyone gets the correct settings just by checking out the branch. However, unless you rewrite history to include that file in your old commits, then anyone who checks out an older commit will have broken LFS URLs and need to remember to run the git config
command.
Since it’s rare for us to go back in time, I went with the .lfsconfig
approach, and committed it to all active branches. In retrospect, I should have rewritten the repository’s initial commit to include .lfsconfig
. This would have been a more thorough solution that avoids the occasional confused developer with missing LFS files.
Next, we need to populate the .gitattributes
file with the list of file types we want to store in LFS. The contents will look something like the following, with one entry for each path you want to ignore:
*.exe filter=lfs diff=lfs merge=lfs -text
*.dll filter=lfs diff=lfs merge=lfs -text
In this case, we’re moving all exe
and dll
files over to LFS. The filter
, diff
and merge
keywords tell Git that it needs to use the special lfs
versions of those operations to appropriately handle the files. -text
tells Git to treat it as a binary file instead of text.
Like .lfsconfig
, you’ll need to commit .gitattributes
to your repository and you’ll probably want to include it in the initial commit unless you aren’t worried about people checking out pre-conversion commits.
As an aside: while you’re poking around Git LFS things, you might bump into smudge
and clean
, which is how Git refers to operations it does immediately after checking out a file and before committing a file, respectively. LFS uses smudge
to replace the “pointer” file with the actual contents from the LFS store, while clean
takes the new version of an LFS-aware file, uploads it to the LFS store, and replaces the file in the repository with the new pointer.
Well, we’ve finished our configuration so we can run the conversion.
git lfs migrate import --everything --include=”*.exe,*.dll”
That’ll run for a while and copy your designated large objects to your LFS store. The --everything
flag tells it to convert all reachable commits, but you can be more discriminatory if you’d like. Note that .gitattributes
tells Git how to handle these types of files going forward, but isn’t used by the conversion process, so we still have to specify the file types with --include
.
When it’s done, you might consider running:
git reflog expire --expire-unreachable=now --all
git gc --prune=now
This tells Git to run garbage collection and get rid of the old unnecessary copies of your files, keeping only the LFS version. If you check the size of your .git
folder now, it should be about the same size as it was before the LFS conversion, but your large files will now be in the lfs
subfolder, which is not part of the repository proper.
Step 4: Share
That’s it! We’ve cleaned up our commit messages, removed a ton of unneeded data, and relocated the remaining big stuff to a better home. All that’s left is to share our work with the team:
git remote set-url origin ssh://git@git.mycompany.com/myrepo.git
git push --all
git push --tags
We took our massive 135 GB Subversion repository and mashed it down to a much more manageable 2 GB Git repository, thanks to careful SubGit conversion, BFG Repo-Cleaner, and Git LFS. Good luck with your own repository-management endeavors!
Share this article: