Welcome to the third installment of our Git in 2016 retrospective! In part two, we looked at improvements made to Git's diff subsystem throughout the year. In part three, we'll be looking at the co-evolution of Git's smudge filter and clean filter system and Git LFS (Large File Storage): Git's companion projects for tracking large binary content.
A Primer on Git LFS
Git is a distributed version control system, meaning the entire history of the repository is transferred to the client during the cloning process. For projects that contain large files — particularly large files that are modified regularly — the initial clone can be expensive, as every version of every file has to be downloaded by the client. Git LFS is a Git extension developed by Atlassian, GitHub, and a few other open source contributors that reduces the impact of large files in your repository by downloading the relevant versions of them lazily. Specifically, large files are downloaded as needed during the checkout process rather than during cloning or fetching.
Alongside Git's five huge releases in 2016, Git LFS had four feature-packed releases of its own: v1.2 through v1.5. You could write a retrospective series on Git LFS in its own right, but for this article, I'm going to focus on one of the most important themes tackled in 2016: speed. A series of improvements to both Git and Git LFS have greatly improved the performance of transferring files to and from the server:
In April, Git LFS 1.2 shipped with the
git lfs clone command, which greatly speeds up the initial clone of your repository. July's Git LFS 1.3 release implemented support for custom transfer adapters, allowing Git LFS providers such as Bitbucket to implement optimized protocols for transferring large objects to and from storage. Finally, Git 2.11 released support for long-lived filter processes, an implementation of which shipped with Git LFS 1.5 to drastically speed up
git add and
git checkout commands within Git LFS repositories.
Long-Running Filter Processes
git add a file, Git's system of clean filters can be used to transform the file’s contents before being written to the Git object store. Git LFS reduces your repository size by using a clean filter to squirrel away large file content in the LFS cache and adds a tiny “pointer” file to the Git object store instead.
Smudge filters are the opposite of clean filters — hence the name. When file content is read from the Git object store during a
git checkout, smudge filters have a chance to transform it before it’s written to the user’s working copy. The Git LFS smudge filter transforms pointer files by replacing them with the corresponding large file, either from your LFS cache or by reading through to your Git LFS store on Bitbucket.
Traditionally, smudge and clean filter processes were invoked once for each file that was being added or checked out. So, a project with 1,000 files tracked by Git LFS invoked the
git-lfs-smudge command 1,000 times for a fresh checkout! While each operation is relatively quick, the overhead of spinning up 1,000 individual smudge processes is costly.
As of Git 2.11 (and Git LFS 1.5), smudge and clean filters can be defined as long-running processes that are invoked once for the first filtered file, then fed subsequent files that need smudging or cleaning until the parent Git operation exits. Lars Schneider, who contributed long-running filters to Git, neatly summarized the impact of the change on Git LFS performance:
The filter process is 80x faster on macOS and 58x faster on Windows for the test repo with 12k files. On Windows, that means the tests runs in 57 seconds instead of 55 minutes!
That’s a seriously impressive performance gain!
Specialized LFS Clones
Long-running smudge and clean filters are great for speeding up reads and writes to the local LFS cache, but they do little to speed up transferring of large objects to and from your Git LFS server. Each time the Git LFS smudge filter can't find a file in the local LFS cache, it has to make two HTTP calls to retrieve it: one to locate the file and one to download it. During a
git clone, your local LFS cache is empty, so Git LFS will naively make two HTTP calls for every LFS tracked file in your repository:
Fortunately, Git LFS 1.2 shipped the specialized
git lfs clone command. Rather than downloading files one at a time;
git lfs clone disables the Git LFS smudge filter, waits until the checkout is complete, and then downloads any required files as a batch from the Git LFS store. This allows downloads to be parallelized and halves the number of required HTTP requests:
Custom Transfer Adapters
As discussed earlier, Git LFS shipped support for long running filter processes in v1.5. However, support for another type of pluggable process actually shipped earlier in the year. Git LFS v1.3 included support for pluggable transfer adapters so that different Git LFS hosting services could define their own protocols for transferring files to and from LFS storage.
As of the end of 2016, Bitbucket is the only hosting service to implement their own Git LFS transfer protocol via the Bitbucket LFS Media Adapter. This was done to take advantage of a unique feature of Bitbucket's LFS storage API called chunking. Chunking means large files are broken down into 4MB chunks before uploading or downloading.
Chunking gives Bitbucket's Git LFS support three big advantages:
Parallelized downloads and uploads. By default, Git LFS transfers up to three files in parallel. However, if only a single file is being transferred (which is the default behavior of the Git LFS smudge filter), it is transferred via a single stream. Bitbucket's chunking allows multiple chunks from the same file to be uploaded or downloaded simultaneously, often dramatically improving transfer speed.
Resumable chunk transfers. File chunks are cached locally, so if your download or upload is interrupted, Bitbucket's custom LFS media adapter will resume transferring only the missing chunks the next time you push or pull.
Deduplication. Git LFS, like Git itself, is content addressable; each LFS file is identified by a SHA-256 hash of its contents. So, if you flip a single bit, the file's SHA-256 changes and you have to re-upload the entire file. Chunking allows you to re-upload only the sections of the file that have actually changed. To illustrate, imagine we have a 41MB spritesheet for a video game tracked in Git LFS. If we add a new 2MB layer to the spritesheet and commit it, we'd typically need to push the entire new 43MB file to the server. However, with Bitbucket's custom transfer adapter, we only need to push ~7Mb: the first 4MB chunk (because the file's header information will have changed) and the last 3MB chunk containing the new layer we've just added! The other unchanged chunks are skipped automatically during the upload process, saving a huge amount of bandwidth and time.
Customizable transfer adapters are a great feature for Git LFS, as they allow different hosts to experiment with optimized transfer protocols to suit their services without overloading the core project.
Next: Pump Up the Rebase
Both Git and Git LFS shipped some impressive performance improvements in 2016. I suspect we'll see a similar trend in 2017 as developers demand faster speeds from repositories with larger files and deeper histories. We may even see some of the improvements from Bitbucket's LFS adapter (i.e., chunking) being rolled into Git LFS core for other providers to take advantage of. As always, if you have any Git LFS tips or feedback on the article, please hit me up on Twitter! I'm @kannonboy. Stay tuned for the next article in our retrospective series, covering enhancements made to one of Git's most powerful and misunderstood features: rebasing. If you want to get a head start, check out Atlassian's tutorial on
If you stumbled on these articles out of order, you can check out the other topics covered in our Git in 2016 retrospective below:
Or, if you've read 'em all and still want more, check out Atlassian's Git tutorials (I'm a regular contributor there) for some tips and tricks to improve your workflow.