Why packs make git even greater

Ok, so Git is an astounding version control system, and there’s no doubt about is speed, but what’s beneath the trunk so that it can get so fast?

Well, many things are greatly optimized, starting with the object structure, but today let’s talk about git packs.

Git packs are files that are automatically created by git in order to reduce space and I/O bandwidth in ways that should be very notorious in any large-history project, but also in some other edge cases.

To understand where do git packs fit let’s first look at how are changes in git tracked in the first place.

Plainly, git’s backing store for content itself is a folder structure in .git/objects.

Git stores mainly 3 kinds of objects: commits, blobs, and trees.

Let’s explain each a little bit:

commit
A commit is a type of object containing the author, date, the commit message, the parent (or parents, in case of a merge), and one or more trees, which represent folders of changes.
tree
There’s at least one tree per non-empty commit, and a tree is a structure used to group more trees and blobs. Think of trees as git’s representation of folders. Trees contain hash references to other trees and blobs, as well as the subtree/blob name and its access mode (i.e., if it is a symlink, an executable file, a regular file…)
blob
A blob represents a non-structured content. Git expects files’ contents to be saved as blobs.

So, given these definitions, and a repository such as this:

% ls -a

.  ..  .git  file1 file2 file3
%

Let’s suppose we have just done a git init ., but we haven’t done anything yet.

Now you can ask yourself what will happen to git’s object store if you do git add . && git commit -m "Initial commit".

Well, some objects will be created, namely:

  • Three blobs, one per file, with the contents of each file
  • A tree for the / folder of the repository, which will encapsulate the three former blobs
  • One commit with the “Initial commit” message, date, author and an entry to the tree

Now, git will seem to be fast enough, becase our computers are actually quite fast, but it’s not the best way of tracking files.

That is, if we modified file1 and did another git add . && git commit -m "file1 mod", we would be storing both versions of file1 fully.

But modifications usually don’t alter much the file completely, so that’s a waste of space!

Git can perform a lot better in this.

Each time we add a file git computes the blob object that corresponds to that file, as well as the necessary tree to keep the staging area consistent (the staging area has the “root tree” of the next commit).

When the files are stored in that way, we say they are “loose objects”, because they are not packed.

But then, there would be a waste, both of physical space and bandwidth whenever we pushed some refs, because transfering files involves a lot of metadata which isn’t important for the commit.

So there are git packs, which is a format that can store several objects in a single file, and index it by object in a separate index file.

Packs can be computed either for references missing from a repo (when pushing, so that only missing references are transmitted, and that’s done in a way that at the receiving end the packs can be more efficiently repacked), or for existing references, in order to save space.

Part of the packing mechanics also split the blobs in deltas, in a way such that a blob can be resolved as baseblob + delta. Then, a single-line change in a 10KB source file will just result in a line-change blob plus one 10KB blob, and not two 10KB blob.

Git can also apply deltas to already-deltaified blobs, with a depth chain that is customizable, but that is usually limited by default to 50 when running git repack by default so that the unpacking side is not too hard.

Oh, and besides, the packs are compressed with zlib, so text gets really compressed.

So, what makes packs fast?

  1. Space. As they use less space than uncompressed blobs, they are read from disk much faster.
  2. Indexing. Packs consist on two files (X.pack and X.idx), and the index file allows for random-access to the pack as if it were a list of files in the filesystem, except the index is already computed, so no jumping between inodes.
  3. Deduplication. Changes that can be reduced as deltas to other changes are arranged in that way, so that the base change is not duplicated, and thus improving file space.

For most operations in git, the worst part of it is traversing the object store and finding the related links. With packs, most of that can be random-accessed much faster due to the index.

Besides, there’s a lot less space wasted in internal fragmentation due to filesystem’s block size (even if ext4 and company help with that in some cases with their inline extents), but by using a single file for the whole pack, a 50MB pack file is a lot more efficient than 15000 random-size-not-multiple-of-4kb objects.

Leave a comment