Wednesday, April 29, 2015

git-fat for large files

GitHub recently added support for largefiles. If you want to share repos globally, that's fine. But for work within a corporate network, I like git-fat. It has few dependencies -- just plain python and rsync -- and caches files in /tmp. It's much simpler than alternatives like git-annex or git-lfs, which are better when you need the option to store files in S3, etc..

However, there is one problem with all these: They still deal with expensive checksums for many operations. This is partly to keep things simple -- letting git operate on files directly. But even just copying large files is slow, let alone checksumming. It's much faster to store URLs and to let the plugin update symlinks (or hardlinks) and handle caching. If you want a checksum, that can be encoded into the URL. Another advantage is that you can store whole directories, rather than individual files.

That is a plan I am working on with a friend. All large files would be read-only, unless explicitly "opened".

For now, git-fat works pretty well.