DunnDunnDunn: vcs

Showing posts with label vcs. Show all posts

Wednesday, April 29, 2015

git-fat for large files

GitHub recently added support for largefiles. If you want to share repos globally, that's fine. But for work within a corporate network, I like git-fat. It has few dependencies -- just plain python and rsync -- and caches files in /tmp. It's much simpler than alternatives like git-annex or git-lfs, which are better when you need the option to store files in S3, etc..

However, there is one problem with all these: They still deal with expensive checksums for many operations. This is partly to keep things simple -- letting git operate on files directly. But even just copying large files is slow, let alone checksumming. It's much faster to store URLs and to let the plugin update symlinks (or hardlinks) and handle caching. If you want a checksum, that can be encoded into the URL. Another advantage is that you can store whole directories, rather than individual files.

That is a plan I am working on with a friend. All large files would be read-only, unless explicitly "opened".

For now, git-fat works pretty well.

Friday, September 12, 2014

Git: Rebase, then merge (--no-ff)

In case you think you should always either rebase or merge, check out this reddit post.

And this blog shows how to rebase a stale GitHub "pull request" (as opposed to using the "Merge Pull Request" button).

Friday, July 4, 2014

Storing dotfiles in Git: symlink + worktree

I use 3 tricks:

core.worktree = /Users/cdunn
symlink .git -> dotfiles/.git/
In .gitignore: *

The beauty of my approach compared to many others is that regular git commands work directly in my home-directory. It's easier to explain with an example:

https://github.com/cdunn2001/dotfiles

To add new ones, pass the '-f' flag to 'git add'.

Tuesday, February 7, 2012

Git at large companies

https://news.ycombinator.com/item?id=3548824

That mentions Amazon, which has embraced git without prohibiting other VCSs. That helped us to solve several problems with scale (100k small repos vs. 3 large repos). I can't comment much, but I can re-post what was said:

Amazon uses Perforce at the moment, and for the most part developers are unhappy with it, as well as the team that has to support it (single giant server prone to outages which block up a couple thousand developers, etc). We're in the process of moving to Git for all of our source.
On the other hand, what you're describing as a problem (what Facebook is describing as going to be a problem) is less likely to be one for Amazon as, with some exceptions that are in the process of fixing the issue, the majority of software at Amazon is developed as a service. Services are segregated into their own package, with most services being broken up into cohesive subpackages (a service my team is building will probably have ~10-13 packages when done), and we have a dependency modeling system for packages baked into everything, from build through deploy, which eliminates most of the cognitive overhead of breaking our services up this way.

All of this translates very well into different Git repositories. What we lose is cohesive atomic commits across packages, which we do get with Perforce. The upshot is we have a team developing a system to handle that specific case.

There you go. SOA (Service Oriented Architecture) solves the repo problem along with a bunch of others.

The problems I've had with git at Amazon are minor, and we do have a very large code-base.

Sunday, August 7, 2011

Git: What is the purpose of `git reset`?

Scott Chacon, the author of ProGit, also has a helpful blog, and this particular entry on "git reset" is absolutely invaluable. Also see this blogpost by Mark Dominus.

Friday, July 15, 2011

Git: Why should I use git instead of Subversion, CVS, etc?

Since the announcement that GoogleCode now supports git, many people are wondering why it's preferable to Subversion or even CVS. Here is my opinion:

I saw part of an interesting [1] video in which a YUI dev claimed that her productivity went up after switching from svn to git. YMMV.

For me, the advantages are:

Distributed repositories
- At first, a central repo seems more appealing to a Project Manager, but eventually you may prefer the Integration Manager model which a DVCS facilitates. Also, a DVCS allows one to commit while offline.
Private branches
- Keep your dirty laundry to yourself. With svn, many devs avoid frequent commits for this reason.
Simpler branch-merging
- When it's easy, people do it.
Rebasing
- The "killer" feature of git. (Also available in Mercurial.) Lets you consolidate groups of commits and pretend that you did them all after the most recent update.
The .git directory
- Very unobtrusive, unlike CVS/ and .svn/. Perforce is even worse, requiring a specific directory for the check-out. With git/hg/bzr/etc., you can version-control any sub-directory in your filesystem at any time, very easily, without setting up a central repo. I sometimes run git init inside a working area for Subversion, for a one-day project. Remember: With Subversion you cannot hide your dirty laundry.
The "stash"
- Unique to git. Syntactic sugar for temporary branching.

"rerere" (reuse recorded resolution)
- Pure magic. Caches merge-conflict resolution, so you never have to resolve manually the identical conflict again.

The biggest advantage of git over mercurial is the [3] index, which is the genius of Linus Torvalds (at least to recognize the value). Otherwise, mercurial is very good and in some ways better.

And what's the biggest disadvantage of git? Large files can make it really slow. With default settings, it's for source-code only. If you want to store big files in git, try git-annex, which even allows the files to be stored on remotes such as rsync, the web (RESTfully), or Amazon S3. Also consider git-media. I wouldn't bother with git-bigfiles.

Saturday, June 25, 2011

Version-control for the home directory dot-files

Lots of people revision-control the dotfiles in their home directories: .bashrc, .vim, etc. That works ok as long as you can ignore any files not controlled, and most VCSs allow that.

But what if you have several homedirs, and you want to maintain some common files between them. Of course, you also have files that differ. I think I've found an elegant solution: 2-tiered VCS.

Each homedir gets a git repo, which is pulled from one in Dropbox. (You'll see why I use Dropbox instead of GitHub in a minute.) I have a branch for each machine, so I can do some comparisons if I want. The master branch has only the common files, which can be used for seeding a new branch on a new machine.

What if I change a common file? I'd hate to have to merge it on each machine. I could forget easily, and that's a lot of work for every little change.

Instead, I keep the common files in cvs, also in Dropbox. Each local cvs workspace is also added to git. (That's not strictly necessary, but it makes setting up a new machine trivial.) When I change a common file, I just 'cvs commit' that file. On any machine, I can run 'cvs update' at any time.

One of the keys to this is the presence of 'CVS/Entries.Static' in the homedir. Otherwise, 'cvs update' could wreak havoc, as some common files are over-ridden on specific machines. (That's why a simpler solution does not work.) Cvs creates that file for you automatically if you 'cvs co' a single file. Otherwise, you can just 'touch CVS/Entries.Static', and remove unwanted files/directories from 'CVS/Entries'.

Another helpful thing is to commit a file called 'cvsignore' (no dot) into the CVSROOT directory (which is in the repo on Dropbox). It has just a single '*', which means to 'ignore everything not listed explicitly in CVS/Entries'. For sub-directories (e.g. .vim/), add a file called .cvsignore with just a single character, '!', to let cvs see all files there.

Also put '*' in '~/.gitignore', and add/commit that file. Henceforth, you will need 'git add -f' for any new files, but that's not really a bad thing.

The most difficult -- and dangerous -- part is setting up the local git repo. Normally, 'cd ~; git clone URL .' will set up a clone in the current directory, but that only works when the directory is empty. Instead, I came up with this sequence of steps:

git init
git remote add origin ~/Dropbox/homedir-repo
git fetch origin
git checkout -f -B mymachine origin/mymachine

Of course, the homedir-repo is 'bare', and the relevant branch was set-up safely in a different directory, with lots of testing. We don't want to destroy our homedir by accident!

So far, this is working extremely well for me, and I have not seen any better ideas out there.

This is helpful in ~/.git/config:

[gc]
auto = 0

That way, git will not pack stuff on Dropbox. Pushing to the remote repo will then only add files. Very little will change. (That's the problem with hosting CVS on Dropbox; files are edited or appended for every commit.)

Monday, April 18, 2011

git: A decision on merging.

A lot of people don't really understand why branch-merging is an ambiguous operation. Here is a good explanation:

http://bramcohen.livejournal.com/74462.html

Friday, November 12, 2010

Git: Why?

Here is a funny and informative blog on problems with git, and why the benefits outweigh them. I liked this part the best:

... by far the best justification I’ve ever seen for git rebase (or git lie, as I prefer to call it).

Anyway, the comments are where the real information can be found.

Wednesday, May 5, 2010

Problems with Perforce (p4)

Gentle Reader,

First I should say that p4 is great for many jobs. In particular, it's efficient for large files or large numbers of files. It also fits well with a common work-flow: Several projects checked out, with several branches, all in one working directory.

Besides, with a title that conjures Shakespeare, it is too great to be by me gainsaid. If it works well enough for you, then you don't need this weblog. Get thee to a nunnery. Parting is such sweet sorrow, but get thee gone. Stop reading!

If on the other hand you are required to use Perforce by your employer and wish it were not so, then like the Duke of Clarence, have patience; you must perforce. Hopefully, after you show this blog to your co-workers, your imprisonment shall not be long.

To sleep, perforce to dream

P4 has a reputation for being fast. Well, it is fast on the server, but communicating with the server, not so much ado.

Suppose you need to run 'p4 fstat' or 'p4 diff' on a huge number of files. And remember: P4 is supposed to be great on large numbers of files.

p4 diff files*

That will print a bunch of info. Great....

Now suppose this is part of a script. You want to learn about all files simultaneously. The output has errors for some files, and some files are not mentioned at all. Consider 5 paths, in 5 different states:

ls non-existent not-added unmapped-but-changed opened-up-to-date unopened-up-to-date
[STDOUT]
 not-added
 unmapped-but-changed
 opened-up-to-date
 unopened-up-to-date
[STDERR]
 ls: non-existent: No such file or directory

Here are several flavors of 'p4 diff':

p4 diff -sa non-existent not-added unmapped-but-changed opened-up-to-date unopened-up-to-date
[STDOUT]
 /home/wshakes/work/opened-up-to-date
[STDERR]
 non-existent - file(s) not opened on this client.
 not-added - file(s) not opened on this client.
 unmapped-but-changed - file(s) not opened on this client.

p4 diff -sr non-existent not-added unmapped-but-changed opened-up-to-date unopened-up-to-date
[STDOUT]
 /home/wshakes/work/unopened-up-to-date
[STDERR]
 non-existent - file(s) not opened on this client.
 not-added - file(s) not opened on this client.
 unmapped-but-changed - file(s) not opened on this client.

p4 diff -se non-existent not-added unmapped-but-changed opened-up-to-date unopened-up-to-date
[STDOUT]
 not-added - file(s) not on client.
 unmapped-but-changed - file(s) not on client.
 opened-up-to-date - file(s) up-to-date.
 unopened-up-to-date - file(s) up-to-date.
[STDERR]
 unmapped-but-changed

It is very difficult to match each section of output to the corresponding file on the command-line. First, you have to parse stderr and stdout. Then, you have to figure out how to map the filename listed in the output back to the filename on the command-line, which can be very tricky in sub-directories.

That's way too much work, especially the file-path mapping, so you decide to run the command on one file at a time.

for f in files*; do
p4 diff -sa $f > $f.diff-sa
done

But soft! For large numbers of files, that will take minutes, or worse. So you decide to use a Perl API. (The C API does not prove to be any more helpful.)

use P4;
$p4 = new P4;
$p4->Connect();
for $file (@files) {
  $fdiffs = ($p4->Run('diff -sa', $file))[0]
  if ($p4->ErrorCount()) {
    print $p4->Errors();
  }
  Process($fdiffs, $file);
}

Most excellent, i' faith! $fdiffs is a hash of the fields that would have gone to stdout. You still have the pesky stderr output, but you know what everything refers to. Only there's one thing wanting ...

Behold! It's still slow -- not as slow, since it now maintains the server connection, but nowhere near so fast as 'p4 diff files*' all at once. Fine. You can pass multiple filenames to the Run() command.

use P4;
$p4 = new P4;
$p4->Connect();
@fdiffs = $p4->Run('diff -sa', @files);
if ($p4->ErrorCount()) {
  print $p4->Errors();
}

for $file (@files) ...

Hark! Not only are you back to the problem of parsing stderr, but you also need to map @fdiffs back to @files in order to know which files were ignored.

This is incredible. The API returns an array of data-structures, but the size of the array does not match the size of the request. What would be so wrong with returning 'undef' to denote the missing files, and maybe '{}' for errors?