Tuesday, April 15, 2008

File Synchronization with Unison

April 14th, 2008 by Mike Diehl in * HOWTOs

Keeping the files on multiple machines synchronized seems to be a recurring problem for many computer users. Until I discovered Unison (http://www.cis.upenn.edu/~bcpierce/unison/) I never really had a completely satisfactory solution.

What we'd like to be able to do is efficiently keep two or more servers completely synchronized with each other no matter what gets changed on any of the servers. In the simplest case, we have a production server and a backup server that we need to keep in sync. We might have a cluster of servers used in a load balancing configuration. In the worst case, we might have a group of computers where changes are occurring on any or all of the devices. Consider the case where we have a computer at the office, a laptop, and a work computer at home. We want to be able to work from any computer at any time.

One solution is to simply use scp (http://www.openssh.com/) to copy the files from one computer to the other or others. This solution requires that we designate one computer to be the “master” and only changes that occur on the master computer are propagated to the other, slave, computers. Besides a lack of flexibility, this solution has one serious drawback; it copies every file from the master to each slave computer, every time the synchronization process is started. On a slow network link, or a large directory structure, this often proves untenable.

A slightly better solution is to use rsync. (http://samba.anu.edu.au/rsync/) The rsync program only transfers those files that are different. In fact, rsync only transfers those parts of a given file that are different. This mechanism is quite efficient, but still suffers from the master/slave architecture that scp suffers from.

There are solutions that depend upon kernel services such as the FAM (http://oss.sgi.com/projects/fam/faq.html) or clustered filesystems like Coda. (http://coda.cs.cmu.edu/doc/html/index.html) These solutions, of course, require a kernel recompilation, which seems like a lot of work to simply keep a couple servers synchronized.

So far, unison is the simplest and most effective solution I've found. Unison will correctly synchronize two servers even if changes occur on both servers. If a change occurs in the same file on both servers, this causes a conflict, and unison will display an error message. File content as well as permissions and ownership can be synchronized. Unison even allows you to keep Linux machines and Windows machines in sync. For those of you who have slow network links, it's nice to know that unison works like rsync in that it only transfers those parts of a file that have been changed, when possible.

Installing unison is trivial. The package management system in most Linux distributions can automatically install unison for you. Otherwise, simply download the source and compile it. You will need Ocaml installed, though.

Unison can be configured to use a native network protocol, or to use OpenSSH in order to transfer files. The native protocol isn't authenticated, nor encrypted, so it isn't nearly as secure as the ssh configuration. I recommend using the ssh configuration and that's the configuration my example will use. For automated synchronization, you will probably want to setup certificate-based authentication for ssh. There are many easy-to-follow instructions on the Internet that describe how to set this up, so I won't cover that here.

Once you have unison installed, and ssh configured, it's time to start synchronizing! But first, we should discuss, briefly, how unison works, especially the first time it is run against a particular file repository. The first time you use unison on a file repository, the program makes a note of modification timestamp, permissions, ownership and i-node number for each file in both repositories. Then, based on this information, it decides which files need to be updated. The program stores all of this information in the ~/.unison directory. The next time unison is run on the file repository, changes are trivial to detect. Intuitively, you might expect that unison is examining the file's contents to see if the file has changed, but that isn't what is happening. If a files modification timestamp and i-node number change, the file needs to be updated. This is a very fast calculation and scales well, even on very large files.

Here is a quick example from one of my computers:

unison /home/mdiehl/Development ssh://10.0.1.56///home/mdiehl/Development/ -owner -group -batch -terse

This should all be on one line. I do a lot of software development and in this example, I'm using unison to synchronize the development directory from my Internet accessible server to my workstation on my private network. Even though this example is fairly intuitive, it doesn't get much more complicated than this, so let's take a closer look.

The example synchronizes /home/mdiehl/Development on my server to the same directory on my workstation who's IP address is 10.0.1.56. The ssh protocol is used for the file comparison and transfer. Since this is a bi-directional process, it doesn't matter where the script runs as long as the two machines can reach each other over the network; it's just more convenient to run my scripts on the server, but I could just as easily run this script from my workstation if I change the IP address in the script.

The “-owner” and “-group” parameters tell unison to attempt to synchronize the user and group ownership. You need to make sure that the owners and groups exist on all of the machines you intend to synchronize. For example, if you are syncing a directory owned by the user “bob,” who's uid is 500, you need to be sure that “bob” exists on every server. Otherwise, you will find that unison will create an entire directory structure owned by uid 500. This is messy, but easily resolved.

Since I run this example command from cron, I use the “-batch” parameter, which tells unison to not ask the user any questions, and simply do what it can if there are any conflicts. Similarly, the “-terse” parameter keeps unison from filling up my cron log with a bunch of unnecessary output.

When I run the example, above, I am presented with a list of updates that are being made between the two computers. The final lines are the most important, though:

UNISON finished propagating changes at 01:05:15 on 13 Apr 2008 Synchronization complete (8 items transferred, 0 skipped, 0 failures)

As you can see, 8 files needed to be transferred in order to synchronize the two servers. Fortunately, there were not problems, and all 8 files were transferred, and my two machines are back in sync. If there were files with conflicting changes, then we would see that in the “skipped” tally. If there had been file permissions or network problems, those would have shown up as failures. Either way, we'd want to go back through the log to find out what happened.

In the several years that I've been using unison, I've only had a few problems with it. As mentioned earlier, the most common problem stems from having conflicting file changes. For example, if you make a change to a file on one server and then change the corresponding file on the other server and the files don't end up being identical, unison sees that as a conflicting change and flags it. The way I usually resolve this problem is by deciding which version I want to keep and using the “-prefer” option to tell unison which version it should... prefer... when there is a conflict. In the example above, if I wanted to have the local version overwrite the remote version, I would add:

-prefer /home/mdiehl/Development

To the end of the command line.

The very first problem I had with unison was when I tried to synchronize two directories that had several tens of thousands of files in them. Unison simply ran out of memory. If I had one complaint about unison, it would be that I have to break large file repositories into smaller pieces in order to use unison to synchronize them. It doesn't seem to me that it should take that much memory to do the book keeping, but I can't argue with the fact that the tool works and I've never lost a file with it.

The unison website indicates that unison is no longer under active development. This is unfortunate, but it shouldn't dissuade you from using and trusting the program. I've found it to be quite mature and is still actively being supported via the unison mailing list. I've had a few occasions to ask for help on the mailing list and I've found the list be extremely helpful.

Unison is a very effective means of synchronizing servers. It can be used in a “star” topology to keep multiple servers in sync. I can also be used in a “ring,” or any other topology you might need. The documentation is quite extensive and well written. I hope you find it as effective and easy to use as I have.

Mike Diehl is a Linux Administrator for Orion International at Sandia National Laboratories in Albuquerque, New Mexico. Mike lives with his wife and two small boys. Mike can be reached via email at: mdiehl@diehlnet.com

No comments: