Tuesday, August 22, 2006

Cooperative backup and archiving

After testing my company's shiny, new 100 Mbps internet service from Cogent, I was struck with an idea. Plenty of companies out there offer internet-based backup service, but most of these are priced per-gigabyte-per-month and are prohibitively expensive. Especially when you consider that our full backup sets are typically 600 GB in compressed form, and we of course want to store daily incrementals for several months.

My idea: find another Cogent subscriber, and enter into a mutual backup agreement. That is, we buy a storage server to sit on their site, and they buy one to sit on ours. We can then exchange backup traffic, eliminating the need to shuffle tapes and send them off-site. Many large enterprises already do this with SAN hardware replication; however this would be a budget "roll your own" solution.

There exist several open-source distributed storage systems that might help, however, none seem ready for prime-time yet. I feel quite a lot could be accomplished with judicious use of native backup tools and open-source encryption software.

Ideally, we could use something like rsync or rdiff-backup to send only the changed data each night. However, since the backup server will be at an untrusted site, all data must be encrypted before it leaves our network, and no decryption keys can exist at the untrusted backup site. Since all good encryption methods produce entirely different byte streams with each encryption, tools like rsync won't gain us anything.

My current thoughts are to use native backup tools to create a local file-based backup, and then use GnuPG or a similar tool to compress and encrypt full backup files for transmission via FTP. With weekly full backups and daily incrementals, we'll be transmitting a lot more data than we would with rsync, but a 100 Mbps connection could make it workable (~20 hours for a full backup over the weekend, and much shorter incremental times).

One thing that could drastically reduce the amount of data that needs transit is some form of single-instance storage. Using native tools, we'd be storing dozens of copies of many binary files (OS and application files). However, I can't see how this could be made to work when encryption has to happen before file transmission.

Any thoughts?