Backing up Large Files

by Paul Zarucki, Electronic Equipments Ltd., 2008-06-27. Updated 2008-07-02.


There are quite a few programs for backing up files, some with graphical interfaces, some with web interfaces and others which work from the command line. They are mostly designed to do scheduled backups to a server or hard disc but I wanted a straighforward way to make one-off backups of some large files to DVDs or a USB drive. I decided, therefore, to use a simple command line procedure to do most of the work. It takes only two commands to create the archive and three to restore the archive and check it for errors.

The beauty of the command line approach is that, once you have a working procedure, you can save the commands in a text file (called a shell script) and link it to an icon on your desktop. Anytime you need to repeat the procedure, simply click the icon! The shell script also serves to document the procedure and you can easily edit it if you need to change it.

If you are new to the command line, I highly recommend the beginner's introduction by Rosalyn Hunter.
Software Used
I use the Debian 4.0 GNU/Linux system but the methods described here should work on most GNU/Linux and Unix systems with little or no modification. The references to copy-and-paste assume you are using an X-windows based graphical desktop system (I use Gnome but the principles apply to almost any of the desktop systems available for GNU/Linux and Unix).

Command Line
This tutorial makes use of the command line. If you are only used to point-and-click methods, don't be afraid! The command line is your friend and, for some tasks, it is quicker and easier than the available graphical software. To start typing commands, open a console or terminal window (if you are using the Gnome desktop look under Applications -> Accessories -> Terminal).
   You can copy-and-paste each command from this page into the console window then edit it to suit your needs before pressing the return key. The left and right arrow keys move the cursor, the delete and backspace keys delete text. Anything you type or paste will be inserted at the cursor position.
   TIP: to copy, select some text with the mouse then, to paste, simply move the mouse to the destination and middle-click (press either the middle button or the scroll wheel on your mouse).


Typical Problem

I have a directory containing some files for a virtual computer. The files that hold the data for the virtual computer's hard discs are not only big, they are also "sparse" files, which means that they use only enough disc space for the data that was actually written to the file. For example, the virtual computer may have a 30GB drive of which 2GB has been used. Even though it uses only 2GB on my hard disc, a program that reads the file may see it as a 30GB file. This type of file can be tricky to back up because, when you copy it, you can end up with a 30GB file, or it might simply fail to copy, depending on the type of file system used on the backup storage.

It would also be nice to preserve links as well as the ownership and permission properties of the files, all of which might be lost if they were simply copied to a DVD or USB drive.

The Solution

A simple way to create an archive that uses no more space than is necessary for sparse files, as well as preserving links, file ownerships and permissions, is to use the venerable tar program. This packs the files into a single "archive" file and can compress the files as well. Think of it as the Unix equivalent of a ZIP file. The command would look something like

tar -czf myarchive.tgz mydir

where mydir is the directory to be archived and myarchive.tgz is the name of the archive file to be created.

I want the archive to be easy to copy to a variety of storage media like a USB flash drive and different formats of DVD. These have limits on the maximum size of a file which can be as small as 1GB so I use the split program to divide the archive into 1000MB chunks. The result is a set of files, each 1000MB in size, which can easily be copied to and moved between different types of storage. This also makes it easy to copy the archive onto multiple DVDs if it is too big to go onto one disc. I could use the following command to split the file "myarchive.tgz"

split -d -b 1000m myarchive.tgz myarchive.tgz.

This would leave me with the original file (myarchive.tgz) plus a series of files named "myarchive.tgz.00", "myarchive.tgz.01", etc., each no larger than 1000MB.

In practise it would be better to use a pipe to feed the output of the tar command to the split command and create the 1000MB files directly without the need first to save the archive as a single file. This halves the amount of disc space and time required for the job. The above comands would then be replaced by the single line

tar -cz mydir | split -d -b 1000m - myarchive.tgz.

where "|" tells the computer to pipe the output of the tar program to the split program.

I then use the md5sum program to calculate the checksum of each file which will make it easy to check for corrupted files when restoring the archive. The command would be

md5sum myarchive.tgz.* > myarchive.md5

which creates the file myarchive.md5 containing the checksums of each of the files produced by the split command.

Restoring the Archive

When I want to restore the archive, I copy the files from the backup medium into a temporary directory. In the console window, I change to the temporary directory and type

md5sum -c myarchive.md5

which calculates new checksums for the archive and compares them with the original checksums in the file myarchive.md5. If all is OK I then re-combine the chunks into a single archive and extract the original files:

cat myarchive.tgz.* | tar -xz

(Again we use a pipe to avoid the need for an intermediate file, saving time and disc space.)

The current directory now has a sub-directory containing a perfect copy of the original files.

An Example

Creating the Archive

I have a directory "vm" with a sub-directory "oldpc" that contains some files that I want to archive. I'll call the archive "oldpc2008-06-29". I type the following command to create the archive:

tar -cz vm/oldpc | split -d -b 1000m - oldpc2008-06-29.tgz.

The result is a series of files named "oldpc2008-06-29.tgz.00", "oldpc2008-06-29.tgz.01", etc., each no larger than 1000MB.

To create the checksum file for this archive, I type

md5sum oldpc2008-06-29.tgz.* > oldpc2008-06-29.md5

That's it! In two lines we have created a set of archive and checksum files ready to be copied to the backup storage of our choice. You can use your favourite DVD writing program (GnomeBaker, K3b, etc.) to write the files to DVD or, alternatively, plug in a USB drive and simply copy the files to it.

Restoring the Archive

To restore the files from the archive, I create a temporary directory, say "restored", and copy the archive files into it. This is something I can do using my normal file manager program. I then open a console window and type the following commands:

cd restored
md5sum -c oldpc2008-06-29.md5

The first line does a "change directory" to select the directory containing the archive files. The second line verifies the checksums.

If all is OK, I then type

cat oldpc2008-06-29.tgz.* | tar -xz

The directory "restored" now has a sub-directory "vm/oldpc" containing a perfect copy of the original files.

A Note about Very Large Files

The tar archive format imposes an upper limit of 68GB on the size of any single file that can be added to a tar archive. I believe the GNU version of the tar program will handle files up to this size but some other versions may have a limit of 8GB. If you have files larger than the maximum then, as long as they are not sparse files, you can overcome the size limitation by compressing and/or splitting these files before making the archive (e.g. using the gzip and split commands). As far as I know, there is no upper limit on the total size of the tar archive or the number of files it may contain.

Further Reading

The tar and split commands are very versatile and this tutorial shows just one way of using them. More information can be obtained on any of the commands used here from the manual pages. To view the "man page" for the tar command, for example, type:

man tar

Scroll the man page using the up/down arrow, PageUp, PageDown, Home and End keys. To finish, press the letter "q".

TIP: You can have multiple console windows open at the same time, one for entering commands and another for viewing man pages, for example. You can also copy and paste between them (or between almost any two windows) using the select-then-middle-click method.

On the Web


Credits

Thanks to Kurt Jürgen Andereya Andereya for pointing out the 68GB limit on the size of any single file that can be added to a tar archive.

Thanks to Michael Crider for pointing out that the md5sum program can also compare the calculated checksums with pre-existing ones.


Change Log

2008-07-02: added a note about file size limits (very large files).

2008-07-01: corrected mistake in the restore procedure (should read "md5sum -c myarchive.md5" rather than "md5sum -c myarchive.md5 myarchive.tgz.*").

2008-06-29: used md5sum with the -c option to eliminate the need to use the diff command.

2008-06-27: first version.


Comments welcome! -- read people's comments on this article and add your own comments if you like.


Feedback and comments to paulatelectronic-equipments.co.uk are welcome.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation, with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.