|
Compressing files in UnixUnderstanding how disk space is allocated can help you get more from your drives |
This month we'll cover the basics of file compression in Unix. We'll explain why small files
may be larger than large files, why one file may be better than two, and what you can do to
squeeze more space out of your disk, including how to use the tape archive (tar
)
utility to better compress small files. (2,100 words)
Mail this article to a friend |
You might expect the answer to that to be pretty simple. Type ls-l
and the directory listing tells you how many bytes are in the file.
In the example listing below, minutes.txt is 3 bytes long
(must have been a short meeting) and note.txt is 1201
bytes long.
$ ls -l total 6 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
But those are the file sizes, not the amount of space used
on the disk. To see the space used on the disk add the -s
switch
by typing ls -ls
. The new listing (shown below) includes an initial
column that contains the number of blocks used on the disk by the
file. A block is a unit of 512 bytes. The first file, minutes.txt,
uses 2 blocks or 1024 bytes, (suddenly that meeting doesn't seem so
short) and note.txt uses 4 blocks, a whopping 2048 bytes.
$ ls -ls total 6 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
What's happening here? It would be nice if a 3-byte file actually used 3 bytes, plus maybe a few bytes for the file name and other information in the directory. Unfortunately that's never been a practical way to organize a disk. The overhead in keeping track of the directory would become a load on the system. Also, as a file expanded and contracted, due to editing and data entry, it would become heavily fragmented. The file's first 3 bytes would be down on track 12, the next 14 bytes over on track 25, and future additions spread out to track 64. The directory would become a hodgepodge of pointers, and loading files into the editor would require the read heads to scramble all over the disk collecting directory information and tiny bits of file.
To handle this problem, a compromise was reached in disk organization. A convenient number of bytes was selected as the minimum amount that could be allocated to a file. This amount could be called an allocation unit. If a file didn't use all the space in its allocation unit, the remainder of the unit would be set aside for future expansion. As a file expanded, so long as it didn't exceed the number of bytes in its allocation unit, all new information was stored in the reserved space on the disk. Once the file exceeded that space, another allocation unit was grabbed and reserved. Any spill over from the first allocation unit would be tucked into the new one, and so on. Now the directory had only to locate the first allocation unit. This method is used in all major operating systems in one form or another.
Earlier Unix systems used an allocation unit of 512 bytes. These 512 bytes
made up 1 block. As disk sizes grew, the basic allocation unit was increased
to 1024 bytes on most systems (larger on some), but many utilities, such as ls
,
continue to report file sizes or disk use in 512-byte blocks. This block size
remains the standard for many utilities, even though the actual size of an allocation
unit has increased to 2 or more blocks.
|
|
|
|
Black holes
With this background, let's now look at the ls- ls
listing
again. A 3-byte file like minutes.txt will occupy a 512-byte block, but more
importantly for disk usage, it will occupy 1 allocation unit,
which on the system in the example below is 2 blocks, or 1024 bytes. The ls -ls
listing correctly indicates 2 blocks used on the disk. Similarly, note.txt is 1,201
bytes, and should therefore occupy 3 blocks (1024 = 2 complete blocks plus an additional 177
bytes in a third block). The note.txt file actually uses 2 allocation units or 4 blocks, as
indicated in the listing.
$ ls -ls total 6 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
This seems dreadfully wasteful. In fact 99.7 percent of the space allocated for minutes.txt is unused, and 41.4 percent of the space for note.txt is wasted. Multiply this by the number of files on the system, and you'll begin to imagine vast black holes of disk space that cannot be reached except by forcing all users to create and fill files that are multiples of 1024.
Before you start hyperventilating, remember that the high percentage of waste only occurs on very small files, so the larger the file the more efficient the allocation system is. If you allow that your system is probably working fairly well, you'll recognize that the allocation system is a good compromise between disk allocation and speed of disk access.
One useful task is to establish the allocation unit size of
your system. You can probably plow through manuals for this
information, but a simpler method is to read the manpage for ls
.
Establish the block size used by the -s
option (usually 512 bytes),
then use vi
to create a file with only a few bytes in it and close
the file. Type ls -ls
to look at the number of blocks used for
that small file, multiply it by the block size, and you have your basic
allocation unit size.
One other useful note: ls-l
, ls -s
, and similar variations
display a total line as the first record in a directory display. The
total 6 in the listing below is in fact the sum of the blocks
displayed by typing ls -ls
.
$ ls -ls total 6 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
Compression
Now that you've identified one type of disk-eating file, what can you do about it?
You've probably heard of or even used some of the various
file-compression utilities, such as pack
, compress
, and the GNU
(which stands for GNU's not Unix) software utility, gzip
.
These utilities work very well on large files, but perform poorly on
small files. In the sample listing below, compress
is applied to
each of the files and the results are displayed. The compress
utility correctly recognizes that it can't do any good on
minutes.txt, and leaves it alone. It does, however, compress note.txt to
188 bytes. Note that compress
appends .Z
to a file when it
compresses it. The effects of compress
are reversed by using
uncompress
, or compress -d file.ext
. You don't need to
include the .Z
in the file name.
In this case we've eliminated 2 blocks, as note.txt compressed down to 2 blocks from 4. If you follow this logic through, you begin to realize that a small file can never be compressed below 2 blocks (or the default allocation unit for your system).
$ ls -ls total 6 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt $ compress minutes.txt $ compress note.txt ls -ls total 4 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 2 -rw-r--r-- 1 mjb group 188 Feb 04 23:25 note.txt.Z $
Compressing files with tar
If you have a directory of small files that are little used, but
need to remain on the system, one way to handle them is to combine
them into one file, then remove the originals. If the files can be
strung together, all the little files can be packed into one
larger file. The obvious candidate for this combining action is the tar
(tape archive) utility, so we'll try it.
Study the following listing for a moment. The tar
command uses key
letters to signify actions to be performed. These are a bit like command line switches,
but are not preceded by -
. In this instance, the tar
arguments
are:
c | create a new archive |
v | verbose, provide information on what you are doing |
f | the next argument is the name of the archive to create txt.tar |
txt.tar | the archive that is being created |
*.txt | the list of files to include in the archive |
Immediately after you type the tar
command, tar
informs you that it
has appended minutes.txt, which would take 1 tape block
(a minutes.txt 1 tape block
), and appended
note.txt, which would take 3 tape blocks (a
note.txt 3 tape blocks
). So tar
reports it results in 512-byte
blocks rather than 1024-byte double blocks.
However, there is a bit of a shock in the ls -ls
command issued
after the tar
is complete. The new archive txt.tar is 8 blocks long.
That's longer than the original 6 blocks used by the two files. The
tar
utility is a bit mindless: It doesn't actually string files end
to end, rather, it strings blocks end to end. It also has to add directory
information into txt.tar, so it's not unusual (in fact it's common) for
a tar
archive to be larger than the sum of its parts.
$ ls -ls total 6 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt $ tar cvf txt.tar *.txt a minutes.txt 1 tape block a note.txt 3 tape blocks $ ls -ls total 14 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt 8 -rw-r--r-- 1 mjb group 4096 Feb 05 01:40 txt.tar $
This makes things look grim. Fortunately, tar
fills
those empty spaces in the blocks with garbage, per the manual. In
fact, this garbage usually takes the form of hex zeroes or NULLs. This makes
a tar
archive an excellent candidate for compression.
Proceeding to the next logical step, the following listing
compresses the tar
archive. The resulting file txt.tar.Z is
404 bytes long (1 allocation unit or 2 blocks). Then, by removing the
original text files, the directory contents are reduced to only 2
blocks, saving 66 percent of the space previously used.
$ ls -ls total 14 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt 8 -rw-r--r-- 1 mjb group 4096 Feb 05 01:40 txt.tar $ compress txt.tar $ ls -ls total 8 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt 2 -rw-r--r-- 1 mjb group 404 Feb 05 01:40 txt.tar.Z $ rm *.txt $ ls -ls total 2 2 -rw-r--r-- 1 mjb group 404 Feb 05 01:40 txt.tar.Z $
The following listings show you how to reverse the tar-and-compress
process. The tar
key argument for extracting from an archive is x
.
The other key arguments are the same as in the earlier tar
command.
$ ls -ls total 2 2 -rw-r--r-- 1 mjb group 404 Feb 05 01:40 txt.tar.Z $ uncompress txt.tar $ ls -ls total 8 8 -rw-r--r-- 1 mjb group 4096 Feb 05 01:40 txt.tar tar xvf txt.tar x minutes.txt, 3 bytes, 1 tape block x note.txt, 1201 bytes, 3 tape blocks $ ls -ls total 14 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt 8 -rw-r--r-- 1 mjb group 4096 Feb 05 01:40 txt.tar $ rm txt.tar $ ls -ls total 6 2 -rw-r--r-- 1 mjb group 3 Feb 04 23:31 minutes.txt 4 -rw-r--r-- 1 mjb group 1201 Feb 04 23:25 note.txt
So you can use tar
and compress
to save lots of disk
space on files that are rarely used. The GNU software gzip
utility has
a few extra options, which make it more efficient than compress
, especially
on large files.
Once you've located stashes of files that are rarely used, but which must remain available, you should archive and compress them. For directories with many little files, tar-and-compress or gzip them. For directories with a few large files, compress or gzip them. You may tar them if you like, but it will probably make little difference in space used, though it might make things easier to administer.
Good luck squeezing more space out of those disk drives.
|
Resources
gzip
utility for SPARC machines http://sunsite.queensu.ca/sun/solaris_2.5.html
gnu tar
ftp://prep.ai.mit.edu/pub/gnu/
About the author
Mo Budlong is president of King Computer Services Inc. and has been
involved in Unix development on Sun and other platforms for over 15 years.
Reach Mo at mo.budlong@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-03-1998/swol-03-unix101.html
Last modified: