|
Processing files with awkThe awk processing utility can practically be used as a programming language -- but first you need to learn its simpler features. In the first of two columns on awk, we show you how it breaks records into fields and how to execute more than one set of commands on a record. |
Awk is a text processing utility that runs through a text file by reading and processing a record at a time. We start with the basics and move to executing more than one set of commands with awk. We give you multiple examples of awk processes. (2,700 words)
Mail this article to a friend |
There have been several requests for information on awk, and I happen to like it as a utility, so this column and next month's column will cover awk.
Awk is a flexible text processing utility that can be used almost as a programming language. You can do a great deal with awk once you learn just a few of its simple features.
Awk runs through a text file by reading and processing one record at time. Its commands are written with the intention that they act repetitively on each record as it is read in to awk. A record that has been read by awk is broken into separate fields, and actions can be performed on the separate fields as well as on the whole record.
The actions or steps to be performed on the fields in each record or on the whole record make up an awk program or an awk script.
When you type awk
as a command, you must also provide two
additional pieces of information or arguments. The first is the
program or script to be executed, and the second is some method of
identifying the file on which to perform the actions. Awk can be
used as a pipe, and the file does not need to be explicitly named on
the command line.
Starting with the basics
Let's start with a simple awk command in Figure 1 to get a better
idea of how it works.
Figure 1
ls -l|awk '{print}'
The output of the ls -l command has been piped into awk and is the "file" to be processed. There is no need to name a file in the awk portion of the command line. The awk program or script is one command, {print}. This example doesn't do much. It takes the whole record that was sent to awk and prints it on the screen. This simple command does partially illustrate the record-by-record action of awk. For each record received by the awk program (each line of the output of the ls -l command), the print instruction is executed. It is important to remember this action by awk. Each record is read, then for each record, the instructions in the program are executed.
The output of this program is pretty uninteresting and will look something like Figure 2 depending on the contents of your directory.
Figure 2
-rw-r--r-- 1 mjb group 109 Mar 09 18:32 store.dat -rw-r--r-- 1 mjb group 93 Mar 09 18:31 store.sav -rwxr-xr-x 1 mjb group 3058 Mar 09 18:29 store.txt -rw-r--r-- 1 mjb group 89 Mar 09 18:32 sort.dat -rw-r--r-- 1 mjb group 193 Mar 09 18:31 sort.sav -rwxr-xr-x 1 mjb group 2068 Mar 09 18:29 sort.txt -rw-r--r-- 1 mjb group 20 Mar 09 18:31 palet.txt
So far nothing very exciting has happened. In fact, this is exactly the same output as the simple ls -l command. Obviously there must be more to awk.
|
|
|
|
Onward! Breaking down into fields
Awk automatically breaks a record into fields. The default
delimiter that awk assumes between fields is spaces. In Figure 2,
field 1 is "-rwxr-xr-x" for the first record, field 2 is "1,"
field 3 is "mjb," and so on.
When awk reads in a record and breaks the contents of the record into fields, it assigns a variable name to each field. These variable names are a dollar sign ($) followed by the number of the field counting from left to right. The variable $1 represents the contents of field 1 which in Figure 2 would be "-rwxr-xr-x." $2 represents field 2 which is "1" in Figure 2 and so on. The awk variables $1 or $2 through $nn represent the fields of each record and should not be confused with shell variables that use the same style of names. Inside an awk script $1 refers to field 1 of a record; $2 to field 2 of a record.
In the first awk example, the print command on its own caused the entire record to be printed. The print command followed by specific field variables will print only those fields named by the variables, instead of the entire record. Let's look at an example. To extract the owner, size, and file name from the output of an ls -l files listing, you would need to print only fields 3, 5, and 9. The command for doing this is illustrated in Figure 3. Note that $3, $5, and $9 appear inside the awk script '{print $3 $5 $9}' and are therefore interpreted by awk as awk field variables. The single quotes protect the awk field variables from the shell, so there is no attempt to expand them. It is good practice to get in the habit of including opening and closing single quotes around awk commands to protect them from shell expansion.
Figure 3
ls -l|awk '{print $3 $5 $9}'
The problem with the output of this command is shown in listing Figure 4. There are no spaces between fields.
Figure 4
mjb109store.dat mjb93store.sav mjb3058store.txt mjb89sort.dat mjb193sort.sav mjb2068sort.txt mjb20palet.txt
One way around this is to embed literals in the print line as in Figure 5, which puts spaces in the output lines, producing the output shown in Figure 6.
Figure 5
ls -l|awk '{print $3 " " $5 " " $9}'
Figure 6
mjb 109 store.dat mjb 93 store.sav mjb 3058 store.txt mjb 89 sort.dat mjb 193 sort.sav mjb 2068 sort.txt mjb 20 palet.txt
This provides some spacing, but the fields don't line up very well. One simple way to improve alignment is to embed tabs in the literals instead of spaces. Repeat the command line in Figure 5, but instead of pressing the space bar between the double quotes, press the TAB key. You will not see any characters on the screen, but the double quotes will be separated by what appear to be more spaces. These "more" spaces are actually a tab character. The result will look something like Figure 7. Figure 8 is an example of the output.
Figure 7
ls -l|awk '{print $3 " " $5 " "$9}'
Figure 8
mjb 109 store.dat mjb 93 store.sav mjb 3058 store.txt mjb 89 sort.dat mjb 193 sort.sav mjb 2068 sort.txt mjb 20 palet.txt
In one more variation, we can switch the order of the fields during printing as in the listing in Figure 9 and the output in Figure 10. In this and subsequent examples I will use the ^ (caret) character to indicate a tab key pressed.
Figure 9
ls -l|awk '{print $9 " ^" $5 " ^"$3}' (<-- note ^ = TAB key)
Figure 10
store.dat 109 mjb store.sav 93 mjb store.txt 3058 mjb sort.dat 89 mjb sort.sav 193 mjb sort.txt 2068 mjb palet.txt 20 mjb
Executing more than one set of commands
Figure 11 adds two more features of awk. You may execute more than
one set of commands on a record by separating the commands with a
semicolon (;), and awk allows flexible use of user-defined
variables within scripts. In this example a variable is used to keep
a running record of the total number of bytes displayed in each line
so far. As each record is processed, field $5 is summed into the
variable ttl before the printing takes place; then as the fields are
printed, the ttl variable is printed on each line as a running total
of bytes for the sizes of files.
The variable ttl is initialized to zero the first time it is used. Since the ttl variable is accessed once each time a record is read, it is accessed for the first time when the first record is read. When this first read happens, and the first reference to variable ttl is made, ttl is automatically set to zero. The syntax "ttl += $5" is borrowed from C. In other program languages it would be necessary to write something like this:
add $5 to ttl
or
ttl = ttl + 5
Awk uses += as a shorthand for "add to."
Awk initializes all variables to 0 when they are used for numbers and to "" when they are used for string storage. Awk is flexible about its variables, and you do not have to identify them as numeric or string types before using them. The ttl variable could have been used as a string holder, but since it is used for numeric information it starts life as a zero when the first record is read, and thereafter immediately has the contents of field $5 added to it.
As a note on Figure 11, press the TAB key after the double quote but before "Total."
Figure 11
ls -l|awk '{ttl+=$5; print $9 " ^" $5 " ^"$3 " ^Total " ttl " bytes"}'
Figure 12 is the output of Figure 11.
Figure 12
store.dat 109 mjb Total 109 bytes store.sav 93 mjb Total 212 bytes store.txt 3058 mjb Total 3270 bytes sort.dat 89 mjb Total 3359 bytes sort.sav 193 mjb Total 3552 bytes sort.txt 2068 mjb Total 5620 bytes palet.txt 20 mjb Total 5640 bytes
Line splitting
Awk examples are gradually getting too long for a single line, so we
will have to start splitting the lines. If you are not using the C
shell, one way to do this is to press enter after you have typed the
initial opening single quote before the awk commands. The line will
be continued allowing you to enter one or more commands until the
final closing single quote is typed. This can be used to break an
awk script or program into several separate lines. Figure 13 is an
example. In Figure 14 the output is identical to Figure 12.
Figure 13
ls -l|awk ' <- once the open quote is typed, press enter {ttl+=$5; <- and continue on the next lines print $9 " ^" $5 " ^"$3 " ^Total " ttl " bytes"} ' <- until the final closing quote
Figure 14
store.dat 109 mjb Total 109 bytes store.sav 93 mjb Total 212 bytes store.txt 3058 mjb Total 3270 bytes sort.dat 89 mjb Total 3359 bytes sort.sav 193 mjb Total 3552 bytes sort.txt 2068 mjb Total 5620 bytes palet.txt 20 mjb Total 5640 bytes
For the C shell, use the backslash as the line continuation character as shown in Figure 15. Further examples will assume that you are using sh, ksh, or one of its derivatives. If you are using csh, then be sure to include the backslash continuation characters.
Figure 15
ls -l|awk ' \ <- use the backslash to force a continuation {ttl+=$5; \ <- on each line print $9 " ^" $5 " ^"$3 " ^Total " ttl " bytes"} \ ' <- until the final closure, then press enter
A running total is fine, but what I really wanted here was a total bytes count at the end of the listing.
Although the awk default is to perform all commands on each record, awk also allows actions to be performed before the first record is read, and/or after the last record is processed. Commands to be executed at the beginning or end of the records are set off by the key words BEGIN and END. Figure 16, is an example of the END key word. The values in field $5 are still accumulated in the ttl variable, but the total in ttl is printed as part of the END action instead of with each record.
Figure 16
ls -l|awk ' {ttl+=$5; print $9 " ^" $5 " ^"$3} END{print "Total " ttl " bytes"}'
Figure 17 is the output of Figure 16 and you will see that the total is printed as a final line after the last directory entry.
Figure 17
store.dat 109 mjb store.sav 93 mjb store.txt 3058 mjb sort.dat 89 mjb sort.sav 193 mjb sort.txt 2068 mjb palet.txt 20 mjb Total 5640 bytes
Figure 18 adds the use of the BEGIN key word and Figure 19 shows the output with the heading created with the BEGIN statement.
Figure 18
ls -l|awk ' BEGIN{print "Custom Directory Listing"} {ttl+=$5; print $9 " ^" $5 " ^"$3} END{print "Total " ttl " bytes"}'
Figure 19
Custom Directory Listing
store.dat 109 mjb store.sav 93 mjb store.txt 3058 mjb sort.dat 89 mjb sort.sav 193 mjb sort.txt 2068 mjb palet.txt 20 mjb Total 5640 bytes
Figure 20 is a pseudo-listing of the three parts of the awk script. The middle section is marked "each record," but this is not an awk keyword. It is inserted to make the pseudo-listing clearer.
Figure 20
ls -l|awk ' BEGIN {print "Custom Directory Listing"} each record {ttl+=$5;print $9 " ^" $5 " ^"$3} END {print "Total " ttl " bytes"}'
Take another look at Figure 19 for an additional problem that can be fixed with a feature of awk. There is a blank line between "Custom Directory Listing" and the line containing the first file. Why? I fudged a bit in the earlier part of this article. The real result of an ls -l actually looks more like Figure 21. The total blocks are listed on the first line.
Figure 21
total 18 -rw-r--r-- 1 mjb group 109 Mar 09 18:32 store.dat -rw-r--r-- 1 mjb group 93 Mar 09 18:31 store.sav -rwxr-xr-x 1 mjb group 3058 Mar 09 18:29 store.txt -rw-r--r-- 1 mjb group 89 Mar 09 18:32 sort.dat -rw-r--r-- 1 mjb group 193 Mar 09 18:31 sort.sav -rwxr-xr-x 1 mjb group 2068 Mar 09 18:29 sort.txt -rw-r--r-- 1 mjb group 20 Mar 09 18:31 palet.txt
Awk sees the line containing "total 18" as the first record that it processes. This first record only has fields $1 and $2, so fields $3, $5, and $9 are blank for the first record. The print command on this first record is actually printing 3 blank fields from the first record. These show up as a single blank line, but this single line provides an opportunity to show another part of the awk language.
"If" tests and conditions
An if test can be used to eliminate an unwanted record.
Figure 22 includes an if test which uses the next statement,
on line 3. The if test is straight forward except that awk
uses "==" (equal equal) for "is equal to." In English this
would read, "If the first field is equal to `total'..."
The next statement causes awk to skip all further actions on this record and to loop back to the top of the logic that reads the next input record.
Figure 22
ls -l|awk ' BEGIN{print "Custom Directory Listing"} {if($1 == "total") next; ttl+=$5; print $9 " ^" $5 " ^"$3} END{print "Total " ttl " bytes"}'
Figure 23 is an illustration of the steps that happen in awk record processing as the if condition is tested, and what the next does. Note that step 1 in the illustration, read a record, is the automatic default of action of awk, and there is no awk command to read a record.
Figure 23 The logic in an if-next statement
1. read a record < the automatic action in awk 2. { if ($1 == "total") < test the first field 3. next; < if true go to step 1 4. ttl += $5; < otherwise continue 5. (rest of the code)
Even simple if tests such as the one shown here can add a powerful tool to awk processes.
This is about all I have space for in this edition, so join me next month for some more advanced features in awk, including better formatting and processing of files whose fields are not separated by spaces.
|
Resources
About the author
Mo Budlong is president of King Computer Services, Inc. and has been involved in Unix development on Sun and other platforms for over 15 years. King Computer Services, Inc. specializes in Unix and client/server consulting and training and currently publishes the COBOL Just In Time Course, a crash course for the Year 2000 problem.
Reach Mo at mo.budlong@sunworld.com.
If you have technical problems with this magazine, contact webmaster@sunworld.com
URL: http://www.sunworld.com/swol-04-1997/swol-04-unix101.html
Last modified: