Vous êtes sur la page 1sur 11

[3.

1] USING AWK FROM THE COMMAND LINE * The Awk programming language was designed to be simple but powerful. It allows a user to perform relatively sophisticated text-manipulation operations through Awk programs written on the command line. For example, suppose I want to turn a document with single-spacing into a docume nt with double-spacing. I could easily do that with the following Awk program: awk '{print ; print ""}' infile > outfile Notice how single-quotes (' ') are used to allow using double-quotes (" ") withi n the Awk expression. This "hides" special characters from the shell. We could a lso do this as follows: awk "{print ; print \"\"}" infile > outfile -- but the single-quote method is simpler. This program does what it supposed to, but it also doubles every blank line in t he input file, which leaves a lot of empty space in the output. That's easy to f ix, just tell Awk to print an extra blank line if the current line is not blank: awk '{print ; if (NF != 0) print ""}' infile > outfile * One of the problems with Awk is that it is ingenious enough to make a user wan t to tinker with it, and use it for tasks for which it isn't really appropriate. For example, we could use Awk to count the number of lines in a file: awk 'END {print NR}' infile -- but this is dumb, because the "wc (word count)" utility gives the same answer with less bother: Use the right tool for the job. Awk is the right tool for slightly more complicated tasks. Once I had a file con taining an email distribution list. The email addresses of various different gro ups were placed on consecutive lines in the file, with the different groups sepa rated by blank lines. If I wanted to quickly and reliably determine how many peo ple were on the distribution list, I couldn't use "wc", since, it counts blank l ines, but Awk handled it easily: awk 'NF != 0 {++count} END {print count}' list * Another problem I ran into was determining the average size of a number of fil es. I was creating a set of bitmaps with a scanner and storing them on a disk. T he disk started getting full and I was curious to know just how many more bitmap s I could store on the disk. I could obtain the file sizes in bytes using "wc -c" or the "list" utility ("ls -l" or "ll"). A few tests showed that "ll" was faster. Since "ll" lists the file size in the fifth field, all I had to do was sum up the fifth field and divide by NR. There was one slight problem, however: the first line of the output of "l l" listed the total number of sectors used, and had to be skipped. No problem. I simply entered: ll awk 'NR!=1 {s+=$5} END {print "Average: " s/(NR-1)}'

This gave me the average as about 40 KB per file. * Awk is useful for performing simple iterative computations for which a more so phisticated language like C might prove overkill. Consider the Fibonacci sequenc e: 1 1 2 3 5 8 13 21 34 ...

Each element in the sequence is constructed by adding the two previous elements together, with the first two elements defined as both "1". It's a discrete formu la for exponential growth. It is very easy to use Awk to generate this sequence: awk 'BEGIN {a=1;b=1; while(++x<=10){print a; t=a;a=a+b;b=t}; exit}' This generates the following output data: 1 2 3 5 8 13 21 34 55 89 BACK_TO_TOP [3.2] AWK PROGRAM FILES * Sometimes an Awk program needs to be used repeatedly. In that case, it's simpl e to execute the Awk program from a shell script. For example, consider an Awk s cript to print each word in a file on a separate line. This could be done with a script named "words" containing: awk '{c=split($0, s); for(n=1; n<=c; ++n) print s[n] }' $1 "Words" could them be made executable (using "chmod +x words") and the resulting shell "program" invoked just like any other command. For example, "words" could be invoked from the "vi" text editor as follows: :%!words This would turn all the text into a list of single words. * For another example, consider the double-spacing program mentioned previously. This could be slightly changed to accept standard input, using a "-" as describ ed earlier, then copied into a file named "double": awk '{print; if (NF != 0) print ""}' -- and then could be invoked from "vi" to double-space all the text in the edito r. * The next step would be to also allow "double" to perform the reverse operation : To take a double-spaced file and return it to single-spaced, using the option: undouble The first part of the task is, of course, to design a way of stripping out the e xtra blank lines, without destroying the spacing of the original single-spaced f ile by taking out all the blank lines. The simplest approach would be to delete every other blank line in a continuous block of such blank lines. This won't nec essarily preserve the original spacing, but it will preserve spacing in some for m. The method for achieving this is also simple, and involves using a variable name d "skip". This variable is set to "1" every time a blank line is skipped, to tel l the Awk program NOT to skip the next one. The scheme is as follows: BEGIN {set skip to 0} scan the input:

if skip == 0

if skip == 1

if line is blank skip = 1 else print the line get next line of input print the line skip = 0 get next line of input

This translates directly into the following Awk program: BEGIN {skip = 0} skip == 0 {if (NF == 0) {skip = 1} else {print}; next} skip == 1 {print; skip = 0; next} This program could be placed in a separate file, named, say, "undouble.awk", wit h the shell script "undouble" written as: awk -f undouble.awk It could also be embedded directly in the shell script, using single-quotes to e nclose the program and backslashes ("\") to allow for multiple lines: awk 'BEGIN {skip = 0} \ skip == 0 {if (NF == 0) {skip = 1} \ else {print}; \ next} \ skip == 1 {print; \ skip = 0; \ next}' Remember that when "\" is used to embed an Awk program in a script file, the pro gram appears as one line to Awk. A semicolon must be used to separate commands. For a more sophisticated example, I have a problem that when I write text docume nts, occasionally I'll somehow end up with the same word typed in twice: "And th e result was also also that ... " Such duplicate words are hard to spot on proof reading, but it is straightforward to write an Awk program to do the job, scanni ng through a text file to find duplicate; printing the duplicate word and the li ne it is found on if a duplicate is found; or otherwise printing "no duplicates found". BEGIN { dups=0; w="xy-zzy" } { for( n=1; n<=NF; n++) { if ( w == $n ) { print w, "::", $0 ; dups = 1 } ; w = $n } } END { if (dups == 0) print "No duplicates found." } The "w" variable stores each word in the file, comparing it to the next word in the file; w is initialized to "xy-zzy" since that is unlikely to be a word in th e file. The "dup" variable is initialized to 0 and set to 1 if a duplicate is fo und; if it's still 0 at the end of the end, the program prints the "no duplicate found" message. As with the previous example, we could put this into a separate file or embed it into a script file. * These last examples use variables to allow an Awk program to keep track of wha t it has been doing. Awk, as repeatedly mentioned, operates in a cycle: get a li

ne, process it, get the next line, process it, and so on; to have an Awk program remember things between cycles, it needs to leave a little message for itself i n a variable. For example, say we want to match on a line whose first field has the value 1,00 0 -- but then print the next line. We could do that as follows: BEGIN $1 == 1000 flag == 1 {flag = 0} {flag = 1; next} {print; flag = 0; next}

This program sets a variable named "flag" when it finds a line starting with 1,0 00, and then goes and gets the next line of input. The next line of input is pri nted, and then "flag" is cleared so the line after that won't be printed. If we wanted to print the next five lines, we could do that in much the same way using a variable named, say, "counter": BEGIN $1 == 1000 counter > 0 {counter = 0} {counter = 5; next} {print; counter--; next}

This program initializes a variable named "counter" to 5 when it finds a line st arting with 1,000; for each of the following 5 lines of input, it prints them an d decrements "counter" until it is zero. This approach can be taken to as great a level of elaboration as needed. Suppose we have a list of, say, five different actions on five lines of input, to be ta ken after matching a line of input; we can then create a variable named, say, "s tate", that stores which item in the list to perform next. The scheme is general ly as follows: BEGIN {set state to 0} scan the input: if match set state to 1 get next line of input if state == 1 do the first thing in the list state = 2 get next line of input if state == 2 do the second thing in the list state = 3 get next line of input if state == 3 do the third thing in the list state = 4 get next line of input if state == 4 do the fourth thing in the list state = 5 get next line of input if state == 5 do the fifth (and last) thing in the list state = 0 get next line of input This is called a "state machine". In this case, it's performing a simple list of actions, but the same approach could also be used to perform a more complicated branching sequence of actions, such as we might have in a flowchart instead of a simple list.

We could assign state numbers to the blocks in the flowchart and then use if-the n tests for the decision-making blocks to set the state variable to indicate whi ch of the alternate actions should be performed next. However, few Awk programs require such complexities, and going into more elaborate examples here would pro bably be more confusing than it's worth. The essential thing to remember is that an awk program can leave messages for itself in a variable on one line-scan cyc le to tell it what to do on later line-scan cycles. BACK_TO_TOP [3.3] A NOTE ON AWK IN SHELL SCRIPTS * Awk is an excellent tool for building UNIX/Linux shell scripts, but there are potential pitfalls. Say we have a scriptfile named "testscript", and it takes tw o filenames as parameters: testscript myfile1 myfile2 If we're executing Awk commands from a file, handling the two filenames isn't ve ry difficult. We can initialize variables on the command line as follows: cat $1 $2 awk -f testscript.awk f1=$1 f2=$2 > tmpfile The Awk program will use two variables, "f1" and "f2", that are initialized from the script command line variables "$1" and "$2". Where this measure gets obnoxious is when we are specifying Awk commands directl y, which is preferable if possible since it reduces the number of files needed t o implement a script. The problem is that "$1" and "$2" have different meanings to the scriptfile and to Awk. To the scriptfile, they are command-line parameter s, but to Awk they indicate text fields in the input. The handling of these variables depends on how Awk print fields are defined -- e ither enclosed in double-quotes (" ") or in single-quotes (' '). If we invoke Aw k as follows: awk "{ print \"This is a test: \" $1 }" $1 -- we won't get anything printed for the "$1" variable. If we instead use single -quotes to ensure that the scriptfile leaves the Awk positional variables alone, we can insert scriptfile variables by initializing them to variables on the com mand line: awk '{ print "This is a test: " $1 " / parm2 = " f }' f=$2 < $1 This provides the first field in "myfile1" as the first parameter and the name o f "myfile2" as the second parameter. Remember that Awk is relatively slow and clumsy and should not be regarded as th e default tool for all scriptfile jobs. We can use "cat" to append to files, "he ad" and "tail" to cut off a given number of lines of text from the front or back of a file, "grep" or "fgrep" to find lines in a particular file, and "sed" to d o search-replaces on the stream in the file. BACK_TO_TOP [3.4] NAWK * The original version of Awk was developed in 1977. It was optimized for throwi ng together "one-liners" or short, quick-and-dirty programs. However, some users liked Awk so much that they used it for much more complicated tasks. To quote t he language's authors: "Our first reaction to a program that didn't fit on one p age was shock and amazement." Some users regarded Awk as their primary programmi ng tool, and many had in fact learned programming using Awk.

After the authors got over their initial consternation, they decided to accept t he fact, and enhance Awk to make it a better general-purpose programming tool. T he new version of Awk was released in 1985. The new version is often, if not alw ays, known as Nawk ("New Awk") to distinguish it from the old one. * Nawk incorporates several major improvements. The most important is that users can define their own functions. For example, the following Nawk program impleme nts the "signum" function: {for (field=1; field<=NF; ++field) {print signum($field)}}; function signum(n) { if (n<0) return -1 else if (n==0) return 0 else return 1} Function declarations can be placed in a program wherever a match-action clause can. All parameters are local to the function. Local variables can be defined in side the function. * A second improvement is a new function, "getline", that allows input from file s other than those specified in the command line at invocation (as well as input from pipes). "Getline" can be used in a number of ways: getline getline myvar getline myfile getline myvar myfile command getline command getline myvar Loads Loads Loads Loads Loads Loads $0 from "myvar" $0 from "myvar" $0 from "myvar" current input. from current input. "myfile". from "myfile". output of "command". from output of "command".

* A related function, "close", allows a file to be closed so it can be read from the beginning again: close("myfile") * A new function, "system", allows Awk programs to invoke system commands: system("rm myfile") * Command-line parameters can be interpreted using two new predefined variables, ARGC and ARGV, a mechanism instantly familiar to C programmers. ARGC ("argument count") gives the number of command-line elements, and ARGV ("argument vector") is an array whose entries store the elements individually. * There is a new conditional-assignment expression, known as "?:", which is used as follows: status = (condition == "green")? "go" : "stop" This translates to: if (condition=="green") {status = "go"} else {status = "stop"} This construct should also be familiar to C programmers. * There are new math functions, such as trig and random-number functions: sin(x) cos(x) atan2(y,z) rand() srand() Sine, with x in radians. Cosine, with x in radians. Arctangent of y/x, in range -PI to PI. Random number, with 0 <= number < 1. Seed for random-number generator.

* There are new string functions, such as match and substitution functions:

match(<target string>,<search string>) Search the target string for the search string; return 0 if no match, return sta rting index of search string if match. Also sets built-in variable RSTART to the starting index, and sets built-in variable RLENGTH to the matched string's leng th. sub(<regular expression>,<replacement string>) Search for first match of regular expression in $0 and substitute replacement st ring. This function returns the number of substitutions made, as do the other su bstitution functions. sub(<regular expression>,<replacement string>,<target string>) Search for first match of regular expression in target string and substitute rep lacement string. gsub(<regular expression>,<replacement string>) Search for all matches of regular expression in $0 and substitute replacement st ring. gsub(<regular expression>,<replacement string>,<target string>) Search for all matches of regular expression in target string and substitute rep lacement string. * There is a mechanism for handling multidimensional arrays. For example, the fo llowing program creates and prints a matrix, and then prints the transposition o f the matrix: BEGIN {count = 1; for (row = 1; row <= 5; ++row) { for (col = 1; col <= 3; ++col) { printf("%4d",count); array[row,col] = count++; } printf("\n"); } printf("\n"); for (col = 1; col <= 3; ++col) { for (row = 1; row <= 5; ++row) { printf("%4d",array[row,col]); } printf("\n"); } exit; } This yields: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 10 13 8 11 14 9 12 15

Nawk also includes a new "delete" function, which deletes array elements: delete(array[count]) * Characters can be expressed as octal codes. "\033", for example, can be used t

o define an "escape" character. * A new built-in variable, FNR, keeps track of the record number of the current file, as opposed to NR, which keeps track of the record number of the current li ne of input, regardless of how many files have contributed to that input. Its be havior is otherwise exactly identical to that of NR. * While Nawk does have useful refinements, they are generally intended to suppor t the development of complicated programs. My own feeling is that Nawk represent s overkill for all but the most dedicated Awk users, and in any case would requi re a substantial document of its own to do its capabilities justice. Those who w ould like to know more about Nawk are encouraged to read THE AWK PROGRAMMING LAN GUAGE by Aho / Weinberger / Kernighan. This short, terse, detailed book outlines the capabilities of Nawk and provides sophisticated examples of its use. BACK_TO_TOP [3.5] AWK QUICK REFERENCE GUIDE * This final section provides a convenient lookup reference for Awk programming. * Invoking Awk: awk [-F<ch>] {pgm} -- where: ch: pgm: pgm file: vars: data file: {-f <pgm file>} [<vars>] [- <data file>]

Field-separator character. Awk command-line program. File containing an Awk program. Awk variable initializations. Input data file. program: {<initializations>} {<program actions>} {<program actions>} {<final actions>}

* General form of Awk BEGIN <search pattern 1> <search pattern 2> ... END

* Search patterns: /<string>/ Search for string. /^<string>/ Search for string at beginning of line. /<string>$/ Search for string at end of line. The search can be constrained to particular fields: $<field> ~ /<string>/ Search for string in specified field. $<field> !~ /<string>/ Search for string \Inot\i in specified field. Strings can be ORed in a search: /(<string1>) (<string2>)/ The search can be for an entire range of lines, bounded by two strings: /<string1>/,/<string2>/ The search can be for any condition, such as line number, and can use the follow ing comparison operators: == != < > <= >= Different conditions can be ORed with " " or ANDed with "&&". [<charlist or range>] Match on any character in list or range.

[^<charlist or range>] Match on any character not in list or range. . Match any single character. * Match 0 or more occurrences of preceding string. ? Match 0 or 1 occurrences of preceding string. + Match 1 or more occurrences of preceding string. If a metacharacter is part of the search string, it can be "escaped" by precedin g it with a "\". * Special characters: \n Newline (line feed).

Backspace. \r Carriage return. \f Form feed. A "\" can be embedded in a string b y entering it twice: "\\". * Built-in variables: $0; $1,$2,$3,... NR NF FILENAME FS RS OFS ORS OFMT Field variables. Number of records (lines). Number of fields. Current input filename. Field separator character (default: " "). Record separator character (default: "\n"). Output field separator (default: " "). Output record separator (default: "\n"). Output format (default: "%.6g").

* Arithmetic operations: + Addition. - Subtraction. * Multiplication. / Division. % Mod. ++ Increment. -- Decrement. Shorthand x += 2 x -= 2 x *= 2 x /= 2 x %= 2 assignments: -- is the same -- is the same -- is the same -- is the same -- is the same as: as: as: as: as: x x x x x = = = = = x x x x x + * / % 2 2 2 2 2

* The only unique string operation is concatenation, which is performed simply b y listing two strings connected by a blank space. * Arithmetic functions: sqrt() log() exp() int() Square root. Base \Ie\i log. Power of \Ie\i. Integer part of argument.

* String functions: length() Length of string. substr(<string>,<start of substring>,<max length of substring>) Get substring.

split(<string>,<array>,[<field separator>]) Split string into array, with initial array index being 1. index(<target string>,<search string>) Find index of search string in target string. sprintf() Perform formatted print into string. * Control structures: if (<condition>) <action 1> [else <action 2>] while (<condition>) <action> for (<initial action>;<condition>;<end-of-loop action>) <action> Scanning through an associative array with "for": for (<variable> in <array>) <action> Unconditional break continue next exit control statements: Break out of "while" or "for" loop. Perform next iteration of "while" or "for" loop. Get and scan next line of input. Finish reading input and perform END statements. Print items separated by OFS; end with newline. Print items concatenated; end with newline.

* Print: print <i1>, <i2>, ... print <i1> <i2> ... * Printf(): General format:

printf(<string with format codes>,[<parameters>]) Newlines must be explicitly specified with a "\n". General form of format code: %[<number>]<format code> The optional "number" can consist of: A leading "-" for left-justified output. An integer part that specifies the minimum output width. (A leading "0" causes t he output to be padded with zeroes.) A fractional part that specifies either the maximum number of characters to be p rinted (for a string), or the number of digits to be printed to the right of the decimal point (for floating-point formats). The format codes are: d o x c s e f Prints Prints Prints Prints Prints Prints Prints a a a a a a a number in decimal format. number in octal format. number in hexadecimal format. character, given its numeric code. string. number in exponential format. number in floating-point format.

Prints a number in exponential or floating-point format.

* Awk can perform output redirection (using ">" and ">>") and piping (using " ") from both "print" and "printf".