Monday, November 9, 2009

AWK Help

UNIX Utilities - awk
The awk command line is:
awk [program-f programfile] [flags/variables] [files]
Command line flags
-f file -- Read the awk script from the specified file rather than the command line
-F re -- Use the given regular expression re as the field separator rather than the default "white space"
variable=value -- Initialize the awk variable with the specified
An awk program consists of one or more awk commands separated by either \n or semicolons.
The structure of awk commands
Each awk command consists of a selector and/or an action; both may not be omitted in the same command. Braces surround the action.
selector [only] -- action is print
{action}[only] -- selector is every line
selector {action} -- perform action on each line where selector is true
Each action may have multiple statements separated from each other by semicolons or \n
Line selection
A selector is either zero, one, or two selection criteria; in the latter case the criteria are separated by commas
A selection criterion may be either an RE or a boolean expression (BE) which evaluates to true or false
Commands which have no selection criteria are applied to each line of the input data set
Commands which have one selection criterion are applied to every line which matches or makes true the criterion depending upon whether the criterion is an RE or a BE
Commands which have two selection criteria are applied to the first line which matches the first criterion, the next line which matches the second criterion and all the lines between them.
Unless a prior applied command has a next in it, every selector is tested against every line of the input data set.
Processing
The BEGIN block(s) is(are) run (mawk's -v runs first)
Command line variables are assigned For each line in the input data set It is read and NR, NF, $I, etc. are set For each command, its criteria are evaluated If the criteria is true/matches the command is executed After the input data set is exhausted, the END block(s) is(are) run
Elementary awk programming Constants
Strings are enclosed in quotes (")
Numbers are written in the usual decimal way;
non-integer values are indicated by including a period (.) in the representation.
REs are delimited by /
Variables Need not be declared
May contain any type of data, their data type may change over the life of the program
Are named as any token beginning with a letter and continuing with letters, digits and underscores
As in C, case matters; since all the built-in variables are all uppercase, avoid this form.
Some of the commonly used built-in variables are:
NR -- The current line's sequential number
NF -- The number of fields in the current line
FS -- The input field separator;
defaults to whitespace and is reset by the -F command line parameter
Fields - Each record is separated into fields named $1, $2, etc
$0 is the entire record
NF contains the number of fields in the current line
FS contains the field separator RE; it defaults to the white space RE, /[]*/
Fields may be accessed either by $n or by $var where var contains a value between 0 and NF
print/printf
print prints each of the values of $1 through $NF separated by OFS then prints a \n onto stdout; the default value of OFS is a blank
print value value ... prints the value(s) in order and then puts out a \n onto stdout;
printf(format,value,value,...) prints the value(s) using the format supplied onto stdout, just like C. There is no default \n for each printf so multiples can be used to build a line. There must be as many values in the list as there are item descriptors in format.
Values in print or printf may be constants, variables, or expressions in any order
Operators - awk has many of the same operators as C, excepting the bit operators. It also adds some text processing operators.
Built-in functions
substr(s,p,l) -- The substring of s starting at p and continuing for l characters
index(s1,s2) -- The first location of s2 within s1; 0 if not found
length(e) -- The length of e, converted to character string if necessary, in bytes
sin, cos, tan -- Standard C trig functions
atan2(x,y) -- Standard quadrant oriented arctangent function
exp, log -- Standard C exponential functions
srand(s), rand() -- Random number seed and access functions
Elementary examples and uses
length($0)>72 -- print all of the lines whose length exceeds 72 bytes
{$2="";print} -- remove the second field from each line
{print $2} -- print only the second field of each line
/Ucast/{print $1 "=" $NF} -- for each line which contains the string 'Ucast' print the first variable, an equal sign and the last variable (awk code to create awk code; a common trick)
BEGIN{FS="/"};NF<4 1="=" ct="1;pr" name="ap" next="next" printf="printf" prv="prv" v="$0}">
Advanced awk programming
Program structure (if, for, while, etc.)
if(boolean) statement1 else statement2 if the boolean expression evaluates to true execute statement1, otherwise execute statement 2
for(v=init;boolean;v change) statement Standard C for loop, assigns v the value of init then while the boolean expression is true executes the statement then the v change
for(v in array) statement Assigns to v each of the values of the subscripts of array, not in any particular order, then executes statement
while(boolean) statement While the boolean expression is true, execute the statement
do statement while(boolean) execute statement, evaluate the boolean expression and if true, repeat
statement in any of the above constructs may be either a simple statement or a series of statements enclosed in {}, again like C; a further requirement is that the opening { must be on the line with the beginning keyword (if, for, while, do) either physically or logically via \ .
break -- exit from an enclosing for or while loop
continue -- restart the enclosing for or while loop from the top
next -- stop processing the current record, read the next record and begin processing with the first command
exit -- terminate all input processing and, if present, execute the END command
Arrays
There are two types of arrays in awk - standard and generalized
Standard arrays take the usual integer subscripts, starting at 0 and going up; multidimensional arrays are allowed and behave as expected
Generalized arrays take any type of variable(s) as subscripts, but the subscript(s) are treated as one long string expression.
The use of for(a in x) on a generalized array will return all of the valid subscripts in some order, not necessarily the one you wished.
The subscript separator is called SUBSEP and has a default value of comma (,)
Elements can be deleted from an array via the delete(array[subscript]) statement
Built-in variables
FILENAME -- The name of the file currently being processed
OFS -- Output Field Separator default ' '
RS -- Input Record Separator default \n
ORS -- Output Record Separator default \n
FNR -- Current line's number with respect to the current file
OFMT -- Output format for printed numbers default %.6g
RSTART -- The location of the data matched using the match built-in function
RLENGTH -- The length of the data matched using the match built-in function
Built-in functions
gsub(re,sub,str) -- replace, in str, each occurrence of the regular expression re with sub; return the number of substitutions performed
int(expr) -- return the value of expr with all fractional parts removed
match(str,re) -- return the location in str where the regular expression re occurs and set RSTART and RLENGTH; if re is not found return 0
split(str,arrname,sep) -- split str into pieces using sep as the separator and assign the pieces in order to the elements from 1 up of arrname; use FS if sep is not given
sprintf(format,value,value,...) -- write the values, as the format indicates, into a string and return that string
sub(re,sub,str) -- replace, in str, the first occurrence of the regular expression re with sub; return 1 if successful, 0 otherwise
system(command) -- pass command to the local operating system to execute and return the exit status code returned by the operating system
tolower(str) -- return a string similar to str with all capital letters changed to lower case
Other file I/O
print and printf may have > (or >>) filename or command appended and the output will be sent to the named file or command; once a file is opened, it remains open until explicitly closed
getline var < name="ae">Advanced examples and uses
{ split($1,t,":")
$1 = (t[1]*60+t[2])*60+t[3]
print
}
Replaces an HH:MM:SS time stamp in the first field with a seconds since midnight value which can be more easily plotted, computed with, etc.
{ for(i = 1; i<=NF; i++) ct[$i] += 1 } END { for(w in ct) { printf("%6d %s",ct[w],w) } } This reads a file of text and creates a file containing each unique word along with the number of occurrences of the word in the text. NR=1 { t0=$1; tp = $1; for(i=1;i<=nv;i++) dp[i] = $(I+1);next} { dt=$1-tp; tp = $1 printf("%d ",$1-t0) for(i=1;i<=nv;i++) { printf("%d ",($(I+1)-dp[i])/dt) dp[i] = $(i+1) } printf("\n") } Take a set of time stamped data and convert the data from absolute time and counts to relative time and average counts. The data is presumed to be all amenable to treatment as integers. If not, formats better the %d must be used. BEGIN{ printf("set term postscript\n") > "plots"
printf("set output 'lpr -Php'\n") > "plots" }
{ if(system("test -s " $1 ".r") {
print "process1 " $1 ".r " $2
printf("plot '%s.data' using 2:5 title '%s'",\
$1,$3) >> "plots"
}
}
END { print "gnuplot < i =" index(" a="$0" hold =" hold" hold ="" hold = "\f" hold = "\n" hold = "\n\n" unfirst =" 1" hold =" hold" b="" ll="="0)" ll="72" nf="="0" b="" b="substr(b,1,length(b)-1)" b="b">ll) {
i = ll
while(substr(b,i,1)=" ") I--
print substr(b,1,i-1)
b = substr(b,i+1)
}
}
END { print b; print "" }

This will take an arbitrary stream of text (where paragraphs are indicated by consecutive \n) and make all the lines approximately the same length. The default output line length is 72, but it may be set via a parameter on the awk command line. Both long and short lines are taken care of but extra spaces/tabs within the text are not correctly handled.

BEGIN { FS = "\t" # make tab the field separator
printf("%10s %6s %5s %s\n\n",
"COUNTRY", "AREA", "POP", "CONTINENT")
}
{ printf("%10s %6d %5d %s\n", $1, $2, $3, $4)
area = area +$2
pop = pop + $3
}
END { printf("\n%10s %6d %5d\n", "TOTAL", area, pop) }

This will take a variable width table of data with four tab separated fields and print it as a fixed length table with headings and totals.
Important things which will bite you
$1 inside the awk script is not $1 of the shell script; use variable assignment on the command line to move data from the shell to the awk script,
Actions are within {}, not selections
Every selection is applied to each input line after the previously selected actions have occurred; this means that a previous action can cause unexpected selections or selection misses.
Operators " " The blank is the concatenation operator
+ - * / % All of the usual C arithmetic operators, add, subtract, multiply, divide and mod.
== != < <= > >= All of the usual C relational operators, equal, not equal, less than, less than or equal and greater than, greater than or equal
&& The C boolean operators and and or
= += -= *= /= %= The C assignment operators
~ !~ Matches and doesn't match
?: C conditional value operator
^ Exponentiation
++ -- Variable increment/decrement
Note the absence of the C bit operators &, , <<>>
[s]printf format items
Format strings in the printf statement and sprintf function consist of three different type of items: literal characters, escaped literal characters and format items. Literal characters are just that: characters which will print as themselves. Escaped literal characters begin with a backslash (\) and are used to represent control characters; the common ones are: \n for new line, \t for tab and \r for return. Format items are used to describe how program variables are to be printed.
All format items begin with a percent sign (%). The next part is an optional length and precision field. The length is an integer indicating the minimum field width of the item, negative if the data is to be white spacethe left of the field. If the length field begins with a zero (0), then instead of padding the value with leading blanks, the item will be padded with leading 0s. The precision is a decimal followed by the number of decimal digits to be displayed for various floating point representations. Next is an optional source field size modifier, usually 'l' (ell). The last item is the actual source data type, commonly one of the list below:
d Integer
f Floating point in fixed point format
e Floating point invaluel format
g Floating point in "best fit" format;
integer, fixed point, or exponential; depending on exact value
s Character string
c Integer to be interpreted as a character
x Integer to be printed as hexadecimal
Examples: %-20s Print a string in the left portion of a 20 character field
%d Print an integer in however many spaces it takes
%6d Print an integer in at least 6 spaces; used to format pretty output
%9ld Print a long integer in at least 9 spaces
%09ld Print a long integer in at least 9 spaces with leading 0s, not blanks
%.6f Print a float with 6 digits after the decimal and as many before it as needed
%10.6f Print a float in a 10 space field with 6 digits after the decimal
Awk is a very nice language with a very strange name. In this first article of a three-part series, Daniel Robbins will quickly get your awk programming skills up to speed. As the series progresses, more advanced topics will be covered, culminating with an advanced real-world awk application demo.
");
}
}
}
//-->
In defense of awk
In this series of articles, I'm going to turn you into a proficient awk coder. I'll admit, awk doesn't have a very pretty or particularly "hip" name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear "awk" and think of a mess of code so backwards and antiquated that it's capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for coffee machine).
Sure, awk doesn't have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk's syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal.

The first awk
You should see the contents of your /etc/passwd file appear before your eyes. Now, for an explanation of what awk did. When we called awk, we specified /etc/passwd as our input file. When we executed awk, it evaluated the print command for each line in /etc/passwd, in order. All output is sent to stdout, and we get a result identical to catting /etc/passwd.
Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed.
$ awk '{ print $0 }' /etc/passwd

In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing.
$ awk '{ print "" }' /etc/passwd

$ awk '{ print "hiya" }' /etc/passwd

Running this script will fill your screen with hiya's. :)
Multiple fields
print $1
$ awk -F":" '{ print $1 $3 }' /etc/passwd

halt7operator11root0shutdown6sync5bin1....etc.
print $1 $3
$ awk -F":" '{ print $1 " " $3 }' /etc/passwd
$1$3
$ awk -F":" '{ print "username: " $1 "\t\tuid:" $3" }' /etc/passwd

username: halt uid:7username: operator uid:11username: root uid:0username: shutdown uid:6username: sync uid:5username: bin uid:1....etc.
External scripts
BEGIN { FS=":"}{ print $1 }

The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F":" option to awk on the command line. It's generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We'll cover the FS variable in more detail later in this article.
The BEGIN and END blocks
Normally, awk executes each block of your script's code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program.
Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.
Regular expressions and blocks
/foo/ { print }

/[0-9]+\.[0-9]*/ { print }
Expressions and blocks
fredprint
$1 == "fred" { print $3 }
root
$5 ~ /root/ { print $3 }
Conditional statements
if
{ if ( $5 ~ /root/ ) { print $3 }}

Both scripts function identically. In the first example, the boolean expression is placed outside the block, while in the second example, the block is executed for every input line, and we selectively perform the print command by using an if statement. Both methods are available, and you can choose the one that best meshes with the other parts of your script.
if if
{ if ( $1 == "foo" ) { if ( $2 == "foo" )
{ print "uno" }
else { print "one" }
}
else if ($1 == "bar" )
{ print "two" }
else { print "three" }}
if
! /matchme/ { print $1 $3 $4 }
{ if ( $0 !~ /matchme/ ) { print $1 $3 $4 }}

Both scripts will output only those lines that don't contain a matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.
( $1 == "foo" ) && ( $2 == "bar" ) { print }

This example will print only those lines where field one equals foo and field two equals bar.
Numeric variables!
In the BEGIN block, we initialize our integer variable x to zero. Then, each time awk encounters a blank line, awk will execute the x=x+1 statement, incrementing x. After all the lines have been processed, the END block will execute, and awk will print out a final summary, specifying the number of blank lines it found.
Stringy variables
2.01
1.01x$( )1.01
{ print ($1^2)+1 }

If you do a little experimenting, you'll find that if a particular variable doesn't contain a valid number, awk will treat that variable as a numerical zero when it evaluates your mathematical expression.
Lots of operators
Another nice thing about awk is its full complement of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk allows us to use the previously demonstrated exponent operator "^", the modulo (remainder) operator "%", and a bunch of other handy assignment operators borrowed from C.
These include pre- and post-increment/decrement ( i++, --foo ), add/sub/mult/div assign operators ( a+=3, b*=2, c/=2.2, d-=6.2 ). But that's not all -- we also get handy modulo/exponent assign ops as well ( a^=2, b%=4 ).
Field separators
Awk has its own complement of special variables. Some of them allow you to fine-tune how awk functions, while others can be read to glean valuable information about the input. We've already touched on one of these special variables, FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we were using /etc/passwd as input, FS was set to ":". While this did the trick, FS allows us even more flexibility.
FS="\t+"

Above, we use the special "+" regular expression character, which means "one or more of the previous character".
FS="[[:space:]+]"

While this assignment will do the trick, it's not necessary. Why? Because by default, FS is set to a single space character, which awk interprets to mean "one or more spaces or tabs." In this particular example, the default FS setting was exactly what you wanted in the first place!
FS="foo[0-9][0-9][0-9]"
Number of fields
{if ( NF > 2 ) { print $1 " " $2 ":" $3 }}

Record number
{ #skip header if ( NR > 10 ) { print "ok, now for the real information!" }}

Awk provides additional variables that can be used for a variety of purposes. We'll cover more of these variables in later articles.
We've come to the end of our initial exploration of awk. As the series continues, I'll demonstrate more advanced awk functionality, and we'll end the series with a real-world awk application. In the meantime, if you're eager to learn more, check out the resources listed below.

Multi-line records
Awk is an excellent tool for reading in and processing structured data, such as the system's /etc/passwd file. /etc/passwd is the UNIX user database, and is a colon-delimited text file, containing a lot of important information, including all existing user accounts and user IDs, among other things. I showed you how awk could easily parse this file. All we had to do was to set the FS (field separator) variable to ":".
By setting the FS variable correctly, awk can be configured to parse almost any kind of structured data, as long as there is one record per line. However, just setting FS won't do us any good if we want to parse a record that exists over multiple lines. In these situations, we also need to modify the RS record separator variable. The RS variable tells awk when the current record ends and a new record begins.
As an example, let's look at how we'd handle the task of processing an address list of Federal Witness Protection Program participants:
Jimmy the Weasel100 Pleasant DriveSan Francisco, CA 12345Big Tony200 Incognito Ave.Suburbia, WA 67890

Ideally, we'd like awk to recognize each 3-line address as an individual record, rather than as three separate records. It would make our code a lot simpler if awk would recognize the first line of the address as the first field ($1), the street address as the second field ($2), and the city, state, and zip code as field $3. The following code will do just what we want:
BEGIN { FS="\n" RS=""}

Above, setting FS to "\n" tells awk that each field appears on its own line. By setting RS to "", we also tell awk that each address record is separated by a blank line. Once awk knows how the input is formatted, it can do all the parsing work for us, and the rest of the script is simple. Let's look at a complete script that will parse this address list and print out each address record on a single line, separating each field with a comma.
BEGIN { FS="\n" RS=""}{ print $1 ", " $2 ", " $3}

If this script is saved as address.awk, and the address data is stored in a file called address.txt, you can execute this script by typing "awk -f address.awk address.txt". This code produces the following output:
Jimmy the Weasel, 100 Pleasant Drive, San Francisco, CA 12345Big Tony, 200 Incognito Ave., Suburbia, WA 67890
OFS and ORS
In address.awk's print statement, you can see that awk concatenates (joins) strings that are placed next to each other on a line. We used this feature to insert a comma and a space (", ") between the three address fields that appeared on the line. While this method works, it's a bit ugly looking. Rather than inserting literal ", " strings between our fields, we can have awk do it for us by setting a special awk variable called OFS. Take a look at this code snippet.
print "Hello", "there", "Jim!"

The commas on this line are not part of the actual literal strings. Instead, they tell awk that "Hello", "there", and "Jim!" are separate fields, and that the OFS variable should be printed between each string. By default, awk produces the following output:
Hello there Jim!

This shows us that by default, OFS is set to " ", a single space. However, we can easily redefine OFS so that awk will insert our favorite field separator. Here's a revised version of our original address.awk program that uses OFS to output those intermediate ", " strings:
BEGIN { FS="\n" RS="" OFS=", "}{ print $1, $2, $3}

Awk also has a special variable called ORS, called the "output record separator". By setting ORS, which defaults to a newline ("\n"), we can control the character that's automatically printed at the end of a print statement. The default ORS value causes awk to output each new print statement on a new line. If we wanted to make the output double-spaced, we would set ORS to "\n\n". Or, if we wanted records to be separated by a single space (and no newline), we would set ORS to " ".
Multi-line to tabbed
Let's say that we wrote a script that converted our address list to a single-line per record, tab-delimited format for import into a spreadsheet. After using a slightly modified version of address.awk, it would become clear that our program only works for three-line addresses. If awk encountered the following address, the fourth line would be thrown away and not printed:
Cousin VinnieVinnie's Auto Shop300 City AlleySosueme, OR 76543

To handle situations like this, it would be good if our code took the number of records per field into account, printing each one in order. Right now, the code only prints the first three fields of the address. Here's some code that does what we want:
BEGIN { FS="\n" RS="" ORS="" }
{ x=1
while ( xfor ( initial assignment; comparison; increment ) {code block}

Here's a quick example:
for ( x = 1; x <= 4; x++ ) {print "iteration",x} This snippet will print: iteration 1 iteration 2 iteration 3 iteration 4 Break and continue Again, just like C, awk provides break and continue statements. These statements provide better control over awk's various looping constructs. Here's a code snippet that desperately needs a break statement: while (1) { print "forever and ever..."} Because 1 is always true, this while loop runs forever. Here's a loop that only executes ten times: x=1while(1) { print "iteration",x if ( x == 10 ) { break } x++}
Here, the break statement is used to "break out" of the innermost loop. "break" causes the loop to immediately terminate and execution to continue at the line after the loop's code block.
The continue statement complements break, and works like this:

x=1
while (1)
{ if ( x == 4 )
{ x++ continue }
print "iteration",x
if ( x > 20 ) {
break }
x++}
This code will print "iteration 1" through "iteration 21", except for "iteration 4". If iteration equals 4, x is incremented and the continue statement is called, which immediately causes awk to start to the next loop iteration without executing the rest of the code block. The continue statement works for every kind of awk iterative loop, just as break does. When used in the body of a for loop, continue will cause the loop control variable to be automatically incremented. Here's an equivalent for loop:
for ( x=1; x<=21; x++ ) { if ( x == 4 ) { continue } print "iteration",x} It wasn't necessary to increment x just before calling continue as it was in our while loop, since the for loop increments x automatically. Arrays
You'll be pleased to know that awk has arrays. However, under awk, it's customary to start array indices at 1, rather than 0:
myarray[1]="jim"myarray[2]=456
When awk encounters the first assignment, myarray is created and the element myarray[1] is set to "jim". After the second assignment is evaluated, the array has two elements.
Iterating over arrays Once defined, awk has a handy mechanism to iterate over the elements of an array, as follows:
for ( x in myarray ) {print myarray[x]}

This code will print out every element in the array myarray. When you use this special "in" form of a for loop, awk will assign every existing index of myarray to x (the loop control variable) in turn, executing the loop's code block once after each assignment. While this is a very handy awk feature, it does have one drawback -- when awk cycles through the array indices, it doesn't follow any particular order. That means that there's no way for us to know whether the output of above code will be:
jim456
or
456jim

To loosely paraphrase Forrest Gump, iterating over the contents of an array is like a box of chocolates -- you never know what you're going to get. This has something to do with the "stringiness" of awk arrays, which we'll now take a look at.
Array index stringiness
as you know that awk actually stores numeric values in a string format. While awk performs the necessary conversions to make this work, it does open the door for some odd-looking code:
a="1"b="2"c=a+b+3
After this code executes, c is equal to 6. Since awk is "stringy", adding strings "1" and "2" is functionally no different than adding the numbers 1 and 2. In both cases, awk will successfully perform the math. Awk's "stringy" nature is pretty intriguing -- you may wonder what happens if we use string indexes for arrays. For instance, take the following code:
myarr["1"]="Mr. Whipple"print myarr["1"]
As you might expect, this code will print "Mr. Whipple". But how about if we drop the quotes around the second "1" index?

myarr["1"]="Mr. Whipple"print myarr[1]

Guessing the result of this code snippet is a bit more difficult. Does awk consider myarr["1"] and myarr[1] to be two separate elements of the array, or do they refer to the same element? The answer is that they refer to the same element, and awk will print "Mr. Whipple", just as in the first code snippet. Although it may seem strange, behind the scenes awk has been using string indexes for its arrays all this time!
After learning this strange fact, some of us may be tempted to execute some wacky code that looks like this:

myarr["name"]="Mr. Whipple"print myarr["name"]

Not only does this code not raise an error, but it's functionally identical to our previous examples, and will print "Mr. Whipple" just as before! As you can see, awk doesn't limit us to using pure integer indexes; we can use string indexes if we want to, without creating any problems. Whenever we use non-integer array indices like myarr["name"], we're using associative arrays. Technically, awk isn't doing anything different behind the scenes than when we use a string index (since even if you use an "integer" index, awk still treats it as a string). However, you should still call 'em associative arrays -- it sounds cool and will impress your boss. The stringy index thing will be our little secret. ;)

Array tools
When it comes to arrays, awk gives us a lot of flexibility. We can use string indexes, and we aren't required to have a continuous numeric sequence of indices (for example, we can define myarr[1] and myarr[1000], but leave all other elements undefined). While all this can be very helpful, in some circumstances it can create confusion. Fortunately, awk offers a couple of handy features to help make arrays more manageable.
First, we can delete array elements. If you want to delete element 1 of your array fooarray, type:

delete fooarray[1]

And, if you want to see if a particular array element exists, you can use the special "in" boolean operator as follows:

if ( 1 in fooarray ) {print "Ayep! It's there."} else {print "Nope! Can't find it."}

Resources
If you'd like a good old-fashioned book, O'Reilly's sed & awk, 2nd Edition is a wonderful choice.
Be sure to check out the comp.lang.awk FAQ. It also contains lots of additional awk links.
Patrick Hartigan's awk tutorial is packed with handy awk scripts.

No comments:

Post a Comment