It spells AWK because it stands for the names Aho, Weinberger and Kernighan, all AT&T Bell Labs employees in 1977 when awk first appeared.
AWK is an amazing tool. It's a UNIX tool, but even WINDOWS addicts know it. Despite the existence of its successor perl it does not die out. (Maybe perl code is a little hard to read? Maybe the attitude "We let you do it in many ways" is not so useful for source code?-)
AWK is one of the utilities occurring in almost any shell script bigger than 100 lines.
Anytime the UNIX shell-script interpreter is coming to its limits, AWK helps out.
I would not say that AWK code is easy to read, but not as hard as perl,
and even a bit easier than shell scripts.
Its syntax feels like a free form of C without pointers (how relieving!).
Warning: it is an interpreted script language and has no explicitly declared data types!
The Shortest Useful is an AWK Program
Let's see how many awk variants are installed. Open a terminal prompt on your LINUX or WINDOWS + CYGWIN system, and enter the following command line (without "$", this is the system's prompt)
$ ls -l /usr/bin/*awk lrwxrwxrwx 1 root root 21 Jan 17 22:26 /usr/bin/awk -> /etc/alternatives/awk -rwxr-xr-x 1 root root 538224 Jul 2 2013 /usr/bin/dgawk -rwxr-xr-x 1 root root 441512 Jul 2 2013 /usr/bin/gawk -rwxr-xr-x 1 root root 3188 Jul 2 2013 /usr/bin/igawk -rwxr-xr-x 1 root root 117768 Mar 24 2014 /usr/bin/mawk lrwxrwxrwx 1 root root 22 Jan 17 22:26 /usr/bin/nawk -> /etc/alternatives/nawk -rwxr-xr-x 1 root root 445608 Jul 2 2013 /usr/bin/pgawk
Maybe you need to enter ls -la /bin/*awk
, this depends on your LINUX variant.
When you enter the ls
file-list command with an awk
filter pipe, you see this:
$ ls -l /usr/bin/*awk | awk '{print $9, $11}' /usr/bin/awk /etc/alternatives/awk /usr/bin/dgawk /usr/bin/gawk /usr/bin/igawk /usr/bin/mawk /usr/bin/nawk /etc/alternatives/nawk /usr/bin/pgawk
We used AWK as column filter, to see only column 9 and 11 (when present). Column 11 represents the target when the file node is a symbolic link.
This is really a short program: {print $9, $11}
, don't you think?
What can we learn from that?
- AWK reads its input line by line, although you could set the RS (record separator) variable to something else than newline
- AWK splits every input record (line) into parts, using the FS (field separator) variable that defaults to whitespace, and makes the parts available as $1 - $N, the whole line is in $0
- we need to enclose such an AWK program into 'single quotes', else the shell would see all $-variables and would try to substitute them with most likely no content
That is what all people do with AWK: feed in lines of a file and convert column contents to some new shape.
AWK Program in a File
Put {print $9, $11}
into a file named NameAndLink.awk (the extension is not obligatory)
....
{ print $9, $11 }
.... and then do this:
$ ls -l /usr/bin/*awk | awk -f NameAndLink.awk # same result as above
Within the file you do not need the 'single quotes' any more. For bigger AWK applications, a separate file for the source code is very recommendable.
Another thing you can do is to tag the file in its head with the according command-interpreter, so that it can be executed as a script:
NameAndLink.awk#!/usr/bin/awk -f { print $9, $11 }
Mind that now you need to set execute-permissions on it:
$ chmod u+x NameAndLink.awk $ ls -l /usr/bin/*awk | NameAndLink.awk # same result as above
A Standard AWK Program
Here is a skeleton of how most AWK programs look like.
awk ' BEGIN { print "Starting"; } /a/ { print "Got a"; } /b/ || /c/ { print "Got b or c, see yourself: " $0 } /b/ { print "Got b" } { print "Generally I got " $0; } END { print "Ending" } ' <file.txt
Assuming we have a file.txt with content
a b c d
we would see following output:
Starting Got a Generally I got a Got b or c, see yourself: b Got b Generally I got b Got b or c, see yourself: c Generally I got c Generally I got d Ending
What can we learn from that?
- The BEGIN section is executed before the first line is read, the END section after the last line has been read
- the { brace section } that has no pattern is executed for every line, even when that line was matched against other patterns
- the sections headed by /regular expression/ patterns are executed for every line that matches their pattern
- when some input matches several patterns, their sections are executed in the order they occur in source code
- the trailing ";" semicolon is optional
Similarities to XSLT and CSS are obvious: it is a pattern-matching language.
AWK Capabilities in an Example
Instead of starting "yet another AWK tutorial", I want to demonstrate the power of it in a little application I needed recently. That application should process a Maven pom.xml file and enrich it with version numbers.
Inputs to Process
pom.xml<?xml version="1.0"?> <project> <groupId>com.mycompany.app</groupId> <artifactId>my-module</artifactId> <version>1.0.0-SNAPSHOT</version> <parent> <groupId>com.mycompany.app</groupId> <artifactId>my-app</artifactId> <version>1.0-SNAPSHOT</version> <relativePath>../parent/pom.xml</relativePath> </parent> <dependencies> <dependency> <groupId>fri.example.test</groupId> <artifactId>module-one</artifactId> </dependency> <dependency> <artifactId>module-two</artifactId> <groupId>fri.example.test</groupId> </dependency> <dependency> <artifactId>module-hundred</artifactId> <groupId>fri.example.test</groupId> </dependency> <dependency> <groupId>fri.example.test</groupId> <artifactId>module-three</artifactId> <exclusions> <exclusion> <groupId>fri.example.test</groupId> <artifactId>module-five</artifactId> </exclusion> </exclusions> </dependency> <dependency> <artifactId>module-four</artifactId> <groupId>fri.example.test</groupId> </dependency> </dependencies> </project>
There is a parent
referenced, let's assume that it holds a dependencyManagement
section where versions for modules are defined, and we processed these into maven-resolve.txt
file by calling mvn dependency:resolve
(for example by using another awk script;-).
module-four 1.4 module-two 1.2 module-three 1.3 module-one 1.1
Output to Achieve
We want the associated versions to be put into the module dependency elements like this:
<?xml version="1.0"?> <project> <groupId>com.mycompany.app</groupId> <artifactId>my-module</artifactId> <version>1.0.0-SNAPSHOT</version> <parent> <groupId>com.mycompany.app</groupId> <artifactId>my-app</artifactId> <version>1.0-SNAPSHOT</version> <relativePath>../parent/pom.xml</relativePath> </parent> <dependencies> <dependency> <groupId>fri.example.test</groupId> <artifactId>module-one</artifactId> <version>1.1</version> </dependency> <dependency> <artifactId>module-two</artifactId> <groupId>fri.example.test</groupId> <version>1.2</version> </dependency> <dependency> <artifactId>module-hundred</artifactId> <groupId>fri.example.test</groupId> </dependency> <dependency> <groupId>fri.example.test</groupId> <artifactId>module-three</artifactId> <exclusions> <exclusion> <groupId>fri.example.test</groupId> <artifactId>module-five</artifactId> </exclusion> </exclusions> <version>1.3</version> </dependency> <dependency> <artifactId>module-four</artifactId> <groupId>fri.example.test</groupId> <version>1.4</version> </dependency> </dependencies> </project>
As you can see there are some subtleties in the example pom.xml.
To irritate the script, there is an exclusion
tag containing an artifactId
tag.
Sometimes the groupId
and artifactId
tags are swapped.
And there is a module-hundred
for which we have no version.
The Program
We need to read the maven-resolve.txt
file at BEGIN.
Then we will read the pom.xml
file and input a version tag wherever an according module occurs.
The resulting AWK source is amazing short.
I wrote it as shell script, to show how shell variables can be integrated into the AWK program.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | #!/bin/bash pom=pom.xml versions=maven-resolve.txt newPom=pom-with-versions.xml awk ' BEGIN { # read versions from file in format "module version" into associative array while ((getline < "'$versions'") > 0) { moduleVersions[$1] = $2; } } /^[ \t]*<exclusions>/ { # set a state for artifactId tags appearing here inExclusion = 1; } /^[ \t]*<\/exclusions>/ { # reset state inExclusion = 0; } { if (inExclusion != 1) { if (match($1, "^<artifactId>(.+)</artifactId>$", matchArray)) { currentVersion = moduleVersions[matchArray[1]]; } else if (length(currentVersion) > 0 && $1 == "</dependency>") { # having version for artifact, print it print "\t\t\t<version>" currentVersion "</version>"; currentVersion = ""; } } print $0; # print out any line of POM } ' < $pom > $newPom |
First we read maven-resolve.info
line by line using the built-in getline
command.
We can use an input redirection with the getline
command like in a shell script.
Mind that the input redirection file name needs to be wrapped into "double quotes" inside the AWK program.
But the shell variable needs to be outside the AWK program, so it is enclosed in single quotes,
which 'splits' the AWK program and exposes $versions
to the shell for substitution.
When reading the file, AWK works as usual by splitting the line and provding $1 - $N.
We use this to fill up the "associative array" moduleVersions
with our module/version informations.
In AWK you don't need to declare variables, you simply use them. AWK will create them when needed.
They're all global.
So after this we have a map of module/version associations.
Now we process every line of pom.xml
.
Because we need the module name from within the artifactId
tags,
I decided to match them in the { common braces }.
The built-in match()
function
can give us the the text within the artifactId
tags, because I enclosed
that into ( parentheses ). That enclosed text will appear in the third parameter matchArray.
Mind that AWK has array indexes from 1-n, not from 0-n, so I get the name of the module from
matchArray[1]
. And I query the map with that module name to get its version.
When then a line appears that contains a closing dependency
tag,
the script checks whether a version exists for that passed artifactId
,
and inserts a version
tag if so.
Finally every line of the pom.xml
gets printed out unchanged
by print $0
.
The exclusions
patterns are there because within a Maven exclusion
also artifactId
tags can appear. And these exclusions are within dependency
elements.
So such would break the artifact version found just before, and thus the script uses the
inExclusion
state to avoid this. Without that state, module-three
would miss its version.
Please mind that awk does not understand XML, so e.g. an XML comment at the wrong place could break the script. This is is just a quick line-reader solution!
That's it, AWK contains a lot more, so use it and enjoy the brevity. I've also seen bigger applications written in AWK, but like all script languages it lacks encapsulation, and thus is not suitable for big modular projects.
Keine Kommentare:
Kommentar veröffentlichen