Blog-Archiv

Mittwoch, 27. Mai 2015

The Immortal AWK Language

It spells AWK because it stands for the names Aho, Weinberger and Kernighan, all AT&T Bell Labs employees in 1977 when awk first appeared.

AWK is an amazing tool. It's a UNIX tool, but even WINDOWS addicts know it. Despite the existence of its successor perl it does not die out. (Maybe perl code is a little hard to read? Maybe the attitude "We let you do it in many ways" is not so useful for source code?-)

AWK is one of the utilities occurring in almost any shell script bigger than 100 lines. Anytime the UNIX shell-script interpreter is coming to its limits, AWK helps out. I would not say that AWK code is easy to read, but not as hard as perl, and even a bit easier than shell scripts. Its syntax feels like a free form of C without pointers (how relieving!).
Warning: it is an interpreted script language and has no explicitly declared data types!

The Shortest Useful is an AWK Program

Let's see how many awk variants are installed. Open a terminal prompt on your LINUX or WINDOWS + CYGWIN system, and enter the following command line (without "$", this is the system's prompt)

$ ls -l /usr/bin/*awk

lrwxrwxrwx 1 root root     21 Jan 17 22:26 /usr/bin/awk -> /etc/alternatives/awk
-rwxr-xr-x 1 root root 538224 Jul  2  2013 /usr/bin/dgawk
-rwxr-xr-x 1 root root 441512 Jul  2  2013 /usr/bin/gawk
-rwxr-xr-x 1 root root   3188 Jul  2  2013 /usr/bin/igawk
-rwxr-xr-x 1 root root 117768 Mar 24  2014 /usr/bin/mawk
lrwxrwxrwx 1 root root     22 Jan 17 22:26 /usr/bin/nawk -> /etc/alternatives/nawk
-rwxr-xr-x 1 root root 445608 Jul  2  2013 /usr/bin/pgawk

Maybe you need to enter ls -la /bin/*awk, this depends on your LINUX variant.
When you enter the ls file-list command with an awk filter pipe, you see this:

$ ls -l /usr/bin/*awk | awk '{print $9, $11}'

/usr/bin/awk /etc/alternatives/awk
/usr/bin/dgawk 
/usr/bin/gawk 
/usr/bin/igawk 
/usr/bin/mawk 
/usr/bin/nawk /etc/alternatives/nawk
/usr/bin/pgawk 

We used AWK as column filter, to see only column 9 and 11 (when present). Column 11 represents the target when the file node is a symbolic link.

This is really a short program: {print $9, $11}, don't you think?
What can we learn from that?

  1. AWK reads its input line by line, although you could set the RS (record separator) variable to something else than newline
  2. AWK splits every input record (line) into parts, using the FS (field separator) variable that defaults to whitespace, and makes the parts available as $1 - $N, the whole line is in $0
  3. we need to enclose such an AWK program into 'single quotes', else the shell would see all $-variables and would try to substitute them with most likely no content

That is what all people do with AWK: feed in lines of a file and convert column contents to some new shape.

AWK Program in a File

Put {print $9, $11} into a file named NameAndLink.awk (the extension is not obligatory) ....

NameAndLink.awk
{ print $9, $11 }

.... and then do this:

$ ls -l /usr/bin/*awk | awk -f NameAndLink.awk

# same result as above

Within the file you do not need the 'single quotes' any more. For bigger AWK applications, a separate file for the source code is very recommendable.

Another thing you can do is to tag the file in its head with the according command-interpreter, so that it can be executed as a script:

NameAndLink.awk
#!/usr/bin/awk -f
{ print $9, $11 }

Mind that now you need to set execute-permissions on it:

$ chmod u+x NameAndLink.awk
$ ls -l /usr/bin/*awk | NameAndLink.awk

# same result as above

A Standard AWK Program

Here is a skeleton of how most AWK programs look like.

awk '
  BEGIN {
    print "Starting";
  }
  /a/ {
    print "Got a";
  }
  /b/ || /c/ {
    print "Got b or c, see yourself: " $0
  }
  /b/ {
    print "Got b"
  }
  {
    print "Generally I got " $0;
  }
  END {
    print "Ending"
  }
' <file.txt

Assuming we have a file.txt with content

a
b
c
d

we would see following output:

Starting
Got a
Generally I got a
Got b or c, see yourself: b
Got b
Generally I got b
Got b or c, see yourself: c
Generally I got c
Generally I got d
Ending

What can we learn from that?

  1. The BEGIN section is executed before the first line is read, the END section after the last line has been read
  2. the { brace section } that has no pattern is executed for every line, even when that line was matched against other patterns
  3. the sections headed by /regular expression/ patterns are executed for every line that matches their pattern
  4. when some input matches several patterns, their sections are executed in the order they occur in source code
  5. the trailing ";" semicolon is optional

Similarities to XSLT and CSS are obvious: it is a pattern-matching language.

AWK Capabilities in an Example

Instead of starting "yet another AWK tutorial", I want to demonstrate the power of it in a little application I needed recently. That application should process a Maven pom.xml file and enrich it with version numbers.

Inputs to Process

pom.xml
<?xml version="1.0"?>

<project>
  <groupId>com.mycompany.app</groupId>
  <artifactId>my-module</artifactId>
  <version>1.0.0-SNAPSHOT</version>
 
  <parent>
    <groupId>com.mycompany.app</groupId>
    <artifactId>my-app</artifactId>
    <version>1.0-SNAPSHOT</version>
    <relativePath>../parent/pom.xml</relativePath>
  </parent>

  <dependencies>
  
    <dependency>
      <groupId>fri.example.test</groupId>
      <artifactId>module-one</artifactId>
    </dependency>
  
    <dependency>
      <artifactId>module-two</artifactId>
      <groupId>fri.example.test</groupId>
    </dependency>
  
    <dependency>
      <artifactId>module-hundred</artifactId>
      <groupId>fri.example.test</groupId>
    </dependency>
  
    <dependency>
      <groupId>fri.example.test</groupId>
      <artifactId>module-three</artifactId>
      <exclusions>
        <exclusion>
          <groupId>fri.example.test</groupId>
          <artifactId>module-five</artifactId>
        </exclusion>
      </exclusions>
    </dependency>
  
    <dependency>
      <artifactId>module-four</artifactId>
      <groupId>fri.example.test</groupId>
    </dependency>
  
  </dependencies>

</project>

There is a parent referenced, let's assume that it holds a dependencyManagement section where versions for modules are defined, and we processed these into maven-resolve.txt file by calling mvn dependency:resolve (for example by using another awk script;-).

maven-resolve.txt
module-four 1.4
module-two 1.2
module-three 1.3
module-one 1.1

Output to Achieve

We want the associated versions to be put into the module dependency elements like this:

<?xml version="1.0"?>

<project>
  <groupId>com.mycompany.app</groupId>
  <artifactId>my-module</artifactId>
  <version>1.0.0-SNAPSHOT</version>
 
  <parent>
    <groupId>com.mycompany.app</groupId>
    <artifactId>my-app</artifactId>
    <version>1.0-SNAPSHOT</version>
    <relativePath>../parent/pom.xml</relativePath>
  </parent>

  <dependencies>
  
    <dependency>
      <groupId>fri.example.test</groupId>
      <artifactId>module-one</artifactId>
      <version>1.1</version>
    </dependency>
  
    <dependency>
      <artifactId>module-two</artifactId>
      <groupId>fri.example.test</groupId>
      <version>1.2</version>
    </dependency>
  
    <dependency>
      <artifactId>module-hundred</artifactId>
      <groupId>fri.example.test</groupId>
    </dependency>
  
    <dependency>
      <groupId>fri.example.test</groupId>
      <artifactId>module-three</artifactId>
      <exclusions>
        <exclusion>
          <groupId>fri.example.test</groupId>
          <artifactId>module-five</artifactId>
        </exclusion>
      </exclusions>
      <version>1.3</version>
    </dependency>
  
    <dependency>
      <artifactId>module-four</artifactId>
      <groupId>fri.example.test</groupId>
      <version>1.4</version>
    </dependency>
  
  </dependencies>

</project>

As you can see there are some subtleties in the example pom.xml. To irritate the script, there is an exclusion tag containing an artifactId tag. Sometimes the groupId and artifactId tags are swapped. And there is a module-hundred for which we have no version.

The Program

We need to read the maven-resolve.txt file at BEGIN. Then we will read the pom.xml file and input a version tag wherever an according module occurs. The resulting AWK source is amazing short. I wrote it as shell script, to show how shell variables can be integrated into the AWK program.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#!/bin/bash

pom=pom.xml
versions=maven-resolve.txt
newPom=pom-with-versions.xml

awk '
  BEGIN  {    # read versions from file in format "module version" into associative array
    while ((getline < "'$versions'") > 0)  {
      moduleVersions[$1] = $2;
    }
  }
  
  /^[ \t]*<exclusions>/  {    # set a state for artifactId tags appearing here
    inExclusion = 1;
  }
  /^[ \t]*<\/exclusions>/  {    # reset state
    inExclusion = 0;
  }
  
  {
    if (inExclusion != 1)  {
      if (match($1, "^<artifactId>(.+)</artifactId>$", matchArray))  {
        currentVersion = moduleVersions[matchArray[1]];
      }
      else if (length(currentVersion) > 0 && $1 == "</dependency>")  {    # having version for artifact, print it
        print "\t\t\t<version>" currentVersion "</version>";
        currentVersion = "";
      }
    }
    
    print $0;    # print out any line of POM
  }
' < $pom > $newPom

First we read maven-resolve.info line by line using the built-in getline command. We can use an input redirection with the getline command like in a shell script. Mind that the input redirection file name needs to be wrapped into "double quotes" inside the AWK program. But the shell variable needs to be outside the AWK program, so it is enclosed in single quotes, which 'splits' the AWK program and exposes $versions to the shell for substitution.

When reading the file, AWK works as usual by splitting the line and provding $1 - $N. We use this to fill up the "associative array" moduleVersions with our module/version informations. In AWK you don't need to declare variables, you simply use them. AWK will create them when needed. They're all global.

So after this we have a map of module/version associations. Now we process every line of pom.xml. Because we need the module name from within the artifactId tags, I decided to match them in the { common braces }. The built-in match() function can give us the the text within the artifactId tags, because I enclosed that into ( parentheses ). That enclosed text will appear in the third parameter matchArray. Mind that AWK has array indexes from 1-n, not from 0-n, so I get the name of the module from matchArray[1]. And I query the map with that module name to get its version.

When then a line appears that contains a closing dependency tag, the script checks whether a version exists for that passed artifactId, and inserts a version tag if so. Finally every line of the pom.xml gets printed out unchanged by print $0.

The exclusions patterns are there because within a Maven exclusion also artifactId tags can appear. And these exclusions are within dependency elements. So such would break the artifact version found just before, and thus the script uses the inExclusion state to avoid this. Without that state, module-three would miss its version.

Please mind that awk does not understand XML, so e.g. an XML comment at the wrong place could break the script. This is is just a quick line-reader solution!


That's it, AWK contains a lot more, so use it and enjoy the brevity. I've also seen bigger applications written in AWK, but like all script languages it lacks encapsulation, and thus is not suitable for big modular projects.




Keine Kommentare: