Blog-Archiv

Freitag, 8. Januar 2021

A Subtle Problem with AWK Pipe Statements

When using my video-scripts I came across a subtle problem with awk pipe statements. If you forget to close a pipe-command, the next one may deliver wrong results under certain circumstances (which makes this problem subtle).

This is about GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0) on Ubuntu 20.04.1 with LINUX 5.4.0-59.

AWK is an ancestor of perl and a great tool for quick data interpretation, better than perl because simpler. But, like most script languages, it has its peculiarities that may cause undetected bugs. In my case a video length was reported to be too short to contain a given timestamp, which was not true, so I had to check the responsible awk-script.

AWK Pipe Statement

Example:

file = "a.txt"
sizeCommand = "stat --printf='%s' " file
sizeCommand | getline size
print "Size of " file " is " size

This code fetches the size of the file a.txt through an external command that is piped into getline to read the first line of its output.

The GNU documentation puts a close() immediately after the pipe statement. So the correct form would be:

....
sizeCommand | getline size
close(sizeCommand)
....

→ In case you forget to close(), you may experience strange results!

Problem Reproduction

Here is a reproduction of what I encountered. You need two text files:

  1. a.txt contaning the single character 'a' (size = 1), and
  2. ab.txt contaning 'ab' (size = 2).

Then put following AWK script pipe-statement-problem.awk into the same directory and make it executable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/awk -f

BEGIN {
  files[0] = "a.txt"
  files[1] = "ab.txt"
  files[2] = "a.txt"
  
  sizes[0] = 1
  sizes[1] = 2
  sizes[2] = 1
  
  for (i in files) {
     externalCommand = "stat --printf='%s' "  files[i]
     externalCommand | getline size
     print "size of " files[i] " = " size
     if (size != sizes[i])
       print "ERROR in size of " files[i] ", should be " sizes[i]
  }
}

The shebang #!/usr/bin/awk -f in first line tells the UNIX-shell to use /usr/bin/awk for execution.

As no data are processed by this script, everything happens in the BEGIN rule that is executed on script start.

The script builds two arrays, one for file names and one for the expected sizes of these files.

The for-loop opens all files in the array, which are a.txt, ab.txt, and again a.txt, and fetches their sizes. An ERROR message is printed if the size doesn't match what was expected.
This script doesn't make any practical sense, but it is something like a unit test for the pipe-statement.

Output is ('$' is the UNIX command prompt):

$ pipe-statement-problem.awk
size of a.txt = 1
size of ab.txt = 2
size of a.txt = 2
ERROR in size of a.txt, should be 1

The error happens when, once again, executing the pipe for file a.txt. For some reason awk then delivers the size of file ab.txt, which is the one that preceded this pipe-statement.

Fix: Close Any Pipe!

The bug can be fixed by inserting a close() immediately after the pipe-statement on line 14:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/awk -f

BEGIN {
  files[0] = "a.txt"
  files[1] = "ab.txt"
  files[2] = "a.txt"
  
  sizes[0] = 1
  sizes[1] = 2
  sizes[2] = 1
  
  for (i in files) {
     externalCommand = "stat --printf='%s' "  files[i]
     externalCommand | getline size
     close(externalCommand)   # MUST close any of such statements!
     print "size of " files[i] " = " size
     if (size != sizes[i])
       print "ERROR in size of " files[i] ", should be " sizes[i]
  }
}

Running the fixed script you see:

$ pipe-statement-problem.awk
size of a.txt = 1
size of ab.txt = 2
size of a.txt = 1

This is the right output. Now the size of file a.txt has been read correctly.

Conclusion

When I got to know the AWK pipe-statement, I didn't even know that you can (or must) close it. The resulting problems may stay undetected a long time, because there is no warning and no error message, simply the result of getline is wrong. I didn't find out why the result is always that of the preceding pipe, and why it happens only when repeating a pipe that was already executed once.




Keine Kommentare: