Blog-Archiv

Sonntag, 21. April 2019

LINUX tar link to zero size

This Blog is about an incident that emptied lots of files and costed me half a day searching for the reason. It is about UNIX tar ("tape archiver").

Circumstances

If you pack an archive with tar, and you unpack it with a tool that doesn't handle hardlinks, you may, under certain circumstances, encounter data loss.

The circumstances happen when the list of files to pack (that you pass to tar) contains a directory that again contains a file that was already listed before. In that case tar packs the file a second time, but this time as link, and with size zero. (It does so although there was no link of any kind in the source file system!) On unpacking with the wrong tool, the link overwrites the real file and truncates it to zero size.

Example

Here is an example directory with files to pack:

tar-test
peter.txt
people.txt
paul.txt
mary.txt

All of peter.txt, paul.txt and mary.txt are normal files. By accident the name of the directory people.txt is similar to that of the files.

Now we want to pack an archive from these files, and we want all *.txt files to be in the archive, so we apply a find command with a wildcard:

cd tar-test tar -cvzf people.tgz `find . -name '*.txt'`

The tar -cvzf command packs an archive (z is for zip-compression), people.tgz is the name of the resulting archive file, and the command-substitution `find . -name '*.txt'` generates the names of files to pack into the archive.

The find command delivers the correct list:

find . -name '*.txt'
./people.txt ./people.txt/mary.txt ./people.txt/paul.txt ./peter.txt

But unfortunately also the directory people.txt matches the pattern '*.txt' and is in that list.

Now when tar processes the directory people.txt, it will pack every file inside also into the archive. Because these files are in the find-list too, the second occurrence of any of them will be stored as link(!) inside the archive.

Look at the result:

tar -tvf people.tgz
drwxrwxrwx root/root 0 2019-04-21 19:19 ./people.txt/ -rwxrwxrwx root/root 11 2019-04-21 19:20 ./people.txt/mary.txt -rwxrwxrwx root/root 10 2019-04-21 19:20 ./people.txt/paul.txt hrwxrwxrwx root/root 0 2019-04-21 19:20 ./people.txt/mary.txt link to ./people.txt/mary.txt hrwxrwxrwx root/root 0 2019-04-21 19:20 ./people.txt/paul.txt link to ./people.txt/paul.txt -rwxrwxrwx root/root 11 2019-04-21 19:20 ./peter.txt

The tar -tvf command lists the archive people.tgz. All files inside the people.txt directory (that also matched the '*.txt' wildcard) were packed twice. We see that all links have size zero (red color).

Mind that unpacking such an archive with tar -xvzf people.tgz works fine, no trace of any links in the resulting file system. You can check this with the ls -l command that lists a link count in 2nd column (red color), and we see that is just 1:

ls -l
-rwxrwxrwx 1 root root 11 Apr 21 19:20 mary.txt -rwxrwxrwx 1 root root 10 Apr 21 19:20 paul.txt

But unpacking with a tool that does not recognize links will do the following:

  1. first unpack the non-empty paul.txt file (first because the link depends on it)
  2. then unpack the link to paul.txt which has size zero
  3. because the empty link has been written over the non-empty file, the file is empty now

Damage done! In case the file is not verified after unpacking, you may not detect the data loss for a long time.

Fix

Here is a way to fix it:

tar -cvzf people.tgz `find . -type f -name '*.txt'`

The find -type f command would find only files, no directories, thus people.txt matching the '*.txt' pattern would not be in its result list.

Resume

Different operating systems have different file systems. Hard and symbolic links exist on UNIX systems, not on WINDOWS. When you use Java tools, you won't be able to create links of any kind, because Java is platform-independent and thus can provide only functionality that is present on all platforms.

Here is a fix for Java (that simply ignores links inside a tar-archive), referring to the com.ice.tar library:

TarEntry e = tarInputStream.getNextEntry();
if (e != null) {
  if (e.getHeader().linkName == null || e.getHeader().linkName.length() <= 0) {
    File created = super.extractEntry(dir, e);
    created.setLastModified(e.getModTime().getTime());
  }
  else {
    System.err.println("Did not extract link: "+e.getName()+" -> "+e.getHeader().linkName);
  }
}



Keine Kommentare: