Blog-Archiv

Samstag, 4. April 2015

Extract Google Blog Export using Java

I spend a lot of hours writing this Blog. In case Google decides to close its blogger site, all work would be lost. This is not very likely to happen, but I want to backup my work anyway, maybe I will publish parts of it elsewhere.

So I was looking for a way to save my Blog to my local disk, and found a nice export utility that Google provides. You can find it under

"Settings" - "Other" - "Import & back up" - "Backup Content"

When you download the generated XML file to your local disk, you have a backup of your Blog. But that XML file is big and hard to view. And it does not contain the images you attached.

Here I introduce a short Java application that lets you export that Google XML to normal viewable HTML files, residing within a "blog" sub-directory. I will develop this utility step by step. You can find the full source code in a folding element at bottom of this page.

If you used JavaScript in your Blog, be sure that it contains NO end-of-line // comments!
The Google exporter removes all newlines and thus would break that code!

Java Application Starter

Here is the base of a normal Java application that can be started from command line. It will demand exactly one argument that is the Google export file. It will terminate with an exit code of 1 (which is an erroneous termination on UNIX platforms) when no argument was given. Else it will print its working directory to the error output stream.

public class BlogSaver
{
  public static void main(String[] args) throws Exception {
    if (args.length != 1)
      syntax();
    
    System.err.println("Working in "+System.getProperty("user.dir"));
  }
  
  private static void syntax() {
    System.err.println("SYNTAX: java "+BlogSaver.class.getName()+" blog-exportfile.xml");
    System.exit(1);
  }
}

Save this source code to a file named BlogSaver.java in the same directory where your Blog export resides. Install a recent Java Development Kit (JDK) , it is freely available. Compile the Java class by opening a terminal screen, changing to that directory, and typing

javac BlogSaver.java

Run this class by typing

java BlogSaver

With the fully implemented utility you will then find a sub-directory "blog" in your working directory, containing all posts from the Google XML as HTML pages, with image sub-directories named like the pages.

You can also install and use some Java IDE like e.g. NetBeans or Eclipse to code and run this utility.

Parsing XML

I will develop this top-down. First we need XML parsing. Second we need HTML writing. Everything we need is covered by the standard Java library, we won't need any additional one.

I will implement inner classes instead of creating further Java files, because the code will not be long. I will list all needed imports later. If you use an IDE, you always can organize imports automatically.

To set up the processing of the big XML file, I use a SAX parser, because it uses less memory and is faster than a DOM parser.

/**
 * Saves my Google Blog export file as HTML to subdir "blog".
 * @author fritzthecat 2015-04-04
 */
public class BlogSaver
{
  public static void main(String[] args) throws Exception {
    if (args.length != 1)
      syntax();
    
    System.err.println("Working in "+System.getProperty("user.dir"));
    final String fileName = args[0];
    final String basePath = new File(fileName).getParent();
    final BlogXmlHandler handler = new BlogXmlHandler(basePath == null ? "." : basePath);
    final SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
    parser.parse(new BufferedInputStream(new FileInputStream(fileName)), handler);
  }

  // TODO: implement SAX handler BlogXmlHandler 

}

This code does not care about the fact that the file might not exist. (Java will throw a FileNotFoundException that will report the reason of the problem sufficiently.) It sets up a SAX parser and loads it with the input file and a SAX handler that must be coded now.

The SAX Event Handler

The SAX handler is responsible for reading and interpreting the entire XML file. The Google Blog XML consists of entry elements in a feed root element. Many of these entry elements do not contain posts but background information about the Blog. Here is an outline of the XML, I expose just those elements that we need to read in the SAX handler:

<feed>
  ....
  
  <entry>
    .....
        
    <published>2015-03-24T15:56:00.000-07:00</published>
    <updated>2015-03-25T05:35:01.729-07:00</updated>
    <category scheme="http://schemas.google.com/g/2005#kind" term="http://schemas.google.com/blogger/2008/kind#post"/>
    <title type="text">JS Slide Show</title>
    <content type="html">(Here is the blog post text as escaped HTML)</content>
    
    ....
    
  </entry>

</feed>

Now only those entry elements with an sub-element category with an attribute term that ends with kind#post matter. Here is the SAX handler that

  1. collects element content text, and
  2. sets its state according to the elements passing by.
  private static class BlogXmlHandler extends DefaultHandler
  {
    private final String basePath;
    private final StringBuffer characters = new StringBuffer();
    
    private boolean collectText;
    private boolean wantThis;
    private String published;
    private String updated;
    private String title;

    public BlogXmlHandler(String basePath) {
      this.basePath = basePath;
    }

    @Override
    public void characters(char [] chars, int start, int length) throws SAXException {
      if (collectText)
        characters.append(chars, start, length);
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
      if (qName.equals("entry")) {
        collectText = true;
        wantThis = false;
      }
      else if (qName.equals("category")) {
        wantThis = attributes.getValue("term").endsWith("kind#post");
      }
      characters.setLength(0);
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
      if (collectText)
        if (qName.equals("entry"))
          collectText = false;
        else if (qName.equals("published"))
          published = characters.toString();
        else if (qName.equals("updated"))
          updated = characters.toString();
        else if (qName.equals("title"))
          title = characters.toString().trim();
        else if (wantThis && qName.equals("content"))
          process(characters.toString());
    }

    // TODO: implement process()

  }

This is a SAX handler that recognizes the Blog elements and manages its state according to them. When an entry starts, it collects text. When the category element shows the attribute value mentioned above, this entry is a post we want to write to HTML. This is done by a process() method that was not implemented yet.

Mind that static inner classes do not hold a reference to an instance of their enclosing class. When you are a Java programmer you might know that implementing static things is not recommendable. But in class context the static keyword has a different meaning, and thus I prefer them to normal inner classes which always need a reference to an instance of their outer class.

Processing the Blog Post

Here are the remaining methods of BlogXmlHandler that process the content element.

    private void process(String blogPost) {
      HtmlWriter page = null;
      try {
        page = new HtmlWriter(toFileName(title), title, published, updated);
        page.writeHtml(blogPost);
      }
      catch (IOException ex) {
        throw new RuntimeException(ex);
      }
      finally {
        try {
          if (page != null)
            page.close();
        }
        catch (Exception ex) { // ignore close error
        }
      }
    }
    
    private String toFileName(String title) {
      final String path = basePath + "/" + "blog";
      new File(path).mkdirs();
      final int maxLength = 64 - ".html".length();
      title = replace(title);
      title = (title.length() <= maxLength) ? title : title.substring(0, maxLength);
      final String fileName = path+"/"+title;
      return fileName+".html";
    }
    
    private String replace(String title) {
      final StringBuffer sb = new StringBuffer();
      for (int i = 0; i < title.length(); i++) {
        final char c = title.charAt(i);
        sb.append(Character.isLetterOrDigit(c) ? c : '_');
      }
      return sb.toString();
    }

The collected element contents like published, updated and title are passed to an instance of HtmlWriter for further processing. The HTML file name is built from the post title by replacing any special character by underscore. It is shortened to 64 characters when longer. This name also will be the image subdirectory base. The SAX handler calls two methods on every HTML writer: writeHtml() and close().

Solve peripheral problems in a wrapper class, pass core problems to a delegate class, until the onion is peeled :-)

HTML Writer

An instance of this class is responsible for creating one HTML file that represents one post, including its images. Here is its outline that writes the basic HTML code and title and dates.

  private static class HtmlWriter extends FileWriter
  {
    private final String pagePath;
    private final String pageName;
    private final boolean noSmallImagePreview = true;
    
    public HtmlWriter(String fileName, String title, String published, String updated) throws IOException {
      super(fileName);
      
      final File file = new File(fileName);
      pagePath = file.getParent();
      pageName = file.getName();
      
      write("<!DOCTYPE html>\n");
      write("<html>\n");
      write("<head>\n");
      write("  <meta charset=\"UTF-8\" />\n");
      write("  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1\"/ />\n");
      write("  <title>"+title+"</title>\n");
      write("</head>\n");
      write("<body>\n");
      write("  <h1 style='text-align: center;'>"+title+"</h1>\n");
      write("  <hr/>");
      write("  <p>Published: "+published.substring(0, 10)+"<br/>\n");
      write("     Updated: "+updated.substring(0, 10)+"</p>\n");
      write("  <hr/>");
    }
    
    public void writeHtml(String html) throws IOException {
      // TODO
    }
    
    @Override
    public void close() throws IOException {
      write("</body>\n");
      write("</html>\n");
      super.close(); //To change body of generated methods, choose Tools | Templates.
    }
  }

The essential writeHtml() method is subject to the next chapter.

Saving and Linking Images

When saving the HTML content I wanted to

  1. insert newlines wherever possible, to make the HTML human readable
  2. replace image URLs by local URLs, which includes downloading the images.

The following code does not parse HTML to find image URLs, it relies on a certain naming convention of the Blog and does its work using regular expressions. Here is a link to a very short introduction on regular expressions for programmers.

An example Blog image URL would be

https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjqpzcphIKt7d8Ptatz1sudnRIlfAeZkRvXoLum8qc6MqnBxjCXt3qd9BIn3XeaRM-pfWqhMU8v901nsoiqt5EnYDZCpByKFD_zLUnTTHzNJ883Z46-7gTtxBgJ9csUKlDWmodA-h2fC-vg/s400/IMG_4528.JPG

So all images are availiable under an URL of the form

http://*.bp.blogspot.com/*

This translates to the regular expression

http://[^>]*\.bp\.blogspot\.com/[^"]+"

which looks like this when written in Java

http://[^>]*\\.bp\\.blogspot\\.com/[^\"]+)\"

Regular expressions are what some people call "write-only". Sorry for that, but sometimes they do a real good job. So here is the code to process the HTML and replace image URLs by local URLs.

    public void writeHtml(String html) throws IOException {
      // handle all google image URLs by downloading the image
      // to a local folder and replacing the URL by a local URL
      final Pattern pattern = Pattern.compile(
        noSmallImagePreview ?
          "<a href=\"(https?://[^>]*\\.bp\\.blogspot\\.com/[^\"]+)\"[^>]*><img[^>]*></a>"
        :
          "=\"(https?://[^>]*\\.bp\\.blogspot\\.com/[^\"]+)\""
      );
      final Matcher matcher = pattern.matcher(html);
      final StringBuffer sb = new StringBuffer();
      while (matcher.find()) {
        final String imageUrl = matcher.group(1);
        matcher.appendReplacement(
          sb,
          noSmallImagePreview ?
            "<img src=\""+saveImage(imageUrl)+"\"/>"
          :
            "=\""+saveImage(imageUrl)+"\""
        );
      }
      matcher.appendTail(sb);
      html = sb.toString();
      
      // write HTML to file
      write(html);
    }

I implemented two options for saving the HTML:

  1. with a size-reduced image preview that magnifies when clicked upon, like the Blog does it
  2. with NO image preview and only the originally sized image showing (this is default)

So there are two different regular expressions for these options, and two different replacements according to them. The code uses Patttern and Matcher to process the HTML with these regular expressions. It then inserts newlines before and after some standard elements like <p>, <ul> etc, also using regular expressions.

This is not the fastest way to process, but it needs only few code.

You can toggle this option using the noSmallImagePreview flag in HtmlWriter.

Downloading the Image

What remains to do is to download any image, store it into a sub-directory that should be named like the page, and returning an URL string to substitute for the original URL in the HTML.

Following code could have been shorter, had there not been the need to implement the both image options mentioned above. The Google Blog stores the size-reduced image preview into a sub-directory "s400", and the original one into "s1600". So we need to take these directories down to our local directory, and integrate it into the returned URL string.

    private String saveImage(String urlString) {
      final String imageDir = pageName.substring(0, pageName.length() - ".html".length());
      
      final int lastSlash = urlString.lastIndexOf('/');
      final String firstUrlPart = urlString.substring(0, lastSlash);
      final String lastSubdir = firstUrlPart.substring(firstUrlPart.lastIndexOf('/') + 1);
      final String fileName = urlString.substring(lastSlash + 1);
      
      final File imagePath = new File(new File(pagePath, imageDir), lastSubdir);
      imagePath.mkdirs();
      
      try {
        final URL url = new URL(urlString);
        final File imageFile = new File(imagePath, fileName);
        final FileOutputStream output = new FileOutputStream(imageFile);
        final ReadableByteChannel byteChannel = Channels.newChannel(url.openStream());
        output.getChannel().transferFrom(byteChannel, 0, Long.MAX_VALUE);
        
        final String newUrl = imageDir+"/"+lastSubdir+"/"+fileName;
        System.err.println("Saved image "+newUrl);
        return newUrl;
      }
      catch (MalformedURLException e) {
        throw new RuntimeException(e);
      }
      catch (IOException e) {
        throw new RuntimeException(e);
      }
    }

The download by a NIO channel was inspired by stackoverflow , which has become an indispensable support for programmers. Finally, after 13 years of its existence, we slowly could take note of Java NIO :-)

Full Source Code

In the expandable fold below you find the full Java code of the class BlogSaver, including its imports.


  Full Source Code




Keine Kommentare: