Blog-Archiv

Sonntag, 29. Januar 2017

How to Read and Write XML

XML is a "data-language", being around for 20 years now. Its dedication was (and still is) to hold content, not layout.

The idea was to on-the-fly-convert XML to HTML using XSLT. To bring HTML nearer to XML, the XHTML standard was introduced, but has been given up in favour of HTML-5, which can not be read by XML-parsers any more. So this idea has been washed away by JavaScript and CSS frameworks. What remains is XML as content store, and communication format between browser and HTTP-server, alternative to JSON. It is also used heavily for application configurations.

Originally being a SGML dialect, it uses mainly these as control-characters:

< > / & " '

Read XML

Following is valid XML:

<hello-world />

Unlike in HTML, where the semantic of any element is well-defined, XML-elements can have any tag-name, so I called it hello-world. The trailing slash ('/') closes the element immediately, thus it has no content, it is just semantic (meaning). The space before the slash is optional.

Markup, Text, Elements, Tags ...

XML consists of markup and text.

<reminder>Write XML Crash Course</reminder>

The text "Write XML Crash Course" is the content, the element reminder is markup. You need an opening tag before the content, and a closing tag behind it. When I say element, I mean the element inclusive its content, when I say tag, I mean just the markup of the element.

XML forms a hierarchical structure, elements can contain sub-elements, and elements can contain attributes.

<reminders>
  <reminder>Write XML Crash Course</reminder>
  <reminder importan="true">Play with your children</reminder>
</reminders>

The important="true" is an attribute of the element reminder.

Mind that you can not have un-closed elements in XML, although they are allowed in HTML-5. Thus following XML would be invalid:

<reminders>
  <reminder>I forgot to close this element ...
  <br>
  <hr>
</reminders>

But XML allows text to be between elements:

<reminders>
  Text can be here ...
  <reminder>Write XML Crash Course</reminder>
  and also here, ....
  <reminder importan="true">Play with your children</reminder>
  and also here, like everwhere except before the root-element!
</reminders>

This is called "mixed content". It was mainly introduced to support HTML text attributions like <b>, <i> etc. (A reminder how difficult it is to separate content from layout!)

XML does not allow more than one element on first level (only one root-element), thus following XML is invalid:

<?xml version="1.0"?>
<reminders>
  <reminder>Write XML Crash Course</reminder>
</reminders>
<reminders>
  <reminder>Play with your children</reminder>
</reminders>

I used an XML-heading here to express that this is a complete XML document. The following explains the header.

Encoding

To form completely valid XML that should be accepted by any parser, we need this minimal heading:

<?xml version="1.0"?>
<reminders>
  ....

Mind that there must not be any leading space or newline between start of the file and this heading (except an optional byte order mark).

The XML-version is a hint for XML-parsers which rules to apply (currently there is no version 2.0 yet). Most XML parsers also accept XML without a heading. But then the encoding of this XML-text must be UTF-8.

The real value of the heading is that you can tell which encoding the file is (JSON does not have such):

<?xml version="1.0" encoding="ISO-8859-1?>

The encoding attribute tells the parser how to decode the bytes of the file to read. In UNICODE, one character could consist of several bytes. Different operating systems use different encodings (also called character-sets). For example LINUX uses UTF-8 by default (although you can install it to use another), and WINDOWS uses CP-1252 (conforming to the German ISO-885-1). There is no way to find out in which encoding a file has been written except the XML encoding attribute or a byte-order-mark (which is not very readable and popular).

Comments

Comments look like this:

<?xml version="1.0"?>
<!-- This is a comment
     It can be everywhere except inside element tags or attribute contents, or before XML heading.
     It must not contain "--", although most parsers compensate this.
-->
<hello-world/>

Entity References

Because XML uses control-characters like e.g. '<', there must be a way to write that as plain text when needed. For this purpose "entity references" were provided.

<reminder>
  How can I write a '<' when this would open an XML-tag?
  I can use the character entity reference &lt;
</reminder>

There are character-entities and internal entities (in a DTD). The latter serves for including external XML snippets (structured XML authoring).

Quotes

For attribute definitions, you can use single ' or double quote ", as you like.

<?xml version="1.0"?>
<hello-world world = "universe" planet='earth'/>

Attribute contents must be enclosed into quotes. An attribute without value is not possible. You can't close an open double by a single quote, but you can put single into double, or vice versa.

Namespaces

XML-elements can contain namespace-prefixes:

<my-namespace:reminders>
  <my-namespace:reminder>Write XML Crash Course</my-namespace:reminder>
  <my-namespace:reminder importan="true">Play with your children</my-namespace:reminder>
</my-namespace:reminders>

This is a way to reuse the same element-name, like e.g. <meeting>, bound to different contexts, like e.g. <time:meeting> and <location:meeting>, and then use them in the same XML document.

Arbitrary Content (Escaped)

There is a way to integrate data of any kind and structure into XML. This is called CDATA-section:

<content>
  <![CDATA[
Any markup here will not be interpreted by the XML parser, except the closing-token below.
Use it e.g. for embedded XML or HTML source code!
]]>
</content>

Write XML

To write XML, you can use any text editor. Watch out for your operating-system encoding, if it is not UTF-8 (default), you need either to write your encoding into the XML heading, or tell your editor to save the file in UTF-8. Some text editors may manage this automatically for you.

How can you know which elements and attributes you are allowed to use?

When you use no predefined document type, you can use any tag-name and attribute-name you can think of. Just make sure that your document is well-formed, all tags and quotes must be closed.

In case you use a document-type, there are two options:

  1. DTD (Document-Type-Definition), or
  2. XML-schema

A DTD can describe a document not as precise as an XML-schema can, but it is much better readable.

Your Own Document Design

So, where to put the content, into elements or into attributes? A frequently discussed question. My advice is to put into attributes just things that define the element-semantic nearer (meta-informations). Mind also that the set of characters that you can use in attribute content is smaller than that of elements.

Predefined Document Design

When your XML editor does not support one of the type-definitions mentioned above, you are forced to understand that document-type-definition and manually write XML that conforms to it. This can be quite demanding when the type is big and complex, so check the web for free XML editors that support your type-definition!

And ...?

XML authors can convert their work to HTML, PDF, Word, or any other document format when they separated content from layout. Chapter structures and text attributions would not count as layout.

DocBook is the way they are mostly going (or SimpleDocBook). There are also frameworks like DITA, suitable for help-authors.

There are free XML-to-PDF converters on the web. Manually it can be done via Apache FOP, but you will need programming knowledges. XML Mind is a quite friendly XML editor.




Keine Kommentare: