Developers’ Weblog

Sponsored by
HostEurope Logo

Developers’ Weblog

⚠ This page contains old, outdated, obsolete, … historic or WIP content! No warranties e.g. for correctness!

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

Valid UTF-8 but invalid XML

2014-11-19 by t.glaser@tarent.de (EvolvisForge blog)
Tags: work debian

Another PSA: something surprising about XML.

As you might all know, XML must be valid UTF-8 (or UTF-16 (or another encoding supported by the parser, but one which yields valid Unicode codepoints when read and converted)). Some characters, such as the ampersand ‘&’, must be escaped (“&#38;” or “&#x26;”, although “&amp;” may also work, depending on the domain) or put into a CDATA section (“<![CDATA[&]]>”).

A bit surprisingly, a literal backspace character (ASCII 08h, Unicode U+0008) is not allowed in the text. I filed a bugreport against libxml2, asking it to please encode these characters.

A bit more research followed. Surprisingly, there are characters that are not valid in XML “documents” in any way, not even as entities or in CDATA sections. (xmlstarlet, by the way, errors out somewhat nicely for an unescaped literal or entity-escaped backspace, but behaves absolutely hilarious for a literal backspace in a CDATA section.) Basically, XML contains a whitelist for the following Unicode codepoints:

  • U+0009
  • U+000A
  • U+000D
  • U+0020‥U+D7FF
  • U+E000‥U+FFFD
  • U-00010000‥U-0010FFFF

Additionally, a certain number of codepoints is discouraged: U+007F‥U+0084 (IMHO wise), U+0086‥U+009F (also wise, but why allow U+0085?), U+FDD0‥U+FDEF (a bit surprisingly, but consistent with disallowing the backspace character), and the last two codepoints of every plane (U+FFFE and U+FFFF were already disallowed, but U-0001FFFE, U-0001FFFF, …, U-0010FFFF weren’t; this is extremely wise).

The suggestion seems to be to just strip these characters silently from the XML “document”.

I’m a bit miffed about this, as I don’t even use XML directly (I’m extending a PHP “webapplication” that is a SOAP client and talks to a Java™ SOAP-WS) and would expect this to preserve my strings, but, oh my. I’ve forwarded the suggestion to just strip them silently to the libxml2 maintainers in the aforementioned bug report, for now, and may even hack that myself (on customer-paid time). More robust than hacking the PHP thingy to strip them first, anyway — I’ve got no control over the XML after all.

Sharing this so that more people know that not all UTF-8 is valid in XML. Maybe it saves someone else some time. (Now wondering whether to address this in my xhtml_escape shell function. Probably should. Meh.)

MirOS Logo