Developers’ Weblog

Sponsored by
HostEurope Logo

Developers’ Weblog

⚠ This page contains old, outdated, obsolete, … historic or WIP content! No warranties e.g. for correctness!

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

Proposed extensions to the JSON specification

2012-12-01 by tg@
Tags: debian

The following proposal extends the JSON specification, with the idea of using JSON as an information interchange format, rather than just a way of writing certain ECMAscript values. They do not add anything but only restrict valid JSON content and encoders with some rationale.

First of, I’d like to remind everyone, including JSON’s author, that JSON is case-sensitive, except in the four hexdigits after a backslash-u sequence in a String.

Second, I’d like to remind everyone that JSON is not binary-safe. No way around that, it implements Unicode (actually, 16-bit UCS-2, and it doesn’t guarantee that UTF-16 surrogates are correctly paired) text. I also consider only UTF-{8,{16,32}{B,L}E} valid encodings for JSON. (No PDP endian, either. Sorry, guys.)

For my first proposal, I’d like to point out CVE-2011-4815 which was about overflowing hashtables. The obvious fix is to randomise the hash per hashtable; to ensure this doesn’t leak, we sort ASCIIbetically the keys of an Object in the encoder. (Using Unicode is good here — we can just sort the keys as UTF-8 strings by their uint8_t value or as Unicode (UCS-2 or even UCS-4 or UTF-16) strings by the codepoints.) JSON was never preserving the order of elements in an Object anyway so we make it standardised (we still accept any order, and, when parsing, in collision cases, the later value wins). This also helps diffs.

For my second proposal, I’d like to forbid \u0000, \uFFFE, \uFFFF in strings. The first because many implementations use C strings, and for an information interchange format this is better; it also has security implications to allow NUL in a string. The other two, but not unpaired UTF-16 surrogates (as ECMAscript uses UCS-2 and got UTF-16 only later) because they’re not valid Unicode; JSON was not binary-safe already so why bother. Among other benefits, this also helps implementations.

For my third proposal, I’d like to agree that implementations should impose a nesting depth limit that may be user-defined, and in the face of which, cyclic checking may be ignored by an encoder. I emit nesting depth overflows as literalnull; might also throw an error. Since I was asked, the common “standard” value is to restrict nesting depth is 32, unless the user specified one. (I also saw 8, but 32 WFM pretty well.) Most seem to use it even if it may seem low at first. Only specialised applications probably need more, and they can always pass a value.

For my forth proposal, backslash-escape U+007F‥U+009F always. It may upset humans, editors, databases, etc. (This paragraph is newly added, after some IRC discussion.)

All these do not permit anything that wasn’t accepted to be accepted afterwards. I’ve got a fifth proposal which changes acceptance rules — but only for a subset of parsers: formally JSON is defined in ECMA-262 as industry standard that, in contrast to RFC 4627, always allowed any Value as top-level element of a JSON text. I’d like to make it so, and ignore the RFC’s requirement for it to be an Object or Array. Even so, the first two characters (after the BOM, if any) of a JSON text always are in the non-NUL 7-bit ASCII range, allowing for encoding detection. (This is done by the NUL octet pattern in the first four octets.)

JSON has only taken off because it’s a tightly defined simple format that can be used “everywhere” and isn’t too awful for humans (escaping not needed for U+0020‥U+D7FF and U+E000‥U+FFFD after all, although I’d also take the C1 control characters out, see my forth proposal above). I’ve started to use a trailing comma in indexed and associative arrays in code I write at work, when the array values are one a line, to help version control systems to do their diffs, but refrain from asking for a JSON extension to permit that in order to not endanger compatibility any (no comment needed, it’s just not worth it), but I’d like my above proposals to be followed by implementators (and I’m one of them).

Some more discussion with Jonathan pointed out that JSON5 allows for trailing commata in Object and Array; IMHO the only feature of it that is not bad or outright harmful. I’ll probably keep from accepting them because, on their own, they’re not that useful, and I usually would run JSON texts, even configs, through a parser/encoder roundtrip to pretty-print them which would lose them anyway.

As for binary-safeness: probably best to just use base64 and let the outer layers worry about compression. The data is usually unrelated to the JSON-encoded structure, and even if it’s related to other data the base64 representation is usually similar (unless misaligned).

Update 02.12.2012 — Wrong I was about the first two characters: “"€"” is a valid JSON text. Still possible to peek at four octets and determine the encoding by ordering the tests; updated my notes.

MirBSD Logo