[JT] Jochen Topf's Blog
Thu 2011-03-17 00:00

Javascript and Unicode

The Unicode character set contains somewhat over one million code points from 0 to hex 10ffff. That wasn’t always so. Unicode started out with only 16 bit characters, or about 65000 code points. At some point it was decided that that wasn’t enough and the version 2.0 released in 1996 switched to the larger character set.

Unfortunately Javascript (invented in 1995) never got that message. 15 years later it still doesn’t properly support the full Unicode character set. Most of the Javascript documentation I could find on the web ignore this issue, so I dug a bit deeper.

This is what the ECMA-262 language specifications (5th ed) (which is the basis for Javascript) has to say about this: “4.3.16 String value: primitive value that is a finite ordered sequence of zero or more 16-bit unsigned integer. NOTE: A String value is a member of the String type. Each integer value in the sequence usually represents a single 16-bit unit of UTF-16 text.” Note that is says “16-bit unit” not “16-bit character” and note the “usually” which is unusually vague for a spec. It goes an with: “However, ECMAScript does not place any restrictions or requirements on the values except that they must be 16-bit unsigned integers.” Later the spec describes the charCodeAt(position) method of the String object as returning an integer between 0 and 2^16-1 (chapter 15.5.4.5). So it does not work with characters of more than 16 bits. The same problem exists for the Unicode escape sequence described as “\u plus four hexadecimal digits” (chapter 6), enough for a 16 bit character but not the full Unicode range. So everything is fine as long as you stick to the Basic Multilingual Plane (BMP), ie the first hex ffff characters of Unicode which can be expressed in 16bit.

If you want the full Unicode support, it is more than a bit messy. Mozilla has some suggestions how to work around this problem which surfaces especially in regular expressions. Thats all rather ugly, but at least it can be done.

The spec doesn’t say what a Javascript engine does internally, all it describes is what you see using the Javascript language. But from the spec it would make sense to always use 16bit characters. Thats what the obsolete UCS-2 encoding is for. The UTF-16 encoding is similar, but not the same, it still uses 16-bit units, but a character can be one or two units long.

I am using the Google’s V8 Javascript engine which adds another complication, because you have to get those strings from C++ to Javascript and back. So I checked what it does when encountering Unicode characters above hex ffff. It seems it will silently drop the upper bits. At least that is the behaviour in older versions (I used the version 2.2.18 in Ubuntu 10.10), it has now been changed. There is a ticket addressing this problem but it was closed as “fixed” with the comment “V8 won’t not support characters outside the Basic Multilingual Plane for now.” (Ignore the obviously wrong double negative here).

There are some discussions on this issue here and here. So it seems the only solution would be to do the UTF-8 to UTF-16 conversion myself and feed that to the V8 methods.

I must say that I am pretty disappointed that there is such a glaring hole in Javascript and its implementations, key pieces of our modern web infrastructure.

Tags: dev · javascript · unicode