OPTU8TO16(3) BSD Programmer's Manual OPTU8TO16(3)
optu8to16, optu8to16vis - converts multibyte characters to wide charac- ters preserving octets
#include <wchar.h> size_t optu8to16(wchar_t *pwc, const char *s, size_t n, mbstate_t *ps); #include <mbfun.h> size_t optu8to16vis(wchar_t *pwc, const char *s, size_t n, mbstate_t *ps); /* deprecated */
The optu8to16() function usually converts the multibyte character pointed to by s to a wide character, and stores the wide character in the wchar_t object pointed to by pwc if pwc is non-null and s points to a valid char- acter in the CESU-8 multibyte encoding, similar to mbrtowc() in a UTF-8 locale. If s does not point to a valid character, the first octet is transliterated to either an ISO_646.irv:1991 (ASCII) mapping into UCS-2 (U+0000 .. U+007F), or to the OPTU-16 raw octet range (U+EF80 .. U+EFFF). The optu8to16vis() function behaves the same, except raw octets are mapped into the normal UCS range as if they had been encoded in the lega- cy 8-bit codepage. The conversion happens in accordance with the conver- sion state described in the mbstate_t object pointed to by ps; it should be noted that raw octet conversion is stateful. This function may examine at most n bytes of the array beginning from s. If n is set to 0, the function behaves as if end of input (not a null character) has been read and ignores s. If s points to a valid character and the character corresponds to a null wide character, then the function places the mbstate_t object pointed to by ps to an initial conversion state. These are the special cases: pwc == NULL The conversion from a multibyte character to a wide char- acter has taken place and the conversion state may be af- fected, but the resultant wide character is discarded. s == NULL optu8to16() sets the conversion state object pointed to by ps to an initial state and always returns 0. In this case, optu8to16() ignores pwc but not n, and is equivalent to the following call: optu8to16(NULL, "", 1, ps); n == 0 Read end of input (not a null character, but an epsilon as known from computer science automaton modelling) and ig- nore s. optu8to16() will still emit up to two wide charac- ters and return 0, if the conversion state contains infor- mation about these, and (size_t)-2 otherwise. Application note: If the end of input has been reached, call optu8to16() with n == 0 until it returns (size_t)-2, and process the remaining wide characters emitted. This en- sures no raw octets in the OPTU-8 encoded source are lost. ps == NULL optu8to16() uses its own internal state object to keep the conversion state, instead of ps mentioned in this manual page. Calling any other functions in libc never change the internal state of optu8to16(), which is initialised at program startup time.
0 or positive Number of bytes read from s. If 0, the state contained enough information to emit a wide character; if positive, the bytes form a valid multibyte character in the OPTU-8 encoding. (size_t)-2 s points to the byte sequence which possibly contains part of a valid multibyte character, but which is incomplete. All n bytes of the input have been processed and stored in ps. (size_t)-1 Generic error condition; should not happen in the current implementation. errno is set to indicate the error.
The optu8to16() function is designed to be as robust as possible and can, in contrast to mbrtowc(), not throw EILSEQ. While EINVAL to indicate an invalid or uninitialised mbstate_t object is theoretically possible, nei- ther this nor other processing errors should ever happen with the current implementaton.
iswoctet(3), mbrtowc(3), wcrtomb(3)
At present, MirBSD is limited to the UCS BMP (Basic Multilingual Plane), thus OPTU-8 is limited to the common subset of CESU-8 and UTF-8. The optu8to16() function was standardised by MirBSD and have been designed to behave as close to their ISO/IEC 9899:1999 ("ISO C99") equivalents wcrtombs() and mbrtowcs() as possible, with the following in- tentional exceptions: If n is 0, s is ignored, even if it is NULL, not the other way round. The return value 0 does not indicate that a null character was processed, use pwc for that. It indicates that no byte of the input has been read. The optu8to16vis() function assumes codepage 1252 and maps holes into distinguishable codepoints. All these extended functions declare macros with the same name that can be used to check for their presence.
The optu8to16 function first appeared in MirBSD #11. Later attempts to tackle similar or related problems are PEP 383 (2009) https://www.python.org/dev/peps/pep-0383/ and the Wobbly Transformation Format (2014) https://simonsapin.github.io/wtf-8/ which all differ in which encoding is used for nonstandard codes. The range used in the OPTU encoding is registered with CSUR: http://www.evertype.com/standards/csur/conscript-table.html
Thorsten Glaser <tg@mirbsd.de> wrote the entire internationalisation im- plementation in MirBSD. He is also the steward for the OPTU encoding.
On a system whose wide character type is only 16 bits wide, as opposed to 31 bits of ISO 10646, the OPTU encoder and decoder are permitted to not de- and recompose any surrogates encountered and pass them through as if they were regular wide characters with no special function. Since MirBSD is such a system, the reference implementation does not care about UTF-16 surrogates posing as OPTU-16 characters at all; a planes-aware Universal Coded Character Set-using application is required to handle surrogates by itself. For compatibility purposes, optu8to16 should always be assumed to not treat surrogates specially; applications must ensure to not produce invalid surrogates unless limited to the BMP. MirBSD December 11, 2021 2