[Proposal] UTF-16x (Apr.30 1999) ================================================================== Title: UTF-16x (UTF-16 extension) UCS transformation format, 16 bit form and Upper compatible format of UTF-16 Introduction This document represents UTF-16x format. UTF-16x means UTF-16 extension, which can access all planes of ISO-10646-UCS-4. This format is constructed by 16bit quantities. Background There are the following backgrounds to propose this format. * If The Unicode Standard aims to conform with ISO-10646, it needs to be able to process Plane 17 - 32867. * Many text data which are described by UCS-2/UTF-16 are already circulated, so it is impossible to ignore these data, and it is difficult for all data to be uniformed to UTF-8, when The Unicode Standard defines the character in Plane 17 by any chance * Some research society already inspect to use ISO-10646 private use area ( Plane 224-255, Group 96-127 ). If they circulate these data which contains the code point over Plane 17 by UTF-8, the serious conversion problem will happen between UTF-8 and UTF-16. * If The Unicode Standard does not define conversion rule between ISO-10646-UCS-4 and UTF-16x, It is probable that vendors define nonstandard conversion rule each other, and The Unicode Standard becomes difficult to conform with ISO-10646 by this confusion in future. Purpose * UTF-16x must have format of upper compatibility of UTF-16. * UTF-16x must be correctly converted form ISO-10646-UCS-4. * UTF-16x does not aim to be implemented by all processors. * UTF-16x need not be rendered by processors. * UTF-16x does not aim to be processed efficiently * UTF-16x may be used to distinguish ISO-10646 text from The Unicode Standard Text easily. UTF-16x definition It applies same rule as UTF-16 to the code points from 0x00000 to 0x10FFFF. But some code points are reserved for super surrogates. For example, super high surrogate : 0x000EE000-0x000EE7FF super middle surrogate : 0x000EE800-0x000EEBFF super row surrogate : 0x000EEC00-0x000EEFFF The code points from 0x00110000 to 0x7FFFFFFF are represented by super surrogate trio (high, middle, row). This means that UTF-16x sequence needs 12 octets and it certainly has the next sequence. (low surrogate : 0xDC00-0xDFFF) 0xDB78 + low surrogate + 0xDB7A + low surrogate + 0xDB7B + low surrogate or 0xDB79 + low surrogate + 0xDB7A + low surrogate + 0xDB7B + low surrogate Encoding UTF-16x At first, binary expression is used. UCS-4(binary expression): 0wxxxxxx-xxxxyyyy-yyyyyyzz-zzzzzzzz UTF-16x(binary expression): 11011011-0111100w, 110111xx-xxxxxxxx, 11011011-01111010, 110111yy-yyyyyyyy, 11011011-01111011, 110111zz-zzzzzzzz Next, hexadecimal expression is used. UCS-4 range: 0x00110000-0x3FFFFFFF UTF-16x expression: U+DB78 + low surrogate + U+DB7A + low surrogate + U+DB7B + low surrogate UCS-4 range: 0x40000000-0x7FFFFFFF UTF-16x expression: U+DB79 + low surrogate + U+DB7A + low surrogate + U+DB7B + low surrogate ================================================================== Masahiko Maedera, ??????????????????????? Masahiko_Maedera@???????????