(keitai-l) Re: Supported Character Sets for I-mode

From: Joe Bowbeer <joe.bowbeer_at_gmail.com> Date: 01/10/06 Message-ID: <31f2a7bd0601092105i5c409a15icd3460066372d939@mail.gmail.com>

On 1/9/06, Paul Lester <paul@thetamusic.com> wrote:
> According to this UTF-8 is 3 bytes and SJIS is 2 bytes... am I right?
> Why have I always thought UTF-8 was 2 bytes?  I think I'm going crazy.
> When I encode UTF-8 I could have sworn when I look at the file in a hex
> editor each character was always 2 bytes!
>

UTF-16 would give you 2-byte encodings for most characters, even ASCII.

UTF-8 on the other hand is designed to yield 1-byte encodings for
7-bit ASCII, but requires at least 2 bytes for other characters.

This table compares the different encodings:

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

Japanese Katakana and Hiragana, for example, are both in the 0x3xxx
range, which yields 3-byte characters in UTF-8.

For more about dissatisfaction with UTF-8, see:

http://www-128.ibm.com/developerworks/unicode/library/u-secret.html