On 1/9/06, Paul Lester <paul@thetamusic.com> wrote:
> According to this UTF-8 is 3 bytes and SJIS is 2 bytes... am I right?
> Why have I always thought UTF-8 was 2 bytes? I think I'm going crazy.
> When I encode UTF-8 I could have sworn when I look at the file in a hex
> editor each character was always 2 bytes!
>
UTF-16 would give you 2-byte encodings for most characters, even ASCII.
UTF-8 on the other hand is designed to yield 1-byte encodings for
7-bit ASCII, but requires at least 2 bytes for other characters.
This table compares the different encodings:
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
Japanese Katakana and Hiragana, for example, are both in the 0x3xxx
range, which yields 3-byte characters in UTF-8.
For more about dissatisfaction with UTF-8, see:
http://www-128.ibm.com/developerworks/unicode/library/u-secret.html
Received on Tue Jan 10 07:05:36 2006