On 10 Jan 2006, at 12:46, Paul Lester wrote:
> According to this UTF-8 is 3 bytes and SJIS is 2 bytes... am I
> right?
As I understand it UTF-8 is variable width, 1 to 4 bytes, with the 1
byte required for the lower 128 US ascii. But kanji and similar
scripts take 3.
From Wikipedia http://en.wikipedia.org/wiki/UTF-8
UTF-8 is generally larger than the appropriate legacy encoding for
everything except diacritic-free, Latin-alphabet text. Most
alphabetic scripts had only a single byte per character in legacy
encodings but their letters take at least two bytes in UTF-8.
Ideographic scripts generally had two bytes per character in their
legacy encodings yet take three bytes per character in UTF-8.
Nick
Received on Tue Jan 10 06:30:29 2006