(keitai-l) Re: Supported Character Sets for I-mode

From: Nick May <nick_at_kyushu.com> Date: 01/13/06 Message-Id: <062A6FB6-1D9F-4BF9-9095-77422C22BF0D@kyushu.com>

On 13 Jan 2006, at 14:51, Joe Bowbeer wrote:

> While he is not rigorous in his analysis, I suspect he is correct  
> that:
>
> 1. the actual size gain of UTF-8 compared to UTF-16 probably isn't  
> so large

He gives two reasons

1)  he states is that UTF-8 representations of  <, >, &, =, ", ', and  
space are all smaller in UTF-8 than in UTF-16. So even though UTF-16  
is smaller for chars, it is fatter for these characters, which are  
used often.

This does not apply to sjis or eucjp, as I recollect. (or are these  
characters 2 byte in sjis/eucjp?), so is strictly a UTF-16 point....  
(???)

> 2. the expansion may be offset by the natural compression of  
> ideographic scripts

or as he puts it "ideographic scripts are parsimonious with  
characters when compared with Latin scripts."

This may be true (I VERY much doubt it overall) but is irrelevant to  
a comparison of encodings WITHIN a script. It's is a specious argument.

He then says

 >> If compression is really what you're after, then zip or gzip the  
XML.

Yes of course - we do this anyway. I am interested in "post  
compression". A 10% difference is a 10% saving in bandwidth as it  
relates to that content.

and then he makes the interesting and relevant claim...

 >Compressed UTF-8 will likely be close in size to compressed UTF-16,  
regardless of the initial size difference.
 >Whichever one is larger initially will have more redundancy for the  
compression algorithm to reduce.

I think we - and he -  needs to do some testing before making or  
accepting such a claim.

> but it seems reasonable to think
> that UTF-8 and UTF-16 would compress to about the same size, given
> that the information is the same in both cases.

I am not sure what you mean by "information" here, so it is difficult  
to judge.

Anyone interested in helping me test this?

We could take a JP text file, add in a dollop of common markup  
characters, transform the encoding to various of the encodings  
specified, then gzip it. See what the percentage difference in size is.

Nick