On 13 Jan 2006, at 14:51, Joe Bowbeer wrote:
> While he is not rigorous in his analysis, I suspect he is correct
> that:
>
> 1. the actual size gain of UTF-8 compared to UTF-16 probably isn't
> so large
He gives two reasons
1) he states is that UTF-8 representations of <, >, &, =, ", ', and
space are all smaller in UTF-8 than in UTF-16. So even though UTF-16
is smaller for chars, it is fatter for these characters, which are
used often.
This does not apply to sjis or eucjp, as I recollect. (or are these
characters 2 byte in sjis/eucjp?), so is strictly a UTF-16 point....
(???)
> 2. the expansion may be offset by the natural compression of
> ideographic scripts
or as he puts it "ideographic scripts are parsimonious with
characters when compared with Latin scripts."
This may be true (I VERY much doubt it overall) but is irrelevant to
a comparison of encodings WITHIN a script. It's is a specious argument.
He then says
>> If compression is really what you're after, then zip or gzip the
XML.
Yes of course - we do this anyway. I am interested in "post
compression". A 10% difference is a 10% saving in bandwidth as it
relates to that content.
and then he makes the interesting and relevant claim...
>Compressed UTF-8 will likely be close in size to compressed UTF-16,
regardless of the initial size difference.
>Whichever one is larger initially will have more redundancy for the
compression algorithm to reduce.
I think we - and he - needs to do some testing before making or
accepting such a claim.
> but it seems reasonable to think
> that UTF-8 and UTF-16 would compress to about the same size, given
> that the information is the same in both cases.
I am not sure what you mean by "information" here, so it is difficult
to judge.
Anyone interested in helping me test this?
We could take a JP text file, add in a dollop of common markup
characters, transform the encoding to various of the encodings
specified, then gzip it. See what the percentage difference in size is.
Nick
Received on Fri Jan 13 08:24:58 2006