On Fri, 13 Jan 2006, Curt Sampson wrote:
> Let's have a look at slashdot.co.jp's top page and an article page with
> 150+ comments, compressed and uncompressed, in various encodings.
>
> compressed uncompressed ratio uncompressed_name
> 21808 142018 84.6% comments.utf-8.html
> 20226 130989 84.5% comments.euc-jp.html
> 20359 130989 84.4% comments.sjis.html
> 15648 61434 74.5% top.utf-8.html
> 14616 56637 74.1% top.euc-jp.html
> 14632 56637 74.1% top.sjis.html
I should mention, as I forgot to earlier, that these were compressed
with gzip at the default compression level. Just for a quick comparison,
if anybody's curious, bzip2 -9 gives:
comments.utf-8.html: 87.92% saved, 142018 in, 17157 out.
comments.euc-jp.html: 87.11% saved, 130989 in, 16879 out.
comments.sjis.html: 87.17% saved, 130989 in, 16808 out.
top.utf-8.html: 77.36% saved, 61434 in, 13907 out.
top.euc-jp.html: 75.93% saved, 56637 in, 13635 out.
top.sjis.html: 75.89% saved, 56637 in, 13655 out.
One way of looking at this is that on a 130 KB EUC-JP file, the
difference between that and UTF-8 after compression is 278 bytes, or
0.21% of the original (smaller EUC-JP) file size.
I think that this pretty much explodes any arguments about UTF-8 versus
EUC-JP if your main concern is data size; what do you do in terms of
compression makes much, much more difference.
I should also note, for those who might bring up the issue of CPU speed,
that on modern computers, uncompressing the compressed text on the fly
as you process it is likely to be significantly faster than processing
the uncompressed text directly; the cost of a main memory hit is dozens
of times the cost of a cache hit.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974
*** Contribute to the Keitai Developers' Wiki! ***
*** http://www.keitai-dev.net/ ***
Received on Fri Jan 13 09:47:00 2006