(keitai-l) Re: Supported Character Sets for I-mode

From: Nick May <nick_at_kyushu.com> Date: 01/11/06 Message-Id: <FD0D57B4-93BE-4920-8E15-C2C60488C59B@kyushu.com>

[I would be curious to hear from Nihon-jin readers of Keitai-l about  
how they, and other Japanese, currently perceive UTF-8 and UNICODE in  
general. Please don't be afraid of jumping in - we would love to hear  
from you and we are all - mostly all - benjo-slipper  trained]  Nick

I'm replying to points made in two separate posts by CJS. Not so much  
to those posts in fact, but to the general view they reflect, which I  
have heard expressed in similar terms before.

Let's start with the (of expressed) ad-hominem:

> shows the real problem these people have with Unicode: NIH.

It may well be that some irrational  people dislike UNICODE because  
it was not invented "here" - in this case - Japan. Possibly those  
same irrational people WOULD  like UNICODE it were it to have been  
invented by someone from "here".**

More fool them. In either case. (Aside: for the life of me I can't  
imagine many locals coming up with UniHan as a basis for encoding, or  
other locals liking it...)

Why should we bother ourselves with the views of irrational people  
when some objections people have to UNICODE historically have been  
perfectly reasonable and rational, even if not always rationally  
expressed? Indeed, many of them are rehearsed, albeit thoroughly  
condescendingly in places, in the IBM article. (And echoed in tone -  
alas - in Wikipedia.)

It is worth asking first what we want from, and thus how we should  
judge, an encoding and this depends on our perspective.

Do we, for example want  (I quote from CJS's post.)

> 	a general-purpose multi-lingual character set,

If we are making an OS - hell yes. That sounds peachy. Have an  
encoding that can handle most of most languages. (99% of 99%, I think  
was how it was put.)  Sell a LOT of computers that way. Solve a lot  
of day to day practical problems for computer techies.

techies everywhere like something

>  that offers [such] a good compromise of clear standards,
> ease of use, intertranslation with other character sets and reasonably
> compact character encodings.

Note the word "compromise".  Its the "one ring to bind them" view.

But that is not the only perspective. There is another perspective -  
which is where the TRON article was useful.  Lets say we have a  
legacy encoding scheme with various problem and constraints (EUCJP or  
SJIS, for the sake of argument). We want to fix all those problems.  
We don't frankly give a hoot whether in fixing those problems for OUR  
language, we also gain the ability to encode Persian (for example).   
We DO want to be able to encode all kanji in our national literature  
so that it can be kept in electronic form, we DO have to be able to  
handle every possible name that may appear on a driving license. 99%  
of 99%? Not even CLOSE! (I appreciate the figure was plucked out of  
the air, rather but let's keep it for now for - admittedly -  
disingenuously dramatic purposes...)  Imagine telling 1 in a hundred  
Americans they can't use their own name on their bank account...   
Compromise? Not possible. Why should we? This is the "encoding to  
serve and preserve a culture" view.

These are two VERY different perspectives.

In the first case 99% of 99% is fine.  In the second case, it isn't.  
If we are going to go through the pain of abandoning our legacy  
encoding, we want to do so to something that fixes ALL *OUR*  
problems, as far as possible, in a way that serves OUR linguistic  
culture.

NOTE: AFAIK For roman-text languages, UNICODE, from the start, has  
fixed 100% of the problems in 100% of the cases. So - for roman-text  
languages - whichever perspective you look at it from - it is a win.  
For two byte languages - as I recollect - the initial proposals for  
UNICODE did not.

And that is where a lot of the animosity comes from. I have strong  
memories of being on mailing lists in the mid to late 90's with non- 
Japanese people stating bluntly that Japanese would simply have to  
stop using certain kanji.  I don't claim they were informed people,  
or in a position to make decisions - but they were vociferous and it  
was not an uncommon view. Were the Japanese on the list upset? Damned  
right. Me too - about as cross as if someone had decided that (to  
choose an inexact analogy) henceforth all dictionaries in the would  
would be UNABLE to represent English English spellings, only American  
spellings....

Now - I am sure there are a queue of people anxious to point out that  
all these problems have been solved or can be solved. Well good.  
("Solved elegantly?" we inquire, soto voce... And I am suspicious of  
problems that CAN be solved, but haven't in fact been...) After lord  
only knows how many flame wars (polite flame wars, occasionally,  
these were academics - but the vitriol was there)  we are finally, X  
years later, at a stage where we have a system that JUST about, but  
probably not quite, "works" from both perspectives for Japanese.   
Even in its current state, could it be used to encode the entire  
Japanese driving license database? Not just "in theory" but "in  
practice"?  (I would be curious to know this, if anyone has info...)

It is tiresome that now, in 2006 there are still people (not saying  
that anyone on this list IS such a person) who ignore all the work  
that has had to be done to remedy the deficiences of UNICODE from the  
second perspective above - as an encoding to serve and preserve a  
culture - and pretend that Japanese dislike of UNICODE has always  
been completely irrational and hence that the history of bad blood  
and mistrust on the subject is completely irrational.  (Of course it  
is also tiresome that people on the other side don't  acknowledge  
that many of the initial problems have been addressed.)

My main practical objection to UNICODE now - UTF-8 at least - is that  
it is fat-arsed. The Oompah Loompah of encodings.  Want to send data  
to keitai? You will send FAR fewer bites with sjis. It doesn't matter  
whether we convert at the gateway or not - the fact is that SJIS/ 
EUCJP is better suited to lowish bandwidth environments. It must have  
been a no-brainer for DOCOMO as to which encoding they went with when  
they first launched imode - they would have been lunatic to go with  
UTF-8. It would have cut their capacity by 33% at a stroke and  
significantly increased costs to the user.

I think the keitai domain is STILL in that "1%" for which UNICODE is  
less appropriate. We can extend UNICODE to make 2byte?  Technically,  
perhaps - but - realistically, not much chance of that happening now.

As for the wider web - it has its place in xml and so on - but for  
webserving pages, I don't see any great advantage over legacy  
encoding schemes that outweighs the additional fat. Running a  
webserver on a 1mbit/s "up" ADSL line? You are in the 1% of the 1%...  
SJIS or EUCJP is the way to go.... Being charged for your bandwidth?  
Cut bandwidth costs - avoid UTF-8. You are in the 1% too...

> Unicode is one of the best technical successes I've ever seen in the
> computer industry. It does 99% of what people need 99% of the time.  
> This
> is why I find the Japanese criticism of it so annoying.

A failing of UNICODE to handle 1% of what I need 1% of the time, in  
English may well constitute a technical success - but it would MORE  
than damn it for me were I a librarian, running a big DP operation,  
or just concerned about my culture and being able to render, and  
preserve, all of its literature in electronic form.

Pistols at dawn in fact...

Nick

(**One need only look at the utter IDIOCY of Monbusho style  
romanization to get a feel for how something utterly and abysmally  
cretinous can be perpetrated by supposedly sane people...)