[I would be curious to hear from Nihon-jin readers of Keitai-l about
how they, and other Japanese, currently perceive UTF-8 and UNICODE in
general. Please don't be afraid of jumping in - we would love to hear
from you and we are all - mostly all - benjo-slipper trained] Nick
I'm replying to points made in two separate posts by CJS. Not so much
to those posts in fact, but to the general view they reflect, which I
have heard expressed in similar terms before.
Let's start with the (of expressed) ad-hominem:
> shows the real problem these people have with Unicode: NIH.
It may well be that some irrational people dislike UNICODE because
it was not invented "here" - in this case - Japan. Possibly those
same irrational people WOULD like UNICODE it were it to have been
invented by someone from "here".**
More fool them. In either case. (Aside: for the life of me I can't
imagine many locals coming up with UniHan as a basis for encoding, or
other locals liking it...)
Why should we bother ourselves with the views of irrational people
when some objections people have to UNICODE historically have been
perfectly reasonable and rational, even if not always rationally
expressed? Indeed, many of them are rehearsed, albeit thoroughly
condescendingly in places, in the IBM article. (And echoed in tone -
alas - in Wikipedia.)
It is worth asking first what we want from, and thus how we should
judge, an encoding and this depends on our perspective.
Do we, for example want (I quote from CJS's post.)
> a general-purpose multi-lingual character set,
If we are making an OS - hell yes. That sounds peachy. Have an
encoding that can handle most of most languages. (99% of 99%, I think
was how it was put.) Sell a LOT of computers that way. Solve a lot
of day to day practical problems for computer techies.
techies everywhere like something
> that offers [such] a good compromise of clear standards,
> ease of use, intertranslation with other character sets and reasonably
> compact character encodings.
Note the word "compromise". Its the "one ring to bind them" view.
But that is not the only perspective. There is another perspective -
which is where the TRON article was useful. Lets say we have a
legacy encoding scheme with various problem and constraints (EUCJP or
SJIS, for the sake of argument). We want to fix all those problems.
We don't frankly give a hoot whether in fixing those problems for OUR
language, we also gain the ability to encode Persian (for example).
We DO want to be able to encode all kanji in our national literature
so that it can be kept in electronic form, we DO have to be able to
handle every possible name that may appear on a driving license. 99%
of 99%? Not even CLOSE! (I appreciate the figure was plucked out of
the air, rather but let's keep it for now for - admittedly -
disingenuously dramatic purposes...) Imagine telling 1 in a hundred
Americans they can't use their own name on their bank account...
Compromise? Not possible. Why should we? This is the "encoding to
serve and preserve a culture" view.
These are two VERY different perspectives.
In the first case 99% of 99% is fine. In the second case, it isn't.
If we are going to go through the pain of abandoning our legacy
encoding, we want to do so to something that fixes ALL *OUR*
problems, as far as possible, in a way that serves OUR linguistic
culture.
NOTE: AFAIK For roman-text languages, UNICODE, from the start, has
fixed 100% of the problems in 100% of the cases. So - for roman-text
languages - whichever perspective you look at it from - it is a win.
For two byte languages - as I recollect - the initial proposals for
UNICODE did not.
And that is where a lot of the animosity comes from. I have strong
memories of being on mailing lists in the mid to late 90's with non-
Japanese people stating bluntly that Japanese would simply have to
stop using certain kanji. I don't claim they were informed people,
or in a position to make decisions - but they were vociferous and it
was not an uncommon view. Were the Japanese on the list upset? Damned
right. Me too - about as cross as if someone had decided that (to
choose an inexact analogy) henceforth all dictionaries in the would
would be UNABLE to represent English English spellings, only American
spellings....
Now - I am sure there are a queue of people anxious to point out that
all these problems have been solved or can be solved. Well good.
("Solved elegantly?" we inquire, soto voce... And I am suspicious of
problems that CAN be solved, but haven't in fact been...) After lord
only knows how many flame wars (polite flame wars, occasionally,
these were academics - but the vitriol was there) we are finally, X
years later, at a stage where we have a system that JUST about, but
probably not quite, "works" from both perspectives for Japanese.
Even in its current state, could it be used to encode the entire
Japanese driving license database? Not just "in theory" but "in
practice"? (I would be curious to know this, if anyone has info...)
It is tiresome that now, in 2006 there are still people (not saying
that anyone on this list IS such a person) who ignore all the work
that has had to be done to remedy the deficiences of UNICODE from the
second perspective above - as an encoding to serve and preserve a
culture - and pretend that Japanese dislike of UNICODE has always
been completely irrational and hence that the history of bad blood
and mistrust on the subject is completely irrational. (Of course it
is also tiresome that people on the other side don't acknowledge
that many of the initial problems have been addressed.)
My main practical objection to UNICODE now - UTF-8 at least - is that
it is fat-arsed. The Oompah Loompah of encodings. Want to send data
to keitai? You will send FAR fewer bites with sjis. It doesn't matter
whether we convert at the gateway or not - the fact is that SJIS/
EUCJP is better suited to lowish bandwidth environments. It must have
been a no-brainer for DOCOMO as to which encoding they went with when
they first launched imode - they would have been lunatic to go with
UTF-8. It would have cut their capacity by 33% at a stroke and
significantly increased costs to the user.
I think the keitai domain is STILL in that "1%" for which UNICODE is
less appropriate. We can extend UNICODE to make 2byte? Technically,
perhaps - but - realistically, not much chance of that happening now.
As for the wider web - it has its place in xml and so on - but for
webserving pages, I don't see any great advantage over legacy
encoding schemes that outweighs the additional fat. Running a
webserver on a 1mbit/s "up" ADSL line? You are in the 1% of the 1%...
SJIS or EUCJP is the way to go.... Being charged for your bandwidth?
Cut bandwidth costs - avoid UTF-8. You are in the 1% too...
> Unicode is one of the best technical successes I've ever seen in the
> computer industry. It does 99% of what people need 99% of the time.
> This
> is why I find the Japanese criticism of it so annoying.
A failing of UNICODE to handle 1% of what I need 1% of the time, in
English may well constitute a technical success - but it would MORE
than damn it for me were I a librarian, running a big DP operation,
or just concerned about my culture and being able to render, and
preserve, all of its literature in electronic form.
Pistols at dawn in fact...
Nick
(**One need only look at the utter IDIOCY of Monbusho style
romanization to get a feel for how something utterly and abysmally
cretinous can be perpetrated by supposedly sane people...)
Received on Wed Jan 11 15:18:19 2006