On Friday, June 21, 2002, at 08:28 , James Santagata wrote:
>> Depends on how you interpret the meaning of "character". What I meant
>> to
>> say was that the number of graphical representations that need to be
>> encoded or dealt with increases.
>
>
> You are right about semantics and the import of how a "character",
> is defined. As a late US President once exclaimed, "it all
> depends on what 'is' is."
:-) I knew that this one would come up ;-)
> So, for the roman character "a", that would be one code point,
> but the glyphs could carry the representation of various styles like
> a block letter, cursive style writing and so on.
It is because the Roman writing system is very simple and very rational.
> In life, though, it always seems that great ideas face three roadblocks
> - technical, monetary and political.
>
> The technical aspect of this are pretty straightforward. and it isn't
> going to really break anyone's bank (I think it actually saves
> orders of magnitude of time and money when dealing with
> internationalization of apps). But it seems politics
> raises its ugly head again especially with the "Han Unification"
> aspect, which in my mind is quite senseless.
Although I agree with you that there are lots of politics that get in
the way and should rather not be allowed to, I can also see that the
technical aspects in some writing systems are not that straightforward -
that is if "technical" includes "functional"
It is one thing to define a technical specification so it is
"technically straightforward" - it is another matter still to define it
such that it is also "user requirements fulfilling".
For us Westerners, reading and writing is a very rational thing. We
therefore see the task of implementing writing systems into machinery on
a very rational basis. Users and engineers are therefore very likely to
be in agreement over the requirements. Simple, rational, minimal
resources.
For users of some other writing systems, particular those which use
pictograms and ideograms rather than phonograms the user requirements
may not be so well aligned with the ease of technical implementation
objectives.
From a Western view, we are likely to say "this and that character -
they are really the same", but from an Oriental view point this may not
be the case even in a rational sense. Often, old and new characters are
used within the same writing system and this often means a subtle
difference in the meaning. From that viewpoint the argument that two
characters which should be the same are in fact different characters may
not always be that easily dismissed and it may be a practical
requirement good enough to support the use of different codes.
Basically, what I am saying here is "Who are we to tell them how to
classify their characters", but please take this with a grain of salt,
because Western rationality has in the past indeed made significant
contributions to classify and order Chinese characters, for which
Nelson's and Halpern's classification systems and dictionaries are proof
as they are well appreciated even by Oriental scholars.
In any event this shows that the task is not as easy as it may seem from
a pure engineering point of view, even if politics were left out of the
equation. Obviously, in the face of such genuine difficulties, politics
add difficulties of a kind which one could easily do without.
> And I think the Han Unification is a hugely important aspect
> because there are so many characters - I personally thank
> God everyday I wake up for the 26 character alphabet.
:-)
> For the opponents of Han Unification, I attribute a lot of
> the opposition to one of two things:
>
> 1) A misunderstanding "Hey, they aren't going have
> my representation of 'Qi' as a character! #$#@#@!!"
you may call that the Talibanisation of Unicode ;-)
> or
>
> 2) What I label the "Cooties" factor (I was going to label it the
> "French Factor", but thought I'd get too many flames from Francophiles
> so I'll stick with "Cooties" factor).
>
> As in:
>
> "I don't want to have my [insert country with pride/animosity]
> bundled together with [insert neighboring country with reciprocal
> pride/animosity]'s ideogram! #$#@#@!!"
Ah, but that's the "Japan-Korea factor", also known in its more violent
form as the "Arab-Israel factor" (Please note: order strictly Roman
alphabetic)
a "French Factor" would be quite the opposite, as in
"We need to *exclude* all their characters in *our* standard and replace
them with new ones we invent specifically for that purpose; then we need
to make sure all of our own characters are mandatorily *included* in
*anyone else's* standard." (Note: I do consider myself a Francophile)
>> Anyway, if it says 20000 characters are covered by the Unicode standard
>> so far, then the question is, does that mean 20000 graphical
>> representations of characters or does it mean actual characters ?
>
> That would be 20,000 codepoints or actual characters. But any one
> codepoint could have tens of different glyphs associated with it.
This is OK if it is presented that way, but in a newspaper article if
would probably be reduced even beyond layman's terms and simply turn
out as either
- "The Unicode standard covers (number of codes) characters"; or
- "The Unicode standard covers (number of glyphs) characters"; or
- "While proponents of the Unicode standard claim (number of codes plus
number of glyphs) characters have been standardised so far, there are
critics who say ...."
In other words, we are back at the question of what the meaning of 'is'
is ;-)
>> Clearly, where different graphical representations are represented by
>> the same code, all the characters that are "ancient versions" of
>> existing characters don't really need to be explicitly covered. Where
>> they are represented by different codes, they would need to be covered.
>
> Correct. As long as it is determined/agreed that the character,
> "Qi" is a specific code point, it doesn't really matter (in my opinion)
> that we unify all of the physical representations of that character
> over the centuries or between countries that use the same/similar
> ideogram as part of their orthography into one codepoint.
Emphasis on "determined/agreed" and "same/similar" ;-)
> Others may, and frequently do argue that this is misguided
> for a number of reasons (usually political).
>> It would seem that part of the work involved in encoding is to decide
>> which characters are considered the same and which are considered to be
>> stand-alone. A task that may at least in some cases turn out to be
>> difficult because scholars may have different opinions.
>
> Yes, this has been a big issue - in my mind mostly political.
Well, perhaps they should define and agree on a standard process for the
determination of whether characters are to be considered same/similar or
different ;-) and hopefully it would filter out the political arguments
as disqualified, but probably it would do the exact opposite ;-)
>> That is my principle understanding too, but it seems to me that it is
>> not always clear what constitutes "the same character". For example, on
>> my system Qi, Ki and qi produce different codes. Thus, if I have a text
>> with the character Qi in it, the character does not change from its
>> Traditional representation to a Japanese representation when I change
>> it
>> from a Chinese to a Japanese font. Although, I see how this could be
>> achieved even if the codes are different.
>
> May I ask how you input your characters? Assuming that "Qi" was
> one code point for a discrete character, that was shared
> in China, Taiwan, Japan and Korea, than changing the font should
> determine how the character is rendered on your monitor.
The Japanese "Ki" I produced using romanised front-end input processors,
ie you type "ki" and it turns into hiragana ki, then you hit the space
bar and it converts into a Kanji for Ki, if it's not the one you are
looking for you keep hitting space until the one you're after shows up.
Similar with the Chinese FEPs, but I am often having trouble with those
and then I look them up by radical method from a table, ie you choose
Radical input, then radical stroke count, remaining stroke count et
voila you get a list of characters (often still too large to find your
desired character instantly but ...) from which you choose the desired
one by clicking on it and it will be inserted at the cursor position.
In other words I didn't specify the codes, the input FEPs did that for
me.
>> Do you have the three writing systems installed on your system ? I'd be
>> interested to learn if it is any different on yours ...
>
> I was able to see clearly the 3 different 'Qi' characters you input.
That's because you have support for all three script systems installed.
If you have one missing, that particular one would not be displayed and
probably show up as garbled Roman text.
> On this computer, I'm running Windows 98 (hold your groans
> please) and have installed the Japanese, Korean, Simp Chinese
> and Trad Chinese MS IME software and I'm also running a
> Vietnamese IME (I think VNI is on this computer).
This looks to me as if they have a system there which is similar to the
one the old MacOS used where you had to install so called Language Kits
for each language. A language consisted of the fonts and display
capabilities plus one or more input methods. It worked, but it was some
sort of a bolted on thing which sometimes caused funny side effects. For
example, Japanese in window titles always were preceeded by katakana
"ME" and had a trailing katakana "MO". I think those where the katanana
with codes identical to opening and closing quotes in Roman scripts.
There was even a shareware utility called "MEMO Busters" to get rid of
it ;-)
On the new OSX which I am using on this machine, this is all pretty much
integrated and you don't need to add language kits anymore. I have not
come across any side effects and the integration seems very well done
and robust. Support for the most common languages is present unless you
explicitly remove it and all you have to do is check tick boxes for
desired languages in the language preference pane within system
preferences.
However, as this OS has only been around for about a year (in this form)
there are still a number of languages missing - I don't think they have
Vietnamese yet.
What I like particularly is the way in which applications separate
program code and text for menus and dialogs. In OSX an application is
actually a directory (inheritance from NeXT) and that directory has a
sub directory branch for the code and another for resources. Within the
resource subtree there is a language resource subdirectory with the
language resources. Those contain all the text of the application and
they are just plain vanilla text files. You can edit them and change the
names of menus and the text in dialogs as you please. This requires no
technical expertise (other than using a text editor) and no
recompilation or linking is required. So, you can do your own
localisation by copying an original language resource file, ie.
English.lproj and name it after the target language, ie. Japanese.lproj
and then edit the new language resource file for Japanese and simply
translate everything in there.
When you are done, you log out and back in again and et voila your
application is now Japanese (provided your user preference is set so
Japanese is higher up on the list than whatever other language has a
resource file - first language on your preferred list to match a
resource will be the one to be used for display).
This is very neat and I have done it a few times for Japanese users who
wanted to use a software that didn't come with Japanese resources
(mostly shareware products).
The notable exception seem to be Microsoft apps. They must have somehow
worked around this thing - maybe they coded a rule into their packages
that says "if you are an English package then ignore any non-English
language resource files".
It seems that the benefits which Unicode and with it other multi-lingual
techniques in its entourage have brought are not always appreciated. Why
sell one multi-lingual package if you can sell one package per language
and charge a multi-lingual user twice or more ? ;-)
regards
benjamin
Received on Fri Jun 21 14:38:15 2002