On Mon, 17 Jan 2005, Alex Shinn wrote:
> At Mon, 17 Jan 2005 14:57:59 +0900 (JST), Curt Sampson wrote:
> >
> > This is not true, because sorts based on the numerical representation of
> > a kana can't give tokuon a lower precedence than kana following the kana
> > with tokuon. For example,「じゃきょう」 sorts before 「しゃく」in my
> > dictionary, but with a sort based on character codes, じ (0x3058) comes
> > after し (0x3057), and so じゃきょう would sort after even 「しんぬ」.
>
> Oops, sorry, don't mind me I was asleep when I replied :(
I have made the exact same mistake on this list in the past.
> I think for hiragana only your algorithm works.
Right. But you could translate katakana in the same way, if you wanted,
with a little tweak or two to deal with elongation marks and so on, and
maybe adding a fourth digit if you really care to sort katakana after
hiragana when the words are exactly the same.
> Including kanji, katakana and romaji the JIS standard includes 5
> collation levels - you can see an open source implementation of the
> full collation in Perl's Lingua::JA::Sort::JIS:
Ah, right. That was actually linked from my page, except due to an HTML
error it was hard to see. I didn't really understand the algorithm it
was using, though.
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974
*** Contribute to the Keitai Developers' Wiki! ***
*** http://www.keitai-dev.net/wiki/ ***
Received on Mon Jan 17 10:26:55 2005