(keitai-l) Re: PHP and Japanese characters: Looking for wisdom

From: Stephen Chasey <src_at_ubit.com> Date: 06/02/07 Message-ID: <op.ts9yyxtelpjjyh@src-note.lan>

Hello,

You may be able to check content type headers and meta encoding tags to  
determine the charset.

Also, if mb_detect_encoding is failing you could try a quick script in  
another language such as perl to fill in this gap.

I did some work with Asian language processing a few years ago.  If you  
are only dealing with Japanese for the moment there are ways to  
algorithmically detect the charset of Japanese text based on the byte  
sequence.

A quick search turned up this Python module which might do the job:
http://chardet.feedparser.org/

This perl module also has some emoji support:
http://search.cpan.org/~hio/Unicode-Japanese-0.37/lib/Unicode/Japanese.pm

Steve

On Fri, 01 Jun 2007 22:14:15 +0900, Erick Papadakis <erick.papa@gmail.com>  
wrote:

> Hi,
>
> Seeing as how this list is aflutter with tech savvy folk, I hope
> someone can shed some light on this problem.
>
> We're developing something in Japanese that needs input from a
> Javascript "escaped" string. Javascript is unfortunately a must
> because the text comes from client side using a bookmarklet. (If it
> could be a regular POST or GET, then there'd be no issues).
>
> My problem is that Japan seems to have had a devil of a time getting
> to standardize its character sets! Some big sites like isize.com use
> Shift_JIS, while others such as Goo or Mixi use EUC-JP, while several
> of the more modern ones (such as blogs) use UTF-8.
>
> When we capture the TITLE (document.title) from these websites, and
> then "rawurldecode" the received text in PHP, the string comes up
> jumbled. If we knew the standard character set before hand, we could
> have used the right mb_convert_encoding and such, but this is now an
> issue. We tried using Javascript's "document.defaultCharset" thingie,
> but that doesn't work either -- I wonder if that's a deprecated
> element of the document object?
>
> Would appreciate any insight into how you have solved the issue of
> different in-coming text into programs. The php function
> "mb_detect_encoding" is totally useless. Given a string, it always
> seems to return utf-8.
>
> Many thanks in advance!
>
> .ep
>
> This mail was sent to address src@ubit.com
> Need archives? How to unsubscribe? http://www.appelsiini.net/keitai-l/
>
>

-- 
Ubit Europe B.V.
Stephen Chasey
Mobile: +81 80 5505 7932
Tel: +31 20 408 1481
Fax: +31 84 711 5404