On Fri, 1 Jun 2007, Erick Papadakis wrote:
> My problem is that Japan seems to have had a devil of a time getting
> to standardize its character sets! Some big sites like isize.com use
> Shift_JIS, while others such as Goo or Mixi use EUC-JP, while several
> of the more modern ones (such as blogs) use UTF-8.
Some use all sorts. Starling's generally use UTF-8, but we convert
everything to Shift_JIS (on the fly) or Docomo phones.
> When we capture the TITLE (document.title) from these websites, and
> then "rawurldecode" the received text in PHP, the string comes up
> jumbled. If we knew the standard character set before hand, we could
> have used the right mb_convert_encoding and such, but this is now an
> issue.
Ideally, you use the character set encoding from the Content-type
header. If you can figure out how to get this from Javascript, I'd be
pretty interested to hear about it.
If there's a META tag, as Christopher pointed out, you can give that a
try. But not everybody uses it (for good reason, actually, for those of
us who do on-the-fly conversion), and the encoding from the content-type
header overrides it, anyway.
> Would appreciate any insight into how you have solved the issue of
> different in-coming text into programs.
For me over the past seven or eight years, mostly, it's been about
dealing with forms, and I generally just put a hidden text field in the
form with the character set encoding. (Browsers always post using the
encoding in which they received the page containing the form from which
they're posting.)
cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974
Mobile sites and software consulting: http://www.starling-software.com
Received on Tue Jun 5 12:58:58 2007