2011/07/27

Headache of ISO-8859-1 to UTF8

In Perl, you can use encode ('UTF-8', $iso-text-str) to convert ISO-8859-1 encoded string to UTF-8 encoded string. And in PHP, you can use utf8_encode($iso-text-str) to do the converting.
However, you may be out of luck, for after being converted, some characters could be funky and not what you wanted to see. I think this is because, some characters in UTF8 are invisible, like x80 - x9f (see the following ISO-8859-1 characters list images)





Because I  only care of those regular characters, like \x20-\x7F or \xA9 or \xAE or \x99,  I strip other characters before applying encoding function.
In Perl
$content =~ s/[^(\x20-\x7F|\xA9|\xAE|\x99)]+//g;
$content = encode('utf8', $content);

In PHP
$content = preg_replace('/[^(\x20-\x7F|\xA9|\xAE|\x99|\n)]+/', "", $content);
$content = utf8_encode($content);

UPDATE: Actually, I found that in Perl, encode function cannot correctly convert \x99 to ™. Finally my solution is the following,
open (FILE,  ">$your_file") || die "couldn't write to epcmf file\n";
   binmode(FILE, ":UTF-8");

   $title =~ s/[^(\x20-\x7F|\xA9|\xAE|\x99)]+//g;
   $title =~ s/\x99/™/g;
   $title =~ s/\xAE/®/g;
   $title =~ s/\xA9/©/g;
   print FILE $title;

Note:

  1. You should edit your script in UTF-8, for example, in PUTTY, you can change your character set to UTF-8 at Configuration > Windows > Translation
  2. UTF-8 is different to utf8, so in make sure you write it as binmode(FILE, ":UTF-8");

No comments:

Post a Comment