"The file
written to the disk has the text rendered just fine. Any explanations why?"
Because you wrote the file in the same
way you read it: no encoding specified, so the new file is a copy of the
original. But it *would* look wrong if you wrote the file as ">:encoding(UTF-8)".
Here's how it happens
1. You tell Perl to read a file with
Hindi without specifying the encoding, i.e. as if it contained only ANSI.
You now have a utf8 string which happens to represent only characters which
appear in ANSI: Latin letter a with acute, currency symbol, etc.
2. You tell Perl to write the the string
to a new file as ANSI. It is identical to the original file.
3. You read the new file as UTF-8 in
your text editor. Unlike Perl's read, this reads as UTF-8 so interprets
the byte sequences which in ANSI represent Latin letter a with acute, currency
symbol, etc as Hindi letters, not separate letters from an 8-bit codepage.
4. Meanwhile, you tell Perl to write
the string to a web page as UTF-8. Perl sends the UTF-8 values of the characters
like à¤,
NOT their ANSI (byte) values.
5. You read the page in your browser.
Your page displays Latin letter a with acute, currency symbol, etc.
because you have sent the UTF-8 values for these and not the raw bytes.
The 8-bit values the server sends are now actually à ¤¸à ¥‚à ¤°à ¤œ
à ¤¸à ¥‡ à ¤•à ¤¿à ¤°à ¤¨à ¥‹
à ¤ªà ¤° à ¤†à ¤ˆ, à ¤†à ¤•à ¤°
à ¤›à ¤¤ à ¤ªà ¤° à ¤ à ¤¹à ¤°à ¥€
à ¤§à ¥‚à ¤ª (which is also probably how your string
is stored by Perl).
"So, I
got lulled by the documentation that says that I all I have to do is to
set the `charset utf-8` in config.yml, and Dancer would take care of everything."
The documentation is strictly correct,
but you just asked Dancer to do something you didn't mean. By the time
you gave Dancer the string, there was no Hindi in it, just a load of currency
symbols and accented latin characters, which Dancer faithfully passed on
to the browser, in UTF-8. Dancer had no way of knowing that it came from
a file which had been read as ANSI.
Daniel
From:
Puneet Kishor <punk.kish@gmail.com>
To:
"Stefan Hornburg
(Racke)" <racke@linuxia.de>
Cc:
dancer-users@perldancer.org
Date:
23/12/2011 21:42
Subject:
Re: [Dancer-users]
utf-8 issues
Sent by:
dancer-users-bounces@perldancer.org
On Dec 23, 2011, at 3:28 PM, Stefan Hornburg (Racke) wrote:
> On 12/23/2011 03:39 AM, Puneet Kishor wrote:
>> Fellow Dancers,
>>
>> I am mystified by the following issue.
> > My Dancer-powered web site converts utf-8 encoded, plain text
files formatted with Markdown into html.
>
> How do you open these plain text files inside your Dancer application?
> If you use Perl's open function or File::Slurp,
you have to tell them
> that your file is UTF-8. There is no way around that.
>
That was it. Thanks. Here is what I had to do
- open my $fh, "<", $full_path_to_page
+ open my $fh, "<:encoding(UTF-8)", $full_path_to_page
Then I got an error in my customized Markdown.pm where `md5_hex` croaked,
so I had to change that
- my $key = md5_hex($tag);
+ my $key = md5_hex(encode_utf8($tag));
It works now. So, I got lulled by the documentation that says that I all
I have to do is to set the `charset utf-8` in config.yml, and Dancer would
take care of everything.
Another interesting thing -- before I made the above changes, as I noted
in my earlier email, I just wrote out the output to a file on disk before
sending it back to the browser. The file written to the disk has the text
rendered just fine. Any explanations why?
In any case, all's well now.
--
Puneet Kishor
_______________________________________________
Dancer-users mailing list
Dancer-users@perldancer.org
http://www.backup-manager.org/cgi-bin/listinfo/dancer-users