[Dancer-users] utf-8 issues
Daniel Perrett
dperrett at cambridge.org
Tue Jan 3 13:42:35 CET 2012
"The file written to the disk has the text rendered just fine. Any
explanations why?"
Because you wrote the file in the same way you read it: no encoding
specified, so the new file is a copy of the original. But it *would* look
wrong if you wrote the file as ">:encoding(UTF-8)".
Here's how it happens
1. You tell Perl to read a file with Hindi without specifying the
encoding, i.e. as if it contained only ANSI. You now have a utf8 string
which happens to represent only characters which appear in ANSI: Latin
letter a with acute, currency symbol, etc.
2. You tell Perl to write the the string to a new file as ANSI. It is
identical to the original file.
3. You read the new file as UTF-8 in your text editor. Unlike Perl's read,
this reads as UTF-8 so interprets the byte sequences which in ANSI
represent Latin letter a with acute, currency symbol, etc as Hindi
letters, not separate letters from an 8-bit codepage.
4. Meanwhile, you tell Perl to write the string to a web page as UTF-8.
Perl sends the UTF-8 values of the characters like à¤, NOT their ANSI
(byte) values.
5. You read the page in your browser. Your page displays Latin letter a
with acute, currency symbol, etc. because you have sent the UTF-8 values
for these and not the raw bytes. The 8-bit values the server sends are now
actually à ¤¸à ¥‚à ¤°à ¤œ à ¤¸à ¥‡
à ¤•à ¤¿à ¤°à ¤¨à ¥‹ à ¤ªà ¤° à ¤†à ¤ˆ,
à ¤†à ¤•à ¤° à ¤›à ¤¤ à ¤ªà ¤° à ¤ à ¤¹à ¤°à ¥€
à ¤§à ¥‚à ¤ª (which is also probably how your string is stored by
Perl).
"So, I got lulled by the documentation that says that I all I have to do
is to set the `charset utf-8` in config.yml, and Dancer would take care of
everything."
The documentation is strictly correct, but you just asked Dancer to do
something you didn't mean. By the time you gave Dancer the string, there
was no Hindi in it, just a load of currency symbols and accented latin
characters, which Dancer faithfully passed on to the browser, in UTF-8.
Dancer had no way of knowing that it came from a file which had been read
as ANSI.
Daniel
From: Puneet Kishor <punk.kish at gmail.com>
To: "Stefan Hornburg (Racke)" <racke at linuxia.de>
Cc: dancer-users at perldancer.org
Date: 23/12/2011 21:42
Subject: Re: [Dancer-users] utf-8 issues
Sent by: dancer-users-bounces at perldancer.org
On Dec 23, 2011, at 3:28 PM, Stefan Hornburg (Racke) wrote:
> On 12/23/2011 03:39 AM, Puneet Kishor wrote:
>> Fellow Dancers,
>>
>> I am mystified by the following issue.
> > My Dancer-powered web site converts utf-8 encoded, plain text files
formatted with Markdown into html.
>
> How do you open these plain text files inside your Dancer application?
> If you use Perl's open function or File::Slurp, you have to tell them
> that your file is UTF-8. There is no way around that.
>
That was it. Thanks. Here is what I had to do
- open my $fh, "<", $full_path_to_page
+ open my $fh, "<:encoding(UTF-8)", $full_path_to_page
Then I got an error in my customized Markdown.pm where `md5_hex` croaked,
so I had to change that
- my $key = md5_hex($tag);
+ my $key = md5_hex(encode_utf8($tag));
It works now. So, I got lulled by the documentation that says that I all I
have to do is to set the `charset utf-8` in config.yml, and Dancer would
take care of everything.
Another interesting thing -- before I made the above changes, as I noted
in my earlier email, I just wrote out the output to a file on disk before
sending it back to the browser. The file written to the disk has the text
rendered just fine. Any explanations why?
In any case, all's well now.
--
Puneet Kishor
_______________________________________________
Dancer-users mailing list
Dancer-users at perldancer.org
http://www.backup-manager.org/cgi-bin/listinfo/dancer-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.backup-manager.org/pipermail/dancer-users/attachments/20120103/c92b5136/attachment.htm>
More information about the Dancer-users
mailing list