[Dancer-users] utf-8 issues

Daniel Perrett dperrett at cambridge.org
Tue Jan 3 13:42:35 CET 2012


"The file written to the disk has the text rendered just fine. Any 
explanations why?"
Because you wrote the file in the same way you read it: no encoding 
specified, so the new file is a copy of the original. But it *would* look 
wrong if you wrote the file as ">:encoding(UTF-8)".

Here's how it happens

1. You tell Perl to read a file with Hindi without specifying the 
encoding, i.e. as if it contained only ANSI. You now have a utf8 string 
which happens to represent only characters which appear in ANSI: Latin 
letter a with acute, currency symbol, etc.
2. You tell Perl to write the the string to a new file as ANSI. It is 
identical to the original file. 
3. You read the new file as UTF-8 in your text editor. Unlike Perl's read, 
this reads as UTF-8 so interprets the byte sequences which in ANSI 
represent Latin letter a with acute, currency symbol, etc as Hindi 
letters, not separate letters from an 8-bit codepage.
4. Meanwhile, you tell Perl to write the string to a web page as UTF-8. 
Perl sends the UTF-8 values of the characters like à¤, NOT their ANSI 
(byte) values.
5. You read the page in your browser. Your page displays  Latin letter a 
with acute, currency symbol, etc. because you have sent the UTF-8 values 
for these and not the raw bytes. The 8-bit values the server sends are now 
actually à ¤¸à ¥‚à ¤°à ¤œ à ¤¸à ¥‡ 
à ¤•à ¤¿à ¤°à ¤¨à ¥‹ à ¤ªà ¤° à ¤†à ¤ˆ, 
à ¤†à ¤•à ¤° à ¤›à ¤¤ à ¤ªà ¤° à ¤ à ¤¹à ¤°à ¥€ 
à ¤§à ¥‚à ¤ª (which is also probably how your string is stored by 
Perl).

"So, I got lulled by the documentation that says that I all I have to do 
is to set the `charset utf-8` in config.yml, and Dancer would take care of 
everything."
The documentation is strictly correct, but you just asked Dancer to do 
something you didn't mean. By the time you gave Dancer the string, there 
was no Hindi in it, just a load of currency symbols and accented latin 
characters, which Dancer faithfully passed on to the browser, in UTF-8. 
Dancer had no way of knowing that it came from a file which had been read 
as ANSI.

Daniel



From:   Puneet Kishor <punk.kish at gmail.com>
To:     "Stefan Hornburg (Racke)" <racke at linuxia.de>
Cc:     dancer-users at perldancer.org
Date:   23/12/2011 21:42
Subject:        Re: [Dancer-users] utf-8 issues
Sent by:        dancer-users-bounces at perldancer.org




On Dec 23, 2011, at 3:28 PM, Stefan Hornburg (Racke) wrote:

> On 12/23/2011 03:39 AM, Puneet Kishor wrote:
>> Fellow Dancers,
>> 
>> I am mystified by the following issue.
> > My Dancer-powered web site converts utf-8 encoded, plain text files 
formatted with Markdown into html.
> 
> How do you open these plain text files inside your Dancer application?
> If you use Perl's open function or File::Slurp, you have to tell them
> that your file is UTF-8. There is no way around that.
> 

That was it. Thanks. Here is what I had to do

                 - open my $fh, "<", $full_path_to_page
                 + open my $fh, "<:encoding(UTF-8)", $full_path_to_page

Then I got an error in my customized Markdown.pm where `md5_hex` croaked, 
so I had to change that

                 - my $key = md5_hex($tag);
                 + my $key = md5_hex(encode_utf8($tag));

It works now. So, I got lulled by the documentation that says that I all I 
have to do is to set the `charset utf-8` in config.yml, and Dancer would 
take care of everything.

Another interesting thing -- before I made the above changes, as I noted 
in my earlier email, I just wrote out the output to a file on disk before 
sending it back to the browser. The file written to the disk has the text 
rendered just fine. Any explanations why?


In any case, all's well now.

--
Puneet Kishor
_______________________________________________
Dancer-users mailing list
Dancer-users at perldancer.org
http://www.backup-manager.org/cgi-bin/listinfo/dancer-users


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.backup-manager.org/pipermail/dancer-users/attachments/20120103/c92b5136/attachment.htm>


More information about the Dancer-users mailing list