"The file written to the disk has the text rendered just fine. Any explanations why?"
Because you wrote the file in the same way you read it: no encoding specified, so the new file is a copy of the original. But it *would* look wrong if you wrote the file as ">:encoding(UTF-8)".

Here's how it happens

1. You tell Perl to read a file with Hindi without specifying the encoding, i.e. as if it contained only ANSI. You now have a utf8 string which happens to represent only characters which appear in ANSI: Latin letter a with acute, currency symbol, etc.
2. You tell Perl to write the the string to a new file as ANSI. It is identical to the original file.
3. You read the new file as UTF-8 in your text editor. Unlike Perl's read, this reads as UTF-8 so interprets the byte sequences which in ANSI represent Latin letter a with acute, currency symbol, etc as Hindi letters, not separate letters from an 8-bit codepage.
4. Meanwhile, you tell Perl to write the string to a web page as UTF-8. Perl sends the UTF-8 values of the characters like à¤, NOT their ANSI (byte) values.
5. You read the page in your browser. Your page displays  Latin letter a with acute, currency symbol, etc. because you have sent the UTF-8 values for these and not the raw bytes. The 8-bit values the server sends are now actually सूरज से किरनो पर आई, आकर छत पर ठहरी धूप (which is also probably how your string is stored by Perl).

"So, I got lulled by the documentation that says that I all I have to do is to set the `charset utf-8` in config.yml, and Dancer would take care of everything."
The documentation is strictly correct, but you just asked Dancer to do something you didn't mean. By the time you gave Dancer the string, there was no Hindi in it, just a load of currency symbols and accented latin characters, which Dancer faithfully passed on to the browser, in UTF-8. Dancer had no way of knowing that it came from a file which had been read as ANSI.

Daniel



From:        Puneet Kishor <punk.kish@gmail.com>
To:        "Stefan Hornburg (Racke)" <racke@linuxia.de>
Cc:        dancer-users@perldancer.org
Date:        23/12/2011 21:42
Subject:        Re: [Dancer-users] utf-8 issues
Sent by:        dancer-users-bounces@perldancer.org





On Dec 23, 2011, at 3:28 PM, Stefan Hornburg (Racke) wrote:

> On 12/23/2011 03:39 AM, Puneet Kishor wrote:
>> Fellow Dancers,
>>
>> I am mystified by the following issue.
> > My Dancer-powered web site converts utf-8 encoded, plain text files formatted with Markdown into html.
>
> How do you open these plain text files inside your Dancer application?
> If you use Perl's open function or
File::Slurp, you have to tell them
> that your file is UTF-8. There is no way around that.
>

That was it. Thanks. Here is what I had to do

                - open my $fh, "<", $full_path_to_page
                + open my $fh, "<:encoding(UTF-8)", $full_path_to_page

Then I got an error in my customized Markdown.pm where `md5_hex` croaked, so I had to change that

                - my $key = md5_hex($tag);
                + my $key = md5_hex(encode_utf8($tag));

It works now. So, I got lulled by the documentation that says that I all I have to do is to set the `charset utf-8` in config.yml, and Dancer would take care of everything.

Another interesting thing -- before I made the above changes, as I noted in my earlier email, I just wrote out the output to a file on disk before sending it back to the browser. The file written to the disk has the text rendered just fine. Any explanations why?


In any case, all's well now.

--
Puneet Kishor
_______________________________________________
Dancer-users mailing list
Dancer-users@perldancer.org
http://www.backup-manager.org/cgi-bin/listinfo/dancer-users