HTML Tidy output with accented characters

  • Advertisement ( why? )
     

    Guest, 27th Jun 2012 9:00 am

    I have been experiencing problems with Tidy since installing HTMLKit Tools ver. 20120605a, as fresh install. The code I have is as follows:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">

    <html>
    <head>
    <title>Untitled</title>
    </head>
    <body>

    Escape codes test

    áéíóúýàèìòùâêîôûãõç""

    </body>
    </html>

    The output from Tidy is as follows:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
    <html>
    <head>
    <meta name="generator"
    content="HTML-Kit Tools HTML Tidy plugin">
    <title>
    Untitled
    </title>
    </head>
    <body>
    <p>
    Escape codes test
    </p>
    <p>
    áéíóúýàèìòùâêîôûãõç""
    </p>
    </body>
    </html>

    As you can see the accented characters are not being converted to their relative HTML entities but to other characters.

    How to fix this in a permanent way?

    • HTML-Kit Support, 27th Jun 2012 3:44 pm

      On 6/27/2012 9:00 AM, Guest wrote:

      I have been experiencing problems with Tidy since installing HTMLKit
      Tools ver. 20120605a, as fresh install. The code I have is as
      follows:

      <snip>

      Escape codes test

      áéíóúýàèìòùâêîôûãõç""

      <snip>

      The output from Tidy is as follows:

      <snip>

       
      

      > Escape codes test >

       
      

      > áéíóúýà èìòùâêîôûãõç"" >

      <snip>

      As you can see the accented characters are not being converted to
      their relative HTML entities but to other characters.

      How to fix this in a permanent way?

      Hi,

      What's happening here is that Tidy is converting accented characters to
      UTF-8 encoding which is the default encoding suggested by W3C.
      Unfortunately, Tidy's use of UTF-8 has caused some confusion because
      some UTF-8 encoded characters look like gibberish even though there's
      nothing wrong with it.

      http://www.w3.org/International/O-charset.en.php

      If most of your documents use Latin-1 characters, you can create a
      custom Tidy config to preserve accented characters by adding this option:

      output-encoding: latin1

      as described on:

      http://www.html-kit.com/support/tools/tidy-config/

      I've also added a switch to better control Tidy's use of UTF-8. This
      will be in the next update. If you'd like to get an early test version,
      let me know what your User Assistant username is and I'll enable
      TreeHouse test versions for you.

      Hope this helps!

      Chami

      • Guest, 28th Jun 2012 4:06 am

        Thank you for the reply; the problem, as I see it, is that accented characters are not being converted by Tidy to their respective HTML entity names but to some other UTF-8 characters. In fact whenever I paste the Tidy output to the editor and save the page, the accented characters are converted to the UTF-8 gibberish. I will try to change my configuration and run Tidy again, and if nothing works I'll let you know. It could be a good idea to grab the test version; my username is cmpsalvestrini.

      • Steve, 28th Jun 2012 5:46 am

        HTML-Kit Support wrote:

        On 6/27/2012 9:00 AM, Guest wrote:

        I have been experiencing problems with Tidy since installing HTMLKit
        Tools ver. 20120605a, as fresh install. The code I have is as

        [snip]

        What's happening here is that Tidy is converting accented characters
        to UTF-8 encoding which is the default encoding suggested by W3C.
        Unfortunately, Tidy's use of UTF-8 has caused some confusion because
        some UTF-8 encoded characters look like gibberish even though there's
        nothing wrong with it.

        Further information to Chami's explanation: The file with the UTF-8
        characters is being read as having a Windows-1252 character set.

        Encoding Problem: Treating UTF-8 Bytes as Windows-1252 or ISO-8859-1
        http://www.i18nqa.com/debug/bug-utf-8-latin1.html

        UTF-8 Encoding Debugging Chart
        http://www.i18nqa.com/debug/utf8-debug.html

        If most of your documents use Latin-1 characters, you can create a
        custom Tidy config to preserve accented characters by adding this
        option:
        output-encoding: latin1

        Also:

        preserve-entities: yes ¹

        This will prevent Tidy from converting coded entities into characters.

        Another possibility is to write a Unicode Byte Order Mark character
        (BOM) at the beginning of the output:

        output-bom: yes ²

        This will let UTF-aware software that the file contains UTF-8
        characters. An unfortunate side effect of this setting is that if the
        software is not UTF-aware, it will see more "junk" characters at the
        beginning of the file. :-(


        1. http://tidy.sourceforge.net/docs/quickref.html#preserve-entities

        2. http://tidy.sourceforge.net/docs/quickref.html#output-bom

        --
        Steve

        In heaven all the interesting people are missing. -Friedrich Nietzsche

        • HTML-Kit Support, 28th Jun 2012 7:46 am

          On 6/28/2012 5:46 AM, Steve wrote:

          Steve

          Steve! Nice to see you (and the quotes) again on the newsgroup :)

          Chami

          • cmpsalvestrini, 5th Jul 2012 12:30 pm

            I guess I must explain the source of my frustration. I want Tidy to convert my accented characters into entities and at this time Tidy is not converting them; instead it's converting them to other accented characters. I'm going to try the solutions given here, and if all else fails I'll get back to you. Cheers, and thank you for the reply.

            • HTML-Kit Support, 5th Jul 2012 5:16 pm

              On 7/5/2012 12:30 PM, cmpsalvestrini wrote:

              I guess I must explain the source of my frustration. I want Tidy to
              convert my accented characters into entities and at this time Tidy is
              not converting them; instead it's converting them to other accented
              characters. I'm going to try the solutions given here, and if all
              else fails I'll get back to you. Cheers, and thank you for the
              reply.

              Hi,

              I did add you to TreeHouse if you'd like to play with the latest test
              build. But I'm afraid I'm still thinking how best to address this.

              I think the issue here is that Tidy converts accented characters to the
              UTF-8 equivalent instead of converting them to HTML entities. In some
              case that's the preferred way to handle this because if a character
              encoding can handle a particular character then it could be represented
              in that encoding instead of defaulting to HTML entities (which some
              people prefer not to use). On the other hand, HTML entities would work
              regardless of the selected encoding and some people, including myself,
              prefer HTML entities.

              So... I'm not sure which should be the default behavior.

              If there was an easy way to select a bunch of text, or the whole
              document, and convert accented characters to HTML entities would that
              help? This way you can apply HTML entities before running through Tidy
              which Tidy would then preserve.

              Chami

      • Donpastor, 3rd Nov 2012 3:29 pm

        That worked like a charm. Thanks!