Page 1 of 1

Encoding Autodetection Fails With Large HTML String

PostPosted: Tue Aug 24, 2021 12:13 am
by djrecipe
First of all, I just want to point out that "Where to post what" (viewtopic.php?f=4&t=2) is out of date and should point to https://bitbucket.org/chromiumembedded/cef/issues instead.

Issue
Chrome auto-detection of character encoding only checks first X characters in the html string. If there are UTF-16 characters at the end of a long html string, these characters will not render properly in Chrome/CEF. These characters will render properly if the HTML is shortened.

Example
Given this 180,000 character HTML file: https://pastebin.com/LjtHdDs2 , you may open this in Chrome browser or render via CEF and the Chinese text at the end of the HTML string will be garbled. The Chinese characters will be rendered properly if:
1. Many of the <rect></rect> elements are removed, resulting in a shorter HTML string overall.
2. A single UTF-16 character is added somewhere towards the beginning of the HTML string.
3. <meta charset="utf-16"/> is added at the beginning of the file

Theory
Chrome checks the first X characters of the HTML string to autodetect encoding. This number is somewhere around unsigned short max (65535). If no special chars are found, it defaults to UTF-8 (?)

I'm not sure this is really a "bug" persay, but it is somewhat strange behavior with no warning that can cause confusion for developers.

Re: Encoding Autodetection Fails With Large HTML String

PostPosted: Tue Aug 24, 2021 9:49 am
by magreenblatt
Your analysis is likely correct. Evaluating the whole contents of a large HTML file before parsing/rendering would be bad for performance. I suggest using one of the many available techniques for explicitly specifying the character encoding.

Re: Encoding Autodetection Fails With Large HTML String

PostPosted: Thu Oct 21, 2021 6:50 am
by djrecipe
Yea, in the end we are treating it as something that needs to be communicated/worked-around. Thanks for your input. @magreenblatt