Encoding Autodetection Fails With Large HTML String

Having problems with building or using CEF's C/C++ APIs? This forum is here to help. Please do not post bug reports or feature requests here.

Encoding Autodetection Fails With Large HTML String

Postby djrecipe » Tue Aug 24, 2021 12:13 am

First of all, I just want to point out that "Where to post what" (viewtopic.php?f=4&t=2) is out of date and should point to https://bitbucket.org/chromiumembedded/cef/issues instead.

Issue
Chrome auto-detection of character encoding only checks first X characters in the html string. If there are UTF-16 characters at the end of a long html string, these characters will not render properly in Chrome/CEF. These characters will render properly if the HTML is shortened.

Example
Given this 180,000 character HTML file: https://pastebin.com/LjtHdDs2 , you may open this in Chrome browser or render via CEF and the Chinese text at the end of the HTML string will be garbled. The Chinese characters will be rendered properly if:
1. Many of the <rect></rect> elements are removed, resulting in a shorter HTML string overall.
2. A single UTF-16 character is added somewhere towards the beginning of the HTML string.
3. <meta charset="utf-16"/> is added at the beginning of the file

Theory
Chrome checks the first X characters of the HTML string to autodetect encoding. This number is somewhere around unsigned short max (65535). If no special chars are found, it defaults to UTF-8 (?)

I'm not sure this is really a "bug" persay, but it is somewhat strange behavior with no warning that can cause confusion for developers.
djrecipe
Newbie
 
Posts: 7
Joined: Tue Aug 24, 2021 12:00 am

Re: Encoding Autodetection Fails With Large HTML String

Postby magreenblatt » Tue Aug 24, 2021 9:49 am

Your analysis is likely correct. Evaluating the whole contents of a large HTML file before parsing/rendering would be bad for performance. I suggest using one of the many available techniques for explicitly specifying the character encoding.
magreenblatt
Site Admin
 
Posts: 12382
Joined: Fri May 29, 2009 6:57 pm

Re: Encoding Autodetection Fails With Large HTML String

Postby djrecipe » Thu Oct 21, 2021 6:50 am

Yea, in the end we are treating it as something that needs to be communicated/worked-around. Thanks for your input. @magreenblatt
djrecipe
Newbie
 
Posts: 7
Joined: Tue Aug 24, 2021 12:00 am


Return to Support Forum

Who is online

Users browsing this forum: Google [Bot] and 32 guests