Page 1 of 2

A potential CEF cache corruption scenario

PostPosted: Fri Jul 03, 2020 11:14 pm
by chintansolanki
We have a .NET WPF container app in which we host several web apps using CEFSharp.WinForms control. At times, we see that for some users, some JavaScript resource requests fail with the ERR_CONTENT_DECODING_FAILED error message. This issue gets resolved if we reload the app after either clearing the CEF cache or after disabling the cache from the network tab in the developer toolbar window. Please note that this issue isn't confined to a specific subset of resource files: instead, we have seen it happening sporadically for a variety of JavaScript resource files (some hosted on Apache while the others hosted on IIS servers).

While a possible cause for usual ERR_CONTENT_DECODING_FAILED error is a server-side content-encoding issue, in this specific case, we believe this could potentially be related to the CEF browser caching. Please see the analysis section below for the reasons we believe so.

Application Setup

When we initialize CEF settings, we set MultiThreadedMessageLoop setting to true and set CachePath property to a location under %localappdata% on windows 10 machine. When the container app starts, it creates three CEF web browser controls and launches web apps in them. All three apps load concurrently. After that, more CEF web browsers are created as the user visits more apps. The user also reloads some of these apps over time. All the web apps are internal apps sharing the same domain but physically hosted on different web servers. The JavaScript resource files in question usually have caching policy set to allow them to be cached for a week.

CEFSharp version - 79.1.360.0
CEF version - r79.1.36+g90301bd+chromium-79.0.3945.130
Chromium version - 79.0.3945.130

Our Analysis so far

    1. We checked the web-server logs for the failing JavaScript resources. We observed that in most cases, the server requests for those resource files (by the impacted user) were made a few days ago. The users are usually able to use the application well for some days before they sporadically start getting this error.

    2. We checked the network logs (*.HAR file). We see that for the failing JavaScript resource, _transferSize is 0 (which seems to indicate that response was served from the cache as indicated here and here)

    3. When the error occurs, it gets resolved when we reload the app after either clearing the cache or disabling the cache from the network tab.

    4. We tried artificially simulating this error. We used Fiddler's autoresponder feature to deliberately respond with a bad server response (the content was 'gzip' encoded however we changed Content-Encoding header to indicate 'br'). We could simulate the ERR_CONTENT_DECODING_FAILED error. In network logs, we could see that _tranferSize was a non-zero value. We also observed that chrome did not cache the bad response (when we turned off the auto-responder, it again made a fresh server request). This test indicates that when the original JavaScript response was cached by the browser, it must have been a correctly encoded response, or else the browser would not have cached it.

All of the above points lead us to believe that, JavaScript resource files were downloaded (with correct encoding) and cached in CEF cache. The user was also able to use the apps for some time. After that, however, in certain scenarios, some of these files potentially got corrupted in CEF cache, leading to the content decoding error.

We tried using CEF response filter mechanism as explained here to capture the bad response when content decoding error occurs. Unfortunately, we observed that dataIn stream which gets passed to filter function is null when the response fails with this error.

Summary and Questions

This is a sporadic issue which our users are facing. We haven't found a way to deterministically recreate this problem. However based on our analysis so far, we believe some JavaScript files may be getting corrupted in CEF cache over time. We are not sure if the fact that we host several CEF web browsers and load them concurrently could be playing some role in causing this issue.

Has anyone else observed/reported a similar issue? Do you have any idea if we are missing or overlooking something here or going in the wrong direction? Any pointers will be greatly appreciated.

Re: A potential CEF cache corruption scenario

PostPosted: Sat Jul 04, 2020 11:31 am
by magreenblatt
Do you run multiple concurrent instances of the main application process sharing the same cache_path? If so, that will cause corruption. Only a single main application process can use a given cache_path at a time.

Re: A potential CEF cache corruption scenario

PostPosted: Sat Jul 04, 2020 2:17 pm
by HarmlessDave
Just to add to that, if you do not enforce a single instance for your application then you can set the cache path manually for each instance.

Re: A potential CEF cache corruption scenario

PostPosted: Mon Jul 06, 2020 2:45 am
by chintansolanki
The .NET container is multi-process. Each process (or instance) has a unique instance name and it initializes CEF settings class exactly once with a unique CachePath value. The CachePath value is as follows.


Each process thus has a unique CachePath location (they all share common parent directory though). Each process then loads several CEF web-browsers in it, some of which load concurrently when that process gets launched. Is there a potential for cache corruption in this scenario?

Re: A potential CEF cache corruption scenario

PostPosted: Mon Jul 06, 2020 9:32 am
by magreenblatt
That sounds fine. Have you tried loading the web app in multiple Google Chrome windows/tabs to see if the problem reproduces there?

Re: A potential CEF cache corruption scenario

PostPosted: Mon Jul 06, 2020 3:20 pm
by amaitland
I'd suggest trying a newer version, 79 is quite old now. Version 83 is the current supported version.

Re: A potential CEF cache corruption scenario

PostPosted: Tue Jul 07, 2020 11:26 am
by chintansolanki
Thanks for the replies.

We have been trying to reproduce this issue in our container app using a crude automation test. We launch a process, load several apps in CEF browsers (to simulate concurrent loading), wait for a few seconds, and refresh all the apps together. The test performs several hundred such iterations. We have run several such tests on different test machines, but we have not been able to reproduce the issue so far using this approach. So far only our actual end-users have experienced and reported this issue but we have not been to reproduce it.

We do have a plan to update to the latest CEF version sometime in September. However, please note that we had been experiencing this issue on an older version of CEF (v63) as well. Are we aware of any specific fix in v83 which might address this issue?

One thing which I might have missed stating in my original email is that users may shut/restart the container app in-between. Practically, the container app may also crash at times. I was reading this post which seems to suggest that abnormal process termination can leave CEF cache in a corrupt state. This seems different from our scenario, as in our case only a few specific JavaScript files get corrupted as opposed to the entire cache. I was however wondering if abnormal process termination can in any way lead to this specific error. Does this issue seem internal to CEF? Or there could be any other external factors causing the corruption? (antivirus?) Just throwing out a few things here :)

Re: A potential CEF cache corruption scenario

PostPosted: Tue Jul 07, 2020 12:04 pm
by magreenblatt
How many users are reporting the problem? If it’s only one or a few users it could also be a hardware issue (hdd going bad, for example).

Re: A potential CEF cache corruption scenario

PostPosted: Wed Jul 08, 2020 9:55 am
by chintansolanki
Hardware issue seems less likely for a few reasons. While this issue seems to impact a small percentage of the user population, it's happening for random users. None of the users use the physical machines, they use virtual machines hosted in data centers. Also, before we started using CEF, we used to use .NET WPF web-browser control (internally uses IE engine) for hosting web apps, and we had never encountered such cache corruption issues.

While we continue to come up with a theory that can explain what might be causing this, I was also thinking about the following two questions.

1. If we are able to get hold of the 'corrupted' CEF cache folder, is there a way to analyze cached files in any way and try to figure out what might have gone wrong? Or any other things we should look for to confirm any signs of corruption? If we simply replace our local machine CEF cache folder with that corrupted folder, is there a chance we might be able to reproduce the same error? Does Chrome browser also use the same cache structure? If yes, can we replace Chrome browser cache folder with this corrupted cache folder and try to reproduce the issue in Chrome browser? I will try to replace my local CEF cache with a corrupt cache folder and confirm if I was able to reproduce the issue.

2. When we reproduced the issue artificially using Fiddler's autoresponder feature, we were able to detect this error by overriding 'OnResourceLoadComplete' method in a custom ResourceRequestHandler and check response.ErrorCode property for value 'CefErrorCode.ContentDecodingFailed'. Once we defect this error, is there an elegant way to handle/resolve this? (for instance by reinitiating the request and bypassing the CEF cache?)

Re: A potential CEF cache corruption scenario

PostPosted: Tue Jul 14, 2020 12:08 pm
by chintansolanki
An update. One of our users recently reported the issue. We copied the CEF cache folder from the user's machine, copied it to our machine, and we could reproduce the same issue. This confirms that indeed the JavaScript file in the cache was corrupted.