Causes and solutions for whitescreens + failure to update on release

Status

In progress

Author(s)

Joyce Sato-Reinhold

Updated

2023-12-13

Related Jira Ticket

https://lite-farm.atlassian.net/browse/LF-3319

Objective

The goal of this document is to delineate the various caching strategies in effect in our application, how they could each contribute to either whitescreen crashes or failure to see new code changes until users reload their page (or, in the worst cases, clear their local storage + cache), and a plan to address them.

Summary

I recommend the following:

Motivation

  • On each of our recent releases (3.3.0-3.5.1) and on all frontend code merge to beta, the application is brought to a state where one of the follow occurs:

    • Old UI persists until tab reload and/or cache clear

    • Whitescreen due to TypeError and/or MIME Type Error until tab reload

    • Whitescreen TypeError due to to missing Redux entities until reload or cache clear

  • These issues occur on releases to production despite our current strategy of logging users out of the application the next time they call the API (by invalidating all access tokens), even though that is already fairly aggressive and disruptive UX

  • As we move to more frequent releases, these issues will be encountered more frequently by our users, leading to more disruption

Causes of the current situation

The current caching strategy of our application can be considered at several levels:

Nginx

Nginx serves the actual index.html file. We can add (as we did in this PR , already live on production as of 3.5.0 release) headers instructing the browser not to cache this page.

This PR did not have any discernible effect, however, presumably because index.html is served by the pwa service worker, and not from the network:

Further investigation into vite-pwa suggests that our changes to /index.html were actually in the right direction, but the other pwa-essential files (most particularly the service worker file itself, which controls the cache on /index.html) were needed as well.

 

The complete nginx configuration required by vite-pwa is described here: Vite PWA Documentation | Deployment | Nginx, and was implemented in

Vite-PWA

The main entity responsible for the caching behaviour of the app is vite-pwa. This plugin generates, upon vite build, a service worker file (sw.js) which requests the caching of the assets in the current Vite build:

sw.js; generated in dist/ folder upon running vite build

/** * The precacheAndRoute() method efficiently caches and responds to * requests for URLs in the manifest. * See https://goo.gl/S9QRab */ workbox.precacheAndRoute([{ "url": "assets/_commonjs-dynamic-modules-302442b1.js", "revision": null }, { "url": "assets/AbandonManagementPlan-7fea6bc6.js", "revision": null }, { "url": "assets/AddCropVariety-776d788b.js", "revision": null }, { "url": "assets/AddCropVariety-9ab8909b.css", "revision": null }, { // many more assets }, { "url": "assets/withTranslation-eca24d6f.js", "revision": null }, { "url": "css/react-table.css", "revision": "b430ff9111421527c34dc69d711f8fc3" }, { "url": "css/ReactWeather.css", "revision": "f59bc68e3c7e408c25c7528eed3c2999" }, { "url": "global.css", "revision": "a09e80b366711b0fb7006f4b0feba05e" }, { "url": "index.html", "revision": "81668a27b09daad4bfb1fcc4bf9c4e26" }, { "url": "registerSW.js", "revision": "1872c500de691dce40960bb85481de07" }, { "url": "manifest.webmanifest", "revision": "c457d54bd7f9f8099ead82b74418837a" }], {}); workbox.cleanupOutdatedCaches(); workbox.registerRoute(new workbox.NavigationRoute(workbox.createHandlerBoundToURL("index.html")));


These assets are stored in Cache Storage within the workbox-precache object, viewable from the Application tab of Chrome Dev Tools:

Which service worker is installed and active determines what code the user is seeing (if they have a service worker cache built already – a first time user would have no cache and would request all of these assets from the network).

Whitescreen MIME Type Error crash

The (former) whitescreen MIME Type Error crash (see video #1 below) we were seeing after the new service worker installs is specific to the incorrect configuration of the pwa service worker and nginx, see .


See also:

  • (lower down in thread)

According to those issue threads, this should be fixed by altering our nginx config as described.

  • Update: it was indeed fixed in


Video #1 (MIME type crash) - status: resolved

Timing of Reload

Once registering the new service worker does not crash the application as above, then the next step will be to work on the timing of the service worker being 1) detected, and 2) taking control of the application. Some threads to follow & test here:

  • The recommend nginx configuration for vite-pwa implemented in adds "public, max-age=0, must-revalidate" always; cache-control headers to the sw.js file. This means the file should always be checked for update before it is served, and should be detected as soon as the code change goes live.

    • Update: confirmed, and observable in video #4 below

  • (Update: completed in LF-3923) Test the results of using the registerSW({ immediate: true }) option in the root of the application to force page reload, see Vite PWA | Guide | Automatic Reload. My understanding is that this will change when the service worker takes effect (not when it begins installing)

    • Update: confirmed, and observable in videos #2 and videos #3 below. Service worker install requires a reload or tab close in both configurations


Here is the timecourse of update with registerSW({ immediate: true }) applied, current on beta as of merge of



Video #2 (automatic page reload) - status: this is beta’s current behaviour and patch 3.5.2 candidate

In this video:

  • I close the tab and re-open beta to begin installing the new service worker (#1000)

  • While the service worker installs, the old code (here is it the transparent header) is still active

  • As soon as #1000 finishes installing (at 0:19) there is an automatic page load, and the new code (white background header) is visible immediately

Delaying cache/code update

Update: in Standup on December 12, 2023 we looked at both video #2 and video #3 (below) and the group preferred including automatic page reload (video #2). But keeping these notes as a reference:

  • @Duncan Brain made a point about form submission here that reminded me that Service Worker keep control of the page until the user has navigated off by default, to allow users to continue whatever they were doing at the time.

  • The way that we currently do release this completely pointless (they are logged off when the submit the form anyway), but if we want to consider not logging users off, this might be a gentler and better user experience. In a nutshell, it would delay code update by one visit/refresh, i.e.

    • User’s first fresh visit after release: service worker installs but does not take control; no code change is visible and the old cached app is used until visit concludes (the tab is closed) or page is manually refreshed

    • User’s second fresh visit after release: new service worker takes control; code change is visible

  • To implement, we would:

    • Not invalidate API tokens

    • Keep autoUpdate, but not automatic page reload

    • Be fairly careful (this would be on us, and probably would be the most challenging part of going this route) to not invalidate API endpoints (we would need a little bit of backwards-compatibility, in other words) so that the old cached frontend could continue go communicate with the updated backend

 

Video #3 (autoUpdate without automatic page reload) - status: neither on beta nor production

Re-create at any point by removing registerSW({ immediate: true }) from application root

Notes:

  • Video is demonstrating that neither navigating around the application nor logging out triggers the new service worker to begin installing

  • At 0:29 I close the tab (this could have been a manual page reload instead); new service worker begins installing

  • Service worker installation completes at 1:00, but there is no refresh and the old code is still active on the page

  • Once installed, logging out also does not reflect the code change (which would have been the header gradient disappearing from the farm selection page here)

  • [Not shown in video but confirmed in previous merge shown below]; a page refresh (or closing the tab a second time) would have shown the code change:

Video #3b (exact same config as above) - status: same as above

  • This is a demo of when the code change does take effect after service worker install completes

  • The new code is visible immediately upon manual refresh (0:07); closing the tab would have also worked

 

User prompting

  • In a brief team discussion, prompting the user for update considered the best UX, and you can often see it in action navigating the Vite PWA docs (it’s very snazzy!)

  • However, the docs also carry this warning, and it should be noted our current implementation is autoUpdate:

  • I’m also not 100% sure how changing this behaviour would also interact with releasing (do we have to do a pre-release patch updating just the service worker behaviour, and what happens to users who don’t interact with the application between patch and release?) It may be safer to stick with the current setting and try to work within it, by, e.g. adding automatic page reload and better detection

 

TypeError: Failed to fetch dynamically imported module

This is an error caused by Vite’s handling of assets, and the fact that each lazy loaded path in our Routes file (that is, every one of them) is a dynamically loaded module and so its own javascript file. It is an open issue on the Vite repo: and an active one.

From my understanding of that discussion, there are several main ways devs have been dealing with this:

  1. An error boundary around the app that listens for this error and reloads the page when detected

  2. Maintaining the previous deployment's /assets alongside the new ones. Helpful description here in this issue comment.

  3. Via Service Worker Detection; this might be the same as what we already have with automatic page reload (not 100% sure on this one because the code snippet, doesn’t seem to do all of what the issue comment describes). What I consider the ideal implementation is automatic refresh upon new SW detection = at the point of code merge.

  4. Removing lazy loading all together. This means all paths go into the main.js bundle

 

Of these, I would prefer #1 or #3. If we stay on a slow release cycle, #2 might not be that bad either.

 

Scope of currently affected users

Here are the Sentry logs of this error for the last 90 days

(It’s actually QUITE a few users, and a bit of IP address search confirms it is definitely not just LiteFarm core team)

Here is another view with a 90d timecourse (you can see the spike after each release). Unfortunately this is the longest time period so we don’t have any coverage data from right after the August release.

Why does it happen?

My best guess is that this happens anytime an out of date resource is requested from the network either because 1) the pwa cache is incomplete, or 2) there is no active service worker. It requires an out-of-date version of the main application to be in memory, and therefore will always be cleared by a page reload.

Additionally, it could NOT occur for a first time visitor (i.e. first visit is after the most recent release). And I cannot re-create it (even with lots of navigating during all parts of the new service worker install) in a browser that has a complete PWA cache already. Therefore I’m a bit surprised that both @Sayaka Ono and @Duncan Brainsaw this last week live as well (was there any possibility you had recently cleared your cache or had the network throttled?) and I still find that a bit worrisome…!

 

These errors absolutely happen days or even weeks after release, so being active on the app at the time of update is not necessary; I guess any user who has loaded a webapp but doesn’t have the full cache of all routes (maybe they weren’t on app long enough for all the files to transfer?) will crash like this if they navigate before the service worker installs, disable their service worker, or aren’t using service workers (I don’t know how many users or browser environments would disable this functionality).

Reproduction:


This error is easy to re-create with Slow 3G (@Anto Sgarlatta sorry I mistakenly said that I don’t think network speed impacts service worker install, but that is totally wrong, because the service worker install completes when all the resources have been fetched and cached! So by throttling the network speed it can be prolonged almost indefinitely).

You can also cause it instantly by unregistering the active service worker if you have not reloaded since an update.


Video #4 (Failed to fetch crash) - Status: current issue on both beta or app

In this video:

  • No service worker has ever become active

  • (As an aside, you can see the original service worker halt install as soon as the new webapp Docker container goes up! The detection is actually instant!)

  • Because the webapp has already been loaded into browser memory (with all its references to old js files), as soon as the new webapp container is live, those old assets are non-existent and all routing fails

  • The PWA would have helped this situation by keeping the old files in Cache storage even when they were not on the server, but in this case (due to slow network), the PWA never finished caching the resources, so they were simply never available to this user

Video #5 (Failed to fetch crash reproductions testing patch) - December 13, 2023 – status: would only occur in this manner until one user page refresh

This story is a little bit different than in Video #4 because a service worker is installed. Also LiteFarm’s configuration is now the same as the code in Videos #3 and Video #3b above, but this is the PR that is implementing the nginx changes, as opposed to one PR afterwards. In other words, the nginx changes are most likely not applicable to the currently served pwa files, which were requested before the container was rebuilt and are now in browser memory.

  • An old service worker had been active (as opposed to absent as in Video #4)

  • We are looking at a (mostly) cached version of the webpage

    • This is an aside, but an important one for explaining what we saw today: translations are not part of the cache (because in the built webapp they are in /locales not in /dist) and are always network requested.

  • Installation finishes and navigation crashes

    • If this were reproducible, with this timing, with the correct nginx configuration already applied before the vite build, I would be tempted to blame the service worker lifecycle cleaning up the oudated cache too soon

    • However because there is no such crash in Videos #3 and Video #3b (the cached webapp is entirely stable to navigate, even when the next service worker is installed), I think this crash is still on us and our faulty nginx config


This crash could be prevented by:

  • The updated nginx, which will take effect presumably the next time sw.js, the webmanifest, or the never-expiring assets are requested, which is not under our control because of our previous nginx config did not specify a timecourse for these files

  • Refreshing after install (before navigation), as happens with the registerSW change in root

  • Error boundary around routing

But because all of these changes require new frontend code, the only way to prevent it entirely for the build that implements any/all these is to serve old assets for a period of time.

Redux + Redux-Persist

The persist:root data object created and kept in local storage by redux-persist was originally considered the root of the whitescreen crashes (see comments of ). However, during the forced logout of our current release process, this object is completely purged from local storage:

export const logout = () => { localStorage.removeItem('id_token'); purgeState(); // this function calls persistor.purge() return history.push('/'); };

 

The persist:root is then regenerated (on the homepage, before login) in the shape defined by the initialState logic in the current version of the Redux store, but empty of data. Therefore:

  • If a cashed version of store.js is used, the shape of the newly regenerated persist:root will still be according to the old data, see . This means that handling application caching via vite-pwa is also responsible for getting the right Redux store shape.

  • Purging persist:root does NOT clear the Redux store – only the persisted store data. We may wish to also reset the Redux store (). However, I think this will be handled by automatic page reload if we go that route with vite-pwa so then not necessary.

  • ONLY IF we want to move off the current logout + purge strategy, we will may have to start writing redux-persist migrations. The purpose of migrations is to combine old persisted data with new persisted data. (Edit: that’s too strong a point We could potentially remove logout and still purge the store so I think there are other options here. Also the timing of the logout and purge is not effective right now anyway).

  • Beginning to write redux-persist migrations would create a fair amount of development overhead. The documentation is basically non-existent and a maintainer code example demonstrates that the migrations can get extraordinarily extensive over time, creating more boilerplate for our Redux code.

  • Therefore, I think we should continue purging our the persist:root on update as we do now

But in any case, adjusting the handling of our Redux store state specifically should come second to the handling of the cached assets (= new code). Once the MIME type errors are resolved and the code is updating at the correct time, we can see if there are any redux errors that remain – there very well might not be!