Prod server deployment and failure due to hung migration (Release 3.7.7: June 18, 2025)
What Happened:
On June 18, 2025, ~11:45 am PT, deployment to prod failed on a hung migration.
We reviewed the
litefarm-apilogs and it looked as though the migration was hung at the stage where the migration should lock the database for migrationThe database was manually checked for the locked status of the database showing
0: not lockedAttempts to restart the
litefarm-apicontainer were unsuccessful
Attempts to use
knexcli tools to unlock or re-migrate were unsuccessful and resulted in the same error from the logsAI was used to determine there may be a hung process
Attempts to terminate the process were done unsuccessfully, likely due to the container
npm run migrate:dev:dbas an entrypoint to the docker container
Riddhi from the data team showed up to ask if the database was down
Prod was down for approx ~1 hour 45mins
How We Resolved:
Hung migration was resolved – Exact reason unknown as two corrective actions were taken simultaneously by different groups:
Joyce and team on troubleshooting call manually refactored to Dockerfile to change the entrypoint to allow killing the migration process
Anto restarted the
litefarm-dbcontainer and used the previously tried knex unlock command
Lessons Learned:
ROOT CAUSE - coordinate with data team before updating prod
Suspicion is that large data operations from the data team may have interfered with the migration process - not confirmed
If there are only two developers and one has a hard stop we should release earlier in the day as it could be stressful to be alone to troubleshoot
Easy troubleshooting steps are often overlooked in favour of finding the solution. Lets remember to go slow while troubleshooting and follow the basic steps. Basic steps help 90% of the time.
If we get help from others async - lets ask them to coordinate to make sure corrective actions are not being taken simultaneously potentially causing further issues or inconclusive actions
Action Items:
Add release checklist item for
Coordinate with data team before releasingConsider adding troubleshooting dropdown subsection to release checklist (see dropdown below) :
Consider if the DockerFile
ENTRYPOINTcould be improvedLook into the “Under maintenance” screen we used to serve when updating certificates – can we use this for outages