Prod server deployment and failure due to hung migration (Release 3.7.7: June 18, 2025)

Prod server deployment and failure due to hung migration (Release 3.7.7: June 18, 2025)

What Happened:

  • On June 18, 2025, ~11:45 am PT, deployment to prod failed on a hung migration.

  • We reviewed the litefarm-api logs and it looked as though the migration was hung at the stage where the migration should lock the database for migration

    • The database was manually checked for the locked status of the database showing 0 : not locked

      • Attempts to restart the litefarm-api container were unsuccessful

    • Attempts to use knex cli tools to unlock or re-migrate were unsuccessful and resulted in the same error from the logs

    • AI was used to determine there may be a hung process

      • Attempts to terminate the process were done unsuccessfully, likely due to the container npm run migrate:dev:db as an entrypoint to the docker container

  • Riddhi from the data team showed up to ask if the database was down

  • Prod was down for approx ~1 hour 45mins

How We Resolved:

  • Hung migration was resolved – Exact reason unknown as two corrective actions were taken simultaneously by different groups:

    • Joyce and team on troubleshooting call manually refactored to Dockerfile to change the entrypoint to allow killing the migration process

    • Anto restarted the litefarm-db container and used the previously tried knex unlock command

Lessons Learned:

  • ROOT CAUSE - coordinate with data team before updating prod

    • Suspicion is that large data operations from the data team may have interfered with the migration process - not confirmed

  • If there are only two developers and one has a hard stop we should release earlier in the day as it could be stressful to be alone to troubleshoot

  • Easy troubleshooting steps are often overlooked in favour of finding the solution. Lets remember to go slow while troubleshooting and follow the basic steps. Basic steps help 90% of the time.

  • If we get help from others async - lets ask them to coordinate to make sure corrective actions are not being taken simultaneously potentially causing further issues or inconclusive actions

Action Items:

  1. Add release checklist item for

    Coordinate with data team before releasing
  2. Consider adding troubleshooting dropdown subsection to release checklist (see dropdown below) :

Retrieve logs from ALL containers if possible docker logs CONTAINER
litefarm-api:
litefarm-db:
litefarm-web:
STOP: Take time to think and come up with a list of possible causes (prompt: why does it work on beta but not on prod)
Cause 1:
Way to test theory 1:
What happened:
Cause 2:
Way to test theory 2
What happened
Do any of the possible causes make sense or are testable? Maybe nothing is wrong with the code!
Try restarting the problem container
What happened:
Try restarting the containers that are fine
What happened:
Nothing is working!
Can we rollback release?
Yes, good – no worries then
We cannot rollback release:
All hands on deck! Get previous employees and contributors to jump on
  1. Consider if the DockerFile ENTRYPOINT could be improved

  2. Look into the “Under maintenance” screen we used to serve when updating certificates – can we use this for outages