Five habits for shipping big database migrations

If you have ever written code to move a hundred million rows from one schema to another, you know the feeling. The migration looks fine on the staging environment. Production is two orders of magnitude larger and the patterns of access are different. The migration takes five hours instead of fifteen minutes. Two thirds of the way through, you discover that one rare row triggers an exception that the test suite never saw. The on-call engineer pages at 03:00.

Big migrations are a class of engineering problem that does not yield to ordinary discipline. They reward a specific set of habits.

Habit 1: prefer additive change

Whenever possible, write a migration that adds before it removes. New column, dual-write, backfill, switch reads, then remove the old column. It takes longer; it makes every step reversible. Reversible is the only thing that matters at 03:00.

Habit 2: estimate the long tail before you run

Run the slowest single-row case ten thousand times in a tight loop on production-shape data. Multiply by your row count. That is your floor estimate. Anything that depends on a remote call to a third party — a payments provider, an identity provider, a webhook — is going to be slower than the lab number. Plan for it.

Habit 3: ship the rollback before you ship the migration

If your rollback is "restore from backup," you do not have a rollback. Write a script that reverses the migration, run it on staging until it is boring, and ship it next to the migration script with the same code review. The rollback exists before the migration runs in production.

Habit 4: feature-flag the read path, not the migration

Do not make the migration script a feature flag. Make the read path a feature flag. The migration runs fully; the application chooses which schema to read from on a per-request basis. Roll out the read flag at one percent, ten percent, fifty percent, one hundred percent, with a kill switch.

Habit 5: a migration is a project, not a ticket

Big migrations have a tech lead, a runbook, a tabletop rehearsal, a comms plan, an on-call rotation, a post-mortem template, and a written rollback decision authority. They are not assigned to one engineer with a Friday deadline. The cost of treating them as ordinary work is one hour of senior engineering time saved and one major incident not avoided.

What we do at 888

Every database migration that touches a table with more than ten million rows goes through a written design review with the on-call lead. Every such migration ships with a tested rollback. Every rollout uses feature-flagged reads. We have not had a migration-driven incident since adopting this; we expect that to change once, and we will write about it when it does.