Platform engineering · playbook

Elasticsearch v7 → v9. Multi-cluster. Live platform. Zero regression.

The migration ran behind a shape-compare utility that diffed every index per service before each cutover — the safety net that made "zero regression" a target we could actually hit. This is the playbook, not a vendor case study.

v7 → v9

Major version jump

Multi-cluster

Running topology

Per-service

Cutover granularity

0

Regressions

The problem

Many mature platforms end up on a multi-cluster Elasticsearch topology — one legacy cluster on v7, a couple of newer ones already on v9, one or two for specific products. The legacy cluster is EOL, blocking query-performance wins, and nobody wants to be the person who starts "the ES migration" on a Friday.

Migrations like this are easy on paper ("just reindex") and hell in practice. Services get written against the v7 Java-client API and make subtle assumptions about response shape. Document schemas drift over time — fields added in v7.8 behave differently in v9.0. Some services store their own copy of `mapping.json` and re-apply it at boot. Others don't.

The bar for "success" isn't just "v7 is gone." It's: every service that reads from Elasticsearch returns identical results to what it returned the day before the cutover. Across every endpoint. For the full production dataset. That's the only definition of done that matters.

The approach — three pillars

Big-bang cutovers fail silently. Phased migrations that drag for months fail loudly. The sweet spot: a per-service cutover with a pre-flight check that has to return green before we move.

1. One canonical v9 client

A single `ElasticV9Service` that wraps `@elastic/elasticsearch` v9 and exposes the same method surface services already use against v7. Dual-compatible during the migration window: if v9 returns a shape-mismatch error, the service can fall back to v7 for read paths. Writes always go to v9 first. This lets you migrate reads per-endpoint, not per-service.

2. Shape-compare utility as the safety net

Before any service flips, a small Node script runs both clusters against a sampled set of queries and diffs the response document shape field-by-field. Any non-zero diff blocks the cutover until we either (a) fix the service, (b) fix the v9 mapping, or (c) document a tolerated diff. Wired as a CI job on every PR that touches Elasticsearch code.

3. Per-service SPEC → BUILD → QA

Every service migration went through my Forge harness: a SPEC document defined the service, the endpoints, the expected diffs, and the acceptance criteria. Build ran the shape-compare and the implementation change. QA ran the test plan with the owner of the affected service — we only flipped after a green score. Full audit trail per migration.

What the shape-compare actually checks

The utility runs a set of representative queries against both clusters, pulls the top-N documents from each, and compares shape at the field level — not values. Values drift constantly as new data lands; shape is the invariant.

Field presence

Every top-level and nested field in the v7 document must exist in the v9 document. Missing field → cutover blocked with a clear message: "field X present in v7, missing in v9 for index Y."

Field type

v7 and v9 handle `keyword` vs `text` differently for some analyzers. If a field was `keyword` in v7 and `text` in v9, downstream services that do exact-match filters silently break. The tool flags any type diff.

Nested array shape

Deeply nested arrays (Cosmos events, audit trails, any doc with `items[].attributes[].value`) are the classic landmine. If a nested field goes missing in v9 because the ingestion path dropped empty strings, services iterating on it throw. Shape check catches it before a single user query does.

Tolerated diffs

Some diffs are fine — e.g. v9 added a `_source` excludes config that suppresses a field we never used. Tolerated diffs go in a YAML allowlist per index. No silent ignoring; every accepted change is explicit and reviewable.

Cutover phases (per service)

One service at a time. Reads flip endpoint-by-endpoint. Writes dual-write until all reads are on v9. Every phase is reversible up until the final decommission.

  1. Phase 0

    Shape-compare green on every index the service reads. SPEC signed off by the service owner.

  2. Phase 1

    Dual-write enabled: every ingestion path writes to v7 AND v9. Services still read from v7.

  3. Phase 2

    Endpoint-by-endpoint read cutover. One endpoint at a time, with a 24h bake period each. Any observed regression → immediate flip back.

  4. Phase 3

    All reads on v9. Writes still dual. Watch for 7 days. Validate analytics / downstream dashboards against the new cluster.

  5. Phase 4

    Writes stop going to v7. Service is fully on v9.

  6. Phase 5

    v7 cluster decommissioned per index (not all at once). Any orphaned query path surfaces as a 404 in logs — fix, then re-decommission.

What I'd do differently

Write the shape-compare utility first, not last

I initially thought "I'll eyeball the indices." Two days into the first service I was eyeballing at 2am. Shape-compare took a day to write and saved weeks of debugging. Any stateful migration across ~3+ services should have one from day one.

Tolerated-diffs YAML is the real documentation

The allowlist of acceptable shape diffs becomes the most useful document in the project. It's a machine-readable record of every decision about what was OK to change. Future-me re-reads it every time a field is touched.

Reversibility is the feature

Every phase has a flip-back path. The per-endpoint bake in Phase 2 means you actually exercise the flip-back at least once — in our case, a field where v9's `_source` was being filtered differently. 30-second flip, zero customer impact. Big-bang would have been a 3am cluster rollback.

Don't trust "works on my laptop" for ES

Elasticsearch in dev mode uses different defaults than prod (replicas, `refresh_interval`, per-tier settings). Shape matches in dev but not in prod. Shape-compare always runs against staging — never local.

Migrating Elasticsearch (or any stateful service)?

I do migration consulting for stateful data systems: Elasticsearch v7/v8 → v9, Postgres major-version upgrades with DR validation, Kafka protocol migrations, Redis hardening. The pattern is the same — shape-compare safety net, per-service SPEC → Build → QA, reversible phases. Remote, paid engagements — typically 3–8 weeks per migration.

Email [email protected]

Kubernetes from scratch

The other hard-infra story: a 10-script reproducible Kubernetes bootstrap on bare metal with Cilium eBPF + ArgoCD GitOps + Vault-backed secrets + Cloudflared Zero Trust.