Agent-Assisted Schema Evolution
Automating backwards-compatible schema evolution could recover almost $4 million in engineering capacity for a typical tech platform company.
Each schema change in a source data asset must propagate through all derived assets. Even backwards-compatible changes, such as adding a field, require modifying every downstream asset. The associated engineering cost scales rapidly.
Companies like Netflix, Uber, and Spotify operate thousands of microservices, each producing unique data. A conservative estimate is therefore 1,000 source data assets, excluding third-party integrations and ERPs. Total data assets typically exceed source assets by a factor of 10. Airbnb is an outlier with millions of data assets, nearly unmaintainable for an organization of ~8,000 employees.
Let’s assume 1 FTE costs $150,000 annually or $75/hour at 40 hours/week and 50 weeks/year.
Given:
- \(S=1000\) source data assets
- \(D=10\) lineage depth (i.e total assets / source assets)
- \(N=5\) changes per asset per year
- \(T=1\) hour per change per affected data asset
- \(C_{\mathrm{hour}}=75\) dollars per hour
The annual engineering cost is:
\[\begin{eqnarray} C_{\mathrm{total}} &=& S \cdot D \cdot N \cdot T \cdot C_{\mathrm{hour}} \\ &=& 1000 \cdot 10 \cdot 5 \cdot 1 \cdot 75 = \$3,750,000. \end{eqnarray}\]One hour per change accounts for the full cycle: reviewing the upstream docs, checking out the repository, editing the affected data asset, running local tests, pushing, and waiting for a PR review. It’s all mechanical work, but it quickly compounds. Batching and no-ops may reduce costs, but schema changes within derived assets offset any such discounts.
Plug in your organization’s numbers to estimate your cost!
The problem
The cost of schema evolution stems not only from breaking changes but also from backwards-compatible ones. Human interpretation is the bottleneck. This work preserves coherence without creating much value, if any at all. And the costs scale with the size of the organization.
The solution: constrained agents
The solution is a constrained agent that proposes edits, runs tests, and opens pull requests. It does not merge code or propagate changes silently. If tests fail or a reviewer rejects changes, it stops. At worst, the system provides early detection and structured notification.
This background coding agent operates under strict external gates, producing small, reviewable diffs. CI and human review remain the arbiters of correctness. Constrained agency with verification scales better than humans alone or unbounded automation.
Here’s how it works:
- Schema changes are detected via a schema registry.
- Lineage is queried (e.g. OpenLineage) to identify immediate downstream consumers.
- For each downstream asset, the system:
- Checks out the repository.
- Inspects the schema diff.
- Applies deterministic transformers where possible, and, otherwise, invokes an LLM to generate a minimal patch.
- Runs tests and static analyses.
- Opens a PR with a structured explanation of the inferred intent and applied change.
- Propagation proceeds one dependency edge at a time to minimize the blast radius.
Why not rules?
Rules-based tooling handles trivial cases: new fields (with defaults), field renaming, or type constraint relaxation. These deterministic cases are often already managed internally by wire formats or storage technologies.
The residual cost exists due to environmental heterogeneity across formats, runtimes, repositories, and implicit contracts. Exhaustively encoding all variations as rules does not scale.
This system does not replace engineers. It reallocates them from mechanical reconciliation to review, where domain-specific judgment is critical. At scale, schema evolution cost is inevitable. A constrained agentic system with cheap-to-verify output amortizes that cost.
This post was written by Ian, edited by Crossfire ($0.28, < 2 min), and subsequently tweaked by Ian.