Engineering

Published: Apr 3, 2026

Schema Evolution Strategies for Delta Lake, Iceberg, and Avro

10 min read·Tags: schema evolution, delta lake, apache iceberg, avro, data engineering, schema changes, data formats

A source team renames a column from user_id to userId on a Friday afternoon. By Monday your pipeline is producing nulls for 40% of rows, three dashboards are broken, and no one filed a ticket. Schema changes are the silent killer of data pipelines.

Schema evolution is the set of strategies that let your data systems absorb those changes gracefully — without manual intervention, data loss, or 2 AM pages.

Why Schema Evolution Is Hard

Data formats that were designed for analytical storage (Parquet, ORC) don't carry schema metadata at the file level — the schema is inferred at read time. When the schema changes, readers that haven't been updated produce incorrect results or fail silently.

Modern table formats (Delta Lake, Apache Iceberg) and serialization frameworks (Avro, Protobuf) attack this problem differently. Understanding their models helps you choose the right tool for each scenario.

The Four Types of Schema Change

Change Type	Risk Level	Compatible?
Add nullable column	Low	Backward-compatible
Add non-nullable column	High	Breaking change
Remove column	Medium	Backward-compatible (if readers ignore unknowns)
Rename column	High	Breaking (logically a drop + add)
Widen type (int to long)	Low-Medium	Usually safe
Narrow type (long to int)	High	Breaking — data loss risk
Change type (string to int)	High	Breaking

The central question for any schema change: which side is the constraint — the writer or the reader?

Schema Evolution in Delta Lake

Delta Lake stores the table schema in the transaction log. Every write must conform to the current schema by default — this is the "schema enforcement" mode. Schema evolution requires explicitly opting in.

Automatic Schema Merging

# PySpark — Delta Lake schema merge on write
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("schema_evolution") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# New DataFrame has an extra column: 'country'
new_data = spark.createDataFrame([
    ("u001", "Alice", "DE"),
    ("u002", "Bob", "US"),
], ["user_id", "name", "country"])

# mergeSchema=True adds the new column to the table schema
new_data.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .save("/delta/users")

With mergeSchema=True, Delta Lake adds the new country column to the table schema. Existing rows will return NULL for country. This is safe for additive changes.

Overwriting Schema

For destructive changes (reordering columns, changing types), use overwriteSchema:

# Delta Lake — overwrite the entire schema
# WARNING: This rewrites the schema; existing data is read with the new definition
new_data.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save("/delta/users")

overwriteSchema is destructive — use it only when you're intentionally replacing the schema, not just evolving it.

Column Mapping (DV 2.0+)

Delta Lake's column mapping feature (enabled in Delta 2.0+) allows physical column renaming without rewriting data files:

-- Spark SQL (Delta Lake) — rename column without rewriting files
ALTER TABLE users SET TBLPROPERTIES (
    'delta.columnMapping.mode' = 'name',
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5'
);

ALTER TABLE users RENAME COLUMN user_id TO userId;

Column mapping stores a physical name -> logical name mapping in the Delta log. Old files are still valid — only the metadata changes. This is the right way to rename columns in Delta Lake.

Schema Evolution in Apache Iceberg

Iceberg was designed with schema evolution as a first-class feature. Its metadata model tracks column IDs separately from column names — which is the fundamental difference from Delta Lake.

Iceberg's Column ID Model

In Iceberg, every column has a stable integer ID assigned at creation. When you rename a column, you're changing the name — but the ID (which maps to the physical file data) stays the same. Readers that use column IDs (the default) see the rename transparently.

-- Iceberg SQL (Spark SQL dialect with Iceberg catalog)

-- Add a column
ALTER TABLE catalog.db.users ADD COLUMN (country STRING);

-- Rename a column — safe, no data rewrite needed
ALTER TABLE catalog.db.users RENAME COLUMN user_id TO userId;

-- Drop a column
ALTER TABLE catalog.db.users DROP COLUMN legacy_phone;

-- Widen a type (int to bigint) — safe
ALTER TABLE catalog.db.users ALTER COLUMN age TYPE bigint;

All of these operations are metadata-only in Iceberg — no data files are rewritten.

Iceberg Schema Evolution Rules

Operation	Data Rewrite?	Safe?
Add column	No	Yes
Drop column	No	Yes (existing files keep the column, readers ignore it)
Rename column	No	Yes (ID-based tracking)
Reorder columns	No	Yes
Widen type (int to long, float to double)	No	Yes
Narrow type (long to int)	No	Technically allowed, data loss risk
Change type (string to int)	No	Will fail at read time if values are incompatible

Iceberg is more permissive than Delta Lake about column operations, with column ID tracking making renames genuinely safe.

Schema Evolution in Avro

Avro uses JSON-defined schemas embedded in or alongside data files. It was built for streaming (Kafka) and RPC, where producer and consumer schemas may differ at any point in time.

Avro Compatibility Modes

Avro defines three compatibility modes, enforced by a Schema Registry (Confluent Schema Registry is the standard):

# Python — Avro schema evolution example
import json

# V1 schema
schema_v1 = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "user_id", "type": "string"},
        {"name": "name",    "type": "string"}
    ]
}

# V2 schema — adds optional field with default (BACKWARD compatible)
schema_v2 = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "user_id", "type": "string"},
        {"name": "name",    "type": "string"},
        # New field MUST have a default value for backward compatibility
        {"name": "country", "type": ["null", "string"], "default": None}
    ]
}

# V2 can read V1 data (country -> null for old records): BACKWARD compatible
# V1 cannot read V2 data (unknown field 'country'): NOT forward compatible

Avro Compatibility Matrix

Mode	New schema reads old data?	Old schema reads new data?
BACKWARD	Yes	No
FORWARD	No	Yes
FULL	Yes	Yes
NONE	Schema Registry doesn't enforce	—

Rules for BACKWARD compatibility (the most common mode):

Adding a field: must have a default value
Removing a field: only fields that had a default
Renaming a field: use aliases (not renaming — add a new field with alias)
Changing a type: generally not allowed

{
  "type": "record",
  "name": "User",
  "fields": [
    {
      "name": "userId",
      "type": "string",
      "aliases": ["user_id"]
    }
  ]
}

The aliases field tells Avro readers: "if you encounter a field called user_id in old data, map it to userId." This is how you rename fields without breaking backward compatibility.

Choosing the Right Strategy

Scenario	Recommended Approach
Lakehouse with batch pipelines, Databricks/Spark	Delta Lake + column mapping for renames
Lakehouse requiring true column renames + type evolution	Apache Iceberg
Kafka streaming with producer/consumer schema independence	Avro + Confluent Schema Registry (BACKWARD mode)
Need to support multiple schema versions in parallel	Iceberg + time-travel; Avro + schema registry versions
Unknown future schema changes from external sources	Contract-based: Data Contracts

Common Mistakes

1. Conflating column renames with a drop+add. In plain Parquet (no table format), renaming means the old column's data is gone. In Iceberg and Delta Lake with column mapping, it's metadata-only. Know what layer you're operating at.

2. Adding non-nullable Avro fields without defaults. This breaks BACKWARD compatibility immediately. Always add nullable union types ["null", "string"] with "default": null for new Avro fields.

3. Ignoring type widening limits. int -> long is safe. float -> double is safe. string -> int is never safe, even if the values happen to be numeric today — one malformed value will break your pipeline.

4. Schema changes without versioning. Every schema change should go through version control (dbt model diffs, Avro schema registry). "Quick fix" schema changes made directly in production without tracking are the root cause of most schema-related incidents.

Schema Change Checklist

Before any schema change in production:

Classify the change type (add/remove/rename/type change)
Identify all downstream consumers (pipelines, dashboards, ML models)
Test backward/forward compatibility with your chosen format
Communicate to downstream owners before deploying
Use column mapping (Delta) or column IDs (Iceberg) for renames instead of drop+add
Add a schema migration entry to your data catalog or change log

The Bottom Line

Schema evolution is not a theoretical concern — it's a weekly reality for any team managing production data pipelines. Delta Lake and Iceberg handle additive changes gracefully and make column renames safe at the metadata level. Avro's schema registry gives Kafka-based pipelines the compatibility guarantees needed for independent producer/consumer deployments.

The worst schema migrations happen when teams don't have a strategy. The best ones are invisible to downstream consumers.

Next step: Combine schema evolution with a formal change process — read Data Contracts for Teams to see how contracts prevent breaking changes before they reach production.

Continue Reading

[VERIFY]: Delta Lake column mapping minimum reader/writer version numbers. [VERIFY]: Iceberg type widening rules for latest spec version.

View all articles

Try Harbinger Explorer for free

Connect any API, upload files, and explore with AI — all in your browser. No credit card required.

Start Free Trial

Harbinger Explorer

Schema Evolution Strategies for Delta Lake, Iceberg, and Avro

Why Schema Evolution Is Hard

The Four Types of Schema Change

Schema Evolution in Delta Lake

Automatic Schema Merging

Overwriting Schema

Column Mapping (DV 2.0+)

Schema Evolution in Apache Iceberg

Iceberg's Column ID Model

Iceberg Schema Evolution Rules

Schema Evolution in Avro

Avro Compatibility Modes

Avro Compatibility Matrix

Choosing the Right Strategy

Common Mistakes

Schema Change Checklist

The Bottom Line

Continue Reading

Continue Reading

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

Airflow vs Dagster vs Prefect: An Honest Comparison

Change Data Capture Explained

Try Harbinger Explorer for free

Why Schema Evolution Is Hard

The Four Types of Schema Change

Schema Evolution in Delta Lake

Automatic Schema Merging

Overwriting Schema

Column Mapping (DV 2.0+)

Schema Evolution in Apache Iceberg

Iceberg's Column ID Model

Iceberg Schema Evolution Rules

Schema Evolution in Avro

Avro Compatibility Modes

Avro Compatibility Matrix

Choosing the Right Strategy

Common Mistakes

Schema Change Checklist

The Bottom Line

Continue Reading

Continue Reading

Data Deduplication Strategies: Hash, Fuzzy, and Record Linkage

Airflow vs Dagster vs Prefect: An Honest Comparison

Change Data Capture Explained

Try Harbinger Explorer for free

Command Palette