Incremental Processing
Rocky supports multiple incremental processing strategies to avoid reprocessing data that has not changed. The simplest is watermark-based append, where only rows newer than a stored timestamp are copied. More advanced strategies include partition-level checksums and column-level change propagation.
Materialization strategies
Section titled “Materialization strategies”Every model (replication or transformation) declares a materialization strategy:
| Strategy | Behavior | Use case |
|---|---|---|
full_refresh | CREATE OR REPLACE TABLE ... AS SELECT ... | Small tables, schema changes, initial loads |
incremental | INSERT INTO ... SELECT ... WHERE ts > watermark | Append-only data with a reliable timestamp |
merge | MERGE INTO ... USING ... ON key WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT | Mutable data with a unique key |
time_interval | Per-partition INSERT OVERWRITE with @start_date/@end_date placeholders | Time-series data with partition-level reprocessing |
microbatch | time_interval alias with hourly defaults | dbt-compatible partition processing |
ephemeral | No table; inlined as CTE in downstream models | Lightweight intermediate transformations |
delete_insert | DELETE WHERE partition_key IN (...); INSERT ... | Partition-replace when MERGE overhead isn’t needed |
See Model Format for the full configuration of each strategy.
Watermark-based incremental
Section titled “Watermark-based incremental”The default incremental strategy for replication. It relies on a monotonically increasing timestamp column (typically _fivetran_synced) to determine which rows are new.
Lifecycle
Section titled “Lifecycle”-
Read watermark. Rocky reads the last stored watermark for the table from the state store. The watermark is keyed by the fully qualified table name (
catalog.schema.table). -
No watermark (first run). If no watermark exists, Rocky performs a full refresh — it copies all rows from the source. This establishes the baseline.
-
Watermark exists. Rocky generates an incremental query that filters to rows newer than the watermark:
SELECT *, CAST(NULL AS STRING) AS _loaded_byFROM source_catalog.source_schema.ordersWHERE _fivetran_synced > TIMESTAMP '2025-03-15T14:30:00Z'- Update watermark. After a successful copy, Rocky writes the new watermark (the maximum timestamp seen in this batch) to the state store. The next run picks up from this point.
Configuration
Section titled “Configuration”Replication strategy and watermark column live on the pipeline:
[pipeline.bronze]type = "replication"strategy = "incremental"timestamp_column = "_fivetran_synced"The timestamp column must exist in the source table and contain monotonically increasing values. If the source system backfills historical data with old timestamps, those rows will be missed by incremental processing.
Merge strategy
Section titled “Merge strategy”For mutable data where rows can be updated after initial insert, the merge strategy uses a unique key to upsert:
[strategy]type = "merge"unique_key = ["customer_id"]update_columns = ["name", "email", "updated_at"]This generates:
MERGE INTO target_catalog.target_schema.customers AS targetUSING (SELECT ... FROM source WHERE ...) AS sourceON target.customer_id = source.customer_idWHEN MATCHED THEN UPDATE SET name = source.name, email = source.email, updated_at = source.updated_atWHEN NOT MATCHED THEN INSERT *If update_columns is omitted, all columns are updated on match.
Partition-level checksums
Section titled “Partition-level checksums”Beyond watermark-based incremental (which only handles appends), Rocky supports partition-level checksums to detect changes in existing rows. This is implemented in the incremental module of rocky-core.
How it works
Section titled “How it works”- Each partition of a model (e.g., one partition per date) gets a checksum computed from its data (a hash of the partition contents and row count).
- On subsequent runs, Rocky compares current checksums against stored checksums from the previous run.
- Only partitions with changed checksums are reprocessed. Unchanged partitions are skipped entirely.
Previous run: { "2026-03-28": 0xABCD, "2026-03-29": 0x1234 }Current run: { "2026-03-28": 0xABCD, "2026-03-29": 0x5678, "2026-03-30": 0x9999 }
Result: Changed: ["2026-03-29", "2026-03-30"] Unchanged: ["2026-03-28"]This catches scenarios that watermarks miss: backfills, late-arriving corrections, and retroactive updates to historical data.
Column-level change propagation
Section titled “Column-level change propagation”The compiler’s semantic graph (see The Rocky Compiler) tracks column-level lineage across the entire DAG. Rocky uses this lineage to skip downstream models that do not depend on any changed columns.
Example
Section titled “Example”Consider three models:
orders (source) → orders_summary (uses: amount, customer_id) → orders_audit (uses: status, updated_at)If an upstream schema change only affects the status column, Rocky determines:
orders_summarydoes not depend onstatus— skiporders_auditdepends onstatus— recompute
This is a PropagationDecision: either Recompute or Skip { reason }. The skip reason is logged so you can verify the decision.
Column propagation works with any incremental strategy. It adds a layer of intelligence on top of watermarks or checksums by pruning the DAG execution to only what actually needs to change.
Time-interval processing
Section titled “Time-interval processing”For time-series data that requires partition-level control, the time_interval strategy processes data in discrete time partitions. The model SQL uses @start_date and @end_date placeholders:
SELECT event_date, event_type, COUNT(*) AS event_countFROM events.page_viewsWHERE event_date >= @start_date AND event_date < @end_dateGROUP BY event_date, event_type[strategy]type = "time_interval"time_column = "event_date"granularity = "day"lookback = 3How it works
Section titled “How it works”- Rocky determines which partitions to process based on CLI flags (
--partition,--from/--to,--latest,--missing,--lookback). - For each partition, it substitutes
@start_dateand@end_datewith quoted timestamp literals. - The generated SQL uses
INSERT OVERWRITEsemantics (atomic on Databricks via Delta, multi-statement transaction on Snowflake). - Per-partition state is tracked in the state store for gap discovery (
--missing).
Per-warehouse SQL
Section titled “Per-warehouse SQL”- Databricks:
INSERT INTO <target> REPLACE WHERE <filter> <select>(single atomic statement via Delta) - Snowflake:
BEGIN; DELETE FROM <target> WHERE <filter>; INSERT INTO <target> <select>; COMMIT;(4 statements) - DuckDB: Same shape as Snowflake
CLI flags
Section titled “CLI flags”Note: as of engine v1.33, the canonical form is
rocky planfollowed byrocky apply <plan-id>. Every partition-selection flag below is accepted on bothrocky planand the legacyrocky runalias;rocky runcontinues to work and emits a one-line[deprecated]notice to stderr (silence withROCKY_SUPPRESS_DEPRECATION=1).
rocky plan --partition 2026-04-01 && rocky apply <plan-id> # Process one partitionrocky plan --from 2026-03-01 --to 2026-04-01 && rocky apply <plan-id> # Date rangerocky plan --latest && rocky apply <plan-id> # Most recent partitionrocky plan --missing && rocky apply <plan-id> # Discover and fill gapsrocky plan --lookback 7 && rocky apply <plan-id> # Reprocess last N partitionsrocky plan --parallel 4 && rocky apply <plan-id> # Parallelize partitionsFull refresh fallback
Section titled “Full refresh fallback”Rocky falls back to full refresh in two situations:
Schema drift
Section titled “Schema drift”When the schema drift detector (in rocky-core/drift.rs) finds a type mismatch between source and target columns, it triggers DropAndRecreate. The target table is dropped and rebuilt from scratch. This is necessary because inserting rows with incompatible types would fail at the warehouse level.
Source: orders.amount (DECIMAL(10,2))Target: orders.amount (STRING)→ Schema drift detected → DROP TABLE → Full refreshMissing watermark
Section titled “Missing watermark”If the state store doesn’t contain a watermark for a table — either because the table is new, the state backend was wiped, or the table was renamed — Rocky treats the next run as a first run and performs a full refresh, establishing a new baseline watermark.
State store
Section titled “State store”Watermarks and partition checksums are stored in an embedded key-value store backed by redb. See the State Management page for details on the state store, remote persistence backends, and the state lifecycle.
The state store tracks:
- Watermarks — last successfully replicated timestamp per table
- Check history — historical row counts for anomaly detection
- Run history — metadata about previous runs
- Partition checksums — per-partition hashes for checksum-based incremental
- DAG snapshots — previous DAG structure for change detection
All state is scoped per environment. Dev, staging, and prod maintain independent state with no cross-environment coordination.