Skip to content

State Management

Rocky uses an embedded key-value store to track watermarks and run history. No external database is required — state is stored in a local file that Rocky manages automatically.

Rocky uses redb, an embedded key-value store written in Rust. Think of it as SQLite for key-value data: a single file, no server process, ACID transactions, and zero configuration.

By default, Rocky stores state in .rocky-state.redb in the current directory. You can override this with the --state-path flag:

Terminal window
plan_id=$(rocky --config rocky.toml --state-path /var/lib/rocky/state.redb plan --output json | jq -r .plan_id)
rocky --state-path /var/lib/rocky/state.redb apply "$plan_id"

Each table’s watermark tracks the last successfully replicated timestamp:

Key: "acme_warehouse.staging__us_west__shopify.orders"
Value: {
last_value: "2025-03-15T14:30:00Z",
updated_at: "2025-03-15T14:35:12Z"
}
  • last_value — The maximum value of the timestamp column (e.g., _fivetran_synced) seen in the last successful run
  • updated_at — When the watermark was last written

Watermarks are keyed by the fully qualified table name: catalog.schema.table.

Historical row counts are stored for anomaly detection:

Key: "acme_warehouse.staging__us_west__shopify.orders"
Value: [
{ count: 150432, timestamp: "2025-03-13T10:00:00Z" },
{ count: 151200, timestamp: "2025-03-14T10:00:00Z" },
{ count: 152100, timestamp: "2025-03-15T10:00:00Z" }
]

At the start of each table’s replication, Rocky reads the watermark from the state store.

If no watermark exists for a table, Rocky performs a full refresh — it copies all rows from the source.

If a watermark exists, Rocky generates an incremental query:

SELECT *, CAST(NULL AS STRING) AS _loaded_by
FROM fivetran_catalog.src__acme__us_west__shopify.orders
WHERE _fivetran_synced > TIMESTAMP '2025-03-15T14:30:00Z'

Only rows with a timestamp newer than the stored watermark are copied.

After a successful copy, Rocky updates the watermark to the current timestamp. The next run will pick up from this point.

Use rocky state to view the current state:

Terminal window
rocky state

This displays all stored watermarks and their values, useful for debugging incremental runs.

Removing a watermark for a table causes Rocky to perform a full refresh on the next run. This is useful when you need to backfill data or recover from issues:

Terminal window
rocky state delete --table "acme_warehouse.staging__us_west__shopify.orders"

Rocky compares the current row count of each table against a historical moving average. If the deviation exceeds a configurable threshold (e.g., 50%), Rocky flags it as an anomaly in the run output.

This catches problems like:

  • A source table was truncated (count drops to near zero)
  • A bad sync duplicated data (count spikes dramatically)
  • A connector stopped syncing (count stays flat when it should be growing)

The threshold is configured per pipeline in rocky.toml:

[pipeline.bronze.checks]
enabled = true
row_count = true
freshness = { threshold_seconds = 86400 }

By default, state is stored locally on disk. On ephemeral environments (e.g., EKS pods, CI runners), the local file is lost between runs. Rocky supports remote state backends to persist watermarks across deployments.

BackendConfigUse Case
localDefaultDevelopment, persistent VMs
s3s3_bucketDurable storage, multi-region
valkeyvalkey_urlLow-latency, shared state
tieredBothValkey for speed, S3 for durability
[state]
backend = "s3"
s3_bucket = "${ROCKY_STATE_BUCKET}"
s3_prefix = "rocky/state/" # default
[state]
backend = "valkey"
valkey_url = "${VALKEY_URL}"
valkey_prefix = "rocky:state:" # default
[state]
backend = "tiered"
valkey_url = "${VALKEY_URL}"
s3_bucket = "${ROCKY_STATE_BUCKET}"

The tiered backend combines Valkey (fast) with S3 (durable):

  • Download: Try Valkey first. If miss or error, fall back to S3.
  • Upload: Write to both Valkey (best-effort) and S3 (required).

This gives you sub-millisecond reads from Valkey in the common case, with S3 as a reliable fallback if Valkey is unavailable.

When backend is not local, Rocky syncs state automatically:

  1. Before run: Download remote state → local .redb file
  2. During run: Read/write from local .redb (fast, no network)
  3. After run: Upload local .redb → remote storage

If download fails, Rocky logs a warning and starts fresh from target-table metadata. Upload failure behaviour is governed by the retry + failure policy below.

Every remote transfer (upload or download) runs inside a wall-clock budget with exponential-backoff retries and a three-state circuit breaker — the same machinery the Databricks and Snowflake adapters already use. Configuration lives under [state.retry] in rocky.toml; the full field list is in the configuration reference.

[state]
backend = "s3"
s3_bucket = "${ROCKY_STATE_BUCKET}"
transfer_timeout_seconds = 300 # total wall-clock ceiling — retries share this budget
on_upload_failure = "skip" # "skip" (default) or "fail"
[state.retry]
max_retries = 3 # defaults shown; omit the block to use them
circuit_breaker_threshold = 5

on_upload_failure controls what happens when retries and the circuit breaker are both exhausted:

ModeBehaviourWhen to use
"skip" (default)Log a warning, mark the run successful, leave remote state stale. The next run re-derives watermarks from target-table metadata.Most callers — the de-facto pre-1.13 behaviour. Trades state durability for run liveness.
"fail"Propagate a StateSyncError::RetryBudgetExhausted or CircuitOpen to the caller; the run fails.Strict environments where re-deriving watermarks is prohibitively expensive (long-running backfills, multi-hour syncs).

Terminal outcomes are structured. Every state.upload / state.download event now carries an outcome field so you can alert on state-layer health without log-message regex:

outcomeMeaning
okTransfer completed successfully.
absentRemote state was empty — first run against this backend.
timeoutHit transfer_timeout_seconds wall-clock cap.
error_then_freshExistence check failed; Rocky started fresh.
transient_exhaustedmax_retries exhausted on transient errors.
budget_exhaustedmax_retries_per_run exhausted across transfers.
circuit_openBreaker is open; transfer skipped without attempting.
skipped_after_failureUpload failed, on_upload_failure = "skip" applied.

Run rocky doctor --check state_rw at cold start to catch IAM / reachability problems before they show up as end-of-run upload failures.

Each environment (dev, staging, prod) maintains its own state. There is no shared state across environments.

This means:

  • A fresh deployment starts with no watermarks (full refresh on first run)
  • Dev environments can be reset independently by deleting the state file
  • Remote backends allow state to survive pod restarts in ephemeral environments
  • No coordination required between environments