Data Governance

Rocky provides a governance layer that enforces data quality, schema stability, access control, masking, retention, and auditability. Most governance features are declarative: you configure them in rocky.toml (or a model sidecar) and they execute automatically as part of rocky apply. Two governance features are exposed as standalone commands for CI gating: rocky compliance (classification vs. masking rollup) and rocky retention-status (per-model retention report).

The five governance pillars live on the pipeline target and across project-level blocks:

Grants – declarative catalog and schema ACLs reconciled against Unity Catalog.
Column classification + masking – per-column classification tags plus project-level [mask] / [mask.<env>] strategies.
Compliance rollup – rocky compliance static resolver for CI gating.
Role-graph reconciliation – hierarchical [role.<name>] declarations flattened and reconciled.
Data retention – model-sidecar retention = "<N>[dy]" applied as adapter-native TBLPROPERTIES.

1. Schema Patterns

Schema patterns control how source schemas map to target catalogs and schemas. They are the foundation of Rocky’s multi-tenant routing.

Configuration

Schema patterns live on the pipeline source; templates live on the pipeline target. Both reference the same component names.

[pipeline.bronze.source.schema_pattern]
prefix = "src__"
separator = "__"
components = ["client", "regions...", "connector"]

[pipeline.bronze.target]
adapter = "prod"
catalog_template = "{client}_warehouse"
schema_template = "staging__{regions}__{connector}"

How parsing works

Given a source schema src__acme__us_west__shopify:

Rocky strips the prefix src__
Splits on the separator __ to get segments: ["acme", "us_west", "shopify"]
Maps segments to components:
- client = "acme" (single segment)
- regions = ["us_west"] (variable-length, marked with ...)
- connector = "shopify" (terminal segment)
Resolves target templates:
- {client}_warehouse becomes acme_warehouse
- staging__{regions}__{connector} becomes staging__us_west__shopify

Multi-region examples

The regions... suffix captures one or more segments between the fixed components:

Source Schema	client	regions	connector
`src__acme__us_west__shopify`	`acme`	`["us_west"]`	`shopify`
`src__acme__us_west__us_east__shopify`	`acme`	`["us_west", "us_east"]`	`shopify`
`src__globex__emea__france__paris__zendesk`	`globex`	`["emea", "france", "paris"]`	`zendesk`

Multi-valued regions are joined by the separator in the target schema:

staging__us_west__us_east__shopify
staging__emea__france__paris__zendesk

Custom patterns

The component names are configurable. Use whatever matches your naming convention:

[pipeline.bronze.source.schema_pattern]
prefix = "raw__"
separator = "__"
components = ["environment", "department", "system"]

[pipeline.bronze.target]
adapter = "prod"
catalog_template = "{environment}_analytics"
schema_template = "{department}__{system}"

This maps raw__prod__finance__sap to prod_analytics.finance__sap.

2. Data Contracts

Data contracts enforce schema stability at compile time. They declare which columns must exist, what types they must have, and which columns are protected from removal.

Create a contract

Create a .contract.toml file in the contracts/ directory. The file name should match the model name:

[[columns]]
name = "order_date"
type = "Date"
nullable = false

[[columns]]
name = "category"
type = "String"
nullable = false

[[columns]]
name = "revenue"
type = "Decimal"
nullable = false

[[columns]]
name = "order_count"
type = "Int64"
nullable = false

[rules]
required = ["order_date", "category", "revenue", "order_count"]
protected = ["order_date", "revenue"]

Contract rules

Rule	Description
required	Column must exist in the model’s output with the specified type. Compilation fails if missing or wrong type.
protected	Column cannot be removed from the model in future changes. If a protected column disappears, compilation fails with error `E013`.
nullable	When `false`, the compiler verifies the column is non-nullable in the type system.

Compile with contracts

rocky compile --models models --contracts contracts

Violations produce compiler errors:

  error[E011]: column 'revenue' type mismatch: contract expects Decimal, got String
    = help: CAST `revenue` to Decimal in the SELECT, or update the contract's expected type

  error[E013]: protected column 'order_count' has been removed
    = help: restore `order_count` in the SELECT, or remove it from `[rules] protected`

Contract validation in CI

Add contract validation to your CI pipeline:

rocky ci --models models --contracts contracts

This catches contract violations before code reaches production.

3. Grants (Pillar 1 of 5)

Rocky manages Databricks Unity Catalog permissions declaratively. Define desired grants in rocky.toml and Rocky reconciles them during each rocky apply.

Catalog-level grants

Applied to every managed catalog created by the pipeline:

[[pipeline.bronze.target.governance.grants]]
principal = "data_engineers"
permissions = ["USE CATALOG", "MANAGE"]

[[pipeline.bronze.target.governance.grants]]
principal = "analysts"
permissions = ["BROWSE", "USE CATALOG"]

[[pipeline.bronze.target.governance.grants]]
principal = "ml_team"
permissions = ["BROWSE", "USE CATALOG", "SELECT"]

Schema-level grants

Applied to every managed schema created by the pipeline:

[[pipeline.bronze.target.governance.schema_grants]]
principal = "data_engineers"
permissions = ["USE SCHEMA", "SELECT", "MODIFY"]

[[pipeline.bronze.target.governance.schema_grants]]
principal = "analysts"
permissions = ["USE SCHEMA", "SELECT"]

Reconciliation flow

During rocky apply, for each managed catalog and schema:

Read desired permissions from [pipeline.<name>.target.governance.grants] and [pipeline.<name>.target.governance.schema_grants]
Query current state with SHOW GRANTS ON CATALOG and SHOW GRANTS ON SCHEMA
Compute diff: Determine which grants to add and which to revoke
Apply the diff. On Databricks, Rocky reconciles catalog- and schema-level grants through the Unity Catalog permissions API as a single batched request per securable, grouped by principal. On warehouses without a REST permissions API, it emits the equivalent GRANT and REVOKE SQL. The privilege effect is identical either way; only the transport differs (you will see PATCH requests in Databricks audit logs rather than GRANT statements).

-- Equivalent SQL (the form emitted on SQL-only warehouses)
GRANT SELECT ON CATALOG `acme_warehouse` TO `analysts`;
GRANT USE SCHEMA ON SCHEMA `acme_warehouse`.`staging__us_west__shopify` TO `analysts`;
REVOKE MODIFY ON CATALOG `acme_warehouse` FROM `temp_access`;

Managed vs skipped permissions

Managed (Rocky controls)	Skipped (Rocky ignores)
`BROWSE`	`OWNERSHIP`
`USE CATALOG`	`ALL PRIVILEGES`
`USE SCHEMA`	`CREATE SCHEMA`
`SELECT`
`MODIFY`
`MANAGE`

Skipped permissions are never granted or revoked by Rocky. This prevents Rocky from interfering with ownership or admin-level grants.

Principal validation

Principal names must match the pattern ^[a-zA-Z0-9_ \-\.@]+$. In generated SQL, principals are always wrapped in backticks to handle spaces and special characters:

GRANT USE CATALOG ON CATALOG acme_warehouse TO `data engineers`

4. Column Classification and Masking (Pillar 2 of 5)

Classification tags identify sensitive columns; masking strategies decide how those columns are obfuscated in the warehouse. Rocky splits the two concerns so teams can tag columns for discovery and lineage without committing to a specific obfuscation policy, then map tags to strategies in one place (with per-environment overrides).

Shipped in engine-v1.16.0. Currently implemented on Databricks; other adapters default to no-op.

Tag columns in the model sidecar

Classification tags live in the model’s .toml sidecar under a [classification] block. Keys are column names, values are free-form tag strings – Rocky does not enforce a fixed vocabulary:

name = "customers"

[classification]
pii_email = "pii"
phone = "pii"
ssn = "confidential"
home_address = "pii"

The tag strings (pii, confidential, and so on) are matched against the project-level [mask] block to pick a masking strategy. Teams can coin new tags (financial, health, internal) without touching the engine.

Map tags to masking strategies

Project-level [mask] in rocky.toml binds classification tags to masking strategies. A scalar value sets the workspace default; a nested [mask.<env>] table overrides strategies for a specific environment:

[mask]
pii = "hash"             # default: SHA-256 hash of the value
confidential = "redact"  # default: replace with '***'

[mask.prod]
pii = "none"             # prod override: do not mask pii
confidential = "partial" # keep first/last 2 chars, mask the middle

Rocky resolves per-environment masks via RockyConfig::resolve_mask_for_env: top-level scalars become defaults, then any matching [mask.<env>] table overlays same-key values. When no env is passed, only the defaults apply.

Supported strategies

Strategy	Emitted SQL behaviour
`"hash"`	SHA-256 hash of the column value.
`"redact"`	Replace with the literal `'***'`.
`"partial"`	Keep the first and last 2 characters; mask the middle.
`"none"`	Explicit identity – no masking applied. Counts as masked for compliance.

Unknown strategy spellings (e.g., "mask", "obfuscate") hard-fail at config load time. Rocky never silently accepts a strategy it cannot emit SQL for.

Allowed unmasked tags

The [classifications] block carries an escape hatch for tags that are used purely for discovery/lineage and are not expected to have a matching [mask] strategy:

[classifications]
allow_unmasked = ["internal", "public"]

Any tag listed here suppresses the W004 “tag has no masking strategy” compiler warning. This is advisory only – it does not pretend unmasked columns are enforced; it just silences the warning.

How apply works

After the DAG completes successfully, rocky apply iterates each model’s [classification] block and calls the governance adapter’s apply_column_tags and apply_masking_policy hooks. Both are best-effort: failures emit warn! and the pipeline continues, mirroring the apply_grants semantics.

On Databricks, Rocky uses Unity Catalog column tags plus CREATE MASK / SET MASKING POLICY, with one statement per column – UC rejects multi-column masking DDL in a single statement. BigQuery, Snowflake, and DuckDB silently no-op until adapter-specific coverage lands.

See the configuration reference for the full schema of the [mask] and [classifications] blocks.

5. Compliance Rollup (Pillar 3 of 5)

rocky compliance is a static resolver that answers one question: are all classified columns masked wherever policy says they should be?

It is a thin rollup over the classifications and masks configuration. No warehouse calls, no network round-trips. Shipped in engine-v1.16.0.

Basic usage

rocky compliance

Compliance report (env: <all>)
  models scanned:       42
  classified columns:   87
  with strategy:        84
  exceptions:           3

EXCEPTIONS:
  customers.pii_email    (prod)  no strategy for classification 'pii'
  orders.card_last_four  (prod)  no strategy for classification 'financial'
  users.ssn              (dev)   no strategy for classification 'confidential'

Flags

Flag	Purpose
`--env <name>`	Scope the report to a single environment. Without it, Rocky expands across the defaults plus every `[mask.<env>]` override.
`--exceptions-only`	Filter the `per_column` table to rows that produced at least one exception. The `exceptions` list itself is always shown.
`--fail-on exception`	Exit with code `1` when any exception is emitted. Wire this into CI to block merges that leave classified columns unmasked.
`--models <dir>`	Models directory to scan (defaults to `models/`).

Exit codes

Exit code	Meaning
`0`	Report produced. Exceptions may or may not be present – exit stays 0 unless `--fail-on exception` is passed.
`1`	`--fail-on exception` was set and at least one exception was emitted.

How `none` counts

MaskStrategy::None (explicit identity) counts as masked for compliance purposes. The rationale: choosing “do not mask” is a deliberate policy decision, not a gap. A tag with no mapping in [mask] at all is the gap that produces an exception.

The [classifications] allow_unmasked = [...] list suppresses exceptions for tags you’ve deliberately excluded from the mask policy, without pretending the columns are enforced.

JSON output

rocky compliance --env prod --output json

The JSON payload is the ComplianceOutput schema: a summary block with counters, a per_column array, and an exceptions array. Use this for dashboards and CI step summaries.

6. Role-Graph Reconciliation (Pillar 4 of 5)

Rocky supports hierarchical role declarations that flatten into a resolved permission set per role. Inheritance is declarative and composable; cycles and unknown parents are rejected at config-load time.

Shipped in engine-v1.16.0. When a SCIM client is configured, the Databricks adapter provisions rocky_role_* SCIM groups and emits add-only per-catalog GRANT statements from the flattened role graph. Groups and grants are never deleted — removal requires manual cleanup. When SCIM is not configured, the adapter falls back to log-only: it validates the flattened graph and emits debug! events without touching the warehouse.

Declare roles in `rocky.toml`

[role.reader]
permissions = ["SELECT", "USE CATALOG", "USE SCHEMA"]

[role.analytics_engineer]
inherits = ["reader"]
permissions = ["MODIFY"]

[role.admin]
inherits = ["analytics_engineer"]
permissions = ["MANAGE"]

Each [role.<name>] block declares:

inherits – a list of immediate parent roles. Rocky walks these transitively.
permissions – a list of canonical Rocky permission strings ("SELECT", "USE CATALOG", "MODIFY", "MANAGE", …).

Roles with empty permissions are legal – they act as grouping nodes that exist only for inheritance.

Resolution semantics

At reconcile time, Rocky calls RockyConfig::role_graph() which flattens the [role.*] map into a deterministic name → ResolvedRole map:

Walk the inherits DAG via DFS with cycle detection.
Union this role’s permissions with every transitive ancestor’s permissions.
Reject unknown parents (e.g., inherits = ["nonexistent_role"]).
Reject unknown permission spellings.

Cycles and unknown parents are caught at config-load time, regardless of whether the target adapter supports role-graph reconcile. This means the resolver catches misconfiguration even on warehouses where the adapter silently no-ops.

Databricks reconcile

The Databricks reconcile_role_graph validates each flattened role’s rocky_role_<name> principal syntax and, when a SCIM client is configured, runs a two-pass reconcile:

Pass 1 — create a rocky_role_<name> SCIM group per role (best-effort per role).
Pass 2 — emit an add-only per-catalog GRANT <permission> ON CATALOG ... for every (role, catalog, permission) triple.

These are add-only (v1) semantics: groups and grants are never revoked, so removing a role or permission from rocky.toml requires manual cleanup on the warehouse. When no SCIM client is configured, the adapter falls back to log-only — it validates and logs the resolved permission set without emitting any GRANTs. Other adapters default to no-op.

7. Data Retention (Pillar 5 of 5)

Data retention policies tell the warehouse how long to keep historical data for each table. Rocky expresses retention as a single sidecar key; each adapter translates it to the warehouse-native TBLPROPERTIES or session parameter.

Shipped in engine-v1.16.0.

Declare retention on a model

Model sidecars take a top-level retention key:

name = "events_daily"
retention = "90d"   # grammar: \d+[dy] -- days or years

Grammar:

<N>d – N days
<N>y – N years; flat-multiplied to 365 days each (no leap-year math)

Garbage inputs ("abc", "90", "-3d") are rejected at sidecar parse time via ModelError::InvalidRetention.

Omitting the retention key (or setting it to null) disables retention management for that model – Rocky leaves the warehouse’s default behaviour in place.

Adapter translation

Adapter	Translation
Databricks	Paired Delta TBLPROPERTIES: `delta.logRetentionDuration = '<N> days'` and `delta.deletedFileRetentionDuration = '<N> days'`. Applied via `ALTER TABLE ... SET TBLPROPERTIES`.
Snowflake	`DATA_RETENTION_TIME_IN_DAYS = <N>` via `ALTER TABLE ... SET`.
BigQuery	Default-unsupported. No first-class retention knob; sidecar ignored with a `warn!`.
DuckDB	Default-unsupported. Sidecar ignored with a `warn!`.

Retention apply runs after the DAG completes, in the same post-run reconcile loop as classification + masking. Failures emit warn! and never abort the run.

Inspecting configured retention: `rocky retention-status`

rocky retention-status

MODEL              CONFIGURED   WAREHOUSE   IN SYNC
──────────────────────────────────────────────────────
events_daily       90 days      -           no
orders             365 days     -           yes
customers          -            -           yes

Without --drift, the WAREHOUSE column is - (not probed) and IN SYNC compares the configured value against nothing.

Flags:

Flag	Purpose
`--models <dir>`	Models directory (defaults to `models/`).
`--model <name>`	Scope the report to a single model.
`--drift`	Probe the warehouse for the applied retention, fill `warehouse_days`, and filter the report to models with a declared policy.

`--drift` probes the warehouse

With --drift, Rocky resolves a governance adapter per model and reads the currently-applied TBLPROPERTIES / session parameter, filling warehouse_days and recomputing in_sync so teams can detect drift between rocky.toml and the live table. The probe is Databricks + Snowflake only — DuckDB and BigQuery inherit the default no-observation impl, so --drift leaves warehouse_days empty on those targets. Probe errors surface per-model on stderr but do not fail the command.

8. Workspace Isolation

Rocky can isolate catalogs to specific Databricks workspaces using the Unity Catalog workspace bindings API. Each binding declares both a workspace ID and an access level (READ_WRITE or READ_ONLY).

[pipeline.bronze.target.governance.isolation]
enabled = true

[[pipeline.bronze.target.governance.isolation.workspace_ids]]
id = 123456789
binding_type = "READ_WRITE"

[[pipeline.bronze.target.governance.isolation.workspace_ids]]
id = 987654321
binding_type = "READ_ONLY"

binding_type defaults to "READ_WRITE" if omitted and maps to the Databricks API values BINDING_TYPE_READ_WRITE and BINDING_TYPE_READ_ONLY.

When enabled, Rocky:

Sets each managed catalog’s isolation mode to ISOLATED via PATCH /api/2.1/unity-catalog/catalogs/{name}
Binds each catalog to the specified workspaces with their declared access level via PATCH /api/2.1/unity-catalog/bindings/catalog/{name}

This prevents other workspaces from accessing the catalog. Only the listed workspaces can read (or, where READ_WRITE, write) data.

When to use isolation

Multi-workspace environments: Different teams or environments have separate workspaces
Compliance requirements: Data must not be accessible from unauthorized workspaces
Development/production separation: Prevent dev workspaces from touching production catalogs

Isolation is applied as best-effort – if the API call fails (e.g., workspace ID does not exist), Rocky logs a warning but continues the run.

9. Tagging Strategy

Tags are key-value pairs applied to catalogs, schemas, and tables using Databricks ALTER ... SET TAGS SQL.

Configuration

[pipeline.bronze.target.governance.tags]
managed_by = "rocky"
data_owner = "analytics-team"
environment = "production"
cost_center = "CC-1234"

What gets tagged

Tags are applied at three levels during rocky apply:

Level	SQL	Applied Tags
Catalogs	`ALTER CATALOG ... SET TAGS (...)`	Governance tags + parsed schema components
Schemas	`ALTER SCHEMA ... SET TAGS (...)`	Governance tags + parsed schema components
Tables	`ALTER TABLE ... SET TAGS (...)`	Governance tags only

Example generated SQL

ALTER CATALOG acme_warehouse SET TAGS (
    'managed_by' = 'rocky',
    'data_owner' = 'analytics-team',
    'environment' = 'production',
    'client' = 'acme'
);

ALTER SCHEMA acme_warehouse.staging__us_west__shopify SET TAGS (
    'managed_by' = 'rocky',
    'data_owner' = 'analytics-team',
    'connector' = 'shopify',
    'regions' = 'us_west'
);

Using tags for discovery

Rocky uses tags to discover managed catalogs. The managed_by = "rocky" tag is queried via:

SELECT catalog_name
FROM system.information_schema.catalog_tags
WHERE tag_name = 'managed_by' AND tag_value = 'rocky'

This means you can deploy Rocky across multiple catalogs and discover all managed catalogs by their tag.

Tagging best practices

Always include managed_by = "rocky" so Rocky can discover its own catalogs
Use environment to distinguish dev/staging/prod
Use data_owner to track responsibility
Use cost_center for chargeback and FinOps
Add custom tags for compliance (e.g., pii = "true", data_classification = "internal")

10. Config Groups and Enforcement

A config group is one definition that a fan-out of models opts into by name (group = "<name>" in the sidecar). It supplies shared routing (schema_template) and a shared strategy, so a set of models route and materialize the same way without repeating the config. The full reference lives in the model format guide; this section covers the governance angle.

Enforced config groups

By default a group is an overridable default: a member model can pin its own target.schema or strategy and the local value wins. Set enforce = true to make the group’s fields binding instead. A member that locally pins a field the group controls then fails the load rather than quietly routing or materializing itself differently from the rest of the group:

enforce = true
schema_template = "mart_{region}"

[strategy]
type = "merge"
unique_key = ["id"]

Enforcement covers exactly the two fields the group owns: the target schema (when the group sets schema_template) and the strategy. A member that locally sets either one fails the load with a GroupOverride error. The model can still supply its own [args] to fill the template and set any field the group does not own (such as target.catalog); it just cannot override the schema routing or materialization that the group governs.

This is a load-time guarantee in the same family as data contracts: the check runs when the model graph loads, so an off-policy override is rejected before any SQL reaches the warehouse rather than surfacing as drift later. Enforcement is strictly opt-in. Without enforce, groups stay overridable defaults.

It applies to every model in the group regardless of whether the model is written in SQL or the .rocky DSL. The group governs routing and materialization, not the model body.

11. Model Tags

Model tags are free-form governance attributes that describe a model as a whole (domain, tier, owner, anything your governance model needs). They are distinct from the tagging strategy in the previous section: those tags live under [pipeline.*.target.governance.tags] and land on Unity Catalog catalogs, schemas, and tables via ALTER ... SET TAGS, used for catalog discovery. Model tags live in the model sidecar (or its config group) and flow into Rocky’s model graph, the orchestrator’s asset tags, and the rocky compile JSON.

Sidecar `[tags]`

Declare model tags in a [tags] block in the model’s .toml sidecar. Keys and values are free-form strings:

name = "fct_orders"

[tags]
domain = "finance"
tier = "gold"
owner = "data-eng"

Config-group `[tags]` baseline

A config group can declare its own [tags] block. Every member model inherits the group’s tags as a shared baseline, so a governance attribute applied once on the group lands on the whole fan-out:

schema_template = "mart_{region}"

[tags]
domain = "finance"
tier = "gold"

Sidecar over group, per key

When a model belongs to a group, its resolved tags are the group’s [tags] with the model’s own [tags] merged on top, per key. A member can override a single inherited key without dropping the rest of the group’s tags: a model in the finance group above can set tier = "silver" in its sidecar and still inherit domain = "finance". Precedence mirrors the rest of the group resolution, sidecar over group.

Projection to Dagster

Resolved tags are emitted on rocky compile --output json as models_detail[].tags. The dagster-rocky integration projects them onto each derived asset’s first-class Dagster tags, so the same attribute is usable in asset selection (for example tag:domain=finance). Alongside the governance tags, the translator synthesizes rocky/-namespaced tags for the model name, target catalog, target schema, and strategy. The rocky/ prefix keeps those from ever colliding with a governance key. The result is that a tag applied once in a sidecar or group is visible end-to-end, from the typed model graph through rocky compile to the orchestrator.

Per-model warehouse tags: `[governance.tags]`

Model [tags] are orchestrator-facing and never touch the warehouse. When you want a tag written onto a model’s own target table or view in Unity Catalog, declare a [governance.tags] block in the model sidecar:

name = "fct_orders"

[governance.tags]
domain = "finance"
tier = "gold"

After the model materializes, rocky apply emits view-aware tag DDL against its target securable — ALTER VIEW ... SET TAGS (...) for view-format models, ALTER TABLE ... SET TAGS (...) otherwise. Keys and values are applied verbatim (no prefix). This is the per-model counterpart to the catalog- and schema-level tagging strategy above ([pipeline.*.target.governance.tags]).

The three tag surfaces are independent and reach different consumers — keep them apart:

Block	Where it lives	What it does
`[tags]`	Model sidecar / config group	Dagster asset tags + `rocky compile` JSON. Never written to the warehouse.
`[governance.tags]`	Model sidecar	`ALTER VIEW/TABLE ... SET TAGS` on the model’s own securable, post-materialize.
`[pipeline.*.target.governance.tags]`	Pipeline target	`ALTER CATALOG/SCHEMA/TABLE ... SET TAGS` during replication, used for catalog discovery.

Application of [governance.tags] is best-effort: a failure warns but never aborts the run, matching the classification and retention governance posture. An empty block is skipped (Unity Catalog rejects SET TAGS ()).

12. Quality Checks

Rocky runs data quality checks inline during replication. Checks execute immediately after each table is copied, and results are included in the run output.

Configuration

[pipeline.bronze.checks]
enabled = true
row_count = true
column_match = true
freshness = { threshold_seconds = 86400 }
anomaly_threshold_pct = 50.0

Check types

Row count

Compares COUNT(*) between source and target tables. Uses batched UNION ALL queries (200 tables per batch) for efficiency:

{
  "name": "row_count",
  "passed": true,
  "source_count": 15000,
  "target_count": 15000
}

Column match

Compares column sets between source and target (case-insensitive). Reports missing or extra columns. Uses cached columns from drift detection – no additional query needed:

{
  "name": "column_match",
  "passed": false,
  "missing": ["new_column"],
  "extra": []
}

Freshness

Checks the time since the last data update by comparing MAX(timestamp_column) against the current time:

freshness = { threshold_seconds = 86400 }  # 24 hours

A table that has not received new data within the threshold is flagged:

{
  "name": "freshness",
  "passed": false,
  "lag_seconds": 172800,
  "threshold_seconds": 86400
}

Null rate

Samples the table using TABLESAMPLE and calculates the null percentage per column:

[pipeline.bronze.checks]
null_rate = { columns = ["email", "phone"], threshold = 0.05, sample_percent = 10 }

The sample_percent keeps the query fast even on large tables.

Anomaly detection

Compares the current row count against a historical moving average. If the deviation exceeds the threshold, Rocky flags it:

anomaly_threshold_pct = 50.0  # Flag if count changes by more than 50%

This catches:

Source tables being truncated (count drops to near zero)
Bad syncs duplicating data (count spikes)
Connectors stopping (count stays flat)

Custom checks

User-provided SQL queries with a {target} placeholder:

[[pipeline.bronze.checks.custom]]
name = "no_future_dates"
sql = "SELECT COUNT(*) FROM {target} WHERE order_date > CURRENT_DATE()"
threshold = 0

[[pipeline.bronze.checks.custom]]
name = "revenue_positive"
sql = "SELECT COUNT(*) FROM {target} WHERE revenue < 0"
threshold = 0

The check passes if the query result is less than or equal to the threshold.

13. Audit Trail

Rocky stores run history and quality metrics in the embedded state store (redb), providing a queryable audit trail. Every rocky apply now stamps eight extra governance fields on its RunRecord (shipped in engine-v1.16.0); the full trail is available via rocky history --audit.

`rocky history --audit` and the 8 audit fields

The default rocky history output stays compact for byte-stability with schema v5 consumers. Pass --audit to expand every governance field in text or JSON:

rocky history --audit
rocky history --audit --output json

Each RunRecord carries:

Field	Source
`triggering_identity`	Auth principal that kicked off the run.
`session_source`	Auto-detected: `Cli` / `Dagster` / `Lsp` / `HttpApi`.
`git_commit`	Resolved at run start from the current repo.
`git_branch`	Resolved at run start from the current repo.
`idempotency_key`	Echoed from `rocky plan --idempotency-key <KEY>` (or the single-step `rocky run --idempotency-key` alias) when passed.
`target_catalog`	The catalog(s) the run wrote to.
`hostname`	The host that executed the run.
`rocky_version`	The CLI version that produced the record.

Schema version v5 → v6 (forward-deserialize)

The audit trail expansion bumped the redb schema version from v5 to v6. The migration is forward-deserialize only – no in-place blob rewrite – so existing stores open cleanly. Defaults filled in on v5 rows:

hostname = "unknown"
rocky_version = "<pre-audit>"
session_source = Cli

This means old runs still render correctly under rocky history --audit; they simply show the placeholder strings for the three fields that did not exist yet.

View run history

rocky history

RUN ID       STARTED                  STATUS     MODELS   TRIGGER
────────────────────────────────────────────────────────────────────
abc12345678  2026-03-30 10:00:00      Completed  42       Scheduled
def98765432  2026-03-29 10:00:00      Completed  42       Scheduled
ghi11111111  2026-03-28 14:30:00      Failed     38       Manual

Total runs: 3

Filter by date

rocky history --since 2026-03-29

View model execution history

rocky history --model fct_daily_revenue

STARTED                  DURATION   ROWS         STATUS         SQL HASH
────────────────────────────────────────────────────────────────────────────
2026-03-30 10:00:00      2300ms     15432        succeeded      a1b2c3d4
2026-03-29 10:00:00      2100ms     15200        succeeded      a1b2c3d4
2026-03-28 14:30:00      0ms        -            failed         a1b2c3d4

Total executions: 3

View quality metrics

rocky metrics fct_daily_revenue

Latest snapshot (run: abc12345678):
  Row count: 15432
  Freshness lag: 300s
  Null rates:
    email: 2.10%
    phone: 15.30%

View quality trends

rocky metrics fct_daily_revenue --trend

TIMESTAMP                ROW COUNT    RUN ID     FRESHNESS
──────────────────────────────────────────────────────────────
2026-03-30 10:00:00      15432        abc123456  300s
2026-03-29 10:00:00      15200        def987654  280s
2026-03-28 10:00:00      14980        ghi111111  310s

View column-specific metrics

rocky metrics fct_daily_revenue --column email --alerts

Quality alerts

Pass --alerts to see quality issues:

rocky metrics fct_daily_revenue --alerts

Latest snapshot (run: abc12345678):
  Row count: 15432

ALERTS:
  [WARNING] null rate 25.0% exceeds 20% threshold (column: phone)

Alert severity levels:

critical: Null rate exceeds 50%
warning: Null rate exceeds 20%, or freshness lag exceeds 24 hours

JSON output

All history and metrics commands support JSON output for programmatic consumption:

rocky history -o json
rocky metrics fct_daily_revenue --trend -o json

14. Complete Governance Configuration

Here is a full pipeline target with every governance feature enabled. Governance lives under each pipeline’s target so different pipelines can have different policies:

[pipeline.bronze.target.governance]
auto_create_catalogs = true
auto_create_schemas = true

# Tags applied to all managed catalogs, schemas, and tables
[pipeline.bronze.target.governance.tags]
managed_by = "rocky"
environment = "production"
data_owner = "analytics-team"

# Catalog-level grants
[[pipeline.bronze.target.governance.grants]]
principal = "data_engineers"
permissions = ["USE CATALOG", "MANAGE"]

[[pipeline.bronze.target.governance.grants]]
principal = "analysts"
permissions = ["BROWSE", "USE CATALOG"]

[[pipeline.bronze.target.governance.grants]]
principal = "ml_team"
permissions = ["BROWSE", "USE CATALOG", "SELECT"]

# Schema-level grants
[[pipeline.bronze.target.governance.schema_grants]]
principal = "data_engineers"
permissions = ["USE SCHEMA", "SELECT", "MODIFY"]

[[pipeline.bronze.target.governance.schema_grants]]
principal = "analysts"
permissions = ["USE SCHEMA", "SELECT"]

# Workspace isolation
[pipeline.bronze.target.governance.isolation]
enabled = true

[[pipeline.bronze.target.governance.isolation.workspace_ids]]
id = 123456789
binding_type = "READ_WRITE"

[[pipeline.bronze.target.governance.isolation.workspace_ids]]
id = 987654321
binding_type = "READ_ONLY"

Combined with quality checks (also under the pipeline):

[pipeline.bronze.checks]
enabled = true
row_count = true
column_match = true
freshness = { threshold_seconds = 86400 }
anomaly_threshold_pct = 50.0
null_rate = { columns = ["email"], threshold = 0.05, sample_percent = 10 }

[[pipeline.bronze.checks.custom]]
name = "no_future_dates"
sql = "SELECT COUNT(*) FROM {target} WHERE order_date > CURRENT_DATE()"
threshold = 0

Classification, masking, roles, and retention live outside the pipeline target (they are project-level), but the complete picture is:

# Project-level masking policy
[mask]
pii = "hash"
confidential = "redact"

[mask.prod]
pii = "none"
confidential = "partial"

[classifications]
allow_unmasked = ["internal"]

# Project-level role graph
[role.reader]
permissions = ["SELECT", "USE CATALOG", "USE SCHEMA"]

[role.analytics_engineer]
inherits = ["reader"]
permissions = ["MODIFY"]

[role.admin]
inherits = ["analytics_engineer"]
permissions = ["MANAGE"]

Paired with a model sidecar:

name = "customers"
retention = "365d"

[classification]
pii_email = "pii"
phone = "pii"
ssn = "confidential"

This single configuration exercises every governance feature in this guide: schema routing, declarative grants with reconciliation, workspace isolation, classification and masking per environment, role-graph validation at config load, retention, inline quality checks, and a full audit trail.

15. CI Gate Example

The CI gate pattern wires rocky compliance --fail-on exception into a pipeline step that blocks merges when classified columns are unmasked. For quieter local runs, drop --fail-on and add --exceptions-only so the output skips the per-column table when nothing is wrong.

GitHub Actions

name: Rocky Compliance

on:
  pull_request:
    paths:
      - 'models/**'
      - 'rocky.toml'

jobs:
  compliance:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Rocky
        run: |
          curl -fsSL https://raw.githubusercontent.com/rocky-data/rocky/main/engine/install.sh | bash
          echo "$HOME/.local/bin" >> $GITHUB_PATH
      - name: Run compliance gate
        run: rocky compliance --env prod --fail-on exception

The gate exits 0 when every classified column has a resolved strategy (or is listed in allow_unmasked), and exits 1 – failing the job – the moment any exception is emitted.

Local quiet-mode run

rocky compliance --env prod --exceptions-only

When everything is compliant this prints just the summary counters; when exceptions exist, the per_column table is filtered to the offending rows.

Machine-readable gate

For dashboards and custom policy engines, emit JSON and pipe it into jq:

rocky compliance --env prod --output json \
  | jq '.exceptions[] | {model, column, env, reason}'

The ComplianceOutput schema is stable across minor versions – wire downstream tooling against the JSON payload rather than the text-table renderer.

Data Governance

1. Schema Patterns

Configuration

How parsing works

Multi-region examples

Custom patterns

2. Data Contracts

Create a contract

Contract rules

Compile with contracts

Contract validation in CI

3. Grants (Pillar 1 of 5)

Catalog-level grants

Schema-level grants

Reconciliation flow

Managed vs skipped permissions

Principal validation

4. Column Classification and Masking (Pillar 2 of 5)

Tag columns in the model sidecar

Map tags to masking strategies

Supported strategies

Allowed unmasked tags

How apply works

5. Compliance Rollup (Pillar 3 of 5)

Basic usage

Flags

Exit codes

How none counts

JSON output

6. Role-Graph Reconciliation (Pillar 4 of 5)

Declare roles in rocky.toml

Resolution semantics

Databricks reconcile

7. Data Retention (Pillar 5 of 5)

Declare retention on a model

Adapter translation

Inspecting configured retention: rocky retention-status

--drift probes the warehouse

8. Workspace Isolation

When to use isolation

9. Tagging Strategy

Configuration

What gets tagged

Example generated SQL

Using tags for discovery

Tagging best practices

10. Config Groups and Enforcement

Enforced config groups

11. Model Tags

Sidecar [tags]

Config-group [tags] baseline

Sidecar over group, per key

Projection to Dagster

Per-model warehouse tags: [governance.tags]

12. Quality Checks

Configuration

Check types

Row count

Column match

Freshness

Null rate

Anomaly detection

Custom checks

13. Audit Trail

rocky history --audit and the 8 audit fields

Schema version v5 → v6 (forward-deserialize)

View run history

Filter by date

View model execution history

View quality metrics

View quality trends

View column-specific metrics

Quality alerts

JSON output

14. Complete Governance Configuration

15. CI Gate Example

GitHub Actions

Local quiet-mode run

Machine-readable gate

How `none` counts

Declare roles in `rocky.toml`

Inspecting configured retention: `rocky retention-status`

`--drift` probes the warehouse

Sidecar `[tags]`

Config-group `[tags]` baseline

Per-model warehouse tags: `[governance.tags]`

`rocky history --audit` and the 8 audit fields