Skip to content

All Features

Rocky organizes its capabilities around seven trust dimensions — the failure modes that quietly cost data teams the most. Each dimension below maps to specific commands, diagnostic codes, and configuration knobs.

  1. Typed compilerrocky compile, 35+ diagnostics
  2. Column-level lineagerocky lineage, rocky lineage-diff
  3. Branches + replayrocky branch, rocky replay
  4. Cost attributionRunOutput.cost_summary, [budget], rocky cost
  5. AI gated through the compilerrocky ai, rocky ai-sync
  6. Dialect-divergence lintP001 portability lint
  7. Declarative governance — RBAC diff, masking, classification, rocky compliance

The capability detail follows. For positioning vs other tools, see Comparison. For honest scope (Databricks-first, Snowflake/BigQuery Beta, Iceberg writes via Delta UniForm shipped, first-class Iceberg-native writes on the roadmap), see Where Rocky is today.

WarehouseStatusAuth methods
DatabricksProductionUnity Catalog, SQL Statement API, OAuth M2M
SnowflakeBetaPAT, Key-pair JWT, OAuth, password
BigQueryBetaService account, application default
DuckDBProductionEmbedded (local dev/test, no credentials)

Source adapters: Fivetran (REST API discovery), DuckDB (information_schema), Manual (config-defined).

StrategyDescription
full_refreshDrop and recreate table (default)
incrementalAppend past a timestamp watermark
mergeUpsert via MERGE INTO with unique key
time_intervalPartition-keyed with per-partition state, lookback, parallel execution
ephemeralInlined as CTE in downstream queries, no table created
delete_insertDelete by partition key, then insert fresh data
microbatchAlias for time_interval with hourly defaults (dbt-compatible)

Separately, the top-level format key on a model selects warehouse-managed table shapes: Delta tables, Iceberg tables, materialized views, streaming tables, and plain views. SCD Type 2 snapshots are run through a dedicated type = "snapshot" pipeline (see rocky snapshot).

Partition-keyed materialization with:

  • --partition KEY — run a single partition
  • --from / --to — run a date range
  • --latest — run the partition containing now()
  • --missing — discover and fill gaps from state store
  • --lookback N — recompute N previous partitions for late-arriving data
  • --parallel N — concurrent partition processing
  • Per-partition state tracking (Computed / Failed / InProgress)
  • Atomic writes: Databricks uses INSERT INTO ... REPLACE WHERE; Snowflake/DuckDB use transactional DELETE + INSERT
  • Static type inference across the full DAG at compile time
  • Column type tracking through JOINs, GROUP BYs, window functions
  • 35+ diagnostic codes (E001-E026, W001-W003, W010-W011, E010-E013, V001-V020) with actionable fix suggestions
  • Safe type widening detection: INT → BIGINT, FLOAT → DOUBLE, VARCHAR expansion — handled via zero-copy ALTER TABLE
  • NULL-safe equality: != compiles to IS DISTINCT FROM
  • SELECT * expansion with deduplication
  • Parallel type checking via rayon across DAG execution layers
  • Data contracts with required columns, protected columns (prevent removal), allowed type changes (widening whitelist), and nullability constraints
  • Seeded type inference: rocky compile --with-seed runs data/seed.sql through an in-memory DuckDB so leaf .sql models pick up real source types instead of Unknown
  • P001 dialect portability — opt-in lint rejecting SQL that won’t run on the chosen target (databricks / snowflake / bigquery / duckdb). Catches NVL, DATEADD, QUALIFY, ILIKE, FLATTEN, and friends.
  • P002 blast-radius SELECT * — always-on warning when a model uses SELECT * AND a downstream model references specific columns of its output. Leaf-model SELECT * is intentionally not flagged.
  • Project-wide allow list ([portability] allow = [...]) for blanket exemptions.
  • Per-model -- rocky-allow: CONSTRUCT, OTHER pragma for targeted opt-outs.

Full reference: Linters.

  • Computed at compile time — no warehouse query, no catalog rebuild
  • Per-column trace through SQL and Rocky DSL transformations
  • Transform tracking: Direct, Cast, Aggregation, Expression
  • Output formats: JSON, Graphviz DOT, human-readable text
  • CLI: rocky lineage model.column
  • Automatic detection at compile time
  • Safe type widening: ALTER TABLE for compatible changes (INT → BIGINT, FLOAT → DOUBLE, VARCHAR expansion, numeric → STRING)
  • Graduated response: ALTER for safe changes, full refresh for unsafe changes
  • Column addition/removal detected with recommended actions
  • Shadow mode: Write to _rocky_shadow tables for validation before production
CheckDescription
Row countSource vs target row count match
Column matchMissing/extra column detection (case-insensitive, with exclusion list)
FreshnessTimestamp lag against configurable threshold
Null ratePer-column null percentage with TABLESAMPLE support
Custom SQLThreshold-based checks with {target} placeholder
Anomaly detectionRow count deviation from historical baselines

Model-level [[assertions]] blocks cover the full DQX surface:

KindDescription
not_null, uniqueNull / duplicate detection
accepted_valuesValue set membership
relationshipsReferential integrity against another table
expressionCustom SQL boolean per row
row_count_rangeTable-level row-count bounds
in_rangeNumeric column bounds
regex_matchDialect-specific regex (REGEXP / RLIKE / REGEXP_LIKE / REGEXP_CONTAINS)
aggregateSUM/COUNT/AVG/MIN/MAX(col) cmp value
compositeMulti-column uniqueness
not_in_future, older_than_n_daysTime-window sugar

Every assertion supports severity (error / warning), fail_on_error (pipeline-level override), and a per-check filter SQL predicate.

Row-level assertions can route failing rows with [quarantine] mode = "split" | "tag" | "drop". split writes a sibling <target>__quarantine table; tag adds __dqx_valid boolean; drop discards.

All checks run inline during pipeline execution, not as a separate step. Full reference: Data Quality Checks.

Pipeline-oriented syntax that compiles to standard SQL:

StepDescription
fromLoad from source model or table
whereFilter rows
deriveCreate computed columns
groupAggregate with grouping
selectProject columns
joinJoin another model
sortOrder rows
takeLimit row count
distinctRemove duplicates

Plus: window functions with PARTITION BY / ORDER BY / frame specs, match expressions (→ CASE WHEN), date literals (@2025-01-01), IS NULL / IS NOT NULL, IN lists.

init · validate · discover · plan · run · compare · state · branch (incl. branch approve, branch promote --allow-breaking --base-ref --models --skip-approval) · preview create · preview diff · preview cost

compile · lineage · lineage-diff · test · ci · ci-diff (with --semantic for typed-IR breaking-change findings) · export-schemas

ai (generate) · ai-sync · ai-explain · ai-test

playground · serve · lsp · init-adapter · test-adapter · import-dbt · validate-migration

doctor · history · history --rolling-stats · replay · trace · cost · metrics · optimize · compact · profile-storage · archive · state retention sweep · bench · hooks list · hooks test

  • 15-event hook lifecycle — pipeline / discover / compile / per-model / per-table / checks / drift / anomaly / state-sync events fire through the hook registry. See Hooks.
  • rocky trace <run_id|latest> — Gantt-style timeline of a completed run with concurrency lanes, rendered from the state-store RunRecord.
  • rocky replay <run_id|latest> — flat per-model dump of SQL hashes, row counts, bytes, and timings captured at execution time.
  • Timed half-open circuit breaker — three-state (Closed / Open / HalfOpen) breaker shared across Databricks + Snowflake adapters. Fires circuit_breaker_tripped / circuit_breaker_recovered events.
  • OTLP metrics export (feature-gated via --features otel) — rocky apply exports in-process counters and histograms to any OTLP-compatible collector.
  • Run-level budgets[budget] max_usd + max_duration_ms + max_bytes_scanned with on_breach = "warn" | "error"; any single dimension breach fires the budget_breach event. See [budget].
  • Pre-merge budget projectionrocky preview cost projects budget breaches against the branch totals before merge, so reviewers see “this PR would breach max_usd if merged” in the PR comment. Output field projected_budget_breaches mirrors RunOutput.budget_breaches; the Markdown flips between advisory and “would fail the run” based on [budget].on_breach.
  • Rolling stats + model health scorerocky history --model <name> --rolling-stats [--window N] augments ModelHistoryOutput with rolling z-score on rows_affected + duration_ms over the most recent N successful runs (default 20), plus a composite health_score (1.0 − clamp((max(|z|) − 2) / 4, 0, 1)).
  • State-store retention sweep[state.retention] {max_age_days, min_runs_kept, applies_to} config knob + rocky state retention sweep [--dry-run] subcommand. Sweeps history (runs), lineage (dag_snapshots), and audit (quality_history); explicitly leaves operational state (schema cache, watermarks, partitions) untouched.
  • Exhaustive checksum-bisection diffrocky preview diff --algorithm bisection walks the chunk lattice over a single-column integer / numeric primary key for exhaustive row-level coverage at bounded scan cost (O(K · log_K(N)) chunks examined for a single-row change). Per-adapter native row-hashes — DuckDB hash, BigQuery FARM_FINGERPRINT, Databricks Spark xxhash64. Snowflake stays on the sampled fallback until a consumer drives the override.
  • Per-run cost attributionRunOutput.cost_summary carries per-run total cost; per-materialization cost_usd flows through MaterializationMetadata.
  • rocky cost <run_id|latest> — historical rollup over stored runs. Reads the same RunRecord as replay / trace; recomputes per-model cost via the adapter-appropriate formula (duration × DBU for Databricks/Snowflake; bytes × $/TB for BigQuery; zero for DuckDB).
  • See the combo in action — POC #17 (trace + cost + replay against the same run_id): examples/playground/pocs/06-developer-experience/17-trace-replay-cost-combo

Full LSP implementation via rocky lsp:

  • Diagnostics (live compile errors/warnings)
  • Hover (column types and lineage)
  • Go to Definition
  • Find References
  • Completions (models, columns, functions)
  • Rename (across all files)
  • Code Actions (quick fixes)
  • Inlay Hints (inline type annotations)
  • Semantic Tokens (syntax highlighting)
  • Signature Help (function parameters)
  • Document Symbols (outline)

Published VS Code extension with TextMate grammar + semantic tokens.

  • dagster-rocky package with RockyResource and RockyComponent
  • Auto-discovery: Rocky discover → Dagster asset definitions
  • Dagster Pipes protocol: Hand-rolled emitter (no external dependency) — reports materializations, check results, drift observations, anomaly alerts
  • 55 typed JSON output schemas with auto-generated Pydantic v2 models and TypeScript interfaces via rocky export-schemas
  • Freshness policies auto-attached from [checks.freshness] config
  • Column lineage attached to derived model asset metadata
  • Catalog lifecycle: Auto-create catalogs with configurable tags
  • Schema lifecycle: Auto-create schemas with tags
  • RBAC: 6 permission types with declarative GRANT management
  • Permission reconciliation: SHOW GRANTS → diff → GRANT/REVOKE
  • Workspace isolation: Databricks catalog binding (READ_WRITE, READ_ONLY)
  • Resource tagging: Component-derived tags (client, region, source) + static tags
  • Multi-tenant patterns: Schema pattern routing (src__client__region__connector)

Embedded redb state store (.rocky-state.redb) with:

TablePurpose
WATERMARKSPer-table incremental progress
PARTITIONSPer-partition lifecycle (Computed/Failed/InProgress)
CHECK_HISTORYRow count snapshots for anomaly detection
RUN_HISTORYFull run execution records
QUALITY_HISTORYCheck results and metrics
DAG_SNAPSHOTSCompilation snapshots
RUN_PROGRESSIn-flight run progress (checkpoint/resume)

Remote sync to S3 or Valkey/Redis (tiered backend). No manifest file.

Configurable shell commands or webhooks triggered on:

  • Pipeline: start, discover_complete, compile_complete, complete, error
  • Table: before_materialize, after_materialize, materialize_error
  • Model: before_model_run, after_model_run, model_error
  • Checks: before_checks, check_result, after_checks
  • State: drift_detected, anomaly_detected, state_synced

Each hook supports command, timeout_ms, and on_failure behavior. Webhook hooks send POST with JSON templates.

Powered by Claude (via ANTHROPIC_API_KEY):

CommandDescription
rocky ai <intent>Generate a model from natural language
rocky ai-syncDetect schema changes and update models guided by stored intent
rocky ai-explainGenerate intent description from existing code
rocky ai-testGenerate test assertions from intent

All AI commands are CLI-native, not cloud-dependent. Intent is stored as metadata in model TOML files.

Metric (10k models)Value
Compile1.00 s
Per-model cost100 µs
Peak memory147 MB
Lineage0.84 s
Warm compile0.72 s
Startup14 ms
Config validation15 ms
SQL generation200 ms (50k tables/sec)

Linear scaling verified from 1k to 50k models. See benchmarks for full comparison with dbt-core and dbt-fusion.