Architecture

Rocky is a Cargo workspace composed of several crates, each with a focused responsibility.

How data flows

A rocky run touches every layer of the stack. From config file to warehouse result:

rocky.toml
    │  parse config + env-var substitution
    ▼
Config struct
    │  call DiscoveryAdapter
    ▼
Source metadata  ◀── Fivetran REST API / DuckDB info_schema / manual config
    │  compile .sql + .toml models
    ▼
rocky-compiler  →  diagnostics (E001–E035 / W001–W031)
    │  produces typed ProjectIr on success
    ▼
ProjectIr  (one ModelIr per model, fully typed and validated)
    │  topological sort → execution layers
    ▼
Layer 0: [raw_a, raw_b]   ← run in parallel
Layer 1: [enriched]
Layer 2: [summary]
    │  per model: drift detect → skip gate → SQL gen → execute → checks
    ▼
WarehouseAdapter
    ├── rocky-databricks  →  Databricks SQL Statement API
    ├── rocky-snowflake   →  Snowflake REST API
    ├── rocky-bigquery    →  BigQuery REST API
    ├── rocky-trino       →  Trino /v1/statement
    └── rocky-duckdb      →  DuckDB in-process
    │  write results to warehouse; update state store
    ▼
State store (redb, embedded)
    │  watermarks, run progress, partitions, idempotency
    ▼
Hooks + JSON output on stdout

For a detailed walkthrough of each step, see Execution Flow.

Adapter model

Rocky separates concerns through two adapter types:

Source adapters handle discovery: finding what schemas and tables exist and are available for processing. They do NOT extract data. The data must already be in the warehouse, landed by an ingestion tool (Fivetran, Airbyte, etc.) or loaded manually.

rocky-fivetran: calls the Fivetran REST API to list connectors and their enabled tables in the destination
rocky-duckdb: queries information_schema to discover schemas and tables in a local DuckDB database
Manual source (built into rocky-core): reads schema/table definitions from rocky.toml

Warehouse adapters handle execution: running SQL, managing catalog lifecycle, and applying governance (tags, permissions, workspace isolation).

rocky-databricks: executes via the Databricks SQL Statement API and manages Unity Catalog, with adaptive concurrency (AIMD algorithm)
rocky-snowflake: executes via the Snowflake REST API with OAuth, key-pair JWT, and password auth
rocky-bigquery: executes via the BigQuery REST API with service account / ADC auth
rocky-trino: executes via Trino’s /v1/statement REST polling state machine with HTTP Basic / JWT bearer auth (Beta as of engine v1.28.0)
rocky-duckdb: local in-process execution for development, testing, and CI

rocky-ir defines the typed Intermediate Representation (ModelIr, MaterializationStrategy, source/target descriptors). It was extracted from rocky-core in engine v1.30.0 so adapter crates can depend on the IR without pulling in the SQL-generation surface.

rocky-core sits between adapters and IR. It owns warehouse-agnostic logic (DAG resolution, schema pattern parsing, SQL generation templates, checks, contracts, state management) and depends on rocky-ir. rocky-core has no knowledge of Databricks, Fivetran, or any specific system.

This architecture means adding a new warehouse (e.g., Snowflake, Trino) or a new source (e.g., Airbyte) requires implementing an adapter crate without modifying the core engine.

Monorepo

Rocky is a monorepo with five subprojects:

rocky-data/
├── engine/                  # Rust CLI + engine (this section)
├── sdk/python/              # rocky-sdk — standalone typed Python client over the rocky CLI
├── integrations/dagster/    # dagster-rocky Python package (thin adapter over rocky-sdk)
├── editors/vscode/          # VS Code extension (LSP client)
├── examples/playground/     # POC catalog + benchmarks
└── docs/                    # Documentation site (Astro + Starlight)

The playground catalog currently ships 99 POCs across 8 categories. See the playground guide for the full list.

Crate overview

engine/
├── crates/
│   ├── rocky-ir/            # Typed Intermediate Representation (ModelIr, MaterializationStrategy)
│   ├── rocky-core/          # Generic SQL transformation engine (depends on rocky-ir)
│   ├── rocky-catalog-core/  # Catalog client abstraction (Iceberg REST, Unity Catalog, Polaris, Nessie)
│   ├── rocky-sql/           # SQL parsing + typed AST
│   ├── rocky-lang/          # Rocky DSL parser (.rocky files)
│   ├── rocky-compiler/      # Type checking + semantic analysis
│   ├── rocky-adapter-sdk/   # Adapter SDK + conformance tests
│   ├── rocky-databricks/    # Databricks warehouse adapter
│   ├── rocky-snowflake/     # Snowflake warehouse adapter
│   ├── rocky-bigquery/      # BigQuery warehouse adapter (Beta)
│   ├── rocky-trino/         # Trino warehouse adapter (Beta as of v1.28.0)
│   ├── rocky-fivetran/      # Fivetran source adapter
│   ├── rocky-airbyte/       # Airbyte source adapter
│   ├── rocky-iceberg/       # Iceberg + content-addressed Delta UniForm writer
│   ├── rocky-duckdb/        # DuckDB local execution adapter
│   ├── rocky-engine/        # Local execution engine (DuckDB-backed)
│   ├── rocky-server/        # HTTP API + LSP server
│   ├── rocky-cache/         # Three-tier caching
│   ├── rocky-ai/            # AI intent layer
│   ├── rocky-mcp/           # MCP server backing `rocky mcp`
│   ├── rocky-verify/        # Standalone offline verifier for rocky-manifest attestations
│   ├── rocky-observe/       # Observability
│   ├── rocky-wasm/          # WebAssembly exports for browser/edge
│   └── rocky-cli/           # CLI framework + Dagster Pipes
├── rocky/                   # Binary crate (the `rocky` CLI)
└── rocky-lsp/               # Binary crate (the `rocky lsp` language server)

rocky-ir

The typed Intermediate Representation.

ModelIr: single in-memory plan representation; replaced the legacy Plan enum + ReplicationPlan / TransformationPlan structs (deleted in v1.30.0).
MaterializationStrategy: the strategy enum (full list in the MaterializationStrategy section below).
Source / target descriptors and partition windows.

rocky-core

The warehouse-agnostic transformation engine, consuming IR from rocky-ir and producing dialect-specific SQL.

Key modules:

schema.rs: configurable schema pattern parsing (e.g., src__acme__us_west__shopify into structured components)
drift.rs: schema drift detection (compares column types between source and target)
sql_gen.rs: IR to dialect-specific SQL generation
state.rs: embedded state store backed by redb for watermarks and run history
state_sync.rs: remote state persistence: download/upload state from S3, Valkey, or tiered (Valkey + S3)
catalog.rs: catalog and schema lifecycle management (CREATE IF NOT EXISTS, tagging)
checks.rs: inline data quality checks (row counts, column matching, freshness, null rate, custom)
contracts.rs: data contracts (required columns, protected columns, allowed type changes)
unified_dag.rs / dag_executor.rs: unified DAG construction and layered execution across pipeline stages (topological sort itself lives in rocky-ir::dag)
models.rs: SQL model loading (sidecar .sql + .toml files)
source.rs: source adapter traits and manual source configuration
config.rs: TOML configuration parsing with environment variable substitution (${VAR} and ${VAR:-default})

rocky-sql

SQL parsing and validation built on sqlparser-rs.

parser.rs: wraps sqlparser-rs with typed extensions for Rocky’s needs
dialect.rs: Databricks SQL dialect support
validation.rs: SQL identifier validation using strict regex patterns. All identifiers must pass through this module before being interpolated into SQL. This prevents SQL injection by rejecting anything that doesn’t match ^[a-zA-Z0-9_]+$.

rocky-databricks

The Databricks warehouse adapter. Implements the warehouse traits defined in rocky-core.

connector.rs: SQL Statement Execution REST API client (POST /api/2.0/sql/statements, polling for results)
catalog.rs: Unity Catalog CRUD operations, tagging, and catalog isolation
permissions.rs: GRANT/REVOKE execution, SHOW GRANTS parsing
workspace.rs: workspace binding management for catalog isolation
auth.rs: authentication with auto-detection: tries PAT (DATABRICKS_TOKEN) first, falls back to OAuth M2M (DATABRICKS_CLIENT_ID + DATABRICKS_CLIENT_SECRET)
batch.rs: batched information_schema queries using UNION ALL (batches of 200)

rocky-fivetran

The Fivetran source adapter. Discovers what schemas and tables exist in the Fivetran destination. This is a metadata-only operation; the actual data is already in the warehouse, landed by Fivetran’s sync process.

client.rs: async REST client using reqwest with Basic Auth
connector.rs: connector discovery and filtering
schema.rs: schema configuration parsing (nested JSON structures from Fivetran’s API)
pagination.rs: cursor-based pagination for large result sets
sync.rs: sync detection via timestamp comparison (determines if new data is available)

rocky-cache

Three-tier caching system that reduces API calls and speeds up repeated operations.

memory.rs: in-process LRU cache with configurable TTL
valkey.rs: Valkey/Redis distributed cache with distributed locks
tiered.rs: fallback chain: memory -> Valkey -> API. A cache miss at one tier populates all tiers above it.

rocky-duckdb

DuckDB local execution adapter. Minimal implementation providing a local warehouse backend for development and testing without requiring a Databricks connection.

rocky-observe

Observability infrastructure.

metrics.rs: in-process metrics collection: counters (tables processed/failed, statements executed, retries, anomalies) and duration histograms (p50/p95/max for tables and queries). Thread-safe via atomics, serialized to JSON in run output.
tracing_setup.rs: structured JSON logging via the tracing crate
events.rs: event broadcasting over Valkey Pub/Sub for real-time monitoring
otel.rs: feature-gated OpenTelemetry exporter. When the engine is built with --features otel and OTEL_EXPORTER_OTLP_ENDPOINT is set, rocky run exports in-process metrics as OTLP via an OtelGuard RAII handle. Send metrics to any OTLP-compatible collector (Honeycomb, Datadog, Grafana Tempo, Prometheus via the OTel collector).

rocky-lang

Rocky DSL parser for .rocky files. Converts pipeline-oriented syntax into an AST that lowers to standard SQL.

lexer.rs: token scanner built on the logos crate
parser.rs: recursive descent parser producing a typed AST
lowering.rs: AST to SQL lowering

rocky-compiler

Type checking and semantic analysis for Rocky models.

typecheck.rs: column-level type inference across the DAG
semantic.rs: tracks column lineage, dependencies, and contracts
diagnostic.rs: compiler errors and warnings with source locations and suggestions

rocky-adapter-sdk

Stable, versioned traits for building custom warehouse adapters:

WarehouseAdapter: execute SQL, describe tables, manage catalog objects
SqlDialect: format SQL for a specific warehouse
DiscoveryAdapter: discover connectors and tables
GovernanceAdapter: manage tags, grants, workspace bindings
Includes 26 conformance tests (18 always-run + 8 capability-gated)

rocky-snowflake

Snowflake warehouse adapter.

auth.rs: OAuth, password, RS256 key-pair JWT auth (auto-detection)
connector.rs: Snowflake REST API client
dialect.rs: Snowflake SQL dialect (dynamic tables, multi-statement transactions)

rocky-trino

Trino warehouse adapter (Beta as of engine v1.28.0). First first-party warehouse adapter built natively against rocky-adapter-sdk.

connector.rs: async REST client driving Trino’s POST /v1/statement + nextUri polling state machine. Same-origin guard on every nextUri follow-up (same scheme + host + port as the configured coordinator).
dialect.rs: double-quoted identifiers, three-part <catalog>.<schema>.<table> references, DESCRIBE <table> column introspection, ANSI INSERT INTO / CREATE TABLE AS, TABLESAMPLE BERNOULLI.
auth.rs: HTTP Basic + JWT bearer with RedactedString-wrapped credentials.
adapter.rs: WarehouseAdapter impl: dialect, execute_statement, execute_query, describe_table.

A Docker conformance harness is gated behind the trino-conformance cargo feature: cargo test -p rocky-trino --features trino-conformance -- --ignored drives the adapter end-to-end against a real trinodb/trino coordinator. The default cargo test -p rocky-trino run stays credential- and network-free.

v0 limitations: no MERGE (Trino’s MERGE is connector-dependent), no OAuth/Kerberos, no governance, no checksum-bisection support.

rocky-iceberg

Apache Iceberg + Delta UniForm writer surface. Powers materialization = "content_addressed" (engine v1.30.0, Phases 1–5):

discover() reads the bootstrap Delta commit and surfaces schema + partition spec + rowTracking config.
write_batch() emits content-addressed Parquet files (blake3-hashed file names) and a Delta log commit referencing them.
sync_iceberg_metadata() keeps Iceberg-compatible readers in sync after each commit.
Partitioned + rowTracking + post-ALTER schema evolution all supported.

See Content-Addressed Materialization for the user-facing model.

rocky-engine

Local execution engine backed by DuckDB (via rocky-duckdb). Powers rocky test, branching, and CI-style local runs without a warehouse connection.

rocky-server

HTTP API and Language Server Protocol (LSP) server.

REST API (via axum): model metadata, lineage, DAG endpoints for rocky serve
LSP (via tower-lsp): diagnostics, hover, completion, go-to-definition, rename for rocky lsp

rocky-ai

AI intent layer using Claude for model generation, intent extraction, schema change sync, and test generation.

Implements the compile-verify loop (up to 3 retries on compilation failure)
Requires ANTHROPIC_API_KEY

rocky-cli

CLI framework built on clap.

commands/: 80+ command implementations organized by category
output.rs: typed JSON output structs (80+ exported schemas) with JsonSchema derivation for codegen
pipes.rs: Dagster Pipes protocol emitter (activates when DAGSTER_PIPES_CONTEXT is set)

rocky (binary)

The rocky binary crate. Contains only main.rs, which wires all the library crates together and dispatches CLI commands.

Intermediate Representation (IR)

Rocky compiles configuration and SQL into an intermediate representation before generating executable SQL. This separation means the core engine never deals with raw strings; everything is typed and validated. The IR lives in the rocky-ir crate.

ModelIr

ModelIr is the single in-memory plan representation. As of engine v1.30.0, the legacy Plan enum (and the ReplicationPlan / TransformationPlan / From<&Plan> bridge layer) is gone. ModelIr directly carries the source/target descriptors, the materialization strategy, and any dependency / sources / contracts / checks the model declares. Both replication and transformation pipelines compile to the same ModelIr shape, with the kind discriminated by which descriptors are populated.

MaterializationStrategy

Controls how data is written to the target table. Variants:

FullRefresh: CREATE OR REPLACE TABLE ... AS SELECT .... Rebuilds the entire table on every run.
Incremental: INSERT INTO ... SELECT ... WHERE ts > watermark. Only processes new rows. The watermark is read from the embedded state store at SQL-generation time (not carried on the strategy itself, so recipe-hash inputs stay runtime-state-free).
Merge: MERGE INTO ... USING (...) ON key WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT. Upserts based on a unique key.
View: CREATE OR REPLACE VIEW ... AS SELECT .... No physical storage; every read re-executes the SELECT. Supported on every warehouse.
MaterializedView: CREATE OR REPLACE MATERIALIZED VIEW ... AS SELECT .... Databricks-specific.
DynamicTable: CREATE OR REPLACE DYNAMIC TABLE ... TARGET_LAG = '...' AS SELECT .... Snowflake-specific.
TimeInterval: partition-keyed materialization. The model SQL uses @start_date and @end_date placeholders; the runtime substitutes per-partition timestamps. Supports --partition, --from/--to, --latest, --missing CLI flags.
Ephemeral: not materialized; inlined as a CTE in downstream consumers.
DeleteInsert: delete rows matching the partition key, then insert fresh data. Cheaper alternative to Merge when the partition key identifies the rows being rewritten.
Microbatch: alias for TimeInterval with hour-granularity defaults; dbt-compatible naming.
ContentAddressed: content-addressed Parquet + Delta log commit via the rocky-iceberg writer. Designed for cross-engine reads (DuckDB iceberg_scan, Trino, Spark). SQL generation does not run for this strategy; the runtime drives rocky-iceberg::uniform_writer directly. See Content-Addressed Materialization.

The full enum lives in rocky-ir::ir::MaterializationStrategy. SQL generation (where applicable) lives in rocky-core::sql_gen.

Adapter SDK

The rocky-adapter-sdk crate provides the stable, versioned traits (WarehouseAdapter, SqlDialect, DiscoveryAdapter, GovernanceAdapter) plus an AdapterManifest that declares which traits each adapter implements.

Adapters can be built in Rust (direct trait implementation) or in any language via the process adapter protocol (JSON-RPC over stdio).

Scaffold a new adapter:

rocky init-adapter bigquery

Run conformance tests:

rocky test-adapter --adapter duckdb