Architecture
Rocky is a Cargo workspace composed of several crates, each with a focused responsibility. This page describes what each crate does and how they connect.
Adapter model
Section titled “Adapter model”Rocky separates concerns through two adapter types:
Source adapters handle discovery — finding what schemas and tables exist and are available for processing. They do NOT extract data. The data must already be in the warehouse, landed by an ingestion tool (Fivetran, Airbyte, etc.) or loaded manually.
rocky-fivetran— Calls the Fivetran REST API to list connectors and their enabled tables in the destinationrocky-duckdb— Queriesinformation_schemato discover schemas and tables in a local DuckDB database- Manual source (built into
rocky-core) — Reads schema/table definitions fromrocky.toml
Warehouse adapters handle execution — running SQL, managing catalog lifecycle, and applying governance (tags, permissions, workspace isolation).
rocky-databricks— Executes via the Databricks SQL Statement API and manages Unity Catalog, with adaptive concurrency (AIMD algorithm)rocky-snowflake— Executes via the Snowflake REST API with OAuth, key-pair JWT, and password authrocky-duckdb— Local in-process execution for development, testing, and CI
rocky-core sits between them. It defines the Intermediate Representation (IR) and all warehouse-agnostic logic (DAG resolution, schema pattern parsing, SQL generation templates, checks, contracts, state management). rocky-core has no knowledge of Databricks, Fivetran, or any specific system.
This architecture means adding a new warehouse (e.g., Snowflake) or a new source (e.g., Airbyte) requires implementing an adapter crate without modifying the core engine.
Monorepo
Section titled “Monorepo”Rocky is a monorepo with four subprojects:
rocky-data/├── engine/ # Rust CLI + engine (this section)├── integrations/dagster/ # dagster-rocky Python package├── editors/vscode/ # VS Code extension (LSP client)├── examples/playground/ # POC catalog (28 POCs) + benchmarks└── docs/ # Documentation site (Astro + Starlight)Crate overview
Section titled “Crate overview”engine/├── crates/│ ├── rocky-core/ # Generic SQL transformation engine│ ├── rocky-sql/ # SQL parsing + typed AST│ ├── rocky-lang/ # Rocky DSL parser (.rocky files)│ ├── rocky-compiler/ # Type checking + semantic analysis│ ├── rocky-adapter-sdk/ # Adapter SDK + conformance tests│ ├── rocky-databricks/ # Databricks warehouse adapter│ ├── rocky-snowflake/ # Snowflake warehouse adapter│ ├── rocky-fivetran/ # Fivetran source adapter│ ├── rocky-duckdb/ # DuckDB local execution adapter│ ├── rocky-engine/ # Local query engine (DataFusion + Arrow)│ ├── rocky-server/ # HTTP API + LSP server│ ├── rocky-cache/ # Three-tier caching│ ├── rocky-ai/ # AI intent layer│ ├── rocky-observe/ # Observability│ └── rocky-cli/ # CLI framework + Dagster Pipes└── rocky/ # Binary craterocky-core
Section titled “rocky-core”The warehouse-agnostic transformation engine. This crate has no knowledge of specific warehouses or sources — it works entirely through adapter abstractions, producing IR that warehouse adapters consume.
Key modules:
- ir.rs — Intermediate Representation types:
Plan,ReplicationPlan,TransformationPlan,MaterializationStrategy - schema.rs — Configurable schema pattern parsing (e.g.,
src__acme__us_west__shopifyinto structured components) - drift.rs — Schema drift detection (compares column types between source and target)
- sql_gen.rs — IR to dialect-specific SQL generation
- state.rs — Embedded state store backed by
redbfor watermarks and run history - state_sync.rs — Remote state persistence: download/upload state from S3, Valkey, or tiered (Valkey + S3)
- catalog.rs — Catalog and schema lifecycle management (CREATE IF NOT EXISTS, tagging)
- checks.rs — Inline data quality checks (row counts, column matching, freshness, null rate, custom)
- contracts.rs — Data contracts (required columns, protected columns, allowed type changes)
- dag.rs — DAG resolution for model dependencies (topological sort)
- models.rs — SQL model loading (sidecar
.sql+.tomlfiles) - source.rs — Source adapter traits and manual source configuration
- config.rs — TOML configuration parsing with environment variable substitution (
${VAR}and${VAR:-default})
rocky-sql
Section titled “rocky-sql”SQL parsing and validation built on sqlparser-rs.
- parser.rs — Wraps sqlparser-rs with typed extensions for Rocky’s needs
- dialect.rs — Databricks SQL dialect support
- validation.rs — SQL identifier validation using strict regex patterns. All identifiers must pass through this module before being interpolated into SQL. This prevents SQL injection by rejecting anything that doesn’t match
^[a-zA-Z0-9_]+$.
rocky-databricks
Section titled “rocky-databricks”The Databricks warehouse adapter. Implements the warehouse traits defined in rocky-core.
- connector.rs — SQL Statement Execution REST API client (
POST /api/2.0/sql/statements, polling for results) - catalog.rs — Unity Catalog CRUD operations, tagging, and catalog isolation
- permissions.rs — GRANT/REVOKE execution, SHOW GRANTS parsing
- workspace.rs — Workspace binding management for catalog isolation
- auth.rs — Authentication with auto-detection: tries PAT (
DATABRICKS_TOKEN) first, falls back to OAuth M2M (DATABRICKS_CLIENT_ID+DATABRICKS_CLIENT_SECRET) - batch.rs — Batched information_schema queries using UNION ALL (batches of 200)
rocky-fivetran
Section titled “rocky-fivetran”The Fivetran source adapter. Discovers what schemas and tables exist in the Fivetran destination. This is a metadata-only operation — the actual data is already in the warehouse, landed by Fivetran’s sync process.
- client.rs — Async REST client using reqwest with Basic Auth
- connector.rs — Connector discovery and filtering
- schema.rs — Schema configuration parsing (nested JSON structures from Fivetran’s API)
- pagination.rs — Cursor-based pagination for large result sets
- sync.rs — Sync detection via timestamp comparison (determines if new data is available)
rocky-cache
Section titled “rocky-cache”Three-tier caching system that reduces API calls and speeds up repeated operations.
- memory.rs — In-process LRU cache with configurable TTL
- valkey.rs — Valkey/Redis distributed cache with distributed locks
- tiered.rs — Fallback chain: memory -> Valkey -> API. A cache miss at one tier populates all tiers above it.
rocky-duckdb
Section titled “rocky-duckdb”DuckDB local execution adapter. Minimal implementation providing a local warehouse backend for development and testing without requiring a Databricks connection.
rocky-observe
Section titled “rocky-observe”Observability infrastructure.
- metrics.rs — In-process metrics collection: counters (tables processed/failed, statements executed, retries, anomalies) and duration histograms (p50/p95/max for tables and queries). Thread-safe via atomics, serialized to JSON in run output.
- tracing.rs — Structured JSON logging via the
tracingcrate - events.rs — Event broadcasting over Valkey Pub/Sub for real-time monitoring
rocky-lang
Section titled “rocky-lang”Rocky DSL parser for .rocky files. Converts pipeline-oriented syntax into an AST that lowers to standard SQL.
- lexer.rs — Token scanner built on the
logoscrate - parser.rs — Recursive descent parser producing a typed AST
- lowering.rs — AST to SQL lowering
rocky-compiler
Section titled “rocky-compiler”Type checking and semantic analysis for Rocky models.
- typecheck.rs — Column-level type inference across the DAG
- semantic_graph.rs — Tracks column lineage, dependencies, and contracts
- diagnostics.rs — Compiler errors and warnings with source locations and suggestions
rocky-adapter-sdk
Section titled “rocky-adapter-sdk”Stable, versioned traits for building custom warehouse adapters:
WarehouseAdapter— execute SQL, describe tables, manage catalog objectsSqlDialect— format SQL for a specific warehouseDiscoveryAdapter— discover connectors and tablesGovernanceAdapter— manage tags, grants, workspace bindings- Includes 26 conformance tests (19 core + 7 optional)
rocky-snowflake
Section titled “rocky-snowflake”Snowflake warehouse adapter.
- auth.rs — OAuth, password, RS256 key-pair JWT auth (auto-detection)
- connector.rs — Snowflake REST API client
- dialect.rs — Snowflake SQL dialect (dynamic tables, multi-statement transactions)
rocky-engine
Section titled “rocky-engine”Local query engine built on DataFusion + Arrow. Powers rocky test and type inference without a warehouse connection.
rocky-server
Section titled “rocky-server”HTTP API and Language Server Protocol (LSP) server.
- REST API (via
axum) — model metadata, lineage, DAG endpoints forrocky serve - LSP (via
tower-lsp) — diagnostics, hover, completion, go-to-definition, rename forrocky lsp
rocky-ai
Section titled “rocky-ai”AI intent layer using Claude for model generation, intent extraction, schema change sync, and test generation.
- Implements the compile-verify loop (up to 3 retries on compilation failure)
- Requires
ANTHROPIC_API_KEY
rocky-cli
Section titled “rocky-cli”CLI framework built on clap.
- commands/ — 35+ command implementations organized by category
- output.rs — Typed JSON output structs (28 schemas) with
JsonSchemaderivation for codegen - pipes.rs — Dagster Pipes protocol emitter (activates when
DAGSTER_PIPES_CONTEXTis set)
rocky (binary)
Section titled “rocky (binary)”The rocky binary crate. Contains only main.rs, which wires all the library crates together and dispatches CLI commands.
Intermediate Representation (IR)
Section titled “Intermediate Representation (IR)”Rocky compiles configuration and SQL into an intermediate representation before generating executable SQL. This separation means the core engine never deals with raw strings — everything is typed and validated.
The top-level Plan enum represents what Rocky will execute:
enum Plan { Replication(ReplicationPlan), Transformation(TransformationPlan),}- ReplicationPlan — A config-driven copy from source to target (the bronze layer). Contains source/target table references, the incremental strategy, metadata columns, and quality checks.
- TransformationPlan — A user-written SQL model (the silver layer). Contains the parsed SQL, target table, materialization strategy, and dependency references.
MaterializationStrategy
Section titled “MaterializationStrategy”Controls how data is written to the target table:
enum MaterializationStrategy { FullRefresh, Incremental { timestamp_column: String, watermark: Option<DateTime>, }, Merge { unique_key: Vec<String>, update_columns: Option<Vec<String>>, }, MaterializedView, DynamicTable { target_lag: String, }, TimeInterval { time_column: String, granularity: String, // hour, day, month, year lookback: Option<u32>, batch_size: Option<u32>, first_partition: Option<String>, },}- FullRefresh —
CREATE OR REPLACE TABLE ... AS SELECT .... Rebuilds the entire table on every run. - Incremental —
INSERT INTO ... SELECT ... WHERE ts > watermark. Only processes new rows. The watermark is read from the embedded state store. - Merge —
MERGE INTO ... USING (...) ON key WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT. Upserts based on a unique key. - MaterializedView —
CREATE OR REPLACE MATERIALIZED VIEW ... AS SELECT .... Databricks-specific. - DynamicTable —
CREATE OR REPLACE DYNAMIC TABLE ... TARGET_LAG = '...' AS SELECT .... Snowflake-specific. - TimeInterval — Partition-keyed materialization. The model SQL uses
@start_dateand@end_dateplaceholders; the runtime substitutes per-partition timestamps. Supports--partition,--from/--to,--latest,--missingCLI flags.
The IR is generated by rocky-core and consumed by sql_gen.rs, which produces dialect-specific SQL strings ready for execution.
Adapter SDK
Section titled “Adapter SDK”The rocky-adapter-sdk crate provides stable, versioned traits for building custom warehouse adapters:
WarehouseAdapter— execute SQL, describe tables, manage catalog objectsSqlDialect— format SQL for a specific warehouse (table refs, DDL, DML, type mapping)DiscoveryAdapter— discover connectors and tables from a sourceGovernanceAdapter— manage tags, grants, and workspace bindingsAdapterManifest— declares capabilities per adapter (which traits it implements)
Adapters can be built in Rust (direct trait implementation) or in any language via the process adapter protocol (JSON-RPC over stdio).
Scaffold a new adapter:
rocky init-adapter bigqueryRun conformance tests:
rocky test-adapter --adapter duckdb