System architecture

A technical deep dive into the Colrows engine. The system is partitioned into four computational domains - intent parsing, semantic resolution, logical planning, and physical execution - isolated to ensure determinism, reproducibility, and independent scalability.

Architecture thesis

Colrows is a deterministic semantic compiler for enterprise analytics. Unlike traditional query proxies or BI semantic layers that resolve meaning at presentation time, Colrows formalizes business semantics prior to physical execution. The system separates semantic reasoning from physical planning and treats business logic as a versioned, dependency-aware graph that can be validated, compiled, and proven correct before interacting with any underlying SQL engine.

The core architectural premise: ambiguity must be resolved at compile time, not at runtime. Every query undergoes semantic proof, constraint validation, and policy enforcement before cost-based planning and dialect generation occur.

Layered architecture

Four layers: consumers, the Colrows control plane (intent parsing, Consensus semantic layer, SQL engine, stores), and the universal connector layer with backends like Snowflake, Databricks, Postgres, and ClickHouse. — High-level system architecture. Consumers connect through stateless platform nodes; semantic resolution precedes physical planning.

Incoming requests are processed by stateless platform nodes responsible for syntactic parsing and AST construction. The resulting AST is not immediately lowered to SQL. Instead, it is transformed into a semantic intermediate representation that abstracts business intent independently of any physical schema. This intermediate form is the input to the semantic control plane, where validation and graph resolution occur. Only after semantic correctness is established does the SQL engine perform physical planning and dialect specialization.

This strict staging prevents leakage of physical structure into business reasoning and preserves warehouse independence.

The compilation pipeline

Colrows follows a compile-then-execute model similar to modern programming-language compilers. The pipeline has seven deterministic stages.

1. Syntactic parsing & AST construction

The incoming query is parsed into an abstract syntax tree. The parser validates grammar and structural correctness but does not resolve business meaning. Identifiers - metrics, dimensions, entities - are preserved symbolically.

2. Semantic binding

Symbolic references are resolved against the Consensus Semantic Layer. This stage performs name resolution, scope validation, and entity disambiguation. Resolution is graph-based, not catalog-based - metrics are nodes within a dependency graph, with explicit edges to source entities, constraints, and derivation logic.

If a symbol cannot be resolved to a unique semantic node under the requesting persona's scope, compilation fails.

3. Join path proof

When a metric references entities spanning multiple datasets, the engine must prove the existence of a deterministic join path. This is solved as a constrained graph traversal problem:

Entities and datasets form a directed multigraph where edges encode cardinality, foreign-key relationships, and grain compatibility.
The engine computes candidate paths using bounded breadth-first search.
Paths violating declared grain are discarded; cardinality-expanding paths beyond a threshold are pruned; cycles are eliminated using visited-state tracking with relationship-type awareness.
If multiple valid paths exist, a deterministic ranking heuristic prioritizes minimal hop count, declared canonical relationships, and explicit anchor definitions.
Ambiguity that cannot be resolved by explicit semantic anchors fails compilation.

4. Constraint solving

Constraints represent grain, time filters, cardinality boundaries, and policy restrictions. During compilation, constraints are treated as formal predicates attached to metric and dimension nodes. Constraint solving proceeds in three phases:

Aggregation grain validation - requested dimensions must be compatible with metric grain.
Filter compatibility analysis - contradictory predicates are detected.
Scope enforcement - persona-level restrictions are honored.

The solver operates over the dependency graph, not over SQL fragments - so semantic inconsistencies are caught before any SQL is generated.

5. Logical plan construction

Once semantic validation succeeds, the engine lowers the semantic graph into a logical relational plan. Metrics are expanded into expression trees. Derived metrics are topologically sorted by dependency order. Common subexpression elimination is performed across metric trees, and predicate pushdown rules are applied symbolically before dialect specialization. The logical plan at this stage remains warehouse-agnostic.

6. Cost estimation & physical planning

The logical plan is transformed into a physical execution plan. Cost estimation incorporates table statistics, partition metadata, and index availability captured during ingestion. The planner performs join reordering with cost-based heuristics, projection pruning, predicate pushdown validation, and aggregation strategy selection. Because semantic correctness has already been proven, the physical planner can focus exclusively on efficiency rather than correctness inference.

7. Dialect specialization

The final stage translates the physical plan into dialect-aware SQL. A dialect abstraction layer handles syntax variation, function mapping, and quoting semantics. Dialect translation never alters semantic intent - semantic resolution has already been finalized upstream.

The Consensus semantic layer as a deterministic graph

The Consensus Semantic Layer is implemented as a typed, directed graph. Nodes include business entities, metrics, dimensions, datasets, columns, constraints, policies, personas, and scopes. Edges encode semantic relationships such as derivation, measurement, anchoring, and governance applicability.

Every node is versioned. Changes do not overwrite prior definitions - they create new semantic states, enabling reproducibility of historical queries. Dependency management is handled via graph traversal with topological ordering. Cycles in metric derivation are detected through depth-first search with back-edge detection. Compilation fails immediately upon cycle discovery.

Because all business meaning is encoded structurally, semantic reasoning becomes a graph problem rather than a string-substitution problem.

Drift & conflict detection

Semantic drift detection compares current structural metadata against historical graph states:

Column distribution shifts are evaluated using statistical fingerprinting.
Schema evolution is detected via structural diffing of dataset nodes.
Conflict detection operates through structural equivalence analysis - metrics are equivalent only if their normalized expression trees and dependency sets match under canonical ordering.

Vector similarity is used only for candidate identification; structural comparison determines final equivalence. This hybrid approach prevents semantic duplication while avoiding over-reliance on embedding similarity.

Governance as compile-time enforcement

Governance rules are embedded directly into the semantic graph as policy nodes attached to metrics, dimensions, and datasets. During semantic binding, persona scope is resolved into an allowed subgraph. Compilation occurs within this constrained subgraph. If a metric depends on a node outside the permitted scope, resolution fails. Policy enforcement is structural rather than procedural - no post-execution masking or filtering is required, because unauthorized plans cannot be generated.

Scalability model

Stateless parsing and planning nodes scale horizontally behind load balancers. The semantic graph store scales independently through partitioned graph storage and indexed lookup paths for entity resolution. Because semantic validation precedes execution, warehouse load is reduced - invalid or ambiguous queries are rejected before reaching compute-intensive stages. The separation of semantic state from physical planning lets each subsystem scale independently without cross-layer coupling.

Storage substrate

Colrows uses three specialized stores, each chosen for its strengths:

Store	Role	What it holds
Neo4j	Lineage & graph reasoning	Dependency DAGs, impact analysis, traversal paths.
Weaviate	Vector recall	Synonym similarity, concept discovery, multi-vector embeddings.
MongoDB	Semantic state	Metric definitions, dimensions, examples, version history.

Deterministic semantics as infrastructure

Colrows reframes enterprise analytics as a compilation problem, not a dashboard problem. Business meaning is modelled as structured state. Queries are compiled against that state using graph traversal, constraint solving, and cost-based optimization. The result is an engine where correctness is proven before execution, governance is structural, and warehouse independence is preserved.

Colrows therefore operates not as a reporting tool, but as a semantic compiler embedded within the enterprise data control plane.