sydraQL Design
This document defines the initial design for sydraQL, the native time-series query language for SydraDB. It complements the engineering roadmap in Development: sydraQL roadmap and sets the scope for lexer/parser, planning, and execution work.
For the broader post-v0.4.0 operator-facing product direction that should sit on top of these query primitives, see Operator-first user-facing surface.
Implementation reference (current code):
- HTTP surface: HTTP API –
POST /api/v1/sydraqlandhandleSydraql - Query pipeline: overview, lexer, parser, validator
- Planning/execution: logical plan, optimizer, physical plan, operators
- PostgreSQL translation: translator and compatibility matrix
Note: this doc is a design target; the “Source Reference” pages describe what is implemented today.
Goals
- Time-series first: expressive range scans, tag filters, downsampling, and rates without exposing the full SQL surface.
- Deterministic: explicit defaults (time zone, fill policy, ordering) to keep queries reproducible.
- Composable: modular AST and logical plan nodes that translate cleanly to execution iterators.
- Translatable: the PostgreSQL compatibility layer can map supported SQL fragments into sydraQL.
- Optimisable: language constructs map directly to storage features (rollups, tag indexes, WAL metadata).
Non-Goals (v0)
- Arbitrary cross-series joins or cartesian products.
- User-defined functions, stored procedures, or arbitrary SQL expressions.
- Multi-statement transactions or DDL.
- Full SQL compatibility (handled by the translator with best-effort coverage).
Data Model Assumptions
- Series addressed by
series_idor(namespace, metric)plus tags. - Points stored as
(timestamp, value[, field map]). - Tags / Labels represented as string key-values.
- Rollups (e.g. 1m, 5m, 1h) maintained in storage and selectable by the planner.
Query Model
sydraQL exposes a small number of statement types:
| Statement | Shape | Notes |
|---|---|---|
SELECT | select <expr_list> from <selector> [where …] [group by …] [fill …] [order by …] [limit …] | range scans, filtering, aggregation |
INSERT | insert into <series> [(tags)] values (<ts>, <value>[, <json>]) | streaming ingest |
DELETE | delete from <series> where time >= … and time < … [and tags …] | retention / manual deletes |
EXPLAIN | explain <select> | planner debugging (future) |
Stable v0 subset
For the current v0.4.0 alpha cycle, the user-facing contract is intentionally narrower than the aspirational grammar sketches that follow.
Supported and exercised today:
SELECTagainst one resolved series,by_id(...), or one metric family on the legacy path- projections over
time,value,tag.*, andlabel.* - raw-row scalar functions currently exercised on the compiled path:
abs,ceil,floor,round,sqrt,ln,pow,coalesce
- aggregate queries using:
min,max,avg,sum,count,first,last,percentile,delta,rate,irate
- grouping by:
- one
time_bucket(...) - one selector tag such as
tag.host - or the combination
time_bucket(...)plus one selector tag
- one
- native grouped time-bucket fill strategies using:
fill(previous)fill(null)- constant fills such as
fill(0)
- ordering by projection aliases or projected expressions on the supported raw-row path
EXPLAIN BYTECODEandEXPLAIN TABLES_USED
Explicitly out of the current compiled subset:
- selector
tag_filtersyntax on theFROMselector itself - interpolating fill strategies such as
fill(linear) - broader window-function coverage (
moving_avg,ema,lag,lead) - regex predicates as part of the compiled VM path
- multi-series expressions / joins
When a query falls outside the compiled subset, the current contract is to fall back visibly or fail explicitly rather than pretend broader support.
Telemetry-first notes:
label.<k>is a preferred alias fortag.<k>.- Bare metric names may resolve to a metric family on the legacy path.
- Raw row queries over a metric family must narrow to exactly one series after label filters; aggregate/grouped queries may scan multiple series.
Selectors
selector := series_ref
series_ref := by_id(<int>) | 'namespace.metric'
The older selector [tag_filter] grammar sketch remains a future design idea only. It is not part of the v0.4.0 public subset; selector narrowing should use the regular WHERE tag.<k> ... predicate surface instead.
Temporal Predicates
time >= 2024-03-01T00:00:00Z and time < 2024-03-02T00:00:00Z
time between now() - 1h and now()
Tag Predicates
tag.city = "ams" and tag.host != "cache-01"
tag.env =~ /prod|staging/ or tag.team !~ /^infra/
Aggregations & Windows
group by time_bucket(5m, time [, origin]), tag.datacenter
select avg(value), max(value), percentile(value, 0.99)
Fill Policies
fill(previous) -- carry forward last value
fill(null) -- emit null buckets
fill(0) -- constant
fill(linear) -- interpolate between buckets
For v0.4.0, only fill(previous), fill(null), and constant fills stay on the native path. fill(linear) is the stable unsupported exemplar used by demos, benchmarks, and fallback tests.
Ordering & Limits
order by time asc
limit 1000 [offset N]
Syntax Overview
- Keywords are case-insensitive.
- Unquoted identifiers follow
[A-Za-z_][A-Za-z0-9_]*; dotted identifiers such asweather.room1andtag.hostare accepted. - Quoted identifiers are supported.
- Literals currently exercised in parser/tests:
- integers and floats (
123,3.14) - strings (
'foo') - durations (
5m,1h) - ISO-8601 timestamps
- booleans (
true,false) null
- integers and floats (
- Comments:
-- line comment,/* block comment */. - JSON literal handling remains restricted to insert-style payloads rather than general query expressions.
Grammar Sketch (EBNF)
query = select_stmt | insert_stmt | delete_stmt | explain_stmt ;
select_stmt = "select" select_list "from" selector [where_clause] [group_clause]
[fill_clause] [order_clause] [limit_clause] ;
select_list = select_item { "," select_item } ;
select_item = expr [ "as" ident ] ;
selector = series_ref ;
series_ref = ident | "by_id" "(" int_lit ")" ;
where_clause = "where" bool_expr ;
group_clause = "group" "by" group_item { "," group_item } ;
group_item = "time_bucket" "(" duration "," expr ["," expr] ")" | expr ;
fill_clause = "fill" "(" fill_spec ")" ;
order_clause = "order" "by" order_item { "," order_item } ;
limit_clause = "limit" int_lit [ "offset" int_lit ] ;
order_item = expr [ "asc" | "desc" ] ;
bool_expr = bool_term { ("or" | "||") bool_term } ;
bool_term = bool_factor { ("and" | "&&") bool_factor } ;
bool_factor = ["not"] bool_primary ;
bool_primary = comparison | "(" bool_expr ")" ;
comparison = expr comp_op expr | tag_predicate | time_predicate ;
comp_op = "=" | "!=" | "<" | "<=" | ">" | ">=" | "=~" | "!~" ;
expr = additive_expr ;
The current handwritten parser and generated shadow parser already implement a concrete precedence ladder for the supported subset.
Expression precedence (current contract)
Highest to lowest:
- parenthesized expressions and function calls
- unary
+, unary-,not *,/,%+,-- comparison operators:
=,!=,<,<=,>,>=,=~,!~ - logical
and/&& - logical
or/||
Notes:
ORDER BYdirection defaults to ascending when omitted.LIMITmay includeOFFSET.time_bucket(step, time [, origin])is the only bucketed grouping form in the supported compiled subset.
Function Library
- Compiled today
- Aggregates:
min,max,avg,sum,count,last,first - Scalar transforms:
abs,ceil,floor,round,pow,ln,sqrt,coalesce - Time utility:
time_bucket(step, ts [, origin])
- Aggregates:
- Declared in metadata but not part of the supported compiled subset yet
percentile,rate,irate,delta,integral,moving_avg,ema,lag,lead,fill_forward
The source of truth for the registry is Source reference: src/sydra/query/functions.zig, but the supported compiled subset above is the public contract for this alpha.
Execution Semantics
- Implicit ordering: do not rely on implicit ordering as a stable contract; use
order bywhen ordering matters. - Time zone: all timestamps normalised to UTC;
now()uses server clock in UTC. - Bucket exclusivity:
time_bucket(step, ts)aligns to[start, start+step)half-open intervals. - Fill evaluation:
fill(...)parses today but is not part of the supported compiled subset. - Null handling: aggregates follow SQL semantics (
countignores nulls,sumreturns null if all null). - Limits: apply after aggregation/order by; planner should push down
LIMITwhen possible.
Error Model
- Semantic errors map to dedicated codes (e.g.,
ERR_TIME_RANGE_REQUIRED,ERR_UNSUPPORTED_FILL). - Parser surfaces helpful spans; translator remaps SQLSTATE to sydraQL codes where necessary.
- Hard caps on result points (
MAX_POINTS) and runtime (MAX_QUERY_DURATION) fail with informative messages.
Integration & Compatibility
- PostgreSQL translator targets sydraQL AST: only SQL constructs with direct translation are accepted; others produce explicit SQLSTATE warnings.
- Document mapping table: PostgreSQL compatibility matrix enumerating supported SQL features and their sydraQL equivalents.
- HTTP API responds with JSON: metadata (execution stats, trace ids) plus rows.
Implementation Roadmap
- Language spec – finalise this doc, publish examples, crosslink to backlog.
- Lexer & parser – implement zero-copy tokenizer, recursive-descent parser, semantic validation hooks.
- AST → logical plan – define nodes (
Scan,Filter,Project,Aggregate,JoinTime,Limit,Sort). - Planner passes – predicate pushdown, rollup selection, projection pruning.
- Operator execution – iterators leveraging storage segments, rollups, and in-memory aggregation buffers.
- HTTP & CLI – expose
/api/v1/sydraql, update CLI to submit queries, stream responses. - Testing – golden queries, fuzzing, planner snapshots, integration with storage fixtures.
- Observability – metrics (parse/plan/exec timings), structured logs, explain output for debugging.
Execution Telemetry
- HTTP responses from
/api/v1/sydraqlstream astatsobject alongside rows. The object reports elapsed times for each pipeline phase (parse/validate/optimize/physical/pipeline),rows_emitted,rows_scanned, a randomtrace_id, and anoperatorsarray containing{name, rows_out, elapsed_ms}entries for every operator in the execution tree. - pgwire command completion tags mirror the summary metrics (
rows,scanned,stream_ms,plan_ms, optionaltrace_id). AdditionalNOTICEmessages emit one line per operator with the same row counts and timings so libpq-compatible clients can surface diagnostics. - These fields are intended for dashboards and tracing; clients should treat them as part of the public API.
Open Questions
- How far do we want to widen the compiled subset before
v0.4.0versus leaving the rest as explicit fallback? - How do we expose rollup metadata (system tables vs. planner introspection) for users to inspect?
- Should inserts accept structured fields beyond a single numeric value in v0?
- What retention/TTL semantics should
DELETEenforce when interacting with compaction?
Feedback welcome—update this document as decisions land or scope evolves.