src/sydra/query/lexer.zig
Purpose
Tokenizes sydraQL source text into a stream of tokens for the parser.
See also
Definition index (public)
pub const TokenKind = enum { ... }
Full set of token kinds (as implemented):
- Identifiers and literals:
identifierquoted_identifier(double-quoted identifiers)numberstring(single-quoted strings)keyword
- Punctuation:
comma,period,semicolon,colonl_paren,r_parenl_bracket,r_bracketl_brace,r_brace
- Arithmetic:
plus,minus,star,slash,percent,caret
- Comparisons and matching:
equal,bang_equalless,less_equal,greater,greater_equalregex_match(=~),regex_not_match(!~)
- Logical tokens:
and_and(&&),or_or(||)
- Misc:
arrow(->)eofunknown
pub const Keyword = enum { ... }
Recognized sydraQL keywords, including:
select, from, where, group, by, fill, order, limit, offset,
insert, into, values, delete, explain, as,
tag, time, now, between,
logical_and, logical_or, logical_not,
previous, linear,
asc, desc,
boolean_true, boolean_false,
null_literal
Notes:
- Keyword recognition is case-insensitive.
- The lexer maps
and/or/nottological_and/logical_or/logical_not(as keywords). - Boolean and null literals are recognized as keywords:
true,false,null.
pub const Token = struct { ... }
kind: TokenKindlexeme: []const u8— slice of the original sourcespan: Span— byte offsets into the original sourcekeyword: ?Keyword— set only whenkind == .keyword
pub const LexError = error { ... }
Returned for malformed literals:
InvalidLiteralUnterminatedString
pub const Lexer = struct { ... }
Key fields:
allocator: std.mem.Allocatorsource: []const u8index: usize
Public methods:
init(allocator, source) Lexer— constructs a lexer atindex = 0next() LexError!Token— returns the next token (skipping whitespace and comments)peek() LexError!Token— lookahead without consuming (next+ rewind)reset() void— rewinds to the start (index = 0)
Lexer.next() (excerpt)
pub fn next(self: *Lexer) LexError!Token {
self.skipWhitespaceAndComments();
if (self.index >= self.source.len) {
return eofToken(self.source.len);
}
const start = self.index;
const ch = self.source[self.index];
if (isIdentifierStart(ch)) return self.scanIdentifier(start);
if (ch == '"' or ch == '\'') return self.scanString(start, ch);
if (isDigit(ch)) return self.scanNumber(start);
switch (ch) {
',' => return self.makeSimpleToken(TokenKind.comma, start, 1),
'-' => {
if (self.matchChar('>')) return self.makeSimpleToken(TokenKind.arrow, start, 2);
return self.makeSimpleToken(TokenKind.minus, start, 1);
},
'=' => {
if (self.matchChar('~')) return self.makeSimpleToken(TokenKind.regex_match, start, 2);
return self.makeSimpleToken(TokenKind.equal, start, 1);
},
// ... many more single/two-char tokens ...
else => {},
}
self.index += 1;
return Token{ .kind = TokenKind.unknown, .lexeme = self.source[start..self.index], .span = common.Span.init(start, self.index) };
}
Lexing rules (as implemented)
Whitespace and comments
Whitespace: spaces, tabs, \n, and \r are skipped.
Comments:
-- line comments/* block comments */(unterminated block comment falls through to EOF)
Whitespace + comment skipping (excerpt)
fn skipWhitespaceAndComments(self: *Lexer) void {
while (self.index < self.source.len) {
const ch = self.source[self.index];
switch (ch) {
' ', '\t', '\n', '\r' => {
self.index += 1;
},
'-' => {
if (self.matchAhead("--")) {
self.index += 2;
self.skipLineComment();
continue;
}
return;
},
'/' => {
if (self.matchAhead("/*")) {
self.index += 2;
self.skipBlockComment();
continue;
}
return;
},
else => return,
}
}
}
Identifiers and keywords
- Identifiers match:
- start:
[A-Za-z_] - body:
[A-Za-z0-9_]
- start:
- Keyword lookup is case-insensitive and uses a static
keyword_table.
Strings vs quoted identifiers
The lexer uses the delimiter to decide the token kind:
"double quotes"→quoted_identifier'single quotes'→string
Escaping:
- The delimiter is escaped by doubling it:
'can''t'includes a literal'"a""b"includes a literal"
Numbers
Numbers are scanned with support for:
- integer form (
42) - decimal form (
3.14) - exponent suffix (
1e6,1.2E-3)
Parsing into i64 vs f64 happens in the parser (not in the lexer).
Operators and punctuation
Notable multi-character tokens:
->→arrow=~→regex_match!~→regex_not_match!=→bang_equal<=/>=→less_equal/greater_equal&&/||→and_and/or_or
EOF and unknown tokens
- When
index >= source.len,next()returns aneoftoken with an empty lexeme and a zero-width span at the end. - Any unrecognized character produces an
unknowntoken for that single byte.
Internal helpers (non-public)
These are worth knowing when debugging tokenization:
skipWhitespaceAndComments,skipLineComment,skipBlockCommentscanIdentifier,scanString,scanNumber,scanDigitskeywordFromSlice(case-insensitive, backed bykeyword_table)eofToken,isIdentifierStart,isIdentifierBody,isDigit
Tests
Inline tests cover:
- emitting EOF for empty input
- keyword recognition (
SELECT→.keyword/.select) - scanning numbers and strings
- skipping
--and/* */comments peek()not consuming tokens- unterminated strings returning
LexError.UnterminatedString