Version: v0.3.0

`src/sydra/query/lexer.zig`

Purpose

Tokenizes sydraQL source text into a stream of tokens for the parser.

Definition index (public)

`pub const TokenKind = enum { ... }`

Full set of token kinds (as implemented):

Identifiers and literals:
- identifier
- quoted_identifier (double-quoted identifiers)
- number
- string (single-quoted strings)
- keyword
Punctuation:
- comma, period, semicolon, colon
- l_paren, r_paren
- l_bracket, r_bracket
- l_brace, r_brace
Arithmetic:
- plus, minus, star, slash, percent, caret
Comparisons and matching:
- equal, bang_equal
- less, less_equal, greater, greater_equal
- regex_match (=~), regex_not_match (!~)
Logical tokens:
- and_and (&&), or_or (||)
Misc:
- arrow (->)
- eof
- unknown

`pub const Keyword = enum { ... }`

Recognized sydraQL keywords, including:

select, from, where, group, by, fill, order, limit, offset,
insert, into, values, delete, explain, as,
tag, time, now, between,
logical_and, logical_or, logical_not,
previous, linear,
asc, desc,
boolean_true, boolean_false,
null_literal

Notes:

Keyword recognition is case-insensitive.
The lexer maps and/or/not to logical_and/logical_or/logical_not (as keywords).
Boolean and null literals are recognized as keywords: true, false, null.

`pub const Token = struct { ... }`

kind: TokenKind
lexeme: []const u8 — slice of the original source
span: Span — byte offsets into the original source
keyword: ?Keyword — set only when kind == .keyword

`pub const LexError = error { ... }`

Returned for malformed literals:

InvalidLiteral
UnterminatedString

`pub const Lexer = struct { ... }`

Key fields:

allocator: std.mem.Allocator
source: []const u8
index: usize

Public methods:

init(allocator, source) Lexer — constructs a lexer at index = 0
next() LexError!Token — returns the next token (skipping whitespace and comments)
peek() LexError!Token — lookahead without consuming (next + rewind)
reset() void — rewinds to the start (index = 0)

Lexer.next() (excerpt)
pub fn next(self: *Lexer) LexError!Token {
    self.skipWhitespaceAndComments();
    if (self.index >= self.source.len) {
        return eofToken(self.source.len);
    }

    const start = self.index;
    const ch = self.source[self.index];

    if (isIdentifierStart(ch)) return self.scanIdentifier(start);
    if (ch == '"' or ch == '\'') return self.scanString(start, ch);
    if (isDigit(ch)) return self.scanNumber(start);

    switch (ch) {
        ',' => return self.makeSimpleToken(TokenKind.comma, start, 1),
        '-' => {
            if (self.matchChar('>')) return self.makeSimpleToken(TokenKind.arrow, start, 2);
            return self.makeSimpleToken(TokenKind.minus, start, 1);
        },
        '=' => {
            if (self.matchChar('~')) return self.makeSimpleToken(TokenKind.regex_match, start, 2);
            return self.makeSimpleToken(TokenKind.equal, start, 1);
        },
        // ... many more single/two-char tokens ...
        else => {},
    }

    self.index += 1;
    return Token{ .kind = TokenKind.unknown, .lexeme = self.source[start..self.index], .span = common.Span.init(start, self.index) };
}

Lexing rules (as implemented)

Whitespace and comments

Whitespace: spaces, tabs, \n, and \r are skipped.

Comments:

-- line comments
/* block comments */ (unterminated block comment falls through to EOF)

Whitespace + comment skipping (excerpt)
fn skipWhitespaceAndComments(self: *Lexer) void {
    while (self.index < self.source.len) {
        const ch = self.source[self.index];
        switch (ch) {
            ' ', '\t', '\n', '\r' => {
                self.index += 1;
            },
            '-' => {
                if (self.matchAhead("--")) {
                    self.index += 2;
                    self.skipLineComment();
                    continue;
                }
                return;
            },
            '/' => {
                if (self.matchAhead("/*")) {
                    self.index += 2;
                    self.skipBlockComment();
                    continue;
                }
                return;
            },
            else => return,
        }
    }
}

Identifiers and keywords

Identifiers match:
- start: [A-Za-z_]
- body: [A-Za-z0-9_]
Keyword lookup is case-insensitive and uses a static keyword_table.

Strings vs quoted identifiers

The lexer uses the delimiter to decide the token kind:

"double quotes" → quoted_identifier
'single quotes' → string

Escaping:

The delimiter is escaped by doubling it:
- 'can''t' includes a literal '
- "a""b" includes a literal "

Numbers

Numbers are scanned with support for:

integer form (42)
decimal form (3.14)
exponent suffix (1e6, 1.2E-3)

Parsing into i64 vs f64 happens in the parser (not in the lexer).

Operators and punctuation

Notable multi-character tokens:

-> → arrow
=~ → regex_match
!~ → regex_not_match
!= → bang_equal
<=/>= → less_equal/greater_equal
&&/|| → and_and/or_or

EOF and unknown tokens

When index >= source.len, next() returns an eof token with an empty lexeme and a zero-width span at the end.
Any unrecognized character produces an unknown token for that single byte.

Internal helpers (non-public)

These are worth knowing when debugging tokenization:

skipWhitespaceAndComments, skipLineComment, skipBlockComment
scanIdentifier, scanString, scanNumber, scanDigits
keywordFromSlice (case-insensitive, backed by keyword_table)
eofToken, isIdentifierStart, isIdentifierBody, isDigit

Tests

Inline tests cover:

emitting EOF for empty input
keyword recognition (SELECT → .keyword/.select)
scanning numbers and strings
skipping -- and /* */ comments
peek() not consuming tokens
unterminated strings returning LexError.UnterminatedString

Purpose​

See also​

Definition index (public)​

pub const TokenKind = enum { ... }​

pub const Keyword = enum { ... }​

pub const Token = struct { ... }​

pub const LexError = error { ... }​

pub const Lexer = struct { ... }​

Lexing rules (as implemented)​

Whitespace and comments​

Identifiers and keywords​

Strings vs quoted identifiers​

Numbers​

Operators and punctuation​

EOF and unknown tokens​

Internal helpers (non-public)​

Tests​