Tokenizer
tokenize() is a pure function that turns a FlyQL expression into a flat list of typed tokens. It is the shared foundation every FlyQL syntax highlighter, pretty-printer, or formatter can build on. The function is available in all three languages with byte-identical semantics — the same input produces the same tokens in JavaScript, Python, and Go.
Use it when you need token-level control over rendering (ANSI console output, custom HTML, editor decorations, token-aware linters). If you only need an HTML string, prefer the higher-level flyql/highlight helper (JavaScript only).
Token shape
Section titled “Token shape”Every token is a plain record with four fields:
| Field | Type | Description |
|---|---|---|
text | string | The slice of the original input this token covers. |
type | string | One of the CharType values (see below). |
start | int | Zero-based character offset of the first character in text. |
end | int | Zero-based character offset one past the last character — end == start + len(text). |
Tokens satisfy two invariants you can rely on:
- Round-trip: concatenating every token’s
textin order reproduces the original input exactly. - Gap-free offsets:
tokens[0].start == 0,tokens[i].start == tokens[i-1].endfor alli > 0, andtokens[-1].end == len(input)(non-empty inputs).
These two properties together mean a formatter can walk the token list, emit decorations around each text, and never miss a character or overlap a boundary.
tokenize accepts an input string and an optional mode. Only "query" mode is implemented in all three languages; "columns" mode is JavaScript-only.
JavaScript
Section titled “JavaScript”import { tokenize } from 'flyql/tokenize'
const tokens = tokenize("status=200 and region='us-east'")// [// { text: 'status', type: 'flyqlKey', start: 0, end: 6 },// { text: '=', type: 'flyqlOperator', start: 6, end: 7 },// { text: '200', type: 'number', start: 7, end: 10 },// { text: ' ', type: 'space', start: 10, end: 11 },// { text: 'and', type: 'flyqlOperator', start: 11, end: 14 },// { text: ' ', type: 'space', start: 14, end: 15 },// { text: 'region', type: 'flyqlKey', start: 15, end: 21 },// { text: '=', type: 'flyqlOperator', start: 21, end: 22 },// { text: "'us-east'", type: 'string', start: 22, end: 31 },// ]
// Columns-mode tokens (JavaScript only)const columnTokens = tokenize('id as user_id', { mode: 'columns' })Python
Section titled “Python”from flyql import tokenize
tokens = tokenize("status=200 and region='us-east'")for t in tokens: print(t.text, t.type.value, t.start, t.end)
# Python Token is a frozen dataclass with fields: text, type, start, endColumns mode is not supported in Python and raises ValueError:
tokenize("id", mode="columns")# ValueError: columns mode is only available in the JavaScript packageimport "github.com/iamtelescope/flyql/golang"
tokens, err := flyql.Tokenize("status=200 and region='us-east'", "query")if err != nil { // non-nil for unsupported modes; "query" never errors}for _, t := range tokens { fmt.Println(t.Text, t.Type, t.Start, t.End)}Columns mode returns a non-nil error: columns mode is only available in the JavaScript package.
Token types
Section titled “Token types”The tokenizer emits CharType constants from the core parser. In query mode, the interesting types are:
| Type | Example | When it’s produced |
|---|---|---|
flyqlKey | status, meta.region | Field names on the left of an operator |
flyqlOperator | =, !=, >=, and, or, not | Comparison and boolean operators |
number | 200, -42, 3.14, 1e5 | Numeric literals (see canonical numeric rule) |
string | 'ok', "hello" | Quoted string literals |
flyqlBoolean | true, false | Boolean literals |
flyqlNull | null | Null literal |
flyqlColumn | alice, Infinity | Unquoted value that is not a literal — treated as a column reference |
flyqlPipe | | | The transformer-separator pipe |
flyqlTransformer | upper, chars | Transformer names |
flyqlArgument | 25 (inside chars(25)) | Transformer argument |
flyqlParameter | $name, $1 | Parameter placeholders |
flyqlWildcard | * | Wildcard inside values |
space | | Whitespace between tokens |
flyqlError | y in x!y | Tail of input the parser could not consume |
Columns mode uses its own set: column, alias, aliasOperator, operator, transformer, argument, space, error.
Note that flyqlValue never appears in the output — every raw value token from the parser is upgraded by the rule below before tokenize returns.
The VALUE upgrade rule
Section titled “The VALUE upgrade rule”The parser initially tags every value character with flyqlValue. Before returning, tokenize upgrades each value token to a more specific type:
"true"or"false"→flyqlBoolean"null"→flyqlNull- matches the canonical numeric rule →
number - starts with
'or"→string - otherwise →
flyqlColumn
This logic runs in query mode only. Columns mode has no value concept.
Canonical numeric rule
Section titled “Canonical numeric rule”A value token is classified as number only when it matches this regex exactly:
^-?\d+(\.\d+)?([eE][+-]?\d+)?$This is intentionally narrower than JavaScript parseFloat, Python float(), or Go strconv.ParseFloat. The following inputs are not numbers and fall through to flyqlColumn:
Infinity,-InfinityNaN- hex literals like
0x1F - whitespace-padded numerics like
42
All three languages share the same regex, so val=Infinity tokenizes identically in JavaScript, Python, and Go.
Trailing ERROR token
Section titled “Trailing ERROR token”If the parser halts before consuming the full input, tokenize appends one final token covering the unparsed tail:
tokenize('x!y')// [// { text: 'x', type: 'flyqlKey', start: 0, end: 1 },// { text: '!', type: 'flyqlOperator', start: 1, end: 2 },// { text: 'y', type: 'flyqlError', start: 2, end: 3 },// ]This preserves the round-trip invariant even for malformed input. Formatters can render the trailing flyqlError token (or columns error token) with a distinct style — the JavaScript highlight() helper uses the CSS class flyql-error.
Empty input returns an empty list, never a trailing error.
Limitations
Section titled “Limitations”- Columns mode is JavaScript-only. Only the JavaScript columns parser records typed-character output; the Python and Go columns parsers don’t, so
tokenize()in those languages can’t extend to columns mode. - ASCII-only guarantees. Cross-language offset parity is guaranteed for ASCII input. Non-ASCII inputs may produce different
start/endvalues across languages because of byte-vs-codepoint-vs-UTF16-unit differences. - No line or column numbers. Tokens expose only character offsets. If you need
(line, column)pairs for an editor, compute them fromstartagainst the source text.
See also
Section titled “See also”- Cross-Language API Matrix — the canonical function name for each language.
- JavaScript Quickstart — full list of package subpath imports, including
flyql/highlightfor HTML output. - AST & Custom Generators — the higher-level tree you get from
parse().tokenize()is a complement, not a replacement: it gives you a flat view whereparse()gives you a tree.