Skip to content

Tokenizer

tokenize() is a pure function that turns a FlyQL expression into a flat list of typed tokens. It is the shared foundation every FlyQL syntax highlighter, pretty-printer, or formatter can build on. The function is available in all three languages with byte-identical semantics — the same input produces the same tokens in JavaScript, Python, and Go.

Use it when you need token-level control over rendering (ANSI console output, custom HTML, editor decorations, token-aware linters). If you only need an HTML string, prefer the higher-level flyql/highlight helper (JavaScript only).

Every token is a plain record with four fields:

FieldTypeDescription
textstringThe slice of the original input this token covers.
typestringOne of the CharType values (see below).
startintZero-based character offset of the first character in text.
endintZero-based character offset one past the last character — end == start + len(text).

Tokens satisfy two invariants you can rely on:

  • Round-trip: concatenating every token’s text in order reproduces the original input exactly.
  • Gap-free offsets: tokens[0].start == 0, tokens[i].start == tokens[i-1].end for all i > 0, and tokens[-1].end == len(input) (non-empty inputs).

These two properties together mean a formatter can walk the token list, emit decorations around each text, and never miss a character or overlap a boundary.

tokenize accepts an input string and an optional mode. Only "query" mode is implemented in all three languages; "columns" mode is JavaScript-only.

import { tokenize } from 'flyql/tokenize'
const tokens = tokenize("status=200 and region='us-east'")
// [
// { text: 'status', type: 'flyqlKey', start: 0, end: 6 },
// { text: '=', type: 'flyqlOperator', start: 6, end: 7 },
// { text: '200', type: 'number', start: 7, end: 10 },
// { text: ' ', type: 'space', start: 10, end: 11 },
// { text: 'and', type: 'flyqlOperator', start: 11, end: 14 },
// { text: ' ', type: 'space', start: 14, end: 15 },
// { text: 'region', type: 'flyqlKey', start: 15, end: 21 },
// { text: '=', type: 'flyqlOperator', start: 21, end: 22 },
// { text: "'us-east'", type: 'string', start: 22, end: 31 },
// ]
// Columns-mode tokens (JavaScript only)
const columnTokens = tokenize('id as user_id', { mode: 'columns' })
from flyql import tokenize
tokens = tokenize("status=200 and region='us-east'")
for t in tokens:
print(t.text, t.type.value, t.start, t.end)
# Python Token is a frozen dataclass with fields: text, type, start, end

Columns mode is not supported in Python and raises ValueError:

tokenize("id", mode="columns")
# ValueError: columns mode is only available in the JavaScript package
import "github.com/iamtelescope/flyql/golang"
tokens, err := flyql.Tokenize("status=200 and region='us-east'", "query")
if err != nil {
// non-nil for unsupported modes; "query" never errors
}
for _, t := range tokens {
fmt.Println(t.Text, t.Type, t.Start, t.End)
}

Columns mode returns a non-nil error: columns mode is only available in the JavaScript package.

The tokenizer emits CharType constants from the core parser. In query mode, the interesting types are:

TypeExampleWhen it’s produced
flyqlKeystatus, meta.regionField names on the left of an operator
flyqlOperator=, !=, >=, and, or, notComparison and boolean operators
number200, -42, 3.14, 1e5Numeric literals (see canonical numeric rule)
string'ok', "hello"Quoted string literals
flyqlBooleantrue, falseBoolean literals
flyqlNullnullNull literal
flyqlColumnalice, InfinityUnquoted value that is not a literal — treated as a column reference
flyqlPipe|The transformer-separator pipe
flyqlTransformerupper, charsTransformer names
flyqlArgument25 (inside chars(25))Transformer argument
flyqlParameter$name, $1Parameter placeholders
flyqlWildcard*Wildcard inside values
space Whitespace between tokens
flyqlErrory in x!yTail of input the parser could not consume

Columns mode uses its own set: column, alias, aliasOperator, operator, transformer, argument, space, error.

Note that flyqlValue never appears in the output — every raw value token from the parser is upgraded by the rule below before tokenize returns.

The parser initially tags every value character with flyqlValue. Before returning, tokenize upgrades each value token to a more specific type:

  1. "true" or "false"flyqlBoolean
  2. "null"flyqlNull
  3. matches the canonical numeric rulenumber
  4. starts with ' or "string
  5. otherwise → flyqlColumn

This logic runs in query mode only. Columns mode has no value concept.

A value token is classified as number only when it matches this regex exactly:

^-?\d+(\.\d+)?([eE][+-]?\d+)?$

This is intentionally narrower than JavaScript parseFloat, Python float(), or Go strconv.ParseFloat. The following inputs are not numbers and fall through to flyqlColumn:

  • Infinity, -Infinity
  • NaN
  • hex literals like 0x1F
  • whitespace-padded numerics like 42

All three languages share the same regex, so val=Infinity tokenizes identically in JavaScript, Python, and Go.

If the parser halts before consuming the full input, tokenize appends one final token covering the unparsed tail:

tokenize('x!y')
// [
// { text: 'x', type: 'flyqlKey', start: 0, end: 1 },
// { text: '!', type: 'flyqlOperator', start: 1, end: 2 },
// { text: 'y', type: 'flyqlError', start: 2, end: 3 },
// ]

This preserves the round-trip invariant even for malformed input. Formatters can render the trailing flyqlError token (or columns error token) with a distinct style — the JavaScript highlight() helper uses the CSS class flyql-error.

Empty input returns an empty list, never a trailing error.

  • Columns mode is JavaScript-only. Only the JavaScript columns parser records typed-character output; the Python and Go columns parsers don’t, so tokenize() in those languages can’t extend to columns mode.
  • ASCII-only guarantees. Cross-language offset parity is guaranteed for ASCII input. Non-ASCII inputs may produce different start/end values across languages because of byte-vs-codepoint-vs-UTF16-unit differences.
  • No line or column numbers. Tokens expose only character offsets. If you need (line, column) pairs for an editor, compute them from start against the source text.
  • Cross-Language API Matrix — the canonical function name for each language.
  • JavaScript Quickstart — full list of package subpath imports, including flyql/highlight for HTML output.
  • AST & Custom Generators — the higher-level tree you get from parse(). tokenize() is a complement, not a replacement: it gives you a flat view where parse() gives you a tree.