Cached at:
05/11/26, 05:09 PM
# Why Tree-Sitter Is Inadequate for Program Analysis
Source: [https://www.cubix-framework.com/tree-sitter-limitations.html](https://www.cubix-framework.com/tree-sitter-limitations.html)
A great thing about Cubix is that it integrates well with many third\-party parsers\. In the latest update: if you have a Tree\-sitter grammar for your language, then you can get Cubix support for your language relatively quickly\.
So you might see this and ask "I already have a Tree\-sitter parser, and I only care about one language; why can't I just use Tree\-sitter directly?"
Unless you're building a syntax highlighter: terrible, terrible choice\.
Here's why building Cubix's Tree\-sitter integration was many times harder than expected, and a taste of all the pain you're avoiding if you choose to use Cubix with a Tree\-sitter parser instead of Tree\-sitter directly\.
## You Can't Tell Addition From Multiplication
Consider this Sui Move code:
```
let a = x + y;
let b = x * y;
```
Parse both lines with Tree\-sitter\. Look at the AST for the right\-hand sides\. They are**identical**:
```
(binary_expression
left: (identifier)
right: (identifier))
```
The`\+`and`\*`tokens are gone\. Tree\-sitter classified them as "anonymous nodes" and every tool in the ecosystem silently discards them\. You cannot write a refactoring tool, a static analyzer, or a formula extractor that distinguishes addition from multiplication — the most basic semantic distinction in arithmetic — because Tree\-sitter doesn't preserve it\.
This is not a bug in a particular grammar\. Tree\-Sitter is full of constructs that throw away information, and grammars throughout the ecosystem use them\. After all, if you just want a syntax\-highlighter or go\-to\-definition, then this is wasted information\. But any kind of deeper analysis needs this\.
Want more examples? You also can't tell`move x`from`copy x`\. You can't tell`public`from`public\(friend\)`from`public\(package\)`\. You can't tell which of the four abilities —`copy`,`drop`,`store`,`key`— a type declares\. You can't even tell`true`from`false`\. All of these depend on tokens that Tree\-sitter discards\.
## Background
Tree\-sitter is a popular incremental parsing library designed for**syntax highlighting and editor features**\. It produces a Concrete Syntax Tree \(CST\) optimized for the needs of text editors: fast incremental re\-parsing, fault tolerance on incomplete input, and enough structure to colorize tokens\. It was never designed for semantic analysis, program transformation, or roundtrip source code manipulation\.
The problems fall into three categories: major issues where Tree\-sitter actively destroys information you need, structural issues where a CST is the wrong representation for the job, and minor issues that add friction\.
This article is going to be full of references to the Sui Move grammar, as that is the first language supported via our Tree Sitter integration\. But the problems here show up across languages\.
## Major Issues
### Anonymous nodes are silently discarded
Tree\-sitter distinguishes between**named nodes**\(like`binary\_expression`,`function\_definition`\) and**anonymous nodes**\(operators`\+`,`\*`,`\|\|`, punctuation`\(`,`\)`,`,`, keywords`let`,`if`\)\. All mainstream Tree\-sitter libraries — GitHub's[semantic](https://github.com/github/semantic), the newer[hs\-Tree\-sitter](https://github.com/wenkokke/hs-Tree-sitter)— only traverse named nodes\. Anonymous tokens are thrown away\.
This is catastrophic for any tool that needs to understand what code*does*:
- **Operators vanish\.**`x \+ y`and`x \* y`become the same tree\.`a == b`and`a \!= b`become the same tree\. You cannot build a symbolic executor, a formula extractor, or even a linter that checks operator usage\.
- **Punctuation needed for roundtripping vanishes\.**Parentheses, brackets, commas, semicolons — all gone\. You cannot reconstruct source code from the AST\.
- **Keywords that distinguish constructs vanish\.**In Sui Move,`modifier`is a named node, but the keywords inside it —`public`,`package`,`friend`,`entry`,`native`— are all anonymous\. A tool walking named children sees a`modifier`node with zero children for all five visibility levels\. The same pattern recurs throughout the grammar:`ability`wraps`copy`/`drop`/`store`/`key`with no named children;`primitive\_type`wraps nine different types \(`u8`through`u256`,`bool`,`address`,`signer`\) with no named children;`move\_or\_copy\_expression`makes move and copy semantics indistinguishable; even`bool\_literal`makes`true`and`false`identical\.
The Sui Move grammar makes the operator problem especially severe\. Binary expressions are defined in`grammar\.js`using JavaScript spread syntax that bakes each operator into a separate alternative:
```
...table.map(([operator, precedence, associativity]) =>
prec[associativity](precedence, seq(
field('left', $._expression),
field('operator', operator), // Anonymous token -- discarded!
field('right', $._expression)
))
)
```
Twenty different binary operators, all producing structurally identical AST nodes\. The*only*way to distinguish them is through the anonymous token that every library drops\.
### No pretty\-printer — no roundtripping
Tree\-sitter is a one\-way street\. It parses source code into a tree\. It provides**zero**mechanism for going back — from a tree to source code\.
The roundtrip property`parse\(pretty\(parse\(text\)\)\) = parse\(text\)`is a fundamental requirement for any program transformation tool\. If you can't render your modified tree back to valid source, your tool is useless\. Tree\-sitter provides no support for the`pretty`half of this equation\.
A custom pretty\-printer must be written from scratch for every language, working from the same grammar definition but in the opposite direction\. Tree\-sitter offers no help\.
### Evil aliases
For some reason, Tree\-sitter supports an`alias\(\)`rule that lets one grammar rule appear under a different name in the CST\. We never figured out why this is useful, but enough grammars have it that it must have some use\.
For example, the Sui Move grammar uses aliases to give contextual names to shared rules — a generic`\_variable\_identifier`is aliased to`bind\_var`in binding position, giving downstream tools a way to know that this identifier is being used as a binder rather than a reference\.
But aliases cannot be straightforwardly processed by downstream tools because they introduce a layer of indirection that isn't cleanly reflected in`grammar\.json`\.
We wound up having to use a short`jq`script that preprocesses grammars to remove aliases — the only place where we had to change a Tree\-sitter grammar\.
This means that every contextual name the grammar author assigned — every attempt to say "this identifier is a`bind\_var`, not just an`identifier`" — is erased before processing begins\. The semantic distinctions that the grammar author carefully encoded through aliasing are destroyed\.
### FFI memory safety hazards
So we were pretty far along working with the Haskell bindings for Tree\-sitter when we slammed into a segfault\. Uh oh\. How?
You see, Tree\-sitter is a C library\. Its`TSNode`struct contains raw pointers back to the`TSTree`and`TSLanguage`objects that created it\. When accessed from a garbage\-collected language, these pointers create a hidden dependency: if the runtime garbage\-collects the tree while node references still exist, you get**non\-deterministic segfaults**\.
The standard Haskell bindings use`ForeignPtr`with a finalizer that calls`ts\_tree\_delete`\. This means the GC doesn't see the dependency between nodes and trees, so it frees the tree while nodes still hold dangling pointers\. The resulting crashes are intermittent, appearing to correlate with unrelated code changes, and took weeks to diagnose\.
Any language with automatic memory management faces a version of this problem when integrating with Tree\-sitter at a low level\.
## You still want an AST, not a CST
Compilers, static analyzers, and programming tools of all stripes havy historically relied on ASTs \(**abstract syntax trees**\)\. Abstract syntax trees condense a program into its core meaningful syntax\. Non\-semantic differences such as extra parentheses, or the difference between`0xFF`and`255`, get stripped away, so that tools work with something simpler\. They can also perform other normalizations, such as removing the difference between an if\-statement with no else block, vs\. an if\-statement with an empty else block\.
But Tree\-sitter does not provide an AST\. It instead produces CSTs \(**concrete syntax tree**s\)\. Before the introduction of Tree\-sitter, CSTs were virtually unknown outside of researchers in language engineering\. CSTs, as generated by tools such as Rascal and SDF, are very useful for applications that require reconstructing the original source code\. They can also be generated directly from a grammar, reducing the need for additional information about what parts of a syntax to ignore\. Unlike ASTs, they can also preserve comments\.
Tree\-sitter, in introducing concrete syntax trees to a larger audience, has made a number of interesting choices that make it very effective for building syntax highlighters, while reducing its usefulness for most other application\. Unlike traditional CSTs, Tree\-sitter trees are very lossy \(as explained above\), which reduces memory consumption but destroys its utility for analysis and transformation\. They also lose a lot of the information in the grammar, which allows a simplified API and further reduces memory consumption, at the price of making analysis extra difficult\.
My mentor Ira Baxter, who has been building program transformation tools for about 40 years,[wrote](https://stackoverflow.com/a/1685297)"Having a parser \[and getting an AST\] is like climbing the foothills of the Himalayas when the problem is climbing Everest\." But today, thanks to Tree\-sitter, many tool builders do not even get that far\.
Here are some more issues of Tree\-sitter, related to its lack of AST production\.
### Children are just a flat list
Tree\-sitter's grammar definitions encode rich structure —`seq`,`repeat`,`optional`,`choice`— that describes precisely how children are grouped and ordered\. But the CST throws all of this away\. Every node's children are just a flat, untyped list\.
Consider how the Sui Move grammar defines a block:
```
block: $ => seq(
'{',
repeat($.use_declaration),
repeat($.block_item),
optional($._expression),
'}'
)
```
The grammar says: first come use declarations, then block items \(statements ending with`;`\), then optionally a trailing expression \(the block's return value, without`;`\)\. But Tree\-sitter's`node\-types\.json`describes the`block`node as having`"fields": \{\}`and children that can be any of 40\+ types in any order\. The`repeat`/`optional`/`seq`structure is completely erased\.
Or consider function signatures, which allow up to three optional modifiers:
```
_function_signature: $ => seq(
optional($.modifier),
optional($.modifier),
optional($.modifier),
'fun',
...
)
```
The grammar defines three distinct modifier slots\. The CST gives you 0–3`modifier`children with no positional information about which slot each came from\.
This means a tool consuming the CST must**re\-derive**the grammar's grouping logic\. Given a`block`node, you must figure out on your own where the statements end and the trailing return expression begins\. Given a function, you must figure out which modifiers are present by inspecting their content rather than their position\.
A proper AST \(and, really, a proper CST too\) makes the structure explicit in the type:
```
data Block e l where
Block
:: e [UseDeclarationL]
-> e [BlockItemL]
-> e (Maybe HiddenExpressionL)
-> Block e BlockL
```
Statements and the return expression are in separate fields\. Pattern matching enforces the distinction\. There is no ambiguity to resolve at runtime\.
### The grammar is richer than the parse output
The flat\-children problem is a symptom of a deeper issue:**Tree\-sitter's grammar definition encodes far more structure than its CST preserves\.**
The grammar uses`choice\(\)`to define alternatives:
```
block_item: $ => seq(
choice(
$._expression,
$.let_statement,
),
';'
)
```
This says a block item is either an expression or a let statement, followed by a semicolon\. But the CST has no wrapper node for the`choice\(\)`\. You see the concrete child directly — a`let\_statement`or a`call\_expression`— with no indication that these were alternatives in a two\-way choice\.
A proper AST extracts this into a named sum type:
```
data BlockItemInner e l where
BlockItemExpression :: e ExpressionL -> BlockItemInner e BlockItemInnerL
BlockItemLetStatement :: e LetStatementL -> BlockItemInner e BlockItemInnerL
```
Tree\-sitter's grammar also uses hidden rules \(prefixed with`\_`\) like`\_expression`,`\_type`,`\_bind`\. These define important categorical groupings — "an expression is one of: call, binary, if, while, \.\.\." — but Tree\-sitter**actively suppresses**these nodes in the CST\. Where the grammar says`\_expression`, the CST just shows the concrete child \(`call\_expression`,`binary\_expression`, etc\.\) with no wrapper\.
This means the grammar author's intent — "these 16 node types are all expressions" — is lost\. A tool must reconstruct these categories by maintaining its own tables of which concrete node types belong to which abstract categories\.
### Primitive and literal information is structurally invisible
Tree\-sitter represents leaf nodes like number literals as opaque`pattern`nodes — text matching a regex\. The number`255`, the hex literal`0xFF`, and the separator\-formatted`1\_000\_000`are all just a`pattern`node\. There is no structural distinction\.
For syntax highlighting, this is fine — they're all numbers, color them blue\. For program analysis, it's a problem\. A tool that needs to normalize numeric representations, verify literal formats, or preserve the programmer's formatting intent cannot get this information from the Tree\-sitter CST without falling back to raw text matching\. An AST can represent these as distinct constructors with parsed values\.
## Minor Issues
### `sepBy`doesn't mean`sepBy`
Tree\-sitter's`sep\(\)`/`sep1\(\)`combinators, which are supposed to represent comma\-separated lists and similar patterns, actually implement`sepEndBy`semantics — they**allow trailing separators**\. The documentation doesn't call this out\.
This seems minor but causes real parsing failures when building tools that assume standard semantics\. A parameter list`\(a, b, c\)`and`\(a, b, c,\)`parse identically, and a tool that consumes the Tree\-sitter output and applies standard`sepBy`logic will fail on trailing commas\.
### The grammar source is JavaScript, not data
Tree\-sitter grammars are defined in JavaScript files that use the full power of the language — higher\-order functions, spread operators, computed tables\. The`grammar\.json`that Tree\-sitter produces from these is a flattened, desugared version that has lost the high\-level structure\.
The Sui Move grammar's binary expression definition uses`table\.map\(\)`with spread — clear and readable in JavaScript, but the resulting`grammar\.json`contains 20 nearly\-identical alternatives with no trace of the table structure\. Any tool consuming the grammar must reverse\-engineer patterns that were obvious in the source\.
## The Bottom Line
Tree\-sitter was designed to make editors fast and responsive\. It excels at that\. But for program analysis and transformation, it is actively hostile:
What you needWhat Tree\-sitter gives youSemantic distinctions between operatorsIdentical nodes for`\+`,`\*`,`==`,`\!=`Roundtrip parse/printOne\-way parsing onlyStructured, typed childrenFlat list with grouping information erasedContextual node names via aliasesAliases that must be stripped to process the grammarStable memory managementDangling pointers across GC boundariesRich structural representationsHidden rules suppressed, inline choices flattenedWe found that Tree\-sitter is useful as a**tokenizer**— a fast, reliable way to break source code into a token stream\. But every layer above that — structural parsing, type\-safe representation, pretty\-printing, roundtripping, transformation — must be built from scratch on top of it\. The gap between what Tree\-sitter provides and what program analysis requires is far larger than it appears\.