Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese
Summary
TOTEN is a knowledge-based ontological tokenization framework that replaces statistical tokenization with declarative classification grounded in a formal ontology of engineering entities, achieving high ontological atomicity and numerical reconstruction for physical quantities and technical notation in Brazilian Portuguese.
View Cached Full Text
Cached at: 06/20/26, 02:31 PM
# TOTEN: Knowledge-Based Ontological Tokenization of Physical Quantities and Technical Notation in Brazilian Portuguese Source: [https://arxiv.org/html/2606.19626](https://arxiv.org/html/2606.19626) [![[Uncaptioned image]](https://arxiv.org/html/2606.19626v1/x1.png)Antonio de Sousa Leitão Filho](https://orcid.org/0009-0002-1705-3611)1,2,∗ [![[Uncaptioned image]](https://arxiv.org/html/2606.19626v1/x2.png)Allan Kardec Duailibe Barros Filho](https://orcid.org/0000-0002-1654-0955)2 [![[Uncaptioned image]](https://arxiv.org/html/2606.19626v1/x3.png)Fabrício Saul Lima](https://orcid.org/0009-0005-1837-8751)1,2 [![[Uncaptioned image]](https://arxiv.org/html/2606.19626v1/x4.png)Selby Mykael Lima dos Santos](https://orcid.org/0009-0006-6627-6503)1,2 [![[Uncaptioned image]](https://arxiv.org/html/2606.19626v1/x5.png)Rejani Bandeira Vieira Sousa](https://orcid.org/0009-0000-7888-7324)1,3 1Aia Context, São Luís, Brazil 2Universidade Federal do Maranhão, São Luís, Maranhão, Brazil 3Universidade de São Paulo, São Paulo, Brazil ∗Corresponding author:[antonio@aiacontext\.com](https://arxiv.org/html/2606.19626v1/mailto:[email protected]) ###### Abstract Byte\-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary subwords\. We presentTOTEN, a knowledge\-based ontological tokenization framework that replaces statistical derivation with declarative classification grounded in a formal ontology of engineering entities \(OEE\)\. We formalizeTOTENas the triple⟨𝒪,classify,\{instτ\}⟩\\langle\\mathcal\{O\},\\mathrm\{classify\},\\\{\\mathrm\{inst\}\_\{\\tau\}\\\}\\rangle: the ontology gathers types, structural principles, composition relations, and preservable invariants; the classification function maps raw text into typed regions; and the indexed family of instantiators produces a self\-descriptive structured representation\. Robustness derives from deterministic coupling with three consolidated external oracles —Pint\(dimensional\),Unicode Character Database\(typographic\), andRSLP\(Portuguese morphology\)\. The intrinsic evaluation covers four properties verifiable by construction — ontological atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction — over an internally generated and physically validated benchmark \(EngQuant,N=800N=800\) and four external corpora in Brazilian Portuguese \(N=1 771N=1\\,771cases eligible for numerical reconstruction\)\. We additionally report detection*recall*, distinguishing coverage from conditional atomicity\. Compared to eight representative state\-of\-the\-art systems,TOTENachieves unit ontological atomicity in all contrasts and numerical reconstruction of0\.7750\.775to0\.9040\.904on external corpora, against0\.6270\.627–0\.7030\.703for the best baseline \(Quantulum3\); on the internal benchmark,0\.7800\.780against0\.3400\.340\. Differences in atomicity and reconstruction are statistically significant \(McNemar with Holm correction\)\. The Spearman rank correlation between internal and external corpus rankings confirms the concurrent validity of the control benchmark\. Dimensional equivalence shows statistical parity withPint, the oracle from which the system inherits dimensional authority\. *Keywords*Ontological tokenization⋅\\cdotKnowledge\-based systems⋅\\cdotOntological engineering⋅\\cdotKnowledge representation⋅\\cdotNLP in Portuguese⋅\\cdotIntrinsic evaluation ## 1Introduction The symbolic representation of technical entities in scientific text remains an unsolved problem in contemporary language models\. Statistical tokenization algorithms such asByte\-Pair Encoding\[[1](https://arxiv.org/html/2606.19626#bib.bib1)\],WordPiece, andSentencePiece\[[2](https://arxiv.org/html/2606.19626#bib.bib2)\]are derived from predominantly English generalist corpora and produce vocabularies whose granularity is optimized for statistical compression, not semantic preservation\. When applied to technical text in Brazilian Portuguese, these tokenizers fragment semantically atomic entities — physical quantities, locale\-specific numbers, compound dimensional units, symbolic expressions — into subword sequences whose recomposition depends entirely on a downstream model a posteriori\. Adjacent cases such as normative identifiers \(NBR, ABNT\) and hierarchical references to legal articles and paragraphs suffer from the same structural problem; their evaluation on an annotated open corpus is, however, left as an extension of this work\. Recent studies empirically document the consequences of this fragmentation\.Singh and Strouse \[[3](https://arxiv.org/html/2606.19626#bib.bib3)\]demonstrate that right\-to\-left digit grouping substantially increases arithmetic accuracy in GPT\-3\.5, indicating that digit structure preserved in the input is associated with better downstream arithmetic performance independently of parameter scale\.Yang et al\. \[[4](https://arxiv.org/html/2606.19626#bib.bib4)\]catalogue systematic gaps in numerical reasoning whose origin is attributed to inadequate tokenization, not training\. These findings motivate the design of an input representation that preserves the semantic structure of numbers and units; direct verification of the downstream effect on consumer models is outside the scope of this study and is left as future work\. Domain\-specific literature on quantitative extraction for English\[[5](https://arxiv.org/html/2606.19626#bib.bib5),[6](https://arxiv.org/html/2606.19626#bib.bib6)\]offers partial solutions that do not adequately model the technical vocabulary of Brazilian Portuguese; dimensional libraries such asPint\[[7](https://arxiv.org/html/2606.19626#bib.bib7)\]andudunits\-2\[[8](https://arxiv.org/html/2606.19626#bib.bib8)\]operate on already\-isolated unit strings, without performing textual recognition; generic entity recognition models\[[9](https://arxiv.org/html/2606.19626#bib.bib9)\]are trained on categories such as person, location, and organization, ignoring technical\-scientific vocabulary\. This work proposes an alternative grounded inontological engineering\. Rather than deriving vocabulary statistically, we explicitly declare a formal ontology of engineering entities \(OEE\), comprising primary types, structural principles, composition relations, and preservable invariants\. Over this ontology, we defineTOTEN111TOTEN —*Typed Ontological Tokenization*\., a knowledge\-based tokenization framework operating in three functional layers and coupled to three consolidated external oracles\. The scientific contributions of this work are: \(C1\)A*formalization*of ontological tokenization as the triple⟨𝒪,classify,\{instτ\}⟩\\langle\\mathcal\{O\},\\mathrm\{classify\},\\\{\\mathrm\{inst\}\_\{\\tau\}\\\}\\rangle, implementation\-independent and amenable to evaluation via verifiable properties, categorically distinguishing it from statistical subword tokenization\. \(C2\)A*formal ontology of engineering entities*\(OEE\) with primary types defined by intrinsic properties,eightstructural principles expressed as axioms \(Appendix[A](https://arxiv.org/html/2606.19626#A1)\) and declared composition relations, extensible under the*open\-for\-extension, closed\-for\-modification*principle\. \(C3\)A computationally inexpensive*intrinsic evaluation*based on four properties verifiable by construction — reporting, beyond detection \(*recall*\), the four properties: atomicity, dimensional equivalence, typographic robustness, and numerical reconstruction — replicated over five corpora \(one physically validated internal and four external PT\-BR\), with oracle ablation and cross\-corpus ranking consistency validation, demonstrating statistically significant advantage in atomicity and numerical reconstruction against eight state\-of\-the\-art systems\. Section[2](https://arxiv.org/html/2606.19626#S2)establishes the theoretical foundations in ontological engineering and tokenization\. Section[3](https://arxiv.org/html/2606.19626#S3)formalizes the OEE\. Section[4](https://arxiv.org/html/2606.19626#S4)presents the architecture ofTOTEN\. Section[5](https://arxiv.org/html/2606.19626#S5)characterizes the output language\. Section[6](https://arxiv.org/html/2606.19626#S6)describes the experimental protocol\. Section[7](https://arxiv.org/html/2606.19626#S7)presents the results\. Section[8](https://arxiv.org/html/2606.19626#S8)discusses implications\. Section[9](https://arxiv.org/html/2606.19626#S9)concludes\. ## 2Theoretical Foundations ### 2\.1 Ontological Engineering A formal ontology, in the sense established byGruber \[[10](https://arxiv.org/html/2606.19626#bib.bib10)\], is an explicit specification of a shared conceptualization of a domain\.Studer et al\. \[[11](https://arxiv.org/html/2606.19626#bib.bib11)\]characterize ontological engineering as a discipline that produces reusable formal artifacts for knowledge representation, distinguishing*lightweight*ontologies \(taxonomies with few constraints\) from*heavyweight*ones \(axiomatic, with vocabulary rigorously constrained by logical axioms\)\.Guarino \[[12](https://arxiv.org/html/2606.19626#bib.bib12)\]introduces the criterion of*ontological commitment*as a theory’s obligation to the structure of the reality it describes\. We adopt the classical formalization of an ontology as a quadruple 𝒪=⟨𝒯,𝒫,ℛ,ℐ⟩,\\mathcal\{O\}=\\langle\\mathcal\{T\},\\ \\mathcal\{P\},\\ \\mathcal\{R\},\\ \\mathcal\{I\}\\rangle,\(1\)where𝒯\\mathcal\{T\}is the finite set of primary types,𝒫\\mathcal\{P\}is the set of structural principles \(axioms\),ℛ⊆𝒯×𝒯\\mathcal\{R\}\\subseteq\\mathcal\{T\}\\times\\mathcal\{T\}is the composition relation between types, andℐ\\mathcal\{I\}is the set of invariants that any valid representation of instances must preserve\. This formulation is compatible with\[[11](https://arxiv.org/html/2606.19626#bib.bib11)\]and admits incremental extension: given𝒪n\\mathcal\{O\}\_\{n\}at versionnn, versionn\+1n\+1satisfies𝒯n⊆𝒯n\+1\\mathcal\{T\}\_\{n\}\\subseteq\\mathcal\{T\}\_\{n\+1\}and𝒫n⊆𝒫n\+1\\mathcal\{P\}\_\{n\}\\subseteq\\mathcal\{P\}\_\{n\+1\}, without any types or prior principles being removed\. ### 2\.2 Tokenization: Formal Definition LetΣ\\Sigmabe a finite alphabet andΣ∗\\Sigma^\{\*\}the set of all finite strings overΣ\\Sigma\. A*tokenization*is a function tok:Σ∗⟶V∗,\\mathrm\{tok\}:\\Sigma^\{\*\}\\longrightarrow V^\{\*\},\(2\)whereVVis a token vocabulary\. Two families are categorically distinct:statistical tokenization, in whichVVis induced from a corpus𝒞\\mathcal\{C\}by a compression procedure \(BPE, WordPiece, SentencePiece\); andontological tokenization, in whichVVis a languageℳ\\mathcal\{M\}defined over an ontology𝒪\\mathcal\{O\}, and the function factors into two components:tok=ext∘classify\\mathrm\{tok\}=\\mathrm\{ext\}\\circ\\mathrm\{classify\}, whereclassify:Σ∗→𝒫\(ℛ\)\\mathrm\{classify\}:\\Sigma^\{\*\}\\to\\mathcal\{P\}\(\\mathcal\{R\}\)identifies typed regions andext:𝒫\(ℛ\)→ℳ∗\\mathrm\{ext\}:\\mathcal\{P\}\(\\mathcal\{R\}\)\\to\\mathcal\{M\}^\{\*\}produces the structured representation\. This distinction is fundamental: statistical tokenization is*semantically blind*to the domain because the vocabulary emerges from distributional properties; ontological tokenization is*semantically committed*to a conceptualization of the domain, explicitly inherited from the ontology𝒪\\mathcal\{O\}\. ### 2\.3 Related Work #### 2\.3\.1 Subword Tokenization and Its Semantic Impact The dominant family of statistical tokenizers — BPE\[[1](https://arxiv.org/html/2606.19626#bib.bib1)\], WordPiece\[[13](https://arxiv.org/html/2606.19626#bib.bib13)\], and SentencePiece\[[2](https://arxiv.org/html/2606.19626#bib.bib2)\]— shares a methodological assumption: optimal vocabularies emerge from distributional properties of a corpus, under a compression criterion\. Recent studies problematize this assumption on three complementary fronts\.Bostrom and Durrett \[[14](https://arxiv.org/html/2606.19626#bib.bib14)\]show that unigram LM segmentation produces units more aligned with morphology than BPE, indicating that the greedy compression criterion fragments legitimate morphemes\.Rust et al\. \[[15](https://arxiv.org/html/2606.19626#bib.bib15)\]demonstrate, across nine typologically diverse languages, that a dedicated monolingual tokenizer contributes to monolingual performance as much as the volume of pre\-training data, isolating the effect of tokenization from scale\.Schmidt et al\. \[[16](https://arxiv.org/html/2606.19626#bib.bib16)\]introduce the PathPiece tokenizer and empirically establish that fewer tokens does not imply better downstream performance, dissolving the informal equation between compression and quality\. The question of cross\-language equity is addressed byPetrov et al\. \[[17](https://arxiv.org/html/2606.19626#bib.bib17)\], who document differences of up to an order of magnitude in tokenization length between languages for informationally equivalent content, with consequences for cost, latency, and effective context window\.Wegmann et al\. \[[18](https://arxiv.org/html/2606.19626#bib.bib18)\]extend the argument to intralinguistic variation, showing that pre\-tokenization decisions interact with orthographic and dialectal variants; in morphologically rich languagesToraman et al\. \[[19](https://arxiv.org/html/2606.19626#bib.bib19)\]demonstrate that tokenizer choice affects downstream performance comparably to scale increases, with direct implications for the tokenization of technical PT\-BR\.Land and Bartolo \[[20](https://arxiv.org/html/2606.19626#bib.bib20)\]further catalogue*glitch tokens*— tokens present in the vocabulary but virtually absent from training — as a class of systematic failure induced by the disconnect between tokenizer construction and model training\. ForTOTEN, this body of evidence is convergent: statistical vocabulary derivation introduces biases, fragmentations, and artifacts that are not corrected a posteriori by the consumer model\. #### 2\.3\.2 Numerical Representation and Reasoning in Language Models The specific literature on numbers in NLP, synthesized byThawani et al\. \[[21](https://arxiv.org/html/2606.19626#bib.bib21)\]across seven subtasks, identifies numerical representation as a weak and unstable emergent capability in generic models\.Spithourakis and Riedel \[[22](https://arxiv.org/html/2606.19626#bib.bib22)\]show that hierarchical architectures treating numerals as a distinct class reduce perplexity by two to four orders of magnitude on numerical subsets\.Wallace et al\. \[[23](https://arxiv.org/html/2606.19626#bib.bib23)\]establish, via probing, that standard embeddings capture magnitude only for integers up to three digits, collapsing for larger scales\.Geva et al\. \[[24](https://arxiv.org/html/2606.19626#bib.bib24)\]propose injecting numerical ability via synthetic arithmetic data generation during pre\-training, an approach complementary to the input reformulation explored bySingh and Strouse \[[3](https://arxiv.org/html/2606.19626#bib.bib3)\]\. The limitations documented byYang et al\. \[[4](https://arxiv.org/html/2606.19626#bib.bib4)\]reinforce that the problem persists in frontier models\.TOTENcontributes to this discussion from a distinct angle: rather than injecting numeracy via training or redesigning architectures, it operates on the input representation, preserving numerical structure \(sign, mantissa, exponent, locale, right\-to\-left digit grouping per[3](https://arxiv.org/html/2606.19626#bib.bib3)\) as ontologically typed information before any consumer model\. Structured quantity extraction has a parallel trajectory\.Roy et al\. \[[25](https://arxiv.org/html/2606.19626#bib.bib25)\]formalize the problem of*Quantity Entailment*and reasoning with quantities in natural language;Saha et al\. \[[26](https://arxiv.org/html/2606.19626#bib.bib26)\]present BONIE, the first numerical extractor in Open Information Extraction, inferring implicit relations from contextual cues \(e\.g\., the unit km² suggesting area\)\.Almasian et al\. \[[5](https://arxiv.org/html/2606.19626#bib.bib5)\]consolidate this line with CQE, a hybrid system with symbolic and statistical components evaluated on scientific corpora in English\.Zaratiana et al\. \[[6](https://arxiv.org/html/2606.19626#bib.bib6)\]generalize entity recognition to open\-set regime with GLiNER\. These systems partially cover the technical space but treat the normative PT\-BR vocabulary \(NBR, ABNT, hierarchical legal identifiers\) as noise or generic naming, without explicit ontological modeling\. #### 2\.3\.3 Ontological Engineering and Knowledge\-Based Extraction Research on*Ontology\-Based Information Extraction*\(OBIE\), systematized byWimalasuriya and Dou \[[27](https://arxiv.org/html/2606.19626#bib.bib27)\], established architectures in which declared ontologies guide the identification and classification of textual entities\.Maedche and Staab \[[28](https://arxiv.org/html/2606.19626#bib.bib28)\]propose the ontology learning cycle \(import, extract, prune, refine, evaluate\), andCimiano \[[29](https://arxiv.org/html/2606.19626#bib.bib29)\]consolidates methods and metrics in a reference treatise\. Upper ontologies such as SUMO\[[30](https://arxiv.org/html/2606.19626#bib.bib30)\]provide a foundation for cross\-domain integration\.TOTENinherits from this tradition the epistemic commitment to an explicit formal ontology \(OEE\), but inverts the typical causal direction: rather than learning an ontology from a corpus, it declares it a priori and classifies textual regions according to primary types whose invariants are verifiable by construction\. This inversion is deliberate and compatible with domains in which the ontology already exists institutionally — engineering has centuries of dimensional, normative, and symbolic codification that pre\-exist any particular corpus\. #### 2\.3\.4 Knowledge\-Based Systems and the Neurosymbolic Debate The epistemological defense of systems committed to structure received an influential argument fromBender and Koller \[[31](https://arxiv.org/html/2606.19626#bib.bib31)\], according to which models trained solely on form have no mechanism for learning meaning\.Lake et al\. \[[32](https://arxiv.org/html/2606.19626#bib.bib32)\], in a peer\-reviewedtarget articleofBehavioral and Brain Sciences, argue that robust systems require explicit causal and compositional models articulated with statistical learning\. The current neurosymbolic synthesis, surveyed byHitzler et al\. \[[33](https://arxiv.org/html/2606.19626#bib.bib33)\]andSarker et al\. \[[34](https://arxiv.org/html/2606.19626#bib.bib34)\], provides the contemporary framework for hybrid approaches; the modern program of*Inductive Logic Programming*\[[35](https://arxiv.org/html/2606.19626#bib.bib35)\]illustrates that declarative paradigms continue to evolve methodologically\.TOTENdoes not compete with neural models: it positions itself as a symbolic pre\-processing layer whose output — a languageℳ\\mathcal\{M\}ontologically typed — can be consumed both by statistical models and by downstream symbolic agents, in an arrangement compatible with the neurosymbolic taxonomy of typesymbolic\[neural\]\[[34](https://arxiv.org/html/2606.19626#bib.bib34)\]\. #### 2\.3\.5 Brazilian Portuguese Processing and Benchmarks The PT\-BR ecosystem has progressively matured:Souza et al\. \[[36](https://arxiv.org/html/2606.19626#bib.bib36)\]consolidate the statistical foundation with BERTimbau; the shared tasks ASSIN\[[37](https://arxiv.org/html/2606.19626#bib.bib37)\]and ASSIN 2\[[38](https://arxiv.org/html/2606.19626#bib.bib38)\]provide semantic similarity and textual inference datasets; LeNER\-Br\[[39](https://arxiv.org/html/2606.19626#bib.bib39)\]illustrates named entity recognition in the Brazilian legal domain with specific classes \(legislation, case law\)\. Recent academic benchmarks\[[40](https://arxiv.org/html/2606.19626#bib.bib40),[41](https://arxiv.org/html/2606.19626#bib.bib41),[42](https://arxiv.org/html/2606.19626#bib.bib42)\]enable evaluation in the national domain\. None of these resources explicitly models ontological entities in the OEE sense; LeNER\-Br approaches this by introducing normative classes, but maintains statistical treatment of mentions\.TOTENis designed to accommodate, in the OEE, the normative\-technical vocabulary \(NBR, ABNT, compositional unit identifiers, hierarchical references\) that remains absent from consolidated PT\-BR benchmarks; its validation on an annotated open corpus for that vocabulary is left as an extension \(Section[9](https://arxiv.org/html/2606.19626#S9)\)\. The positioning ofTOTENremains, therefore, orthogonal: it does not compete with BPE in compression, with Pint\[[7](https://arxiv.org/html/2606.19626#bib.bib7)\]or udunits\-2\[[8](https://arxiv.org/html/2606.19626#bib.bib8)\]in dimensional conversion, with quantitative extractors\[[5](https://arxiv.org/html/2606.19626#bib.bib5),[26](https://arxiv.org/html/2606.19626#bib.bib26)\]in generic recall, nor with BERTimbau in distributed representation\. It acts as an ontological classification layer that consumes external oracles and produces a domain\-semantically committed representation, recovering the intrinsic properties of technical entities that purely statistical pipelines systematically lose\. ## 3Ontology of Engineering Entities ### 3\.1 Primary Types The OEE declares a finite set𝒯\\mathcal\{T\}of primary types, each characterized by a signature⟨πτ,ιτ⟩\\langle\\pi\_\{\\tau\},\\ \\iota\_\{\\tau\}\\ranglewhereπτ\\pi\_\{\\tau\}is the set of intrinsic properties andιτ\\iota\_\{\\tau\}is the set of invariants that any instance ofτ\\taumust preserve\. The primary types comprise physical quantities, technical prose, technical identifiers, formal operators, universal constants, structural relations, symbolic expressions, pure numbers, and hierarchical references\. This enumeration is closed to the ontology’s principles but open to extension as demanded empirically by the domain\. For example, the type*Physical Quantity*has the signature π=⟨value,unit,dim⟩,\\pi=\\langle\\mathrm\{value\},\\ \\mathrm\{unit\},\\ \\mathrm\{dim\}\\rangle,\(3\)wherevalue∈ℝ∪\{⊥\}\\mathrm\{value\}\\in\\mathbb\{R\}\\cup\\\{\\bot\\\},unit\\mathrm\{unit\}is a string in a compositional unit language, anddim∈ℤ7\\mathrm\{dim\}\\in\\mathbb\{Z\}^\{7\}is the dimensional vector in the canonical order of the International System\[[43](https://arxiv.org/html/2606.19626#bib.bib43)\]\. The essential invariant is*dimensional homogeneity*: two instances may be combined by addition only when their dimensional vectors coincide\. ### 3\.2 Structural Principles The ontology is governed by eight structural principles that govern recognition, instantiation, and composition\. We state here the four central structural axioms \(A1A\_\{1\},A3A\_\{3\},A4A\_\{4\}, andA5A\_\{5\}\) in their normative form; the complete setA1A\_\{1\}–A8A\_\{8\}is reproduced in Appendix[A](https://arxiv.org/html/2606.19626#A1): ###### Axiom\(A1A\_\{1\}— Intrinsicity\)\. A typeτ∈𝒯\\tau\\in\\mathcal\{T\}is defined by its intrinsic propertiesπτ\\pi\_\{\\tau\}, not by pragmatic criteria or empirical frequency in a corpus\. ###### Axiom\(A3A\_\{3\}— Mediated composition\)\. Forτ1,τ2∈𝒯\\tau\_\{1\},\\tau\_\{2\}\\in\\mathcal\{T\}, the compositionτ1∘τ2\\tau\_\{1\}\\circ\\tau\_\{2\}is defined if and only if\(τ1,τ2\)∈ℛ\(\\tau\_\{1\},\\tau\_\{2\}\)\\in\\mathcal\{R\}\. Free concatenation is prohibited\. ###### Axiom\(A4A\_\{4\}— Categorical error\)\. Applying instantiationinstτ′\\mathrm\{inst\}\_\{\\tau^\{\\prime\}\}to a region classified asτ≠τ′\\tau\\neq\\tau^\{\\prime\}constitutes a categorical error, not a gradual loss of quality\. ###### Axiom\(A5A\_\{5\}— Closed\-for\-modification extensibility\)\. 𝒪n\\mathcal\{O\}\_\{n\}admits extension to𝒪n\+1\\mathcal\{O\}\_\{n\+1\}provided𝒯n⊆𝒯n\+1\\mathcal\{T\}\_\{n\}\\subseteq\\mathcal\{T\}\_\{n\+1\},𝒫n⊆𝒫n\+1\\mathcal\{P\}\_\{n\}\\subseteq\\mathcal\{P\}\_\{n\+1\}, and no invariant inℐn\\mathcal\{I\}\_\{n\}is violated by𝒪n\+1\\mathcal\{O\}\_\{n\+1\}\. The four remaining principles —A2A\_\{2\}\(invariant preservation by valid representation\),A6A\_\{6\}\(typographic convention as intrinsic property of notation\),A7A\_\{7\}\(structural anchoring of symbolic expressions in adjacent formal operators\), andA8A\_\{8\}\(distinctive mathematical mark in every compound symbol\) — are stated in complete normative form in Appendix[A](https://arxiv.org/html/2606.19626#A1)\. ### 3\.3 Composition Relations The relationℛ\\mathcal\{R\}is defined explicitly by enumeration\. Notably,\(UniversalConstant,PhysicalQuantity\)∈ℛ\(\\mathrm\{UniversalConstant\},\\ \\mathrm\{PhysicalQuantity\}\)\\in\\mathcal\{R\}: every named universal constant composes a physical quantity with an SI unit \(e\.g\.,kB=1\.38×10−23k\_\{B\}=1\.38\\times 10^\{\-23\}J/K\)\. Analogously,\(SymbolicExpression,FormalOperator\)∈ℛ\(\\mathrm\{SymbolicExpression\},\\ \\mathrm\{FormalOperator\}\)\\in\\mathcal\{R\}: every symbolic expression in prose requires anchoring by an adjacent relational or calculus operator\. ## 4TOTEN Architecture TOTENis an operational instantiation of the OEE ontology in three functional layers, coupled to three consolidated external oracles\. Figure[1](https://arxiv.org/html/2606.19626#S4.F1)summarizes the architecture\. Figure 1:Architecture ofTOTEN\.The ontological classification layer maps raw text into typed regions by consulting the three consolidated external oracles \(Pint, Unicode Character Database, RSLP\) and the declarative specification of the OEE ontology\. The instantiation layer comprises an indexed family of functions, one per type, producing the structured representation in Mode B\.### 4\.1 Ontological Classification Layer The classification layer is a total function classify:Σ∗⟶𝒫\(ℛΣ\),\\mathrm\{classify\}:\\Sigma^\{\*\}\\longrightarrow\\mathcal\{P\}\(\\mathcal\{R\}\_\{\\Sigma\}\),\(4\)where ℛΣ=\{\(τ,\[s,e\),w\[s:e\]\)∣τ∈𝒯,0≤s<e≤\|w\|\}\\mathcal\{R\}\_\{\\Sigma\}=\\\{\(\\tau,\\ \[s,e\),\\ w\[s\\\!:\\\!e\]\)\\ \\mid\\ \\tau\\in\\mathcal\{T\},\\ 0\\leq s<e\\leq\|w\|\\\}\(5\)is the set of typed regions inw∈Σ∗w\\in\\Sigma^\{\*\}\. A region\(τ,\[s,e\),c\)\(\\tau,\[s,e\),c\)associates a typeτ\\tau, a position interval\[s,e\)⊂\[0,\|w\|\)\[s,e\)\\subset\[0,\|w\|\), and the literal contentc=w\[s:e\]c=w\[s\\\!:\\\!e\]\. The image ofclassify\(w\)\\mathrm\{classify\}\(w\)is a set linearly ordered by starting position, with overlap resolution determined by a precedence relation≻𝒯\\succ\_\{\\mathcal\{T\}\}declared in𝒪\\mathcal\{O\}\. Monotonicity of the function with respect to substring inclusion is preserved: for allw′⊑ww^\{\\prime\}\\sqsubseteq w,classify\(w′\)⊆classify\(w\)∩ℛw′\\mathrm\{classify\}\(w^\{\\prime\}\)\\subseteq\\mathrm\{classify\}\(w\)\\cap\\mathcal\{R\}\_\{w^\{\\prime\}\}\. ### 4\.2 Instantiator Family The instantiation layer is the indexed family\{instτ\}τ∈𝒯\\\{\\mathrm\{inst\}\_\{\\tau\}\\\}\_\{\\tau\\in\\mathcal\{T\}\}where each component instτ:ℛτ⟶ℳ\\mathrm\{inst\}\_\{\\tau\}:\\ \\mathcal\{R\}\_\{\\tau\}\\longrightarrow\\mathcal\{M\}\(6\)maps regions of typeτ\\tauinto strings of the output languageℳ\\mathcal\{M\}\. The compositioninstτ∘classify\|ℛτ\\mathrm\{inst\}\_\{\\tau\}\\circ\\mathrm\{classify\}\|\_\{\\mathcal\{R\}\_\{\\tau\}\}produces the ordered sequence of type\-τ\\tautags corresponding to a textww\. The final concatenated result, interleaved with unclassified residual text, constitutes the representationtok\(w\)∈ℳ∗\\mathrm\{tok\}\(w\)\\in\\mathcal\{M\}^\{\*\}\. The categorical separation between classification and instantiation implements the*single authority*principle: the classification layer decides the type of each region; the instantiation layer does not re\-decide, it only formats\. Type errors, per AxiomA4A\_\{4\}\(Categorical error\), propagate as categorical exceptions\. ### 4\.3 Coupling with External Oracles The ontological robustness of the framework derives from systematic coupling with three consolidated external oracles, replacing manual enumeration of cases by delegation to established authorities\. *Dimensional domain\.*ThePintlibrary\[[7](https://arxiv.org/html/2606.19626#bib.bib7)\]is the external authority on units of measure\. Dimensional atoms are materialized deterministically from Pint’s unit registry, with expansion by International System prefixes, validation of positive conversion factors, and exclusion of physical constants that belong to another ontological type\. Dimensional composition over theℤ7\\mathbb\{Z\}^\{7\}vector is delegated to Pint\. *Typographic domain\.*TheUnicode Character Database\[[44](https://arxiv.org/html/2606.19626#bib.bib44)\]is queried to identify typographic markers without character\-by\-character enumeration\. Portuguese ordinals are identified by the decomposition property of typesupercombined with a specific Latin letter; numeric superscripts by the same property applied to digits; mathematical operators by the general category*Sm*\. *Morphological domain\.*The*RSLP*algorithm\[[45](https://arxiv.org/html/2606.19626#bib.bib45)\], the established standard for Portuguese morphology, reduces any gender, number, or derivational inflection to the lemmatic root\.TOTENemploys it to detect contextual technical anchors associated with single\-ASCII\-letter units, allowing occurrences such astemperatura\(temperature\),tensão\(tension\), orpotências\(powers\) to confirm the technical use of an ambiguous unit without the need to manually enumerate all Portuguese inflections\. ## 5Output Language The output languageℳ\\mathcal\{M\}is defined over an extended alphabetΣ∪𝒟\\Sigma\\cup\\mathcal\{D\}, where𝒟\\mathcal\{D\}is a set of structural delimiters\. The production in Backus\-Naur Form \(BNF\) is tag\\displaystyle\\mathrm\{tag\}→\[τattributes\]\\displaystyle\\to\\texttt\{\[\}\\tau\\ \\mathrm\{attributes\}\\texttt\{\]\}\(7\)attributes\\displaystyle\\mathrm\{attributes\}→attribute∣attributeattributes\\displaystyle\\to\\mathrm\{attribute\}\\mid\\mathrm\{attribute\}\\ \\mathrm\{attributes\}\(8\)attribute\\displaystyle\\mathrm\{attribute\}→key=value\\displaystyle\\to\\mathrm\{key\}\\texttt\{=\}\\mathrm\{value\}\(9\)whereτ∈𝒯\\tau\\in\\mathcal\{T\}identifies the type,key\\mathrm\{key\}is an alphanumeric identifier, andvalue\\mathrm\{value\}is a quoted string, a normalized number, or an integer vector\. The complete representationtok\(w\)∈ℳ∗\\mathrm\{tok\}\(w\)\\in\\mathcal\{M\}^\{\*\}is the ordered concatenation of tags with the unclassified residual text between them, preserved verbatim\. This property —*literal preservation*of the source text outside typed regions — distinguishesℳ\\mathcal\{M\}from annotation markup languages typically used in linguistic corpora, in which the source text is replaced or rewritten\. Each typeτ\\taudefines a signatureΠτ\\Pi\_\{\\tau\}of mandatory and optional attributes\. For physical quantity, the mandatory attributes are\{value,unit,dim\}\\\{\\mathrm\{value\},\\allowbreak\\mathrm\{unit\},\\allowbreak\\mathrm\{dim\}\\\}and the optional ones are\{r2l,ambig,alternatives\}\\\{\\mathrm\{r2l\},\\allowbreak\\mathrm\{ambig\},\\allowbreak\\mathrm\{alternatives\}\\\}\. For pure number, the mandatory attributes are\{value,locale,repr,original\}\\\{\\mathrm\{value\},\\allowbreak\\mathrm\{locale\},\\allowbreak\\mathrm\{repr\},\\allowbreak\\mathrm\{original\}\\\}\. For hierarchical reference,hierarchy\\mathrm\{hierarchy\}is mandatory\. For technical identifier,slug\\mathrm\{slug\}is mandatory\. The optional attributesambig\\mathrm\{ambig\}andalternatives\\mathrm\{alternatives\}are reserved for a future contextual ambiguity resolution layer \(see Conclusion\); they are not exercised in this study\. The invariance of*value*under alternative IEEE 754 representations is guaranteed by deterministic canonicalization that maps the original number string to a uniquefloat, modulo machine precision\. Theoriginalattribute in number tags preserves the exact form written by the author, maintaining locale \(Brazilian or English\), thousands separator, and representation \(decimal, scientific, fractional, percentage, ordinal\)\. ## 6Intrinsic Evaluation The evaluation adopts an intrinsic protocol based on four properties verifiable by construction, formally defined below\. ### 6\.1 Verifiable Properties LetSSbe an evaluated tokenization system,GGthe set of entities annotated in theground truthof a corpus𝒞\\mathcal\{C\}, andS\(g\)S\(g\)the representation produced bySSfor entityg∈Gg\\in G\. ###### Definition 1\(Ontological atomicity\)\. PropertyH1\(S,g\)H\_\{1\}\(S,g\)is true if and only ifS\(g\)S\(g\)is a single indivisible tag corresponding to the correct type ofgg\. Formally, H1\(S,g\)=1⇔\|S\(g\)\|=1∧τ\(S\(g\)\)=τ\(g\)\.H\_\{1\}\(S,g\)=1\\;\\Leftrightarrow\\;\|S\(g\)\|=1\\;\\wedge\\;\\tau\(S\(g\)\)=\\tau\(g\)\.\(10\) ###### Definition 2\(Dimensional equivalence\)\. For a pair\(g1,g2\)\(g\_\{1\},g\_\{2\}\)of dimensionally equivalent physical quantities, H2\(S,g1,g2\)=1⇔dim\(S\(g1\)\)=dim\(S\(g2\)\),H\_\{2\}\(S,g\_\{1\},g\_\{2\}\)=1\\;\\Leftrightarrow\\;\\mathrm\{dim\}\(S\(g\_\{1\}\)\)=\\mathrm\{dim\}\(S\(g\_\{2\}\)\),\(11\)with equality inℤ7\\mathbb\{Z\}^\{7\}\. ###### Definition 3\(Typographic robustness\)\. For a group𝒱\\mathcal\{V\}of semantically equivalent notational variants, H3\(S,𝒱\)=1⇔∀v,v′∈𝒱:τ\(S\(v\)\)=τ\(S\(v′\)\)\.H\_\{3\}\(S,\\mathcal\{V\}\)=1\\;\\Leftrightarrow\\;\\forall v,v^\{\\prime\}\\in\\mathcal\{V\}:\\ \\tau\(S\(v\)\)=\\tau\(S\(v^\{\\prime\}\)\)\.\(12\) ###### Definition 4\(Numerical reconstruction\)\. H4\(S,g\)=1H\_\{4\}\(S,g\)=1if and only if the pair\(value\(S\(g\)\),unit\(S\(g\)\)\)\(\\mathrm\{value\}\(S\(g\)\),\\allowbreak\\mathrm\{unit\}\(S\(g\)\)\)is programmatically extractable and satisfies \|value\(S\(g\)\)−value\(g\)\|<ε,\\displaystyle\|\\mathrm\{value\}\(S\(g\)\)\-\\mathrm\{value\}\(g\)\|<\\varepsilon,\(13\)dim\(unit\(S\(g\)\)\)=dim\(unit\(g\)\),\\displaystyle\\mathrm\{dim\}\(\\mathrm\{unit\}\(S\(g\)\)\)=\\mathrm\{dim\}\(\\mathrm\{unit\}\(g\)\),\(14\)with fixed toleranceε=10−6\\varepsilon=10^\{\-6\}\(absolute error over the IEEE 754 canonicalized value\)\. ### 6\.2 Statistical Metrics For binary per\-instance hypotheses, paired contrasts between systems are evaluated by the McNemar test with exact computation\[[46](https://arxiv.org/html/2606.19626#bib.bib46)\]\. Confidence intervals for proportions use theWilson \[[47](https://arxiv.org/html/2606.19626#bib.bib47)\]formula\. Effect size between paired proportions is quantified by Cohen’shhcoefficient\[[48](https://arxiv.org/html/2606.19626#bib.bib48)\]\. Correction for multiple comparisons follows theHolm \[[49](https://arxiv.org/html/2606.19626#bib.bib49)\]procedure\. ### 6\.3 Corpora The internal corpus is theEngQuantbenchmark, withN=800N=800cases generated procedurally over five structural typologies \(cantilever beam, simply supported beam, simple plane frame, truss, and element under combined load\), with physical validation via the OpenSeesPy simulator\[[50](https://arxiv.org/html/2606.19626#bib.bib50)\]\. Compositional diversity covers seven independent generation dimensions\. Cross\-corpus validation employs four external corpora in Brazilian Portuguese:MMMLU PT\_BR, a professional translation of the MMLU benchmark\[[51](https://arxiv.org/html/2606.19626#bib.bib51)\]by OpenAI, with 595 cases eligible for numerical reconstruction;BLUEX\[[40](https://arxiv.org/html/2606.19626#bib.bib40)\], aggregating USP and UNICAMP entrance examinations from 2018 to 2023, with 151 eligible cases;ENEM Maritaca, with 83 eligible cases from the National High School Exam \(ENEM\) from 2022 to 2024; andAlvorada\-Bench\[[41](https://arxiv.org/html/2606.19626#bib.bib41)\], aggregating FUVEST, IME, and ITA, with 942 eligible cases\. ### 6\.4 Comparative Systems We compareTOTENagainst eight representative systems in three families\. Statistical tokenizers:cl100kando200k\[[52](https://arxiv.org/html/2606.19626#bib.bib52)\]\. Specialized quantitative extractors: Quantulum3, CQE\[[5](https://arxiv.org/html/2606.19626#bib.bib5)\], and GLiNER\[[6](https://arxiv.org/html/2606.19626#bib.bib6)\]\. Dimensional libraries: Pint\[[7](https://arxiv.org/html/2606.19626#bib.bib7)\]and udunits\-2\[[8](https://arxiv.org/html/2606.19626#bib.bib8)\]\. Generic entity recognition in Portuguese: spaCy, modelpt\-core\-news\-md\[[9](https://arxiv.org/html/2606.19626#bib.bib9)\]\. ## 7Results Figure[2](https://arxiv.org/html/2606.19626#S7.F2)summarizes the consolidated contrasts betweenTOTENand each comparative system on the internal benchmark\. Table[1](https://arxiv.org/html/2606.19626#S7.T1)presents absolute values per property and system\. Figure 2:Consolidated summary of paired contrasts on the internal benchmark\. Points represent the proportion differenceSTOTEN−SbaselineS\_\{\\text\{TOTEN\}\}\-S\_\{\\text\{baseline\}\}; bars represent 95% Wilson confidence intervals\. Differences inH1H\_\{1\}andH4H\_\{4\}are statistically significant by McNemar with Holm correction in all contrasts \(p<0\.001p<0\.001\)\.Table 1:Consolidated results on the internal EngQuant benchmark\.### 7\.1 Detection and Ontological Atomicity TOTENachieves unit atomicity over the\|G\|=31 674\|G\|=31\\,674annotated ground\-truth entities evaluated\. To avoid a tautological reading ofH1=1\.000H\_\{1\}=1\.000, we decompose recognition into three metrics \(Table[2](https://arxiv.org/html/2606.19626#S7.T2)\): detection*recall*\(\|RS\|/\|G\|\|R\_\{S\}\|/\|G\|, whereRSR\_\{S\}is the set recognized by systemSSandGGis the ground truth\), conditional structural atomicity \(AcondA\_\{\\mathrm\{cond\}\}, the fraction of recognized regions emitted as a single indivisible tag,*without*type requirement\), and effective structural atomicity \(Aeff=Recall⋅AcondA\_\{\\mathrm\{eff\}\}=\\mathrm\{Recall\}\\cdot A\_\{\\mathrm\{cond\}\}, which penalizes non\-detection\)\. Ontological atomicityH1H\_\{1\}\(Definition[1](https://arxiv.org/html/2606.19626#Thmdefinition1), Table[1](https://arxiv.org/html/2606.19626#S7.T1)\) is stricter: it additionally requires that the single tag receive the correct type, soH1≤AeffH\_\{1\}\\leq A\_\{\\mathrm\{eff\}\}\. The separation between detection and correct classification is the standard evaluative convention in ontology\-based extraction\[[27](https://arxiv.org/html/2606.19626#bib.bib27)\]\. BPE tokenizers have unit recall but emit a single tag for about two\-thirds of entities \(Aeff≈0\.66A\_\{\\mathrm\{eff\}\}\\approx 0\.66\), satisfyingH1H\_\{1\}— with correct type — for roughly half \(Table[1](https://arxiv.org/html/2606.19626#S7.T1)\); dimensional libraries have zero textual recall \(they operate on isolated unit strings\); specialized extractors show partial recall\.TOTENcombines unit recall over the declared OEE closure with unit structural and ontological atomicity\. The advantage is categorical: recall*and*atomicity simultaneously\. Entities outside the OEE closure are not recognized by construction, in conformance with AxiomA1A\_\{1\}\(Intrinsicity\)\. Figure[3](https://arxiv.org/html/2606.19626#S7.F3)stratifies the result by ontological type\. Table 2:Detection recall and structural atomicity on the internal EngQuant benchmark\.Figure 3:Atomicity by system and ontological type\.TOTENis the only system that combines unit*recall*and atomicity across all evaluated types \(physical quantity, technical identifier, formal operator, symbolic expression, and number\)\. ### 7\.2 Dimensional Equivalence TOTENachieves conditional dimensional accuracy of0\.9680\.968\(6161correct answers in6363answered pairs, among the7070dimensional pairs evaluated\), against0\.9850\.985for Pint over the same sample\. The differenceΔ=−0\.017\\Delta=\-0\.017is not statistically significant by the McNemar test \(p=1\.0p=1\.0\)\. The interpretation is direct:TOTENconsumes Pint as a dimensional oracle and therefore inherits its authority\. The residual difference reflects operation over continuous text \(TOTEN\) versus isolated unit strings \(Pint\)\. We reportH2H\_\{2\}in two conventions:H2condH\_\{2\}^\{\\mathrm\{cond\}\}, conditional on pairs with a valid response \(the value0\.9680\.968in Table[1](https://arxiv.org/html/2606.19626#S7.T1)\), andH2effH\_\{2\}^\{\\mathrm\{eff\}\}, effective over all7070dimensional pairs with non\-response counted as error \(61/70=0\.87161/70=0\.871for the full configuration in the ablation of Table[5](https://arxiv.org/html/2606.19626#S7.T5)\); the difference between the two forms reflects only coverage, not conditional dimensional accuracy\. Figure[4](https://arxiv.org/html/2606.19626#S7.F4)shows the relationship between coverage and conditional accuracy for the three systems with an explicit dimensional vector\. Figure 4:Coverage and accuracy in dimensional equivalence\.TOTENbalances coverage and accuracy, with a non\-significant difference relative to Pint\. ### 7\.3 Typographic Robustness Figure[5](https://arxiv.org/html/2606.19626#S7.F5)presents robustness stratified by variant type\. We reportH3H\_\{3\}in two forms \(Table[3](https://arxiv.org/html/2606.19626#S7.T3)\):H3globalH\_\{3\}^\{\\mathrm\{global\}\}over all 43 variant groups in the benchmark andH3scopedH\_\{3\}^\{\\mathrm\{scoped\}\}over the 42 groups within the closure declared*a priori*by the OEE coupled to the oracles\.TOTENachievesH3global=0\.326H\_\{3\}^\{\\mathrm\{global\}\}=0\.326\(vs\.H3scoped=0\.325H\_\{3\}^\{\\mathrm\{scoped\}\}=0\.325; removing the singlelocale\_thousandsgroup deferred to a future phase changes the result by−0\.0002\-0\.0002\), particularly on composition with centered dot and compositional space per\[[43](https://arxiv.org/html/2606.19626#bib.bib43)\], and on multiple pressure\-by\-liquid\-column variants\. Pint achieves0\.2950\.295\(0\.2870\.287scoped\); remaining systems fragment or ignore the entity at first contact with a notational variant not exactly anticipated\. The absolute value0\.3260\.326reflects that typographic coverage is limited to the declared OEE closure coupled to the oracles and Unicode UCD delegation: variants outside this closure are not normalized by construction, and expanding the catalogue is incremental work\. The relative advantage over comparatives is preserved under both denominators\. The 43 variant groups are distributed across the notational categories of Figure[5](https://arxiv.org/html/2606.19626#S7.F5)\(e\.g\., Unicode superscript, decimal separator, compositional centered dot, pressure variants\); each category aggregates multiple semantically equivalent groups\. Table 3:Typographic robustness in two evaluation forms\.Figure 5:Typographic robustness stratified by variant type\. For each category of notational variant \(e\.g\., Unicode superscript, PT\-BR decimal separator\), metricH3H\_\{3\}quantifies the fraction of groups whose variants receive an identical type after tokenization\. ### 7\.4 Numerical Reconstruction The most discriminative property of the framework\.TOTENachieves0\.7800\.780on the internal benchmark, against0\.3400\.340for CQE — the best internal baseline — and0\.2200\.220for Quantulum3 and Pint\. Systems without explicit quantitative extraction achieve zero\. The difference relative to CQE,Δ=\+0\.440\\Delta=\+0\.440, is statistically significant by the McNemar test with Holm correction \(p<10−4p<10^\{\-4\}\)\. Figure[6](https://arxiv.org/html/2606.19626#S7.F6)stratifies the result by numerical subtype\. Figure 6:Numerical reconstruction stratified by subtype\.TOTENleads or ties in the majority of evaluated numerical subtypes \(wins in1111and ties in88of the2020subtypes\), with perfect accuracy on fractions, percentages, PT\-BR locale decimals, and Unicode scientific notation; comparative systems cover only partial subsets, with heterogeneous performance across subtypes\. ### 7\.5 Validation on External Corpora Table[4](https://arxiv.org/html/2606.19626#S7.T4)summarizes numerical reconstruction results on the four external Brazilian Portuguese corpora\.TOTENmaintains leadership across all:0\.8660\.866on MMMLU PT\_BR,0\.7750\.775on BLUEX,0\.9040\.904on ENEM Maritaca, and0\.7900\.790on Alvorada\-Bench\. The best comparative system, Quantulum3, achieves between0\.6270\.627and0\.7030\.703\. Figure[7](https://arxiv.org/html/2606.19626#S7.F7)presents the consolidated forest plot of contrasts on external corpora\. Figure[8](https://arxiv.org/html/2606.19626#S7.F8)consolidates the final cross\-corpus synthesis\. Table 4:Numerical reconstruction on external Brazilian Portuguese corpora\.Figure 7:Numerical reconstruction on four external Brazilian Portuguese corpora\.TOTENleads in all, with differences significant by McNemar with Holm correction\.Figure 8:Comparative synthesis on the internal EngQuant benchmark\. Panel \(A\): effective structural atomicity \(AeffA\_\{\\mathrm\{eff\}\}, which does not require correct type; cf\. Table[2](https://arxiv.org/html/2606.19626#S7.T2)\); Panel \(B\): numerical reconstruction \(H4H\_\{4\}\)\.TOTENmaintains unit or near\-unit atomicity and reconstruction against all comparative systems\. ### 7\.6 Oracle Ablation Table[5](https://arxiv.org/html/2606.19626#S7.T5)reports the*leave\-one\-oracle\-out*ablation on the internal EngQuant: for each of the three external oracles coupled toTOTEN— Pint, Unicode UCD, and RSLP — we disable the oracle, keep the rest of the ontological architecture, and re\-measure the five metrics\. Removing Pint produces the most expressive drop, concentrated in Recall \(Δ=−0\.188\\Delta=\-0\.188; paired McNemarb=5 955b=5\\,955,c=0c=0,p<10−3p<10^\{\-3\}\) and inH2H\_\{2\}\(Δ=−0\.871\\Delta=\-0\.871;p<10−3p<10^\{\-3\}\): without the dimensional oracle no unit atoms exist in the recognizer, so physical quantities cease to be detected as atomic entities\. The drop inH4H\_\{4\}is smaller \(Δ=−0\.040\\Delta=\-0\.040;p=0\.5p=0\.5\), because the OEE Number type captures pure numerical values independently of dimensional recognition, preserving most of the reconstruction fidelity\. Removing UCD produces a residual effect on the internal benchmark \(ΔRecall=−0\.003\\Delta\\,\\mathrm\{Recall\}=\-0\.003,p<10−3p<10^\{\-3\};ΔH3=0\.000\\Delta H\_\{3\}=0\.000,p=1\.0p=1\.0\): the contribution of UCD is localized to Unicode mathematical operators \(≈\\approx,≤\\leq,≥\\geq,×\\times\) and typographic exclusion of*super*/*sub*letters, events infrequent in the structural engineering corpus evaluated\. Removing RSLP is*invisible*in this benchmark \(Δ=0\\Delta=0in all metrics;p=1\.0p=1\.0\): EngQuant does not contain regions where single\-letter units \(K, A, V, W, …\) appear ambiguously, so the RSLP technical anchor function is never invoked\. This nullity is honest and expected: the contribution of RSLP materializes in corpora where such ambiguities exist, whose domain\-specific validation is left as future work\. Table 5:Leave\-one\-oracle\-out ablation on the internal EngQuant benchmark\. ### 7\.7 Concurrent Validity of the Internal Benchmark To verify that the system ordering induced by the internal benchmark is consistent with that of real corpora, we compute the Spearman rank correlation coefficient \(ρ\\rho, with mean\-rank correction for ties\) between the ranking of nine systems byH4H\_\{4\}on EngQuant and the ranking on each of the four external corpora\. We report Kendall’sτb\\tau\_\{b\}for robustness,pp\-values by permutation \(10510^\{5\}label resamplings per system\) and 95% bootstrap confidence intervals \(10410^\{4\}within\-corpus resamplings per case\)\. The analysis validates the comparative consistency of the control benchmark; label correctness is assured by construction \(deterministic generator and physical validation via OpenSeesPy\), not by correlation\. Table[6](https://arxiv.org/html/2606.19626#S7.T6)shows that the system ranking by numerical reconstruction on the internal benchmark correlates strongly with those of external corpora \(ρ≥0\.856\\rho\\geq 0\.856across all four corpora,pperm<0\.05p\_\{\\mathrm\{perm\}\}<0\.05\), indicating that EngQuant is a faithful proxy for comparative evaluation\. The residual displacement is concentrated in Pint, whose extraction collapses in continuous external text \(fromH4≈0\.22H\_\{4\}\\approx 0\.22on the internal benchmark toH4≈0H\_\{4\}\\approx 0on external corpora\), expected behavior given that it operates on isolated unit strings rather than continuous text\. Table 6:Concurrent validity of the internal benchmark\. ## 8Discussion Categorical advantage in atomicity\. The difference betweenTOTENand statistical systems inH1H\_\{1\}is categorical, not gradual\. Statistical tokenizers have no explicit concept of ontological entity; dimensional libraries do not perform textual recognition; English\-specialized extractors cover fragments of the PT\-BR vocabulary\. The unit atomicity ofTOTENreflects the*ontological commitment*in the sense ofGuarino \[[12](https://arxiv.org/html/2606.19626#bib.bib12)\]declared in𝒪\\mathcal\{O\}and materialized by the categorical separation between classification and instantiation typical of ontology\-based extraction\[[27](https://arxiv.org/html/2606.19626#bib.bib27)\]\. Near\-parity in dimensional equivalence with Pint\. The non\-significant difference inH2H\_\{2\}against Pint is methodologically expected and desirable\. Pint is the reference oracle for units, andTOTENconsumes it systematically to sustain its dimensional domain\. Surpassing Pint in pure dimensional equivalence would be indicative of a methodological error\. The observed parity confirms that coupling to the external oracle preserves dimensional authority without introducing distortions\. Ontological orchestration*vs\.*Pint*wrapper*\. The ablation in Table[5](https://arxiv.org/html/2606.19626#S7.T5)supports the claim thatTOTENis not a Pint wrapper\. Three facts converge: \(i\) Pint in isolation reportsH4≈0H\_\{4\}\\approx 0on continuous text, because it operates on already\-isolated unit strings; the ontological tokenization that delivers those strings to Pint is what makesH4H\_\{4\}achievably high\. \(ii\) The−\-Pint ablation leaves the rest of the architecture intact and shows that the dominant loss is in Recall andH2H\_\{2\}, not inH4H\_\{4\}— Pint is the dimensional authority, and the OEE delimits that role rather than masking it in a single aggregate\. \(iii\) The−\-UCD and−\-RSLP ablations affect axes orthogonal to Pint’s \(mathematical symbols and unit\-letter ambiguity, respectively\); their effect is small or nil in EngQuant because the internal benchmark is structural\-mechanical, without Unicode relational operators in continuous prose nor isolated single\-letter units\. The architecture is therefore an ontological orchestration that \(a\) delivers to Pint the object over which it is authoritative, \(b\) delegates Unicode mathematical symbol classification to UCD, and \(c\) uses Portuguese morphology \(RSLP\) to resolve unit\-letter ambiguity via semantic anchoring — three demonstrably complementary axes\. Semantic preservation in numerical reconstruction\. The advantage inH4H\_\{4\}replicates across all five corpora\.TOTENpreserves, in the input representation, IEEE 754 normalized value, numerical locale, representation \(decimal, scientific, fractional, percentage, ordinal\), and the literal form written by the author — the structure thatSingh and Strouse \[[3](https://arxiv.org/html/2606.19626#bib.bib3)\]andYang et al\. \[[4](https://arxiv.org/html/2606.19626#bib.bib4)\]associate with better arithmetic performance in consumer models\. Corpus\-in\-English\-based systems lose these dimensions when operating on PT\-BR vocabulary\. Direct verification of the downstream effect of this representation on consumer models is outside the scope of this study and is left as future work\. Limitations\. The system deliberately preserves literal form for typographically degraded notation, without active normalization; resolution of these cases is left for a contextual ambiguity layer reserved as future work\. Dimensionless coverage is partial by construction: the set of non\-SI\-dimensional units is included via an explicit whitelist, avoiding collision with Portuguese words that coincide lexically with terms such asgrade,byte, orcycle\. Residual cases with incorrect typography in normative text may still produce unresolved ambiguous interpretation in this version\. Theoretical implications\. The formalization of ontological tokenization as the triple⟨𝒪,classify,\{instτ\}⟩\\langle\\mathcal\{O\},\\mathrm\{classify\},\\\{\\mathrm\{inst\}\_\{\\tau\}\\\}\\rangleadmits evaluation via verifiable properties without dependence on costly generative models\. This feature is compatible with reproducibility requirements typical of knowledge\-based systems\[[11](https://arxiv.org/html/2606.19626#bib.bib11),[53](https://arxiv.org/html/2606.19626#bib.bib53)\]and allows the framework to be replicated, audited, and extended by other groups without inference cost on an external API\. ## 9Conclusion This work formalized and evaluatedTOTEN, a knowledge\-based ontological tokenization framework for physical quantities and technical notation in Brazilian Portuguese\. The central contribution is the categorical separation between statistical vocabulary derivation, typical of BPE tokenizers, and declarative classification grounded in formal ontology, materialized in a triple composed of an ontology, a classification function, and an indexed family of instantiators\. Evaluation on five distinct corpora demonstrated statistically significant advantage in ontological atomicity and numerical reconstruction against eight representative state\-of\-the\-art systems, with dimensional parity relative to Pint, the external oracle from which the framework derives its dimensional authority\. Future directions include: \(i\) the integration of a human\-in\-the\-loop ambiguity resolution protocol, formally specified but not evaluated here, sustained by theambig\\mathrm\{ambig\}andalternatives\\mathrm\{alternatives\}attributes already reserved in the output language; \(ii\) external validation on a Brazilian normative/legal corpus \(e\.g\., NBR, ABNT, legal text\), contingent on the availability of a publicly open benchmark with ontological annotation for that domain — precedents such as LeNER\-Br\[[39](https://arxiv.org/html/2606.19626#bib.bib39)\]indicate the viability of the vehicle, though without explicit ontological coverage; \(iii\) the materialization of the structured representation in a small language model trained natively on the ontological vocabulary, following a complementary methodological program to consolidated monolingual approaches in Portuguese\[[36](https://arxiv.org/html/2606.19626#bib.bib36)\]; and \(iv\) the*downstream*evaluation of the effect of this representation on consumer models, orthogonal to the empirical evidence gathered bySingh and Strouse \[[3](https://arxiv.org/html/2606.19626#bib.bib3)\]andYang et al\. \[[4](https://arxiv.org/html/2606.19626#bib.bib4)\], outside the scope of this study\. ## Acknowledgements The authors thank the research team at Aia Context and the Universidade Federal do Maranhão \(UFMA\) for institutional support\. ## Declaration of Competing Interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper\. ## Data and Code Availability ## Appendix AOEE Axioms The Ontology of Engineering Entities is governed by eight structural principles, constituting the set𝒫\\mathcal\{P\}of the quadruple𝒪=⟨𝒯,𝒫,ℛ,ℐ⟩\\mathcal\{O\}=\\langle\\mathcal\{T\},\\ \\mathcal\{P\},\\ \\mathcal\{R\},\\ \\mathcal\{I\}\\rangledefined in \([1](https://arxiv.org/html/2606.19626#S2.E1)\)\. Section[3](https://arxiv.org/html/2606.19626#S3)presents in the body of the paper the four central structural axioms \(Intrinsicity, Mediated composition, Categorical error, and Closed\-for\-modification extensibility\) for their architectural centrality; this appendix reproduces them for self\-containment and states in complete normative form the four remaining operational axioms governing invariant preservation, typographic convention, structural anchoring of symbolic expressions, and distinctive mathematical mark\. The numberingA1A\_\{1\}–A8A\_\{8\}is stable and referenced by the declarative specificationdata/oee\-v1\.yamlunder labelsP1P\_\{1\}–P8P\_\{8\}\. ###### Axiom\(A1A\_\{1\}— Intrinsicity\)\. For every typeτ∈𝒯\\tau\\in\\mathcal\{T\}, the identity ofτ\\tauis determined exclusively by its intrinsic signature⟨πτ,ιτ⟩\\langle\\pi\_\{\\tau\},\\ \\iota\_\{\\tau\}\\rangle: neither empirical frequency in corpus𝒞\\mathcal\{C\}nor subsequent pragmatic criteria may redefineτ\\tau\. Formally,τ=τ′⇔πτ=πτ′∧ιτ=ιτ′\\tau=\\tau^\{\\prime\}\\iff\\pi\_\{\\tau\}=\\pi\_\{\\tau^\{\\prime\}\}\\wedge\\iota\_\{\\tau\}=\\iota\_\{\\tau^\{\\prime\}\}\. By construction, the signatures of primary types in𝒯\\mathcal\{T\}are mutually distinct, so this identity relation does not collapse distinct primary types\. ###### Axiom\(A2A\_\{2\}— Invariant preservation\)\. Letinstτ\(r\)∈ℳ\\mathrm\{inst\}\_\{\\tau\}\(r\)\\in\\mathcal\{M\}be the representation produced by the instantiator family for a regionrrclassified asτ\\tau\. For every invarianti∈ιτi\\in\\iota\_\{\\tau\}and every valid representationm=instτ\(r\)m=\\mathrm\{inst\}\_\{\\tau\}\(r\), we requirei\(m\)=i\(r\)i\(m\)=i\(r\)\. Equivalently, the diagramr→instτm→𝑖vr\\xrightarrow\{\\mathrm\{inst\}\_\{\\tau\}\}m\\xrightarrow\{i\}vcommutes withr→𝑖vr\\xrightarrow\{i\}vfor alli∈ιτi\\in\\iota\_\{\\tau\}; a representation that violates any invariant inιτ\\iota\_\{\\tau\}is inadmissible and must be rejected by the instantiation layer\. ###### Axiom\(A3A\_\{3\}— Mediated composition\)\. Forτ1,τ2∈𝒯\\tau\_\{1\},\\tau\_\{2\}\\in\\mathcal\{T\}, the compositionτ1∘τ2\\tau\_\{1\}\\circ\\tau\_\{2\}is defined if and only if\(τ1,τ2\)∈ℛ\(\\tau\_\{1\},\\tau\_\{2\}\)\\in\\mathcal\{R\}\. Free concatenation is prohibited: given a pair\(τ1,τ2\)∉ℛ\(\\tau\_\{1\},\\tau\_\{2\}\)\\notin\\mathcal\{R\}, no instance ofτ1\\tau\_\{1\}may syntactically or semantically contain an instance ofτ2\\tau\_\{2\}as a structural component\. ###### Axiom\(A4A\_\{4\}— Categorical error\)\. Applying instantiationinstτ′\\mathrm\{inst\}\_\{\\tau^\{\\prime\}\}to a regionrrsuch thatτ\(r\)=τ≠τ′\\tau\(r\)=\\tau\\neq\\tau^\{\\prime\}constitutes a categorical error, not a gradual loss of quality\. UnderA2A\_\{2\}, such application necessarily produces a violation of at least one invariant inιτ\\iota\_\{\\tau\}orιτ′\\iota\_\{\\tau^\{\\prime\}\}, and the resulting output is inadmissible inℳ\\mathcal\{M\}\. ###### Axiom\(A5A\_\{5\}— Closed\-for\-modification extensibility\)\. For every evolution𝒪n↝𝒪n\+1\\mathcal\{O\}\_\{n\}\\rightsquigarrow\\mathcal\{O\}\_\{n\+1\}, we simultaneously require𝒯n⊆𝒯n\+1\\mathcal\{T\}\_\{n\}\\subseteq\\mathcal\{T\}\_\{n\+1\},𝒫n⊆𝒫n\+1\\mathcal\{P\}\_\{n\}\\subseteq\\mathcal\{P\}\_\{n\+1\},ℛn⊆ℛn\+1\\mathcal\{R\}\_\{n\}\\subseteq\\mathcal\{R\}\_\{n\+1\}, and that no invariant inℐn\\mathcal\{I\}\_\{n\}be violated by instances produced under𝒪n\+1\\mathcal\{O\}\_\{n\+1\}\. The ontology is therefore open for extension and closed for modification, in the sense analogous to the*open\-for\-extension, closed\-for\-modification*principle in software engineering\. ###### Axiom\(A6A\_\{6\}— Typographic convention as intrinsic property\)\. For every typeτ∈𝒯\\tau\\in\\mathcal\{T\}, the identity ofτ\\tauis invariant under typographic transformations declared as equivalent by the external authority*Unicode Character Database*\(UCD\)\. For every regionrrand every notational variantr′r^\{\\prime\}obtained by finite composition of operations within the declared closure — Unicode normalization, canonical decomposition \(Decomposition\_Type∈\{super, sub, compat, font\}\\in\\\{\\text\{super, sub, compat, font\}\\\}\), general categorySm\\mathrm\{Sm\}for mathematical operators, and combining marks \(Mn\\mathrm\{Mn\}\) — we haveclassify\(r\)=classify\(r′\)\\mathrm\{classify\}\(r\)=\\mathrm\{classify\}\(r^\{\\prime\}\)\. Typographic canonicalization is therefore a function of the classification layer, never of instantiation; variants outside the declared closure constitute an extension to𝒪n\+1\\mathcal\{O\}\_\{n\+1\}underA5A\_\{5\}, and not an*ad hoc*correction toinstτ\\mathrm\{inst\}\_\{\\tau\}\. ###### Axiom\(A7A\_\{7\}— Structural anchoring of symbolic expressions\)\. Every regionrrclassified asτ=SymbolicExpression\\tau=\\mathrm\{SymbolicExpression\}requires, within a declared contextual windowΔ\\Delta\(\|Δ\|≤2\|\\Delta\|\\leq 2adjacent characters\), the presence of at least one structural anchorα\\alphabelonging to the set declared in𝒪\\mathcal\{O\}: \(i\) formal relational or calculus operator \(==,≈\\approx,≤\\leq,≥\\geq,<<,\>\>,∝\\propto,≡\\equiv,∑\\sum,∫\\int,∂\\partial,∇\\nabla\); \(ii\) structural relation; \(iii\) formal index or subscript; or \(iv\) explicit mathematical delimiter\. In the absence ofα\\alphainΔ\\Delta, the region remains inTechnicalProse\\mathrm\{TechnicalProse\}; the axiom eliminates by construction the fundamental ambiguity between variable\-in\-expression and letter\-in\-natural\-word without resorting to an external prose dictionary\. ###### Axiom\(A8A\_\{8\}— Distinctive mathematical mark in compound symbol\)\. Letrrbe a candidate region forτ∈\{SymbolicExpression,PhysicalQuantity\}\\tau\\in\\\{\\mathrm\{SymbolicExpression\},\\ \\mathrm\{PhysicalQuantity\}\\\}whose content is mediated composition \(underA3A\_\{3\}\) over potentially ambiguous ASCII operators\{/,∗,^,\+,−,\(,\)\}\\\{/,\\ \*,\\ \\hat\{\}\\,,\\ \+,\\ \-,\\ \(,\\ \)\\\}\. For suchrrto be admitted inτ\\tau, the presence of at least one*categorical mathematical mark*μ\\muinrris required, belonging to the derived — not enumerated — closureM=digit∪ASCIIsubscript∪Unicodesuper/subscript\(UCD\)∪Greekletter∪unambiguousSmoperator∪appliedfunction∪LaTeXmarkupM=\\mathrm\{digit\}\\,\\cup\\,\\mathrm\{ASCII\\ subscript\}\\,\\cup\\,\\mathrm\{Unicode\\ super/subscript\\ \(UCD\)\}\\,\\cup\\,\\mathrm\{Greek\\ letter\}\\,\\cup\\,\\mathrm\{unambiguous\\ Sm\\ operator\}\\,\\cup\\,\\mathrm\{applied\\ function\}\\,\\cup\\,\\text\{\\LaTeX\{\} markup\}\. Compositions based exclusively on ambiguous ASCII operators over tokens that admit a non\-mathematical prosaic interpretation \(e\.g\., and/or\) are rejected\. AxiomsA3A\_\{3\}andA8A\_\{8\}are complementary:A3A\_\{3\}admits the composition,A8A\_\{8\}requires a categorical mathematical signature for the composition to be recognized as a formal entity\. The correspondence between axioms and the declarative specification is direct:AkA\_\{k\}corresponds to labelPkP\_\{k\}indata/oee\-v1\.yamlfork∈\{1,…,7\}k\\in\\\{1,\\ldots,7\\\}, withA8A\_\{8\}added by the specification in the operational derivation section of the classifier\. AxiomsA1A\_\{1\},A3A\_\{3\},A4A\_\{4\}, andA5A\_\{5\}are structural and govern the form of the ontology𝒪\\mathcal\{O\};A2A\_\{2\}governs the relation betweenℐ\\mathcal\{I\}and\{instτ\}τ∈𝒯\\\{\\mathrm\{inst\}\_\{\\tau\}\\\}\_\{\\tau\\in\\mathcal\{T\}\};A6A\_\{6\},A7A\_\{7\}, andA8A\_\{8\}are operational and govern, respectively, the typographic robustness of the functionclassify\\mathrm\{classify\}, the recognition ofSymbolicExpression\\mathrm\{SymbolicExpression\}in continuous text, and compositional disambiguation under ambiguous ASCII operators\. Together,A1A\_\{1\}–A8A\_\{8\}constitute the ontological commitment\[[12](https://arxiv.org/html/2606.19626#bib.bib12)\]ofTOTEN\. ## References - Sennrich et al\. \[2016\]Rico Sennrich, Barry Haddow, and Alexandra Birch\.Neural machine translation of rare words with subword units\.In*Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1715–1725\. Association for Computational Linguistics, 2016\.doi:[10\.18653/v1/P16\-1162](https://doi.org/10.18653/v1/P16-1162)\. - Kudo and Richardson \[2018\]Taku Kudo and John Richardson\.SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 66–71\. Association for Computational Linguistics, 2018\.doi:[10\.18653/v1/D18\-2012](https://doi.org/10.18653/v1/D18-2012)\. - Singh and Strouse \[2024\]Aaditya K\. Singh and DJ Strouse\.Tokenization counts: the impact of tokenization on arithmetic in frontier large language models\.*Transactions on Machine Learning Research*, 2024\.Featured Certification\. - Yang et al\. \[2025\]Haotong Yang, Yi Yu, Wei Zhang, et al\.Number cookbook: number understanding of language models and how to improve it\.In*Proceedings of the International Conference on Learning Representations \(ICLR\)*, pages 1–22, 2025\. - Almasian et al\. \[2023\]Satya Almasian, Dennis Aumiller, and Michael Gertz\.CQE: a comprehensive quantity extractor\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 12845–12859\. Association for Computational Linguistics, 2023\.doi:[10\.18653/v1/2023\.emnlp\-main\.792](https://doi.org/10.18653/v1/2023.emnlp-main.792)\. - Zaratiana et al\. \[2024\]Urchade Zaratiana, Nadi Tomeh, Pierre Holat, and Thierry Charnois\.GLiNER: generalist model for named entity recognition using bidirectional transformer\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 5364–5376\. Association for Computational Linguistics, 2024\.doi:[10\.18653/v1/2024\.naacl\-long\.300](https://doi.org/10.18653/v1/2024.naacl-long.300)\. - Grecco et al\. \[2022\]Hernan E\. Grecco, Jonas L\. Chase, Lisandro D\. Dalcin, et al\.Pint: a Python package to define, operate and manipulate physical quantities\.*Journal of Open Source Software*, 7\(78\):4574, 2022\.doi:[10\.21105/joss\.04574](https://doi.org/10.21105/joss.04574)\. - Hankin \[2020\]Robin K\. S\. Hankin\.The udunits package for dimensional analysis in R\.*Journal of Statistical Software*, 93:1–14, 2020\.doi:[10\.18637/jss\.v093\.i06](https://doi.org/10.18637/jss.v093.i06)\. - Honnibal et al\. \[2020\]Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd\.spaCy: industrial\-strength natural language processing in Python, 2020\. - Gruber \[1993\]Thomas R\. Gruber\.A translation approach to portable ontology specifications\.*Knowledge Acquisition*, 5\(2\):199–220, 1993\.doi:[10\.1006/knac\.1993\.1008](https://doi.org/10.1006/knac.1993.1008)\. - Studer et al\. \[1998\]Rudi Studer, V\. Richard Benjamins, and Dieter Fensel\.Knowledge engineering: principles and methods\.*Data & Knowledge Engineering*, 25\(1–2\):161–197, 1998\.doi:[10\.1016/S0169\-023X\(97\)00056\-6](https://doi.org/10.1016/S0169-023X(97)00056-6)\. - Guarino \[1998\]Nicola Guarino\.Formal ontology and information systems\.In*Proceedings of the First International Conference on Formal Ontology in Information Systems \(FOIS ’98\)*, pages 3–15\. IOS Press, 1998\. - Devlin et al\. \[2019\]Jacob Devlin, Ming\-Wei Chang, Kenton Lee, and Kristina Toutanova\.BERT: pre\-training of deep bidirectional transformers for language understanding\.In*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 4171–4186\. Association for Computational Linguistics, 2019\.doi:[10\.18653/v1/N19\-1423](https://doi.org/10.18653/v1/N19-1423)\. - Bostrom and Durrett \[2020\]Kaj Bostrom and Greg Durrett\.Byte pair encoding is suboptimal for language model pretraining\.In*Findings of the Association for Computational Linguistics: EMNLP 2020*, pages 4617–4624\. Association for Computational Linguistics, 2020\.doi:[10\.18653/v1/2020\.findings\-emnlp\.414](https://doi.org/10.18653/v1/2020.findings-emnlp.414)\. - Rust et al\. \[2021\]Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych\.How good is your tokenizer? on the monolingual performance of multilingual language models\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\)*, pages 3118–3135\. Association for Computational Linguistics, 2021\.doi:[10\.18653/v1/2021\.acl\-long\.243](https://doi.org/10.18653/v1/2021.acl-long.243)\. - Schmidt et al\. \[2024\]Craig W\. Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner\.Tokenization is more than compression\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 678–702\. Association for Computational Linguistics, 2024\.doi:[10\.18653/v1/2024\.emnlp\-main\.40](https://doi.org/10.18653/v1/2024.emnlp-main.40)\. - Petrov et al\. \[2023\]Aleksandar Petrov, Emanuele La Malfa, Philip H\. S\. Torr, and Adel Bibi\.Language model tokenizers introduce unfairness between languages\.In*Advances in Neural Information Processing Systems*, volume 36, pages 36963–36990, 2023\. - Wegmann et al\. \[2025\]Anna Wegmann, Dong Nguyen, and David Jurgens\.Tokenization is sensitive to language variation\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 10958–10983\. Association for Computational Linguistics, 2025\. - Toraman et al\. \[2023\]Cagri Toraman, Eyup Halit Yilmaz, Furkan Şahinuç, and Oguzhan Ozcelik\.Impact of tokenization on language models: an analysis for Turkish\.*ACM Transactions on Asian and Low\-Resource Language Information Processing*, 22\(4\):116:1–116:21, 2023\.doi:[10\.1145/3578707](https://doi.org/10.1145/3578707)\. - Land and Bartolo \[2024\]Sander Land and Max Bartolo\.Fishing for Magikarp: automatically detecting under\-trained tokens in large language models\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, pages 11631–11646\. Association for Computational Linguistics, 2024\.doi:[10\.18653/v1/2024\.emnlp\-main\.649](https://doi.org/10.18653/v1/2024.emnlp-main.649)\. - Thawani et al\. \[2021\]Avijit Thawani, Jay Pujara, Filip Ilievski, and Pedro Szekely\.Representing numbers in NLP: a survey and a vision\.In*Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 644–656\. Association for Computational Linguistics, 2021\.doi:[10\.18653/v1/2021\.naacl\-main\.53](https://doi.org/10.18653/v1/2021.naacl-main.53)\. - Spithourakis and Riedel \[2018\]Georgios P\. Spithourakis and Sebastian Riedel\.Numeracy for language models: evaluating and improving their ability to predict numbers\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 2104–2115\. Association for Computational Linguistics, 2018\.doi:[10\.18653/v1/P18\-1196](https://doi.org/10.18653/v1/P18-1196)\. - Wallace et al\. \[2019\]Eric Wallace, Yizhong Wang, Sujian Li, Sameer Singh, and Matt Gardner\.Do NLP models know numbers? Probing numeracy in embeddings\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 5307–5315\. Association for Computational Linguistics, 2019\.doi:[10\.18653/v1/D19\-1534](https://doi.org/10.18653/v1/D19-1534)\. - Geva et al\. \[2020\]Mor Geva, Ankit Gupta, and Jonathan Berant\.Injecting numerical reasoning skills into language models\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 946–958\. Association for Computational Linguistics, 2020\.doi:[10\.18653/v1/2020\.acl\-main\.89](https://doi.org/10.18653/v1/2020.acl-main.89)\. - Roy et al\. \[2015\]Subhro Roy, Tim Vieira, and Dan Roth\.Reasoning about quantities in natural language\.*Transactions of the Association for Computational Linguistics*, 3:1–13, 2015\.doi:[10\.1162/tacl\_a\_00118](https://doi.org/10.1162/tacl_a_00118)\. - Saha et al\. \[2017\]Swarnadeep Saha, Harinder Pal, and Mausam\.Bootstrapping for numerical Open IE\.In*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, pages 317–323\. Association for Computational Linguistics, 2017\.doi:[10\.18653/v1/P17\-2050](https://doi.org/10.18653/v1/P17-2050)\. - Wimalasuriya and Dou \[2010\]Daya C\. Wimalasuriya and Dejing Dou\.Ontology\-based information extraction: an introduction and a survey of current approaches\.*Journal of Information Science*, 36\(3\):306–323, 2010\.doi:[10\.1177/0165551509360123](https://doi.org/10.1177/0165551509360123)\. - Maedche and Staab \[2001\]Alexander Maedche and Steffen Staab\.Ontology learning for the Semantic Web\.*IEEE Intelligent Systems*, 16\(2\):72–79, 2001\.doi:[10\.1109/5254\.920602](https://doi.org/10.1109/5254.920602)\. - Cimiano \[2006\]Philipp Cimiano\.*Ontology Learning and Population from Text: Algorithms, Evaluation and Applications*\.Springer, New York, NY, 2006\.ISBN 978\-0\-387\-30632\-2\.doi:[10\.1007/978\-0\-387\-39252\-3](https://doi.org/10.1007/978-0-387-39252-3)\. - Niles and Pease \[2001\]Ian Niles and Adam Pease\.Towards a standard upper ontology\.In*Proceedings of the 2nd International Conference on Formal Ontology in Information Systems \(FOIS ’01\)*, pages 2–9\. ACM, 2001\.doi:[10\.1145/505168\.505170](https://doi.org/10.1145/505168.505170)\. - Bender and Koller \[2020\]Emily M\. Bender and Alexander Koller\.Climbing towards NLU: on meaning, form, and understanding in the age of data\.In*Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics*, pages 5185–5198\. Association for Computational Linguistics, 2020\.doi:[10\.18653/v1/2020\.acl\-main\.463](https://doi.org/10.18653/v1/2020.acl-main.463)\. - Lake et al\. \[2017\]Brenden M\. Lake, Tomer D\. Ullman, Joshua B\. Tenenbaum, and Samuel J\. Gershman\.Building machines that learn and think like people\.*Behavioral and Brain Sciences*, 40:e253, 2017\.doi:[10\.1017/S0140525X16001837](https://doi.org/10.1017/S0140525X16001837)\. - Hitzler et al\. \[2022\]Pascal Hitzler, Aaron Eberhart, Monireh Ebrahimi, Md Kamruzzaman Sarker, and Lu Zhou\.Neuro\-symbolic approaches in artificial intelligence\.*National Science Review*, 9\(6\):nwac035, 2022\.doi:[10\.1093/nsr/nwac035](https://doi.org/10.1093/nsr/nwac035)\. - Sarker et al\. \[2021\]Md Kamruzzaman Sarker, Lu Zhou, Aaron Eberhart, and Pascal Hitzler\.Neuro\-symbolic artificial intelligence: current trends\.*AI Communications*, 34\(3\):197–209, 2021\.doi:[10\.3233/AIC\-210084](https://doi.org/10.3233/AIC-210084)\. - Cropper and Dumančić \[2022\]Andrew Cropper and Sebastijan Dumančić\.Inductive logic programming at 30: a new introduction\.*Journal of Artificial Intelligence Research*, 74:765–850, 2022\.doi:[10\.1613/jair\.1\.13507](https://doi.org/10.1613/jair.1.13507)\. - Souza et al\. \[2020\]Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo\.BERTimbau: pretrained BERT models for Brazilian Portuguese\.*Lecture Notes in Computer Science*, 12319:403–417, 2020\.doi:[10\.1007/978\-3\-030\-61377\-8\_28](https://doi.org/10.1007/978-3-030-61377-8_28)\. - Fonseca et al\. \[2016\]Erick Fonseca, Leandro Santos, Marcelo Criscuolo, and Sandra Aluísio\.Visão geral da avaliação de similaridade semântica e inferência textual\.*Linguamática*, 8\(2\):3–13, 2016\. - Real et al\. \[2020\]Livy Real, Erick Fonseca, and Hugo Gonçalo Oliveira\.The ASSIN 2 shared task: a quick overview\.In*Computational Processing of the Portuguese Language — 14th International Conference, PROPOR 2020*, pages 406–412\. Springer, 2020\.doi:[10\.1007/978\-3\-030\-41505\-1\_39](https://doi.org/10.1007/978-3-030-41505-1_39)\. - Luz de Araujo et al\. \[2018\]Pedro H\. Luz de Araujo, Teófilo E\. de Campos, Renato R\. R\. de Oliveira, Matheus Stauffer, Samuel Couto, and Paulo Bermejo\.LeNER\-Br: a dataset for named entity recognition in Brazilian legal text\.In*Computational Processing of the Portuguese Language — 13th International Conference, PROPOR 2018*, pages 313–323\. Springer, 2018\.doi:[10\.1007/978\-3\-319\-99722\-3\_32](https://doi.org/10.1007/978-3-319-99722-3_32)\. - Almeida et al\. \[2023a\]Thales Sales Almeida, Thiago Laitz, Giovana Kerche Bonas, and Rodrigo Nogueira\.BLUEX: a benchmark based on Brazilian leading universities entrance examinations\.In*Proceedings of the 37th Conference on Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track*, pages 1–12, 2023a\. - Nunes et al\. \[2023\]Desnes Nunes, Ricardo Primi, Ramon Pires, Roberto Lotufo, and Rodrigo Nogueira\.Evaluating GPT\-4’s vision capabilities on Brazilian university admission exams, 2023\. - Almeida et al\. \[2023b\]Thales Sales Almeida, Hugo Abonizio, Rodrigo Nogueira, and Ramon Pires\.Sabiá: Portuguese large language models\.In*Proceedings of the Brazilian Conference on Intelligent Systems \(BRACIS\)*, pages 226–240\. Springer, 2023b\.doi:[10\.1007/978\-3\-031\-45392\-2\_15](https://doi.org/10.1007/978-3-031-45392-2_15)\. - Bureau International des Poids et Mesures \[2019\]Bureau International des Poids et Mesures\.*The International System of Units \(SI\)*\.BIPM, 9th edition, 2019\. - Unicode Consortium \[2024\]Unicode Consortium\.*The Unicode Standard, Version 16\.0*\.Unicode Consortium, 2024\. - Orengo and Huyck \[2001\]Viviane Moreira Orengo and Christian R\. Huyck\.A stemming algorithm for the Portuguese language\.In*Proceedings of the Eighth International Symposium on String Processing and Information Retrieval \(SPIRE\)*, pages 186–193\. IEEE, 2001\.doi:[10\.1109/SPIRE\.2001\.989755](https://doi.org/10.1109/SPIRE.2001.989755)\. - Dietterich \[1998\]Thomas G\. Dietterich\.Approximate statistical tests for comparing supervised classification learning algorithms\.*Neural Computation*, 10\(7\):1895–1923, 1998\.doi:[10\.1162/089976698300017197](https://doi.org/10.1162/089976698300017197)\. - Wilson \[1927\]Edwin B\. Wilson\.Probable inference, the law of succession, and statistical inference\.*Journal of the American Statistical Association*, 22\(158\):209–212, 1927\.doi:[10\.1080/01621459\.1927\.10502953](https://doi.org/10.1080/01621459.1927.10502953)\. - Cohen \[1988\]Jacob Cohen\.*Statistical power analysis for the behavioral sciences*\.Lawrence Erlbaum Associates, 2nd edition, 1988\. - Holm \[1979\]Sture Holm\.A simple sequentially rejective multiple test procedure\.*Scandinavian Journal of Statistics*, 6\(2\):65–70, 1979\. - Zhu et al\. \[2018\]Minjie Zhu, Frank McKenna, and Michael H\. Scott\.OpenSeesPy: Python library for the OpenSees finite element framework\.*SoftwareX*, 7:6–11, 2018\.doi:[10\.1016/j\.softx\.2017\.10\.009](https://doi.org/10.1016/j.softx.2017.10.009)\. - Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.In*Proceedings of the International Conference on Learning Representations \(ICLR\)*, pages 1–27, 2021\. - Brown et al\. \[2020\]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D\. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al\.Language models are few\-shot learners\.In*Advances in Neural Information Processing Systems*, volume 33, pages 1877–1901, 2020\. - Hitzler et al\. \[2018\]Pascal Hitzler, Adila Krisnadhi, and Krzysztof Janowicz\.Towards a simple but useful ontology design pattern representation language\.*Lecture Notes in Computer Science*, 11136:2–17, 2018\.doi:[10\.1007/978\-3\-030\-00671\-6\_1](https://doi.org/10.1007/978-3-030-00671-6_1)\.
Similar Articles
TONIC: Token-Centric Semantic Communication for Task-Oriented Wireless Systems
This paper proposes TONIC, a token-centric semantic communication framework for task-oriented wireless systems that assigns utility-aware unequal error protection to tokens and uses confidence-aware gating with a Transformer-based completion model, outperforming baselines on image classification.
A Triadic Suffix Tokenization Scheme for Numerical Reasoning
This paper introduces Triadic Suffix Tokenization (TST), a deterministic tokenization scheme that partitions digits into three-digit triads with explicit magnitude markers to improve numerical reasoning in large language models. The method addresses inconsistent number fragmentation in standard tokenizers by providing transparent order-of-magnitude relationships at the token level, with two implementation variants offering scalable vocabulary expansion.
Examining the Limits of Word2Vec with Toki Pona
This paper investigates whether Word2Vec can generate meaningful semantic embeddings for Toki Pona, a constructed language with only ~130 words, using a corpus of 1.4 million sentences, and examines the effect of non-Toki Pona tokens on embedding quality.
Local Benchmark: Evaluating Token Efficiency of Pythonic vs. Natural Language CoT on Qwen
Investigates token efficiency differences between Pythonic and natural language Chain-of-Thought reasoning on Qwen models, providing a local benchmark evaluation.
X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation
X-Token introduces two loss formulations (P-KL and H-KL) to address failure modes in logit-based cross-tokenizer knowledge distillation, enabling a student model to learn from teachers with incompatible vocabularies and achieving state-of-the-art results on Llama-3.2-1B.