Number-aware embeddings

Reddit r/LocalLLaMA Models

Summary

A technique to make embedding models aware of number ordering by overriding tokenizer and MLM fine-tuning, achieving 59% accuracy on number sorting benchmarks.

If you look at the cosine sim between the embeddings of "a 500 hp car", "a 1,200 hp car" and "a 73 hp car", you'll soon see that embedding models have no sense of number ordering at all. (I tested Qwen and ModernBERT-based embeddings) It mostly comes from how the tokenizer and the log likelihood loss excessively reward exact prediction over Order Of Magnitude prediction, during the MLM pre-training phase. I've tried to mitigate this by overriding the default tokenizer/prediction head for numbers, and MLM fine-tuning the modified architecture on 300M tokens (of which \~ 4M numbers) And it works. The idea is to regex number patterns, and represent them in log magnitude. Each number then gets smooth-encoded into 128 bins (linear interpolation between adjacent bins), with an embedding dict entry for each of these 128 bins. Decoding works much the same: I've used a classification-regression head, with 128 output bins and smooth CE loss. Making the MLM-pre-trained model into an embedding model was the most interesting part. I've tried JEPA and it failed, so I went for an encoder/decoder setup, that worked fine. End result, after 6 H100-hours or training : on my custom benchmarks (this sentence is a complete red flag, isn't it?), it's able to correctly sort triplets of sentences 59% of the time, vs. 38% for ModernBERT (mean-pooling) and 34% for BGE-base-v1.5 (CLS). It's also quite good at extracting structured/quantitative data from number-heavy HTML tables. The (rather undertrained) model is here: [https://huggingface.co/edereynal/financial\_bert](https://huggingface.co/edereynal/financial_bert) If you're interested in the full engineering, please check the blog post. It's quite dense, technically speaking, but I think it's interesting: [https://www.eloidereynal.com/p/i-spent-1-year-trying-to-predict](https://www.eloidereynal.com/p/i-spent-1-year-trying-to-predict)
Original Article

Similar Articles

Numbers Already Carry Their Own Embeddings

arXiv cs.LG

Introduces Adelic operation-preserved embeddings (AOE), a training-free representation that encodes numbers by combining real value with p-adic expansions, preserving additive and multiplicative structure. Achieves perfect accuracy on the Weaving Pattern benchmark.

Transformers Learn the Mestre-Nagao Heuristic

arXiv cs.LG

This paper trains a two-layer transformer encoder to classify rational elliptic curves by rank from Frobenius traces, achieving >99% accuracy. Mechanistic interpretability reveals the model learns the Mestre-Nagao heuristic and concentrates attention on prime positions, demonstrating that transformers can learn number-theoretic algorithms.