Number-aware embeddings

Reddit r/LocalLLaMA 05/19/26, 12:34 PM Models

number-awareness embedding tokenizer fine-tuning mlm custom-benchmark

Summary

A technique to make embedding models aware of number ordering by overriding tokenizer and MLM fine-tuning, achieving 59% accuracy on number sorting benchmarks.

If you look at the cosine sim between the embeddings of "a 500 hp car", "a 1,200 hp car" and "a 73 hp car", you'll soon see that embedding models have no sense of number ordering at all. (I tested Qwen and ModernBERT-based embeddings) It mostly comes from how the tokenizer and the log likelihood loss excessively reward exact prediction over Order Of Magnitude prediction, during the MLM pre-training phase. I've tried to mitigate this by overriding the default tokenizer/prediction head for numbers, and MLM fine-tuning the modified architecture on 300M tokens (of which \~ 4M numbers) And it works. The idea is to regex number patterns, and represent them in log magnitude. Each number then gets smooth-encoded into 128 bins (linear interpolation between adjacent bins), with an embedding dict entry for each of these 128 bins. Decoding works much the same: I've used a classification-regression head, with 128 output bins and smooth CE loss. Making the MLM-pre-trained model into an embedding model was the most interesting part. I've tried JEPA and it failed, so I went for an encoder/decoder setup, that worked fine. End result, after 6 H100-hours or training : on my custom benchmarks (this sentence is a complete red flag, isn't it?), it's able to correctly sort triplets of sentences 59% of the time, vs. 38% for ModernBERT (mean-pooling) and 34% for BGE-base-v1.5 (CLS). It's also quite good at extracting structured/quantitative data from number-heavy HTML tables. The (rather undertrained) model is here: [https://huggingface.co/edereynal/financial\_bert](https://huggingface.co/edereynal/financial_bert) If you're interested in the full engineering, please check the blog post. It's quite dense, technically speaking, but I think it's interesting: [https://www.eloidereynal.com/p/i-spent-1-year-trying-to-predict](https://www.eloidereynal.com/p/i-spent-1-year-trying-to-predict)

Original Article

Number-aware embeddings

Similar Articles

Numbers Already Carry Their Own Embeddings

Speaking Numbers to LLMs: Multi-Wavelet Number Embeddings for Time Series Forecasting

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

Transformers Learn the Mestre-Nagao Heuristic

Submit Feedback

Similar Articles

Numbers Already Carry Their Own Embeddings

Speaking Numbers to LLMs: Multi-Wavelet Number Embeddings for Time Series Forecasting

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling

Transformers Learn the Mestre-Nagao Heuristic