MultiHashFormer: Hash-based Generative Language Models

arXiv cs.CL Papers

Summary

MultiHashFormer is a hash-based generative language model that represents each token as a unique hash signature, enabling parameter-efficient autoregression. It outperforms standard Transformer LMs at 100M, 1B, and 3B scales and supports multilingual vocabulary expansion without increasing parameters.

arXiv:2606.28057v1 Announce Type: new Abstract: Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:25 AM

# MultiHashFormer: Hash-based Generative Language Models
Source: [https://arxiv.org/abs/2606.28057](https://arxiv.org/abs/2606.28057)
[View PDF](https://arxiv.org/pdf/2606.28057)

> Abstract:Language models \(LMs\) represent tokens using embedding matrices that scale linearly with the vocabulary size\. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder\-only models\. While this offers parameter efficiency, many\-to\-one collisions prevent its use in causal LMs\. In this paper, we propose MultiHashFormer, a new framework that allows hash\-based autoregression\. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions\. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder\. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text\. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks\. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications\.

## Submission history

From: Huiyin Xue \[[view email](https://arxiv.org/show-email/3c52616c/2606.28057)\] **\[v1\]**Fri, 26 Jun 2026 13:03:29 UTC \(4,031 KB\)

Similar Articles

Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation

Hugging Face Daily Papers

This paper proposes a novel approach that conditions diffusion models on Multimodal Large Language Models (MLLMs) for subject-driven image generation, using VAE-based identity conditioning and a Dual Layer Aggregation module to improve both semantic understanding and identity preservation while mitigating copy-paste artifacts.

HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model

arXiv cs.CL

Hebatron is a new open-weight Hebrew-specialized Large Language Model built on NVIDIA's Nemotron-3 Mixture-of-Experts architecture, achieving strong reasoning performance with efficient inference. It is the first language-specific adaptation of this architecture and supports native long-context processing.