BERTomelo: Your Portuguese Encoder Best Friend

arXiv cs.CL Papers

Summary

This paper introduces BERTomelo, a next-generation monolingual encoder pre-trained for Portuguese using the ModernBERT architecture, achieving superior performance on downstream tasks like STS and NER compared to previous Portuguese and multilingual models.

arXiv:2606.28999v1 Announce Type: new Abstract: Encoders have become the state of the art for multiple NLP tasks, especially those requiring deep contextual understanding. While multilingual models offer broad coverage, dedicated monolingual encoders are essential for capturing the unique lexical and syntactic nuances of specific languages. For Portuguese, however, existing monolingual options like BERTimbau and Albertina have not kept pace with recent architectural breakthroughs, often lagging behind English benchmarks in scalability and efficiency. This work introduces BERTomelo, a next-generation monolingual encoder pre-trained from scratch and specifically optimized for the Portuguese language. By leveraging the ModernBERT architecture, BERTomelo overcomes the limitations of previous models, offering Base and Large versions with a 1,024-token context window and hardware-level optimizations like FlashAttention and alternating attention mechanisms. The model was trained on ClassiCC-PT, a massive, high-quality Portuguese corpus of 106 million documents, ensuring superior alignment with the language's contemporary usage. The results demonstrate that BERTomelo not only outperforms previous Portuguese encoders but also provides a more robust and efficient alternative to massive multilingual models in downstream tasks such as STS and NER.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:29 AM

# BERTomelo: Your Portuguese Encoder Best Friend
Source: [https://arxiv.org/abs/2606.28999](https://arxiv.org/abs/2606.28999)
[View PDF](https://arxiv.org/pdf/2606.28999)

> Abstract:Encoders have become the state of the art for multiple NLP tasks, especially those requiring deep contextual understanding\. While multilingual models offer broad coverage, dedicated monolingual encoders are essential for capturing the unique lexical and syntactic nuances of specific languages\. For Portuguese, however, existing monolingual options like BERTimbau and Albertina have not kept pace with recent architectural breakthroughs, often lagging behind English benchmarks in scalability and efficiency\. This work introduces BERTomelo, a next\-generation monolingual encoder pre\-trained from scratch and specifically optimized for the Portuguese language\. By leveraging the ModernBERT architecture, BERTomelo overcomes the limitations of previous models, offering Base and Large versions with a 1,024\-token context window and hardware\-level optimizations like FlashAttention and alternating attention mechanisms\. The model was trained on ClassiCC\-PT, a massive, high\-quality Portuguese corpus of 106 million documents, ensuring superior alignment with the language's contemporary usage\. The results demonstrate that BERTomelo not only outperforms previous Portuguese encoders but also provides a more robust and efficient alternative to massive multilingual models in downstream tasks such as STS and NER\.

## Submission history

From: Luís Paulo Faina Garcia \[[view email](https://arxiv.org/show-email/542375cc/2606.28999)\] **\[v1\]**Sat, 27 Jun 2026 16:23:17 UTC \(204 KB\)

Similar Articles

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

arXiv cs.CL

This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.

BamiBERT: A New BERT-based Language Model for Vietnamese

arXiv cs.CL

BamiBERT is a new BERT-based pre-trained language model for Vietnamese that addresses limitations of PhoBERT, supporting longer context and operating without word segmentation, achieving state-of-the-art results on multiple Vietnamese benchmarks.