BERTomelo: Your Portuguese Encoder Best Friend

arXiv cs.CL 06/30/26, 04:00 AM Papers

portuguese-encoder monolingual-model modernbert pre-training natural-language-processing ner sts

Summary

This paper introduces BERTomelo, a next-generation monolingual encoder pre-trained for Portuguese using the ModernBERT architecture, achieving superior performance on downstream tasks like STS and NER compared to previous Portuguese and multilingual models.

arXiv:2606.28999v1 Announce Type: new Abstract: Encoders have become the state of the art for multiple NLP tasks, especially those requiring deep contextual understanding. While multilingual models offer broad coverage, dedicated monolingual encoders are essential for capturing the unique lexical and syntactic nuances of specific languages. For Portuguese, however, existing monolingual options like BERTimbau and Albertina have not kept pace with recent architectural breakthroughs, often lagging behind English benchmarks in scalability and efficiency. This work introduces BERTomelo, a next-generation monolingual encoder pre-trained from scratch and specifically optimized for the Portuguese language. By leveraging the ModernBERT architecture, BERTomelo overcomes the limitations of previous models, offering Base and Large versions with a 1,024-token context window and hardware-level optimizations like FlashAttention and alternating attention mechanisms. The model was trained on ClassiCC-PT, a massive, high-quality Portuguese corpus of 106 million documents, ensuring superior alignment with the language's contemporary usage. The results demonstrate that BERTomelo not only outperforms previous Portuguese encoders but also provides a more robust and efficient alternative to massive multilingual models in downstream tasks such as STS and NER.

Original Article

View Cached Full Text

Cached at: 06/30/26, 05:29 AM

# BERTomelo: Your Portuguese Encoder Best Friend
Source: [https://arxiv.org/abs/2606.28999](https://arxiv.org/abs/2606.28999)
[View PDF](https://arxiv.org/pdf/2606.28999)

> Abstract:Encoders have become the state of the art for multiple NLP tasks, especially those requiring deep contextual understanding\. While multilingual models offer broad coverage, dedicated monolingual encoders are essential for capturing the unique lexical and syntactic nuances of specific languages\. For Portuguese, however, existing monolingual options like BERTimbau and Albertina have not kept pace with recent architectural breakthroughs, often lagging behind English benchmarks in scalability and efficiency\. This work introduces BERTomelo, a next\-generation monolingual encoder pre\-trained from scratch and specifically optimized for the Portuguese language\. By leveraging the ModernBERT architecture, BERTomelo overcomes the limitations of previous models, offering Base and Large versions with a 1,024\-token context window and hardware\-level optimizations like FlashAttention and alternating attention mechanisms\. The model was trained on ClassiCC\-PT, a massive, high\-quality Portuguese corpus of 106 million documents, ensuring superior alignment with the language's contemporary usage\. The results demonstrate that BERTomelo not only outperforms previous Portuguese encoders but also provides a more robust and efficient alternative to massive multilingual models in downstream tasks such as STS and NER\.

## Submission history

From: Luís Paulo Faina Garcia \[[view email](https://arxiv.org/show-email/542375cc/2606.28999)\] **\[v1\]**Sat, 27 Jun 2026 16:23:17 UTC \(204 KB\)

BERTomelo: Your Portuguese Encoder Best Friend

Similar Articles

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

BamiBERT: A New BERT-based Language Model for Vietnamese

Submit Feedback

Similar Articles

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction

BamiBERT: A New BERT-based Language Model for Vietnamese