BERTomelo: Your Portuguese Encoder Best Friend
Summary
This paper introduces BERTomelo, a next-generation monolingual encoder pre-trained for Portuguese using the ModernBERT architecture, achieving superior performance on downstream tasks like STS and NER compared to previous Portuguese and multilingual models.
View Cached Full Text
Cached at: 06/30/26, 05:29 AM
# BERTomelo: Your Portuguese Encoder Best Friend Source: [https://arxiv.org/abs/2606.28999](https://arxiv.org/abs/2606.28999) [View PDF](https://arxiv.org/pdf/2606.28999) > Abstract:Encoders have become the state of the art for multiple NLP tasks, especially those requiring deep contextual understanding\. While multilingual models offer broad coverage, dedicated monolingual encoders are essential for capturing the unique lexical and syntactic nuances of specific languages\. For Portuguese, however, existing monolingual options like BERTimbau and Albertina have not kept pace with recent architectural breakthroughs, often lagging behind English benchmarks in scalability and efficiency\. This work introduces BERTomelo, a next\-generation monolingual encoder pre\-trained from scratch and specifically optimized for the Portuguese language\. By leveraging the ModernBERT architecture, BERTomelo overcomes the limitations of previous models, offering Base and Large versions with a 1,024\-token context window and hardware\-level optimizations like FlashAttention and alternating attention mechanisms\. The model was trained on ClassiCC\-PT, a massive, high\-quality Portuguese corpus of 106 million documents, ensuring superior alignment with the language's contemporary usage\. The results demonstrate that BERTomelo not only outperforms previous Portuguese encoders but also provides a more robust and efficient alternative to massive multilingual models in downstream tasks such as STS and NER\. ## Submission history From: Luís Paulo Faina Garcia \[[view email](https://arxiv.org/show-email/542375cc/2606.28999)\] **\[v1\]**Sat, 27 Jun 2026 16:23:17 UTC \(204 KB\)
Similar Articles
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.
LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification
Researchers release LegalBench-BR, the first public benchmark for evaluating LLMs on Brazilian legal text classification, showing LoRA-fine-tuned BERTimbau dramatically outperforms GPT-4o mini and Claude 3.5 Haiku.
Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese
TOTEN is a knowledge-based ontological tokenization framework that replaces statistical tokenization with declarative classification grounded in a formal ontology of engineering entities, achieving high ontological atomicity and numerical reconstruction for physical quantities and technical notation in Brazilian Portuguese.
UR-BERT: Scaling Text Encoders for Massively Multilingual TTS Through Universal Romanization and Speech Token Prediction
UR-BERT proposes a Romanized transcription-based text encoder for massively multilingual TTS, scaling to 495 languages by using universal Romanization and a speech token prediction objective to enhance phonetic alignment and generalization to unseen languages.
BamiBERT: A New BERT-based Language Model for Vietnamese
BamiBERT is a new BERT-based pre-trained language model for Vietnamese that addresses limitations of PhoBERT, supporting longer context and operating without word segmentation, achieving state-of-the-art results on multiple Vietnamese benchmarks.