BamiBERT: A New BERT-based Language Model for Vietnamese

arXiv cs.CL Papers

Summary

BamiBERT is a new BERT-based pre-trained language model for Vietnamese that addresses limitations of PhoBERT, supporting longer context and operating without word segmentation, achieving state-of-the-art results on multiple Vietnamese benchmarks.

arXiv:2607.02259v1 Announce Type: new Abstract: In this paper, we introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese that addresses key limitations of PhoBERT -- the current de facto Vietnamese text encoder. Trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second-best on 3 others, setting a new state of the art among "base"-sized Vietnamese encoders and demonstrating strong cross-domain generalization. We release BamiBERT at: https://huggingface.co/Qualcomm-AI-Research/BamiBERT
Original Article
View Cached Full Text

Cached at: 07/03/26, 05:42 AM

# BamiBERT: A New BERT-based Language Model for Vietnamese
Source: [https://arxiv.org/html/2607.02259](https://arxiv.org/html/2607.02259)
Dat Quoc Nguyen1, Thinh Pham2, Chi Tran1, Linh The Nguyen1 1Qualcomm AI Research 2Virginia Tech \{datnq, chitran, linhnt\}@qti\.qualcomm\.com, thinhphp@vt\.eduQualcomm Vietnam Company Limited\. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc\. This work was completed while all authors were at Movian AI, Vietnam\. All datasets and models were downloaded, trained, and evaluated using Movian AI’s resources\.

###### Abstract

In this paper, we introduce BamiBERT, a new BERT\-based pre\-trained language model for Vietnamese that addresses key limitations of PhoBERT—the current de facto Vietnamese text encoder\. Trained from scratch on a 129GB corpus of general\-domain Vietnamese text for 20 epochs, BamiBERT supports an extended context length of up to 2048 tokens and operates directly on raw input, eliminating the need for external word segmentation\. Across 8 Vietnamese benchmarks, it achieves the best score on 11 of 15 metrics and the second\-best on 3 others, setting a new state of the art among "base"\-sized Vietnamese encoders and demonstrating strong cross\-domain generalization\.

BamiBERT: A New BERT\-based Language Model for Vietnamese

Dat Quoc Nguyen1, Thinh Pham2, Chi Tran1, Linh The Nguyen11Qualcomm AI Research††thanks:Qualcomm Vietnam Company Limited\. Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc\. This work was completed while all authors were at Movian AI, Vietnam\. All datasets and models were downloaded, trained, and evaluated using Movian AI’s resources\.2Virginia Tech\{datnq, chitran, linhnt\}@qti\.qualcomm\.com, thinhphp@vt\.edu

## 1Introduction

In today’s LLM\-driven era, BERT\-based models\(Devlin et al\.,[2019](https://arxiv.org/html/2607.02259#bib.bib5); Liu et al\.,[2019](https://arxiv.org/html/2607.02259#bib.bib10)\)remain essential for tasks that demand high precision and low latency, such as span labeling, classification, and information retrieval\. Their lightweight nature makes them particularly well\-suited for resource\-constrained applications\. Rather than competing with large language models \(LLMs\), BERT\-based models often serve as core components in hybrid systems, delivering strong performance at a fraction of the computational cost while effectively complementing LLMsFan et al\. \([2024](https://arxiv.org/html/2607.02259#bib.bib7)\)\. As a result, the BERT family continues to evolve, with recent additions such as ModernBERTWarner et al\. \([2025](https://arxiv.org/html/2607.02259#bib.bib25)\)and NeoBERTBreton et al\. \([2025](https://arxiv.org/html/2607.02259#bib.bib1)\)\.

While English benefits from a rich ecosystem of pre\-trained BERT\-based models, the development of Vietnamese counterparts remains comparatively limited\. The multilingual model XLM\-RoBERTaConneau et al\. \([2019](https://arxiv.org/html/2607.02259#bib.bib4)\)achieves competitive performance on a wide range of Vietnamese NLP tasks by leveraging 138GB of CC100 Vietnamese text in its multilingual pre\-training corpus\. PhoBERTNguyen and Nguyen \([2020](https://arxiv.org/html/2607.02259#bib.bib14)\)was the first large\-scale monolingual BERT model pre\-trained from scratch specifically for Vietnamese, using 20GB of text\. Subsequent general\-domain monolingual models include viBERT and vELECTRABui et al\. \([2020](https://arxiv.org/html/2607.02259#bib.bib2)\), pre\-trained on 60GB of Vietnamese text, and ViDeBERTaTran et al\. \([2023](https://arxiv.org/html/2607.02259#bib.bib19)\), which uses the same 138GB CC100 Vietnamese corpus as XLM\-RoBERTa\. More recently, CafeBERTDo et al\. \([2024](https://arxiv.org/html/2607.02259#bib.bib6)\)was continually pre\-trained from the XLM\-RoBERTa “large” model on 18GB of Vietnamese text collected prior to 2021\. Several domain\-specific monolingual models have also emerged: VnLawBERTChau et al\. \([2020](https://arxiv.org/html/2607.02259#bib.bib3)\)for legal text, ViHealthBERT\(Minh et al\.,[2022](https://arxiv.org/html/2607.02259#bib.bib13)\)and ViPubmedDeBERTaTran\-Tien et al\. \([2023](https://arxiv.org/html/2607.02259#bib.bib20)\)for health and biomedical text, and ViSoBERTNguyen et al\. \([2023b](https://arxiv.org/html/2607.02259#bib.bib18)\)for social media text\.

Among these monolingual models, PhoBERT has become the default choice for many Vietnamese NLP tasks thanks to its strong and consistent performance\. Since its release, it has gained widespread adoption, with over 200K monthly downloads on HuggingFace and active use across the NLP community, whereas all other Vietnamese monolingual models receive fewer than 5K monthly downloads each\. Despite its popularity, PhoBERT has several limitations: it supports only a short maximum context length of 256 subword tokens, and it requires input text to be word\-segmented by an external tool prior to processing\. These limitations motivate the development of a new Vietnamese BERT\-based model that supports a longer context and operates directly on raw text\.

In this paper, we introduce BamiBERT—a new pre\-trained language model for Vietnamese—trained from scratch on a large corpus of 129GB of uncompressed text for 20 epochs, with an extended maximum context length of 2048 tokens\. Unlike PhoBERT, which requires Vietnamese text to be pre\-segmented by an external word segmenter, BamiBERT operates directly on raw input text, making it more flexible and easier to integrate into a wider range of downstream applications\. Experimental results on 8 Vietnamese benchmark datasets show that BamiBERT delivers state\-of\-the\-art or near\-state\-of\-the\-art performance \(ranking \#1 on 11/15 metrics, \#2 on 3/15 metrics, and \#3 on the remaining one\), demonstrating strong cross\-domain generalization\. We release BamiBERT at:[https://huggingface\.co/Qualcomm\-AI\-Research/BamiBERT](https://huggingface.co/Qualcomm-AI-Research/BamiBERT)\.

## 2Pre\-trained language model BamiBERT

This section presents how we pre\-train our new BERT\-based language model from scratch\.

#### Architecture:

We pre\-train a text encoder, named "BamiBERT",111”Bami” denotes ”bánh mì” which is a popular type of sandwich in Vietnam\.from scratch, employing the BERT’s "base" architecture with 12 Transformer block layers\(Devlin et al\.,[2019](https://arxiv.org/html/2607.02259#bib.bib5)\)\. To pre\-train BamiBERT, we use the masked language modeling objective\(Devlin et al\.,[2019](https://arxiv.org/html/2607.02259#bib.bib5)\)and the RoBERTa pre\-training approach\(Liu et al\.,[2019](https://arxiv.org/html/2607.02259#bib.bib10)\)which optimizes BERT with a dynamic masking strategy and without the next sentence prediction objective\. For tokenization, we extend the PhoGPT’s Vietnamese\-specific byte\-level BPE tokenizer\(Nguyen et al\.,[2023a](https://arxiv.org/html/2607.02259#bib.bib16)\)with an additional "<mask\>" token, resulting in a final vocabulary of 20481 token types\. We set a maximum sequence length of 2048\.

#### Pre\-training dataset:

We use a clean, 129 GB dataset of uncompressed, general\-domain text\.

#### Optimization:

The model is optimized using Adam\(Kingma and Ba,[2015](https://arxiv.org/html/2607.02259#bib.bib9)\)\. We use a batch size of 1024 sequence blocks distributed across 8 A100 GPUs \(each with 40GB of memory\) and a peak learning rate of 0\.00015\. The pre\-training process runs for 20 epochs, with the initial 2 epochs dedicated to warming up the learning rate\.

## 3Experiments

### 3\.1Setup

Table 1:Statistics of 8 experimental datasets\.Table 2:Results of pre\-trained "base"\-architecture models\.†denotes results extracted from previous works\.We conduct experiments to compare our model BamiBERT with the previous strong and public pre\-trained "base"\-architecture ones for Vietnamese, including: Vietnamese\-specific models ViDeBERTa\-base\(Tran et al\.,[2023](https://arxiv.org/html/2607.02259#bib.bib19)\), ViSoBERT\(Nguyen et al\.,[2023b](https://arxiv.org/html/2607.02259#bib.bib18)\)and PhoBERT\-base\(Nguyen and Nguyen,[2020](https://arxiv.org/html/2607.02259#bib.bib14)\)as well as the multilingual XLM\-RoBERTa\-base\(Conneau et al\.,[2019](https://arxiv.org/html/2607.02259#bib.bib4)\)\.222ViDeBERTa, ViSoBERT and XLM\-RoBERTa were trained using a maximum sequence length of 512 tokens\.Here, BamiBERT, ViSoBERT and XLM\-RoBERTa take raw texts as input, while ViDeBERTa and PhoBERT are Vietnamese word\-level models\. That is, a Vietnamese word segmentation tool must be applied to produce word\-segmented texts before feeding them to the word\-level ViDeBERTa and PhoBERT\. For ViDeBERTa and PhoBERT experiments, we utilize the RDRSegmenter component\(Nguyen et al\.,[2018a](https://arxiv.org/html/2607.02259#bib.bib15)\)from the VnCoreNLP toolkit\(Vu et al\.,[2018](https://arxiv.org/html/2607.02259#bib.bib24)\)for Vietnamese word segmentation\.

We employ the following experimental benchmark datasets: ViNLI—a Vietnamese dataset for open\-domain natural language inference\(Huynh et al\.,[2022](https://arxiv.org/html/2607.02259#bib.bib8)\), PhoNER\_COVID19—a dataset for recognizing COVID\-19 related named entities in Vietnamese\(Truong et al\.,[2021](https://arxiv.org/html/2607.02259#bib.bib21)\); UIT\-VSFC \(Sentiment\) and UIT\-VSFC \(Topic\)—Vietnamese students’ feedback benchmarks for sentiment\-based and topic\-based classifications\(Nguyen et al\.,[2018b](https://arxiv.org/html/2607.02259#bib.bib17)\); ViSpamReviews—a dataset for spam review detection on Vietnamese e\-commerce websites\(Van Dinh et al\.,[2022](https://arxiv.org/html/2607.02259#bib.bib22)\); UIT\-ViSFD—a Vietnamese aspect\-based sentiment analysis dataset of feedbacks and comments for smartphone e\-commerce\(Luc Phan et al\.,[2021](https://arxiv.org/html/2607.02259#bib.bib12)\); and UIT\-ABSA \(Hotel\) and UIT\-ABSA \(Restaurant\)—Vietnamese aspect\-based sentiment analysis datasets for hotel and restaurant domains\(Van Thin et al\.,[2021](https://arxiv.org/html/2607.02259#bib.bib23)\)\. ViNLI and PhoNER\_COVID19 are based on general\-domain texts, whereas the remaining benchmarks are derived from social media and forum discussions\. See Table[1](https://arxiv.org/html/2607.02259#S3.T1)for the statistics of these datasets\.

For all experimental models, we employtransformers\(Wolf et al\.,[2020](https://arxiv.org/html/2607.02259#bib.bib26)\)to fine\-tune them using the AdamW optimizer\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2607.02259#bib.bib11)\)and set the batch size to 32\. We also perform a grid search on the validation set to select the initial learning rate for AdamW from \{1e\-5, 2e\-5, 5e\-5\}\. We train for 30 epochs on the training set, compute F1on the validation set after each training epoch, and select the model checkpoint with the best F1to report final metric scores on the test set\.

### 3\.2Main Results

Table[2](https://arxiv.org/html/2607.02259#S3.T2)reports the performance of BamiBERT and four baselines—ViDeBERTa, ViSoBERT, XLM\-RoBERTa, and PhoBERT—across eight Vietnamese benchmarks\. BamiBERT achieves the best performance on 11 of the 15 evaluation metrics, ranks second on three metrics, and ranks third on the remaining metric, establishing a new state of the art for "base"\-sized Vietnamese BERT\-based language models\.

#### ViNLI

BamiBERT achieves the best performance on both metrics \(81\.01 Accuracy and 81\.15 F1\), yielding substantial absolute gains of \+3\.01 Accuracy and \+3\.10 F1over the second\-ranked PhoBERT \(78\.00/78\.05\)\. XLM\-RoBERTa follows in third place \(76\.83/77\.01\), trailing PhoBERT by roughly one point on both metrics\. ViSoBERT \(67\.70/67\.82\) and ViDeBERTa \(61\.08/60\.71\) lag considerably behind, with gaps of 13–20 points relative to BamiBERT\.

#### PhoNER\_COVID19

BamiBERT obtains the highest F1score \(94\.90\), outperforming ViDeBERTa by 0\.40 points and PhoBERT by 0\.70 points\. Its advantage widens against the remaining baselines, reaching \+2\.0 F1over ViSoBERT \(92\.90\) and \+2\.4 F1over XLM\-RoBERTa \(92\.50\)\.

#### UIT–VSFC \(Sentiment\)

With 93\.86 Accuracy and 83\.41 F1, BamiBERT ranks among the top\-performing models\. It is essentially on par with PhoBERT, trailing by 0\.24 Accuracy but leading by 0\.14 F1\. BamiBERT maintains consistent margins over XLM\-RoBERTa \(\+0\.30 Accuracy, \+1\.21 F1\) and ViSoBERT \(\+0\.71 Accuracy, \+1\.92 F1\), while ViDeBERTa underperforms substantially on both metrics\.

#### UIT–VSFC \(Topic\)

BamiBERT again leads on both metrics \(89\.34 Accuracy and 79\.90 F1\), outperforming PhoBERT by \+0\.10 on both, XLM\-RoBERTa by \+0\.16/\+0\.34, and ViSoBERT by \+0\.54/\+0\.04\. Although the top four models are tightly clustered \(within 0\.54 Accuracy and 0\.34 F1\), BamiBERT remains the most consistent\. ViDeBERTa, in contrast, trails markedly \(83\.94/66\.49\)\.

#### ViSpamReviews

BamiBERT ranks second overall \(90\.76 Accuracy and 78\.20 F1\), narrowly behind ViSoBERT \(90\.99/79\.06\) by 0\.23 Accuracy and 0\.86 F1\. Nevertheless, it clearly outperforms XLM\-RoBERTa \(\+0\.60 Accuracy, \+1\.65 F1\) and PhoBERT \(\+0\.93 Accuracy, \+2\.02 F1\), while ViDeBERTa lags well behind \(86\.21/67\.04\)\. BamiBERT remains highly competitive at the top tier and continues to surpass other established baselines by clear margins\.

#### UIT–ViSFD

BamiBERT delivers the strongest performance on both subtasks\. For aspect detection, it attains 89\.14 F1, ahead of ViSoBERT \(88\.63;−\-0\.51\) and PhoBERT \(86\.03;−\-3\.11\), with XLM\-RoBERTa \(82\.73\) and ViDeBERTa \(75\.53\) trailing further\. For aspect\-based sentiment classification, BamiBERT achieves 84\.24 F1, again surpassing ViSoBERT \(83\.55;−\-0\.69\) and PhoBERT \(78\.76;−\-5\.48\)\. The consistent gains over PhoBERT \(\+3\.11 and \+5\.48 points\) underscore BamiBERT’s robustness across both subtasks\.

#### UIT–ABSA \(Hotel\)

On aspect detection, BamiBERT obtains the highest F1\(79\.99\), outperforming ViSoBERT \(79\.41;−\-0\.58\) and PhoBERT \(79\.16;−\-0\.83\), while XLM\-RoBERTa \(77\.70\) and ViDeBERTa \(72\.05\) remain less competitive\. On aspect\-based sentiment classification, however, ViSoBERT takes the lead \(74\.24 F1\), followed by PhoBERT \(73\.73;−\-0\.51\) and BamiBERT \(72\.65;−\-1\.59\); XLM\-RoBERTa \(71\.23\) and ViDeBERTa \(62\.97\) trail by a wide margin\.

#### UIT–ABSA \(Restaurant\)

BamiBERT produces the strongest end\-to\-end performance, with 88\.01 F1on aspect detection and 74\.89 F1on aspect\-based sentiment classification\. These results exceed those of ViSoBERT \(86\.86/−\-1\.15 and 73\.87/−\-1\.02\) and PhoBERT \(86\.53/−\-1\.48 and 73\.52/−\-1\.37\), while XLM\-RoBERTa \(82\.18/71\.58\) and ViDeBERTa \(73\.56/63\.78\) lag considerably behind\.

## 4Discussion

Overall performanceAcross 8 Vietnamese benchmarks and different subtasks \(Table[2](https://arxiv.org/html/2607.02259#S3.T2)\), BamiBERT consistently delivers SOTA or near\-SOTA results\. The most substantial improvement appears on ViNLI, where BamiBERT surpasses the next\-best PhoBERT by \+3\.01 Accuracy and \+3\.10 F1, while outpacing ViSoBERT/ViDeBERTa by 13–20 points—demonstrating strong sentence\-pair semantics and cue\-word sensitivity\.

#### Domain effects

Note that PhoNER\_COVID19 and ViNLI represent general\-domain text, while the remaining benchmarks reflect social media and forum content\. BamiBERT exhibits strong cross\-domain generalization, outperforming or closely matching the social\-media\-focused ViSoBERT across both general\-domain \(e\.g\., ViNLI, PhoNER\_COVID19\) and social\-domain benchmarks\. Its consistent top\-tier performance across diverse tasks—NER, span detection, and classification—demonstrates resilience to domain shift and label granularity\. This robustness positions BamiBERT as a reliable choice for NLP pipelines operating under domain heterogeneity and distributional uncertainty\.

#### Detection vs\. classification

In aspect\-based sentiment analysis pipelines, BamiBERT often excels indetection\(e\.g\., Hotel: 79\.99 F1; Restaurant: 88\.01 F1\), while itsclassificationperformance is strongest in the Restaurant domain \(74\.89 F1\) but trails ViSoBERT/PhoBERT in the Hotel domain \(72\.65 F1\)\. This pattern suggests complementary strengths: BamiBERT appears particularly effective at span/target localization and boundary\-sensitive cues, whereas domain\-specific sentiment nuances in the Hotel domain may benefit more from domain\-adapted pretraining \(ViSoBERT\)\. Combined with the new SOTA results on UIT\-ViSFD, this indicates BamiBERT’s robust end\-to\-end capability for fine\-grained social content analysis\.

#### Takeaway

BamiBERT delivers SOTA or near\-SOTA performance across diverse Vietnamese benchmarks, excelling in NLI, topic classification, and fine\-grained sentiment tasks\. Its cross\-domain stability and strong results on both detection and classification make it a robust default for real\-world Vietnamese NLP applications\.

## 5Conclusion

In this paper, we have presented BamiBERT, a new pre\-trained language model for Vietnamese designed to address key limitations of existing monolingual encoders\. Unlike PhoBERT—the current de facto choice of Vietnamese text encoder—BamiBERT is trained from scratch on a 129 GB corpus of general\-domain text for 20 epochs, supports an extended maximum context length of 2048 tokens, and operates directly on raw text, removing the dependency on external word segmentation\. Experiments on 8 Vietnamese benchmark datasets show that BamiBERT does better than PhoBERT, establishing a new state of the art among “base”\-sized Vietnamese encoders with strong cross\-domain generalization\.

## References

- Breton et al\. \(2025\)Lola Le Breton, Quentin Fournier, Mariam El Mezouar, John X\. Morris, and Sarath Chandar\. 2025\.NeoBERT: A Next\-Generation BERT\.
- Bui et al\. \(2020\)The Viet Bui, Thi Oanh Tran, and Phuong Le\-Hong\. 2020\.Improving Sequence Tagging for Vietnamese Text using Transformer\-based Neural Models\.In*Proceedings of PACLIC*, pages 13–20\.
- Chau et al\. \(2020\)Chieu\-Nguyen Chau, Truong\-Son Nguyen, and Le\-Minh Nguyen\. 2020\.VNLawBERT: A Vietnamese Legal Answer Selection Approach Using BERT Language Model\.In*Proceedings of NICS*, pages 298–301\.
- Conneau et al\. \(2019\)Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov\. 2019\.Unsupervised Cross\-lingual Representation Learning at Scale\.*arXiv preprint*, arXiv:1911\.02116\.
- Devlin et al\. \(2019\)Jacob Devlin, Ming\-Wei Chang, Kenton Lee, and Kristina Toutanova\. 2019\.BERT: Pre\-training of Deep Bidirectional Transformers for Language Understanding\.In*Proceedings of NAACL*, pages 4171–4186\.
- Do et al\. \(2024\)Phong Nguyen\-Thuan Do, Son Quoc Tran, Phu Gia Hoang, Kiet Van Nguyen, and Ngan Luu\-Thuy Nguyen\. 2024\.VLUE: A New Benchmark and Multi\-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding\.In*Findings of NAACL*, pages 211–222\.
- Fan et al\. \(2024\)Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat\-Seng Chua, and Qing Li\. 2024\.A Survey on RAG Meeting LLMs: Towards Retrieval\-Augmented Large Language Models\.In*Proceedings of KDD*, page 6491–6501\.
- Huynh et al\. \(2022\)Tin Van Huynh, Kiet Van Nguyen, and Ngan Luu\-Thuy Nguyen\. 2022\.ViNLI: A Vietnamese Corpus for Studies on Open\-Domain Natural Language Inference\.In*Proceedings of COLING*, pages 3858–3872\.
- Kingma and Ba \(2015\)Diederik P\. Kingma and Jimmy Ba\. 2015\.Adam: A Method for Stochastic Optimization\.In*Proceedings of ICLR*\.
- Liu et al\. \(2019\)Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov\. 2019\.RoBERTa: A Robustly Optimized BERT Pretraining Approach\.*arXiv preprint*, arXiv:1907\.11692\.
- Loshchilov and Hutter \(2019\)Ilya Loshchilov and Frank Hutter\. 2019\.[Decoupled weight decay regularization](https://openreview.net/forum?id=Bkg6RiCqY7)\.In*Proceedings of ICLR*\.
- Luc Phan et al\. \(2021\)Luong Luc Phan, Phuc Huynh Pham, Kim Thi\-Thanh Nguyen, Sieu Khai Huynh, Tham Thi Nguyen, Luan Thanh Nguyen, Tin Van Huynh, and Kiet Van Nguyen\. 2021\.SA2SL: From Aspect\-Based Sentiment Analysis to Social Listening System for Business Intelligence\.In*Proceedings of KSEM*, pages 647–658\.
- Minh et al\. \(2022\)Nguyen Minh, Vu Hoang Tran, Vu Hoang, Huy Duc Ta, Trung Huu Bui, and Steven Quoc Hung Truong\. 2022\.ViHealthBERT: Pre\-trained Language Models for Vietnamese in Health Text Mining\.In*Proceedings of LREC*, pages 328–337\.
- Nguyen and Nguyen \(2020\)Dat Quoc Nguyen and Anh Tuan Nguyen\. 2020\.PhoBERT: Pre\-trained language models for Vietnamese\.In*Findings of EMNLP 2020*, pages 1037–1042\.
- Nguyen et al\. \(2018a\)Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras, and Mark Johnson\. 2018a\.A Fast and Accurate Vietnamese Word Segmenter\.In*Proceedings of LREC 2018*, pages 2582–2587\.
- Nguyen et al\. \(2023a\)Dat Quoc Nguyen, Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Dinh Phung, and Hung Bui\. 2023a\.PhoGPT: Generative Pre\-training for Vietnamese\.*arXiv preprint*, arXiv:2311\.02945\.
- Nguyen et al\. \(2018b\)Kiet Van Nguyen, Vu Duc Nguyen, Phu X\. V\. Nguyen, Tham T\. H\. Truong, and Ngan Luu\-Thuy Nguyen\. 2018b\.UIT\-VSFC: Vietnamese Students’ Feedback Corpus for Sentiment Analysis\.In*Proceedings of KSE*, pages 19–24\.
- Nguyen et al\. \(2023b\)Nam Nguyen, Thang Phan, Duc\-Vu Nguyen, and Kiet Nguyen\. 2023b\.ViSoBERT: A Pre\-Trained Language Model for Vietnamese Social Media Text Processing\.In*Proceedings of EMNLP*, pages 5191–5207\.
- Tran et al\. \(2023\)Cong Dao Tran, Nhut Huy Pham, Anh Tuan Nguyen, Truong Son Hy, and Tu Vu\. 2023\.ViDeBERTa: A powerful pre\-trained language model for Vietnamese\.In*Findings of EACL 2023*, pages 1071–1078\.
- Tran\-Tien et al\. \(2023\)Manh Tran\-Tien, Huu\-Loi Le, Dang Nhat Minh, T\. Tran Khang, Huy\-The Vu, and Nguyen Minh\-Tien\. 2023\.ViPubmedDeBERTa: A Pre\-trained Model for Vietnamese Biomedical Text\.In*Proceedings of PACLIC*, pages 831–840\.
- Truong et al\. \(2021\)Thinh Hung Truong, Mai Hoang Dao, and Dat Quoc Nguyen\. 2021\.COVID\-19 Named Entity Recognition for Vietnamese\.In*Proceedings of NAACL*\.
- Van Dinh et al\. \(2022\)Co Van Dinh, Son T\. Luu, and Anh Gia\-Tuan Nguyen\. 2022\.Detecting Spam Reviews on Vietnamese E\-Commerce Websites\.In*Proceedings of ACIIDS*, pages 595–607\.
- Van Thin et al\. \(2021\)Dang Van Thin, Ngan Luu\-Thuy Nguyen, Tri Minh Truong, Lac Si Le, and Duy Tin Vo\. 2021\.Two New Large Corpora for Vietnamese Aspect\-based Sentiment Analysis at Sentence Level\.*ACM Trans\. Asian Low\-Resour\. Lang\. Inf\. Process\.*, 20\(4\)\.
- Vu et al\. \(2018\)Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras, and Mark Johnson\. 2018\.VnCoreNLP: A Vietnamese natural language processing toolkit\.In*Proceedings of NAACL: Demonstrations*, pages 56–60\.
- Warner et al\. \(2025\)Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Griffin Thomas Adams, Jeremy Howard, and Iacopo Poli\. 2025\.Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference\.In*Proceedings of the ACL*, pages 2526–2547\.
- Wolf et al\. \(2020\)Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M\. Rush\. 2020\.Transformers: State\-of\-the\-Art Natural Language Processing\.In*Proceedings of EMNLP: System Demonstrations*, pages 38–45\.

Similar Articles

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

arXiv cs.CL

This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

arXiv cs.CL

VLegal-Bench is a cognitively grounded benchmark for evaluating large language models on Vietnamese legal reasoning tasks, containing 10,450 expert-annotated samples designed to address the gap in legal benchmarks for civil law systems. The benchmark assesses multiple levels of legal understanding through question answering, multi-step reasoning, and scenario-based problem solving, providing a replicable framework for evaluating LLMs in non-English, codified legal contexts.

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

arXiv cs.CL

This paper introduces m3BERT, a multilingual bidirectional encoder with a novel pretraining strategy that jointly optimizes representations across transformer layers and multiple embedding dimensions, enabling a single model to be adapted to varied resource constraints. It significantly outperforms state-of-the-art models on the Bing-Click industrial retrieval dataset.