A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

arXiv cs.CL Papers

Summary

This paper presents a comparative evaluation of embedding models and generator backends for Khmer-language retrieval-augmented question answering in the telecom domain, finding that BGE-M3 performs best for retrieval while generator strengths vary across metrics.

arXiv:2605.22099v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:45 AM

# A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering
Source: [https://arxiv.org/html/2605.22099](https://arxiv.org/html/2605.22099)
Sereiwathna RosPhannet PovDepartment of Big Data, Chungbuk National University, Cheongju, South KoreaRatanaktepi ChhorDepartment of Big Data, Chungbuk National University, Cheongju, South KoreaKimleang LyGeneral Department of Information and Communication Technology, Ministry of Post and Telecommunications, Phnom Penh, CambodiaWan\-Sup ChoDepartment of Management Information Systems, Chungbuk National University, Cheongju, South KoreaBigDatalabs Co\., Ltd, Cheongju, South KoreaSaksonita KhoeurnCorresponding author: saksonita@chungbuk\.ac\.krDepartment of Management Information Systems, Chungbuk National University, Cheongju, South KoreaBigDatalabs Co\., Ltd, Cheongju, South Korea

###### Abstract

Retrieval\-Augmented Generation \(RAG\) has emerged as a promising paradigm for grounding large language model \(LLM\) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy\. Its efficacy, however, remains largely unexamined for low\-resource, non\-Latin\-script languages such as Khmer\. In this paper, we present a RAG\-based question answering system for Khmer\-language telecom\-domain documents\. We conduct a two\-phase comparative evaluation\. First, we benchmark three embedding models—BGE\-M3 \(567M\), Jina\-Embeddings\-v3 \(570M\), and Qwen3\-Embedding \(597M\)—for dense retrieval over Khmer documents\. BGE\-M3 consistently performs best, achieving a Hit Rate@3 of 0\.285, File Hit Rate@3 of 0\.700, MRR@3 of 0\.221, and Precision@3 of 0\.112, substantially outperforming the other retrievers\. Second, using BGE\-M3 as the selected retriever, we evaluate five generator backends—Qwen3 \(8B\), Qwen3\.5 \(9B\), Sailor2\-8B\-Chat, SeaLLMs\-v3\-7B\-Chat, and Llama\-SEA\-LION\-v2\-8B\-IT—on a curated golden dataset of 200 Khmer question\-answer pairs\. To quantify system performance, we apply six RAGAS\-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness\. The results show no single model dominates across all metrics: Qwen3\.5\-9B achieves the highest faithfulness \(0\.859\) and context relevance \(0\.726\), Qwen3\-8B attains the highest factual correctness \(0\.380\), and SeaLLMs\-v3\-7B\-Chat performs best on answer relevance \(0\.867\), answer similarity \(0\.836\), and answer correctness \(0\.599\)\. These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity\.

Keywords:retrieval\-augmented generation, RAG evaluation, RAGAS metrics, Khmer question answering, Khmer NLP, local LLMs, dense retrieval, low\-resource languages

## 1Introduction

Retrieval\-Augmented Generation \(RAG\)\[[10](https://arxiv.org/html/2605.22099#bib.bib15)\]has become a common approach for question answering over domain\-specific document collections because it combines external retrieval with the generative capabilities of Large Language Models \(LLMs\)\. In such systems, performance depends not only on the generator, but also on the quality of retrieval and on whether the generated answer remains grounded in the retrieved evidence\. As a result, evaluating RAG is inherently multi\-dimensional: a system may fail because relevant evidence is not retrieved, because the model does not use the retrieved context effectively, or because it hallucinates unsupported content\. Existing evaluation practices, however, remain heavily shaped by English\-centric settings and do not always transfer cleanly to low\-resource languages or institutionally sensitive domains\[[6](https://arxiv.org/html/2605.22099#bib.bib20),[4](https://arxiv.org/html/2605.22099#bib.bib13),[18](https://arxiv.org/html/2605.22099#bib.bib28)\]\.

These limitations are especially important for Khmer document Question Answering \(QA\)\. Khmer is a low\-resource language written in a complex Abugida script, with limited annotated resources and weak standardization for word segmentation\. These characteristics introduce challenges at multiple stages of a RAG pipeline, including text extraction, document preprocessing, retrieval, and answer evaluation\. In institutional settings, the problem is further compounded by the need for trustworthy responses grounded in authoritative documents, since hallucinated or weakly supported answers can undermine public confidence\[[8](https://arxiv.org/html/2605.22099#bib.bib25)\]\. Despite growing interest in multilingual and low\-resource Natural Language Processing \(NLP\), it remains unclear which retrieval models, which locally deployable generators, and which automated evaluation signals are most suitable for Khmer\-language RAG systems\.

In this paper, we present a systematic study of retrieval\-augmented question answering over Khmer institutional documents\. Our focus is not only on end\-to\-end answer quality, but also on the interaction between retrieval, generation, and automated evaluation in a low\-resource, non\-Latin\-script setting\. We study a privacy\-preserving RAG pipeline built for locally hosted deployment over Khmer telecom\-domain documents, allowing us to examine the practical requirements of document\-grounded question answering under data\-sovereignty constraints\.

Our contributions are threefold:

1. 1\.We benchmark dense retrievers for Khmer document retrieval and analyze their effectiveness on noisy, domain\-diverse institutional text\.
2. 2\.We compare five locally deployable generator models, including both general\-purpose multilingual LLMs and Southeast Asian\-focused models, to assess whether regional specialization yields measurable gains for Khmer question answering\.
3. 3\.We examine six adapted RAGAS\-style metrics in this setting and discuss their usefulness and limitations for evaluating Khmer RAG pipelines\.

To support this study, we construct a gold evaluation set of 200 Khmer question–answer pairs derived from authoritative documents across multiple institutional subdomains\. Using this benchmark, we provide an empirical account of retrieval quality, answer quality, and metric behavior in Khmer document QA\. Our findings show that retriever choice has a substantial effect on downstream performance, while the relative strengths of generator models vary across grounding\-oriented and similarity\-oriented evaluation measures\. More broadly, this paper highlights the need for language\-aware RAG evaluation practices in low\-resource settings, and offers evidence that methods validated in English should not be assumed to behave identically for Khmer\.

## 2Related Work

RAG combines a retriever with an LLM so that answers are generated from retrieved evidence rather than from parametric memory alone\[[10](https://arxiv.org/html/2605.22099#bib.bib15)\]\. In a typical pipeline, documents are collected, segmented into passages, indexed, and retrieved at inference time to condition answer generation\[[6](https://arxiv.org/html/2605.22099#bib.bib20)\]\. Because end\-to\-end performance depends jointly on retrieval quality and response grounding, RAG has become a widely used framework for knowledge\-intensive tasks\. Early systems often relied on dense retrieval methods such as Dense Passage Retrieval \(DPR\)\[[9](https://arxiv.org/html/2605.22099#bib.bib19)\], while more recent multilingual retrievers such as BGE\-M3\[[1](https://arxiv.org/html/2605.22099#bib.bib7)\]aim to improve transfer across languages and scripts\. These developments are especially relevant in low\-resource settings, where retrieval can be degraded by limited training data, Optical Character Recognition \(OCR\) artifacts, and script\-sensitive preprocessing challenges\[[7](https://arxiv.org/html/2605.22099#bib.bib8)\]\.

Traditional reference\-based metrics such as Bilingual Evaluation Understudy \(BLEU\) and Recall\-Oriented Understudy for Gisting Evaluation \(ROUGE\) measure lexical overlap with gold references\[[17](https://arxiv.org/html/2605.22099#bib.bib30),[11](https://arxiv.org/html/2605.22099#bib.bib31)\], while embedding\-based metrics such as Bidirectional Encoder Representations from Transformers Score \(BERTScore\) better capture semantic similarity through contextualized representations\[[23](https://arxiv.org/html/2605.22099#bib.bib32)\]\. However, these approaches are not fully adequate for RAG because they do not directly assess whether a response is supported by the retrieved evidence\. This limitation is particularly important in low\-resource settings, where high\-quality reference sets are costly to construct and correct answers may show substantial lexical and syntactic variation\.

Recent work therefore moves beyond answer similarity toward explicit assessment of grounding and context use\. GEval demonstrates that LLM\-as\-a\-judge methods can support flexible rubric\-based evaluation\[[12](https://arxiv.org/html/2605.22099#bib.bib9)\]\. RAGAS adapts this idea to RAG pipelines through metrics such as faithfulness, answer relevance, and context\-related measures\[[4](https://arxiv.org/html/2605.22099#bib.bib13),[5](https://arxiv.org/html/2605.22099#bib.bib11)\]\. Related benchmarks such as RGB further stress\-test RAG systems under noisy retrieval and counterfactual conditions\[[2](https://arxiv.org/html/2605.22099#bib.bib12)\]\. In an applied telecom setting, Roychowdhury et al\.\[[18](https://arxiv.org/html/2605.22099#bib.bib28)\]likewise report that grounding\-oriented measures such as faithfulness and factual correctness align more closely with expert judgment than similarity\-based metrics\. Taken together, this literature suggests that RAG evaluation should account not only for output quality, but also for evidential support\.

Although RAG and LLM evaluation have advanced rapidly, most evidence still comes from English and other high\-resource languages\. For Khmer, this creates an important gap: document processing and retrieval must contend with a non\-Latin script, inconsistent word segmentation, OCR noise, and limited task\-specific resources\. Regionally focused models such as SEA\-LION\[[14](https://arxiv.org/html/2605.22099#bib.bib3)\]and Sailor2\[[3](https://arxiv.org/html/2605.22099#bib.bib4)\]indicate growing support for Southeast Asian languages, but they do not by themselves establish how well retrieval models, generator models, and automated evaluation metrics behave in Khmer institutional QA settings\.

Ly et al\.\[[13](https://arxiv.org/html/2605.22099#bib.bib1)\]conducted a study in which they prepared a dataset of questions and corresponding Khmer answers to perform fine\-tuning experiments on large language models \(LLMs\) for the Khmer language\. To evaluate the generated answers, the authors employed similarity\-based metrics that compare the model outputs with reference answers\. Specifically, ROUGE\-1, ROUGE\-2, and ROUGE\-L were used to measure unigram overlap, bigram overlap, and the longest common subsequence, respectively\.

![Refer to caption](https://arxiv.org/html/2605.22099v1/x1.png)Figure 1:System architecture of the RAG pipeline\.These metrics provide insights into how closely the generated responses align with the ground truth answers\. In contrast, our work generates answers based on retrieved contextual information rather than relying on direct comparison with predefined target answers\. Therefore, traditional similarity\-based evaluation metrics may not fully capture the quality of the responses, as they do not adequately reflect the relevance and faithfulness of the generated answers to the provided context\.

More broadly, prior work on Khmer and related low\-resource language processing suggests that language\-specific preprocessing and resource constraints can materially affect downstream system performance\. Our study builds on this perspective by evaluating retrieval quality, answer generation, and automated RAG assessment jointly in a Khmer\-language institutional document environment\.

Overall, this work is positioned at the intersection of multilingual RAG, grounding\-aware evaluation, and low\-resource language processing\. Unlike prior work centered on English, our focus is not only whether RAG works, but whether its retrieval and evaluation assumptions remain reliable in a Khmer\-language setting\.

## 3Methodology

Our experimental setup follows a standard RAG pipeline comprising a retriever followed by a generator module\. Figure[1](https://arxiv.org/html/2605.22099#S2.F1)shows the schematic of our pipeline\. The input to the system is a user query in Khmer or English, which is processed through dense retrieval followed by LLM\-based answer generation conditioned on the retrieved context\.

### 3\.1Dataset

We collected open\-source data from websites that publish official documents related to Information and Communication Technology \(ICT\)\. The data sources include notifications, guidelines, laws, announcements, decrees, sub\-decrees, government documents, press releases, Q&A documents, decisions, and general information\. The corpus is mostly Khmer language documents with embedded English technical terms\. The focus on closed\-domain institutional documents enables this research to create a specific and verifiable knowledge domain, which is critical in building question answering systems in a low\-resource language setting\.

The documents are preprocessed into Markdown format and recursively chunked into segments that preserve semantic coherence\. Each chunk carries metadata including the source document identifier and a unique chunk ID for provenance tracking\. The resulting corpus contains over 7,000 chunks\.

For evaluation, we curate a golden dataset comprising 200 question–answer pairs derived from the telecom domain document corpus\. The questions span multiple domains and reflect realistic citizen queries written in Khmer\. Each entry consists of: \(1\) a question, \(2\) a target answer \(ground truth\) composed by domain experts, and \(3\) domain metadata \(document ID, question ID, domain category\) for stratified analysis\.

### 3\.2Retriever Models

The retriever module is based on dense passage retrieval\. An encoder\-based language model computes embeddings for both the query and the document chunks\. For every query embedding, the retriever outputs the top\-kkmost similar chunk embeddings using cosine similarity\.

We evaluate three embedding models to determine which provides the most effective dense retrieval for Khmer documents:

1. 1\.BGE\-M3\(567M\)\[[1](https://arxiv.org/html/2605.22099#bib.bib7)\]: Supports multi\-lingual, multi\-functionality, and multi\-granularity embeddings through self\-knowledge distillation, achieving competitive performance across 100\+ languages including Southeast Asian scripts\. Served through Ollama\[[15](https://arxiv.org/html/2605.22099#bib.bib17)\]\.
2. 2\.Jina\-Embeddings\-v3\(570M\)\[[20](https://arxiv.org/html/2605.22099#bib.bib33)\]: An embedding model supporting 89\+ languages with task\-specific adapters for retrieval, classification, and semantic similarity\. Served through Ollama\.
3. 3\.Qwen3\-Embedding\(597M\)\[[25](https://arxiv.org/html/2605.22099#bib.bib5)\]: A compact embedding model from the Qwen family designed for semantic search across multiple languages\. Served through Ollama\.

Document chunks are embedded in batches during the offline stage, and the complete vector database is serialized for efficient runtime loading\. At query time, the top\-kkchunks \(defaultk=3k\{=\}3\) are selected and concatenated with similarity scores and source metadata into a structured context string\. The retriever is evaluated using standard information retrieval metrics: Hit Rate@kk, File\-level Hit Rate@kk, Mean Reciprocal Rank \(MRR@kk\), and Precision@kk\.

### 3\.3Generator Models

Once the relevant context has been retrieved for a query, the query and context are passed to the LLM for generating the response\. A system prompt constrains the LLM to use only the provided context and not generate information beyond it\.

Table 1:Summary of models used in the experimental setup\.We evaluate five LLM backends with different sizes and linguistic focus:

1. 1\.Qwen3 \(8B\)\[[21](https://arxiv.org/html/2605.22099#bib.bib2)\]: A general\-purpose model from the Qwen family with broad language coverage\. Served locally via Ollama\.
2. 2\.Qwen3\.5 \(9B\)\[[22](https://arxiv.org/html/2605.22099#bib.bib6)\]: A newer\-generation model from the Qwen family with improved instruction\-following capabilities\. Served locally via Ollama\.
3. 3\.Sailor2\-8B\-Chat\[[3](https://arxiv.org/html/2605.22099#bib.bib4)\]: An 8B\-parameter model specifically trained for Southeast Asian language understanding\. Inference is performed using HuggingFace Transformers\.
4. 4\.SeaLLMs\-v3\-7B\-Chat\[[24](https://arxiv.org/html/2605.22099#bib.bib24)\]: A 7B chat model optimized for Southeast Asian languages\. Inference is performed using HuggingFace Transformers\.
5. 5\.Llama\-SEA\-LION\-v2\-8B\-IT\[[14](https://arxiv.org/html/2605.22099#bib.bib3)\]: An 8B instruction\-tuned SEA\-LION family model targeting Southeast Asian multilingual use\. Inference is performed using HuggingFace Transformers\.

All local inference is performed on a single machine with one NVIDIA GPU H200\.

### 3\.4Evaluation Metrics

We focus on six adapted metrics inspired by the RAGAS framework\[[4](https://arxiv.org/html/2605.22099#bib.bib13)\]\. Higher value is better for all of them\. The evaluation employs GPT\-4o\-mini\[[16](https://arxiv.org/html/2605.22099#bib.bib27)\]as the LLM judge and BGE\-M3\[[1](https://arxiv.org/html/2605.22099#bib.bib7)\]for computing semantic similarity\. We use the notation of Roychowdhury et al\.\[[18](https://arxiv.org/html/2605.22099#bib.bib28)\]: given questionqqand contextc​\(q\)c\(q\)retrieved from the corpus, the LLM generates answera​\(q\)a\(q\)\. The ground truth answer is denotedg​t​\(q\)gt\(q\)\.

- •Faithfulness \(𝐹𝑎𝑖𝐹𝑢𝑙\\mathit\{FaiFul\}\):Checks if the generated statements froma​\(q\)a\(q\)are present in the retrieved contextc​\(q\)c\(q\)through verdicts; the ratio of valid verdictsVVto total number of statementsS​\(q\)S\(q\)is the answer’s faithfulness: 𝐹𝑎𝑖𝐹𝑢𝑙=\|V\|\|S​\(q\)\|\\mathit\{FaiFul\}=\\frac\{\|V\|\}\{\|S\(q\)\|\}\(1\)
- •Answer Relevance \(𝐴𝑛𝑠𝑅𝑒𝑙\\mathit\{AnsRel\}\):The average cosine similarity of the user’s questionqqwithN=3N\{=\}3generated questionsq~i\\tilde\{q\}\_\{i\}, usinga​\(q\)a\(q\)as reference, is the answer relevance: 𝐴𝑛𝑠𝑅𝑒𝑙=1N​∑i=1Nsim​\(E​\(q\),E​\(q~i\)\)\\mathit\{AnsRel\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathrm\{sim\}\(E\(q\),E\(\\tilde\{q\}\_\{i\}\)\)\(2\)
- •Context Relevance \(𝐶𝑜𝑛𝑅𝑒𝑙\\mathit\{ConRel\}\):The semantic similarity between the questionqqand the retrieved contextc​\(q\)c\(q\)computed using BGE\-M3 embeddings: 𝐶𝑜𝑛𝑅𝑒𝑙=sim​\(E​\(q\),E​\(c​\(q\)\)\)\\mathit\{ConRel\}=\\mathrm\{sim\}\(E\(q\),E\(c\(q\)\)\)\(3\)
- •Answer Similarity \(𝐴𝑛𝑠𝑆𝑖𝑚\\mathit\{AnsSim\}\):The similarity between the embedding ofa​\(q\)a\(q\)and the embedding ofg​t​\(q\)gt\(q\): 𝐴𝑛𝑠𝑆𝑖𝑚=sim​\(E​\(a​\(q\)\),E​\(g​t​\(q\)\)\)\\mathit\{AnsSim\}=\\mathrm\{sim\}\(E\(a\(q\)\),E\(gt\(q\)\)\)\(4\) Table 2:Retriever evaluation results on the 200\-question golden dataset\. Bold values indicate the best score per metric\.↑\\uparrowindicates higher is better\.
- •Factual Correctness \(𝐹𝑎𝑐𝐶𝑜𝑟\\mathit\{FacCor\}\):The F1\-Score of statements ina​\(q\)a\(q\)classified as True Positive \(TP\), False Positive \(FP\) and False Negative \(FN\) by the LLM judge with respect tog​t​\(q\)gt\(q\): 𝐹𝑎𝑐𝐶𝑜𝑟=\|TP\|\|TP\|\+0\.5×\(\|FP\|\+\|FN\|\)\\mathit\{FacCor\}=\\frac\{\|\\mathrm\{TP\}\|\}\{\|\\mathrm\{TP\}\|\+0\.5\\times\(\|\\mathrm\{FP\}\|\+\|\\mathrm\{FN\}\|\)\}\(5\)
- •Answer Correctness \(𝐴𝑛𝑠𝐶𝑜𝑟\\mathit\{AnsCor\}\):Determines correctness ofa​\(q\)a\(q\)with respect tog​t​\(q\)gt\(q\)as a weighted sum of factual correctness and answer similarity: 𝐴𝑛𝑠𝐶𝑜𝑟=w1×𝐹𝑎𝑐𝐶𝑜𝑟\+w2×𝐴𝑛𝑠𝑆𝑖𝑚\\mathit\{AnsCor\}=w\_\{1\}\\times\\mathit\{FacCor\}\+w\_\{2\}\\times\\mathit\{AnsSim\}\(6\)with weights\[w1,w2\]=\[0\.5,0\.5\]\[w\_\{1\},w\_\{2\}\]=\[0\.5,0\.5\]in our implementation\.

For each of the 200 questions, the evaluation pipeline: \(1\) retrieves the top\-3 document chunks from the vector database, \(2\) generates an LLM answer conditioned on the retrieved context, \(3\) records the question, target answer, LLM answer, and retrieved chunks, and \(4\) computes the six metrics\. The evaluation is conducted separately for each of the five LLM backends\. For Qwen3 and Qwen3\.5, evaluation is run via Ollama\. For Sailor2\-8B\-Chat, SeaLLMs\-v3\-7B\-Chat, and Llama\-SEA\-LION\-v2\-8B\-IT, inference is performed using HuggingFace Transformers\. The generation evaluation uses BGE\-M3 as the retriever, which was identified as the best\-performing embedding model in our retriever comparison \(Section[4\.1](https://arxiv.org/html/2605.22099#S4.SS1)\)\.

## 4Results and Discussion

The results of the RAG evaluation are presented in two parts\. First, we compare the three embedding models for retrieval quality \(Table[2](https://arxiv.org/html/2605.22099#S3.T2)\)\. Then, using BGE\-M3 as the selected retriever, we evaluate the five generator models on the six RAGAS metrics \(Table[3](https://arxiv.org/html/2605.22099#S4.T3)\)\.

### 4\.1Retriever Evaluation

Table[2](https://arxiv.org/html/2605.22099#S3.T2)presents the retrieval performance of the three embedding models evaluated on the 200\-question golden dataset\. We report Hit Rate \(the fraction of queries for which the correct chunk appears in the top\-kkresults\), File\-level Hit Rate \(the fraction for which a chunk from the correct source file appears\), Mean Reciprocal Rank \(MRR\), and Precision at different values ofkk\.

Table 3:RAGAS metrics for the 200\-question golden dataset withk=3k\{=\}3retrieved contexts \(BGE\-M3 retriever\)\. Numbers are mean scores\. Bold values indicate the best score per metric\.↑\\uparrowindicates higher is better\.BGE\-M3 achieves the highest scores across all retrieval effectiveness metrics, with a Hit Rate@3 of 0\.285—more than double that of Jina\-Embeddings\-v3 \(0\.135\) and substantially higher than Qwen3\-Embedding \(0\.175\)\. The advantage is even more pronounced at the file level: BGE\-M3 retrieves a chunk from the correct source file 70% of the time atk=3k\{=\}3, compared to 52\.5% for Qwen3\-Embedding and 48\.5% for Jina v3\. Interestingly, Jina v3 achieves the highest top\-1 cosine similarity \(0\.759\), yet this does not translate into better retrieval accuracy, consistent with cautions about interpreting raw cosine similarity as a quality signal\[[19](https://arxiv.org/html/2605.22099#bib.bib29)\]\. Based on these results, we select BGE\-M3 as the retriever for the generator evaluation\.

### 4\.2Generator Evaluation

Table[3](https://arxiv.org/html/2605.22099#S4.T3)presents the RAGAS metrics for the five generator models on the 200\-question golden dataset using BGE\-M3 as the retriever\.

### 4\.3Discussion on Metrics

We discuss our findings about the six RAGAS metrics and their behaviour in the context of Khmer document QA\.

#### Faithfulness \(𝐹𝑎𝑖𝐹𝑢𝑙\\mathit\{FaiFul\}\)\.

Qwen3\.5\-9B achieves the highest faithfulness score of 0\.859, with SeaLLMs\-v3\-7B\-Chat close behind at 0\.846\. Qwen3\-8B and Sailor2\-8B remain in a similar middle range \(0\.780 and 0\.758\), whereas Llama\-SEA\-LION\-v2\-8B\-IT drops markedly to 0\.556\. This ranking suggests that strong multilingual or regional coverage alone is insufficient; what matters is whether the model can stay tightly grounded in the retrieved evidence\. The result is consistent with the findings of Roychowdhury et al\.\[[18](https://arxiv.org/html/2605.22099#bib.bib28)\], who report that faithfulness is generally concordant with manual evaluation by subject matter experts\.

Table 4:Summary of key findings per metric\.
#### Answer Relevance \(𝐴𝑛𝑠𝑅𝑒𝑙\\mathit\{AnsRel\}\)\.

SeaLLMs\-v3\-7B\-Chat achieves the highest answer relevance \(0\.867\), followed by Llama\-SEA\-LION\-v2\-8B\-IT \(0\.831\), Qwen3\-8B \(0\.808\), Sailor2\-8B \(0\.802\), and Qwen3\.5\-9B \(0\.779\)\. Unlike the earlier three\-model comparison, the inclusion of two additional Southeast Asian\-focused models shows that regional specialization can improve topical responsiveness to Khmer questions\. However, as noted in the literature\[[19](https://arxiv.org/html/2605.22099#bib.bib29),[18](https://arxiv.org/html/2605.22099#bib.bib28)\], this metric relies on cosine similarity between generated and original question embeddings and should therefore be interpreted cautiously as a relative signal rather than an absolute measure\.

#### Context Relevance \(𝐶𝑜𝑛𝑅𝑒𝑙\\mathit\{ConRel\}\)\.

All five models achieve nearly identical context relevance scores, ranging only from 0\.717 to 0\.726, with Qwen3\.5\-9B slightly ahead at 0\.726\. This is expected because context relevance primarily reflects retrieval quality, which is determined by the shared BGE\-M3 embedding model and is largely independent of the LLM backend\. The tight clustering confirms that retrieval quality remains a system\-level bottleneck rather than a differentiating property of the generators themselves\. As observed by Roychowdhury et al\.\[[18](https://arxiv.org/html/2605.22099#bib.bib28)\], context relevance is mainly indicative and dependent on context length, making it difficult to interpret as a standalone measure of answer quality\.

#### Factual Correctness \(𝐹𝑎𝑐𝐶𝑜𝑟\\mathit\{FacCor\}\)\.

Qwen3\-8B achieves the highest factual correctness \(0\.380\), with SeaLLMs\-v3\-7B\-Chat second at 0\.352, followed by Qwen3\.5\-9B \(0\.303\), Sailor2\-8B \(0\.258\), and Llama\-SEA\-LION\-v2\-8B\-IT \(0\.217\)\. The overall low scores across all models reflect the difficulty of the task: documents contain highly specific information such as phone numbers, URLs, and legal article numbers, where partial matches yield low F1 scores\. Importantly, the ranking differs from faithfulness: Qwen3\.5\-9B is best grounded in retrieved context, but Qwen3\-8B more accurately reproduces factual details from the ground truth\. Together with faithfulness, this metric has been found to be most aligned with expert evaluation\[[18](https://arxiv.org/html/2605.22099#bib.bib28)\]and remains one of the most informative measures in our setting\.

#### Answer Similarity \(𝐴𝑛𝑠𝑆𝑖𝑚\\mathit\{AnsSim\}\) and Answer Correctness \(𝐴𝑛𝑠𝐶𝑜𝑟\\mathit\{AnsCor\}\)\.

SeaLLMs\-v3\-7B\-Chat achieves the highest answer similarity \(0\.836\) and also the highest composite answer correctness \(0\.599\), indicating the strongest overall balance between semantic closeness to the reference answer and factual adequacy\. Llama\-SEA\-LION\-v2\-8B\-IT also scores relatively high on answer similarity \(0\.766\) but falls back to 0\.488 on answer correctness because of its weak factual correctness\. By contrast, Qwen3\-8B remains competitive on answer correctness \(0\.521\) despite lower answer similarity, because its stronger factual correctness compensates for the gap\. This again shows that semantic similarity and factual accuracy are distinct dimensions of quality\. Table[4](https://arxiv.org/html/2605.22099#S4.T4)summarizes the key conclusions\.

In summary, our results indicate that of these metrics,𝐹𝑎𝑖𝐹𝑢𝑙\\mathit\{FaiFul\}and𝐹𝑎𝑐𝐶𝑜𝑟\\mathit\{FacCor\}\(and hence𝐴𝑛𝑠𝐶𝑜𝑟\\mathit\{AnsCor\}\) are perhaps best aligned with human expert judgment for our domain; scores for𝐴𝑛𝑠𝑆𝑖𝑚\\mathit\{AnsSim\},𝐴𝑛𝑠𝑅𝑒𝑙\\mathit\{AnsRel\}and𝐶𝑜𝑛𝑅𝑒𝑙\\mathit\{ConRel\}are subject to inherent variations as discussed above\. This echoes the findings of Roychowdhury et al\.\[[18](https://arxiv.org/html/2605.22099#bib.bib28)\]for the telecom domain\.

### 4\.4Retriever Model Selection

A key finding is that BGE\-M3 substantially outperforms both Jina\-Embeddings\-v3 and Qwen3\-Embedding across all retrieval effectiveness metrics\. Atk=3k\{=\}3, BGE\-M3 achieves more than double the Hit Rate of Jina v3 \(0\.285 vs\. 0\.135\) and a 63% higher Hit Rate than Qwen3\-Embedding \(0\.285 vs\. 0\.175\)\. The gap widens at the file level, where BGE\-M3 retrieves from the correct source file 70% of the time compared to 52\.5% \(Qwen3\-Embedding\) and 48\.5% \(Jina v3\)\.

An interesting observation is that Jina v3 achieves the highest mean cosine similarity for its top\-1 retrieved chunk \(0\.759\), yet produces the worst retrieval accuracy\. This highlights a known limitation of relying on raw cosine similarity as a retrieval quality indicator\[[19](https://arxiv.org/html/2605.22099#bib.bib29)\]: a model may assign high similarity scores to semantically related but non\-matching passages\. BGE\-M3’s self\-knowledge distillation approach appears to produce embeddings that better discriminate between truly relevant and merely related content for Khmer documents\.

Nevertheless, even the best retriever \(BGE\-M3\) achieves a relatively modest Hit Rate@3 of 0\.285, indicating that retrieval quality remains a primary bottleneck in the overall RAG pipeline\. This is reflected in the tightly clustered𝐶𝑜𝑛𝑅𝑒𝑙\\mathit\{ConRel\}scores \(∼\\sim0\.72\) observed across all five generator models\.

### 4\.5Language Focus vs\. General Multilingual Models

With the addition of SeaLLMs\-v3\-7B\-Chat and Llama\-SEA\-LION\-v2\-8B\-IT, the comparison between general\-purpose and Southeast Asian language\-focused models becomes more nuanced\. Regional specialization does not uniformly help or hurt Khmer RAG performance\. SeaLLMs is the strongest model on answer relevance, answer similarity, and answer correctness, while Qwen3\.5\-9B remains best on faithfulness and Qwen3\-8B remains best on factual correctness\.

Within the Qwen family, an interesting trade\-off remains: Qwen3\.5\-9B achieves higher faithfulness \(0\.859 vs\. 0\.780\) and answer similarity \(0\.661 vs\. 0\.648\), while Qwen3\-8B leads on factual correctness \(0\.380 vs\. 0\.303\) and answer correctness \(0\.521 vs\. 0\.480\)\. This suggests that the newer, slightly larger Qwen3\.5 model is better at grounding answers in retrieved context, while Qwen3 more precisely reproduces factual details from the ground truth\.

The three Southeast Asian\-focused models also separate clearly from one another\. SeaLLMs appears to transfer well to Khmer answer generation, whereas Sailor2 and especially Llama\-SEA\-LION struggle more on grounding\-oriented metrics\. Several hypotheses may explain this spread: \(1\) the amount and quality of Khmer data may differ substantially across SEA\-focused pretraining corpora; \(2\) instruction\-following ability under constrained RAG prompting may matter more than regional coverage alone; and \(3\) differences in tokenizer behaviour, chat templates, and HuggingFace inference settings may influence answer style and grounding\. The broader conclusion is that regional language targeting can help, but the effect is strongly model\-specific rather than guaranteed across all models in the same family\.

From a deployment perspective, model choice depends on the target objective\. For applications where faithfulness to retrieved context is paramount, Qwen3\.5\-9B is the best choice\. For applications where factual accuracy matters most, Qwen3\-8B is preferable\. For applications prioritizing overall answer quality and semantic closeness to the target response, SeaLLMs\-v3\-7B\-Chat is the strongest option among the evaluated models\.

A key limitation of this study is that all reported scores are based on automated evaluation\. Although prior work in the telecom domain suggests that faithfulness and factual correctness align more closely with expert judgment than similarity\-based metrics, we did not conduct a dedicated human evaluation for Khmer in this study\. Therefore, our conclusions about metric reliability should be interpreted as evidence from automated analysis rather than definitive validation against human annotations\.

### 4\.6Limitations

- •Fixed retrieval\.The current system uses a single dense retrieval pass with fixedk=3k\{=\}3\. Adaptive retrieval or iterative refinement could improve context quality\.
- •Automated evaluation bias\.All metrics rely on LLM judges and English\-centric embedding models, potentially introducing systematic bias for Khmer evaluation\.
- •Limited ground truth\.The 200\-question golden dataset, while carefully curated, may not capture the full distribution of citizen queries\.
- •No human evaluation\.We did not conduct human evaluation of answer quality, which would provide a more reliable assessment\.
- •Inference setup differences\.Sailor2\-8B was evaluated with a different inference framework \(HuggingFace\) than the Qwen models \(Ollama\), which may introduce confounding factors\.

## 5Conclusion and Future Work

In this work, we presented a retrieval\-augmented generation system for Khmer document question answering using locally deployed language models\. We conducted a two\-fold comparative evaluation: a retriever comparison of three embedding models, and a generator comparison of five LLMs \(Qwen3\-8B, Qwen3\.5\-9B, Sailor2\-8B, SeaLLMs\-v3\-7B\-Chat, and Llama\-SEA\-LION\-v2\-8B\-IT\) on 200 question–answer pairs across six RAGAS\-inspired metrics\. Our results show that verdict\-based RAGAS metrics \(𝐹𝑎𝑖𝐹𝑢𝑙\\mathit\{FaiFul\}and𝐹𝑎𝑐𝐶𝑜𝑟\\mathit\{FacCor\}\) provide more reliable evaluation signals for Khmer text than similarity\-based metrics \(𝐴𝑛𝑠𝑅𝑒𝑙\\mathit\{AnsRel\}and𝐴𝑛𝑠𝑆𝑖𝑚\\mathit\{AnsSim\}\), which remain sensitive to embedding\-model choice and the inherent limitations of cosine similarity\. This is consistent with findings from the telecom domain\[[18](https://arxiv.org/html/2605.22099#bib.bib28)\]and suggests that these concerns also extend to low\-resource, non\-Latin\-script language settings\. Among the retrievers, BGE\-M3 substantially outperforms Jina\-Embeddings\-v3 and Qwen3\-Embedding on all retrieval metrics, achieving more than double the Hit Rate@3 of Jina v3\. Notably, high cosine similarity alone, as observed with Jina v3, does not necessarily translate into strong retrieval accuracy, reinforcing the importance of task\-specific retriever evaluation\. For generation, performance cannot be explained by a simple general\-purpose\-versus\-regional distinction\. Qwen3\.5\-9B achieves the highest faithfulness \(0\.859\), Qwen3\-8B leads in factual correctness \(0\.380\), and SeaLLMs\-v3\-7B\-Chat performs best on answer relevance \(0\.867\), answer similarity \(0\.836\), and answer correctness \(0\.599\), whereas Llama\-SEA\-LION\-v2\-8B\-IT underperforms on faithfulness and factual correctness\. These findings indicate that regional specialization can benefit Khmer question answering, but its impact is strongly model\-dependent\. Additionally, retrieval quality emerges as a significant bottleneck, as reflected in the modest Hit Rate@3 \(0\.285\) and the tightly clustered context relevance scores \(∼\\sim0\.72\) across all five generator models\.

Future work will focus on: \(1\) improving retrieval through hybrid dense\+sparse methods and query expansion; \(2\) incorporating Khmer\-specific tokenization and embedding models to improve both retrieval and evaluation quality; \(3\) conducting human evaluation studies with citizens to validate automated metrics against expert judgments; \(4\) exploring domain adaptation through instruction fine\-tuning of smaller models on Khmer data, which has been shown to improve metric concordance with expert evaluation\[[18](https://arxiv.org/html/2605.22099#bib.bib28)\]; and \(5\) extending the system to support cross\-lingual QA where questions and documents may be in different languages\.

## References

- \[1\]J\. Chen, S\. Xiao, P\. Zhang, K\. Luo, D\. Lian, and Z\. Liu\(2024\)BGE M3\-embedding: multi\-lingual, multi\-functionality, multi\-granularity text embeddings through self\-knowledge distillation\.External Links:2402\.03216Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p1.1),[item 1](https://arxiv.org/html/2605.22099#S3.I1.i1.p1.1),[§3\.4](https://arxiv.org/html/2605.22099#S3.SS4.p1.4)\.
- \[2\]J\. Chen, H\. Lin, X\. Han, and L\. Sun\(2024\)Benchmarking large language models in retrieval\-augmented generation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,Vancouver, Canada,pp\. 17754–17762\.Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p3.1)\.
- \[3\]L\. Dou, Q\. Liu, F\. Zhou, C\. Chen, Z\. Wang, Z\. Jin, Z\. Liu, T\. Zhu, C\. Du, P\. Yang, H\. Wang, J\. Liu, Y\. Zhao, X\. Feng, X\. Mao, M\. T\. Yeung, K\. Pipatanakul, F\. Koto, M\. S\. Thu, H\. Kydlíček, Z\. Liu, Q\. Lin, S\. Sripaisarnmongkol, K\. Sae\-Khow, N\. Thongchim, T\. Konkaew, N\. Borijindargoon, A\. Dao, M\. Maneegard, P\. Artkaew, Z\. Yong, Q\. Nguyen, W\. Phatthiyaphaibun, H\. H\. Tran, M\. Zhang, S\. Chen, T\. Pang, C\. Du, X\. Wan, W\. Lu, and M\. Lin\(2025\)Sailor2: sailing in south\-east asia with inclusive multilingual llms\.External Links:2502\.12982,[Link](https://arxiv.org/abs/2502.12982)Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p4.1),[item 3](https://arxiv.org/html/2605.22099#S3.I2.i3.p1.1)\.
- \[4\]S\. Es, J\. James, L\. E\. Anke, and S\. Schockaert\(2024\)Ragas: automated evaluation of retrieval augmented generation\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations,St\. Julian’s, Malta,pp\. 150–158\.Cited by:[Appendix A](https://arxiv.org/html/2605.22099#A1.p1.4),[§1](https://arxiv.org/html/2605.22099#S1.p1.1),[§2](https://arxiv.org/html/2605.22099#S2.p3.1),[§3\.4](https://arxiv.org/html/2605.22099#S3.SS4.p1.4)\.
- \[5\]Exploding Gradients\(2026\)RAGAS: retrieval augmented generation assessment\.Note:[https://github\.com/vibrantlabsai/ragas](https://github.com/vibrantlabsai/ragas)GitHub repository, accessed 2026\-03\-12Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p3.1)\.
- \[6\]Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, M\. Wang, and H\. Wang\(2024\)Retrieval\-augmented generation for large language models: a survey\.External Links:2312\.10997Cited by:[§1](https://arxiv.org/html/2605.22099#S1.p1.1),[§2](https://arxiv.org/html/2605.22099#S2.p1.1)\.
- \[7\]S\. B\. Hosseinbeigi, S\. Asghari, M\. A\. S\. Kashani, M\. H\. Shalchian, and M\. A\. Abbasi\(2025\)Advancing retrieval\-augmented generation for persian: development of language models, comprehensive benchmarks, and best practices for optimization\.External Links:2501\.04858Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p1.1)\.
- \[8\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. Bang, A\. Madotto, and P\. Fung\(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[§1](https://arxiv.org/html/2605.22099#S1.p2.1)\.
- \[9\]V\. Karpukhin, B\. Oguz, S\. Min, P\. Lewis, L\. Wu, S\. Edunov, D\. Chen, and W\. Yih\(2020\)Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Online,pp\. 6769–6781\.Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p1.1)\.
- \[10\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. K"uttler, M\. Lewis, W\. Yih, T\. Rockt"aschel, S\. Riedel, and D\. Kiela\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,Red Hook, NY, USA,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2605.22099#S1.p1.1),[§2](https://arxiv.org/html/2605.22099#S2.p1.1)\.
- \[11\]C\. Lin\(2004\)ROUGE: a package for automatic evaluation of summaries\.InText Summarization Branches Out,Barcelona, Spain,pp\. 74–81\.Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p2.1)\.
- \[12\]Y\. Liu, D\. Iter, Y\. Xu, S\. Wang, R\. Xu, and C\. Zhu\(2023\)G\-eval: nlg evaluation using gpt\-4 with better human alignment\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 2511–2522\.Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p3.1)\.
- \[13\]K\. Ly, D\. Valy, and P\. Kong\(2024\)Fine\-tuning for question answering in low\-resource languages: a case study on khmer\.In2024 17th International Congress on Advanced Applied Informatics \(IIAI\-AAI\-Winter\),Kitakyushu, Japan,pp\. 162–165\.Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p5.1)\.
- \[14\]R\. Ng, T\. N\. Nguyen, Y\. Huang, N\. C\. Tai, W\. Y\. Leong, W\. Q\. Leong, X\. Yong, J\. G\. Ngui, Y\. Susanto, N\. Cheng, H\. Rengarajan, P\. Limkonchotiwat, A\. V\. Hulagadri, K\. W\. Teng, Y\. Y\. Tong, B\. Siow, W\. Y\. Teo, W\. Lau, C\. M\. Tan, B\. Ong, Z\. H\. Ong, J\. R\. Montalan, A\. Chan, S\. Antonyrex, R\. Lee, E\. Choa, D\. O\. Tat\-Wee, B\. J\. D\. Liu, W\. C\. Tjhi, E\. Cambria, and L\. Teo\(2025\)SEA\-lion: southeast asian languages in one network\.External Links:2504\.05747,[Link](https://arxiv.org/abs/2504.05747)Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p4.1),[item 5](https://arxiv.org/html/2605.22099#S3.I2.i5.p1.1)\.
- \[15\]Ollama\(2024\)Ollama: run large language models locally\.Note:[https://ollama\.com](https://ollama.com/)Cited by:[item 1](https://arxiv.org/html/2605.22099#S3.I1.i1.p1.1)\.
- \[16\]OpenAI\(2024\)GPT\-4o mini\.Note:[https://openai\.com/index/gpt\-4o\-mini\-advancing\-cost\-efficient\-intelligence](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence)Cited by:[§3\.4](https://arxiv.org/html/2605.22099#S3.SS4.p1.4)\.
- \[17\]K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu\(2002\)BLEU: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting of the ACL,Philadelphia, PA, USA,pp\. 311–318\.Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p2.1)\.
- \[18\]S\. Roychowdhury, S\. Soman, H\. G\. Ranjani, N\. Gunda, V\. Chhabra, and S\. K\. Bala\(2024\)Evaluation of RAG metrics for question answering in the telecom domain\.InICML 2024 Workshop on Foundation Models in the Wild,Vienna, Austria,pp\. 1–7\.Note:arXiv:2407\.12873Cited by:[Appendix A](https://arxiv.org/html/2605.22099#A1.p1.4),[§1](https://arxiv.org/html/2605.22099#S1.p1.1),[§2](https://arxiv.org/html/2605.22099#S2.p3.1),[§3\.4](https://arxiv.org/html/2605.22099#S3.SS4.p1.4),[§4\.3](https://arxiv.org/html/2605.22099#S4.SS3.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.22099#S4.SS3.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2605.22099#S4.SS3.SSS0.Px3.p1.1),[§4\.3](https://arxiv.org/html/2605.22099#S4.SS3.SSS0.Px4.p1.1),[§4\.3](https://arxiv.org/html/2605.22099#S4.SS3.SSS0.Px5.p2.6),[§5](https://arxiv.org/html/2605.22099#S5.p1.5),[§5](https://arxiv.org/html/2605.22099#S5.p2.1)\.
- \[19\]H\. Steck, C\. Ekanadham, and N\. Kallus\(2024\)Is cosine\-similarity of embeddings really about similarity?\.InCompanion Proceedings of the ACM on Web Conference 2024,New York, NY, USA,pp\. 887–890\.Cited by:[§4\.1](https://arxiv.org/html/2605.22099#S4.SS1.p2.1),[§4\.3](https://arxiv.org/html/2605.22099#S4.SS3.SSS0.Px2.p1.1),[§4\.4](https://arxiv.org/html/2605.22099#S4.SS4.p2.1)\.
- \[20\]S\. Sturua, I\. Mohr, M\. K\. Akram, M\. Günther, B\. Wang, M\. Krimmel, F\. Wang, G\. Mastrapas, A\. Koukounas, N\. Wang, and H\. Xiao\(2024\)Jina\-embeddings\-v3: multilingual embeddings with task lora\.External Links:2409\.10173Cited by:[item 2](https://arxiv.org/html/2605.22099#S3.I1.i2.p1.1)\.
- \[21\]Q\. Team\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[item 1](https://arxiv.org/html/2605.22099#S3.I2.i1.p1.1)\.
- \[22\]Q\. Team\(2026\-02\)Qwen3\.5: accelerating productivity with native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[item 2](https://arxiv.org/html/2605.22099#S3.I2.i2.p1.1)\.
- \[23\]T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi\(2019\)BERTScore: evaluating text generation with BERT\.External Links:1904\.09675Cited by:[§2](https://arxiv.org/html/2605.22099#S2.p2.1)\.
- \[24\]W\. Zhang, H\. P\. Chan, Y\. Zhao, M\. Aljunied, J\. Wang, C\. Liu, Y\. Deng, Z\. Hu, W\. Xu, Y\. K\. Chia, X\. Li, and L\. Bing\(2024\)SeaLLMs 3: open foundation and chat multilingual large language models for southeast asian languages\.External Links:2407\.19672Cited by:[item 4](https://arxiv.org/html/2605.22099#S3.I2.i4.p1.1)\.
- \[25\]Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou\(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.Cited by:[item 3](https://arxiv.org/html/2605.22099#S3.I1.i3.p1.1)\.

## Appendix AComputation of RAGAS Metrics

We refer the reader to Es et al\.\[[4](https://arxiv.org/html/2605.22099#bib.bib13)\]and Roychowdhury et al\.\[[18](https://arxiv.org/html/2605.22099#bib.bib28)\]for details on the metrics, but for completeness, we summarize the prompts and steps involved in our adapted implementation\. The notation is as follows: given questionqqand contextc​\(q\)c\(q\)retrieved from the corpus, the LLM generates answera​\(q\)a\(q\)\. The ground truth answer isg​t​\(q\)gt\(q\)\.

### A\.1Faithfulness \(FaiFul\)

Faithfulness is computed in two steps\. First, the LLM judge decomposesa​\(q\)a\(q\)into atomic statementsS​\(q\)S\(q\)using the prompt:*“Given a question and answer, create one or more statements from each sentence in the given answer\.”*Second, for each statements∈S​\(q\)s\\in S\(q\), the LLM judge determines a binary verdictv​\(s,c​\(q\)\)v\(s,c\(q\)\)indicating whether the statement is supported by the context\. Faithfulness is the ratio of supported verdicts to total statements \(Equation 1\)\.

### A\.2Answer Relevance \(AnsRel\)

The LLM judge generatesNNquestions froma​\(q\)a\(q\)using the prompt:*“Generate a question for the given answer\.”*The cosine similarity between the embedding of the original questionqqand each generated questionq~i\\tilde\{q\}\_\{i\}is computed, and the average is reported as answer relevance \(Equation 2\)\.

### A\.3Context Relevance \(ConRel\)

In our implementation, context relevance is computed directly as the cosine similarity between the BGE\-M3 embedding of the questionqqand the BGE\-M3 embedding of the concatenated retrieved contextc​\(q\)c\(q\)\. Unlike the original LLM\-based extraction variant of RAGAS, this implementation does not require the judge model to extract relevant sentences from the context\.

### A\.4Factual Correctness \(FacCor\) and Answer Correctness \(AnsCor\)

The LLM judge classifies statements froma​\(q\)a\(q\)andg​t​\(q\)gt\(q\)into True Positives \(TP\), False Positives \(FP\), and False Negatives \(FN\)\. Factual correctness is the F1 score \(Equation 5\)\. Answer correctness is the weighted sum of factual correctness and answer similarity \(Equation 6\)\.

### A\.5Answer Similarity \(AnsSim\)

Answer similarity is the cosine similarity between BGE M3\-embedding ofa​\(q\)a\(q\)andg​t​\(q\)gt\(q\)\(Equation 4\)\. This metric does not involve the LLM judge\.

Table 5\. Single\-sample metric trace in English\.

Similar Articles

When Retrieval Doesn't Help: A Large-Scale Study of Biomedical RAG

arXiv cs.CL

A large-scale study across 5 models (7B–72B), 10 biomedical QA datasets, 4 retrieval methods, and 4 corpora finds that RAG yields only small and inconsistent gains (1–2 points) over no-retrieval baselines in biomedical question answering. The study concludes that the main bottleneck is not retrieval quality but models' limited ability to effectively use retrieved evidence.