QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
Summary
QIMMA is a new quality-first Arabic LLM leaderboard introduced by TII UAE that validates benchmarks before evaluation to ensure accurate performance measurement. It addresses systematic quality issues in existing Arabic NLP benchmarks through a rigorous multi-stage validation pipeline.
View Cached Full Text
Cached at: 05/08/26, 09:03 AM
QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
Source: https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard Back to Articles
- 🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated
- ⛰ What’s in QIMMA?
- 🔬 The Quality Validation Pipeline- Stage 1: Multi-Model Automated Assessment - Stage 2: Human Annotation and Review
- ⚠️ What We Found: Systematic Quality Problems- By the Numbers - Taxonomy of Issues Found
- 💻 Code Benchmark: A Different Kind of Quality Work
- ⚙️ Evaluation Setup- Evaluation Framework - Metrics by Task Type - Prompt Templates
- 🏆 Leaderboard Results- The Size-Performance Relationship
- 🌟 What Makes QIMMA Different
- 🔗 Resources
- 🔖 Citation
QIMMA validates benchmarks before evaluating models, ensuring reported scores reflect genuine Arabic language capability in LLMs.
If you’ve been tracking Arabic LLM evaluation, you’ve probably noticed a growing tension: the number of benchmarks and leaderboards is expanding rapidly, butare we actually measuring what we think we’re measuring?
We builtQIMMAقمّة (Arabic for “summit”), to answer that question systematically. Instead of aggregating existing Arabic benchmarks as-is and running models on them, we applied a rigorous quality validation pipelinebeforeany evaluation took place. What we found was sobering: even widely-used, well-regarded Arabic benchmarks contain systematic quality issues that can quietly corrupt evaluation results.
This post walks through what QIMMA is, how we built it, what problems we found, and what the model rankings look like once you clean things up.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%F0%9F%94%8D-the-problem-arabic-nlp-evaluation-is-fragmented-and-unvalidated🔍 The Problem: Arabic NLP Evaluation Is Fragmented and Unvalidated
Arabic is spoken by over 400 million people across diverse dialects and cultural contexts, yet the Arabic NLP evaluation landscape remains fragmented. A few key pain points have motivated this work:
**Translation issues.**Many Arabic benchmarks are translations from English. This introduces distributional shifts. Questions that feel natural in English become awkward or culturally misaligned in Arabic, making benchmark data less representative of how Arabic is naturally used.
**Absent quality validation.**EvennativeArabic benchmarks are often released without rigorous quality checks. Annotation inconsistencies, incorrect gold answers, encoding errors, and cultural bias in ground-truth labels have all been documented in established resources.
**Reproducibility gaps.**Evaluation scripts and per-sample outputs are rarely released publicly, making it hard to audit results or build on prior work.
**Coverage fragmentation.**Existing leaderboards cover isolated tasks and narrow domains, making holistic model assessment difficult.
To illustrate where QIMMA sits relative to existing platforms:
LeaderboardOpen SourceNative ArabicQuality ValidationCoding EvalPublic OutputsOALL v1✅Mixed❌❌✅OALL v2✅Mostly❌❌✅BALSAMPartial50%❌❌❌AraGen✅100%✅❌❌SILMA ABL✅100%✅❌✅ILMAAMPartial100%✅❌❌HELM Arabic✅Mixed❌❌✅⛰ QIMMA✅99%✅✅✅ QIMMA is the only platform combining all five properties: open source, predominantly native Arabic content, systematic quality validation, code evaluation, and public per-sample inference outputs.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%E2%9B%B0-whats-in-qimma⛰ What’s in QIMMA?
QIMMA consolidates109 subsetsfrom14 source benchmarksinto a unified evaluation suite of over52,000 samples, spanning 7 domains:
DomainBenchmarksTask TypesCulturalAraDiCE-Culture, ArabCulture, PalmXMCQSTEMArabicMMLU, GAT, 3LM STEMMCQLegalArabLegalQA, MizanQAMCQ, QAMedicalMedArabiQ, MedAraBenchMCQ, QASafetyAraTrustMCQPoetry & LiteratureFannOrFlopQACoding3LM HumanEval+, 3LM MBPP+Code A few things stand out about this design:
- **99% native Arabic content.**The only exception is code evaluation, which is inherently language-agnostic.
- **First Arabic leaderboard with code evaluation.**QIMMA integrates Arabic-adapted versions of HumanEval+ and MBPP+, making it possible to assess coding capability with Arabic-language problem statements.
- **Diversity in Domains and Tasks.**QIMMA evaluates real-world competency areas including education, governance, healthcare, creative expression, and software development.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%F0%9F%94%AC-the-quality-validation-pipeline🔬 The Quality Validation Pipeline
This is the methodological heart of QIMMA. Before running a single model, we applied amulti-stage validation pipelineto every sample in every benchmark.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#stage-1-multi-model-automated-assessmentStage 1: Multi-Model Automated Assessment
Each sample was independently evaluated by two state-of-the-art LLMs:
- Qwen3-235B-A22B-Instruct
- DeepSeek-V3-671B
We chose two models with strong Arabic capability but different training data compositions, so that theircombinedjudgment is more robust than either alone.
Each model scores a sample against a10-point rubric, with binary scores (0 or 1) per criterion:

A sample is eliminated if either model scores it below 7/10. Samples where both models agree on elimination are dropped immediately. However, where only one model flags a sample, it proceeds to human review in Stage 2.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#stage-2-human-annotation-and-reviewStage 2: Human Annotation and Review
Flagged samples are reviewed bynative Arabic speakerswith cultural and dialectal familiarity. Human annotators make final calls on:
- Cultural context and regional variation
- Dialectal nuance
- Subjective interpretation
- Subtle quality issues automated assessment may miss
For culturally sensitive content, multiple perspectives are considered, since “correctness” can genuinely vary across Arab regions.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%E2%9A%A0%EF%B8%8F-what-we-found-systematic-quality-problems⚠️ What We Found: Systematic Quality Problems
The pipeline revealed recurring quality issues across benchmarks; not isolated errors, butsystematic patternsreflecting gaps in how benchmarks were originally constructed.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#by-the-numbersBy the Numbers
BenchmarkTotal SamplesDiscardedDiscard RateArabicMMLU14,1634363.1%MizanQA1,769412.3%PalmX3,001250.8%MedAraBench4,960330.7%FannOrFlop6,984430.6%ArabCulture3,48270.2%MedArabiQ49910.2%GAT13,9861~0.0%3LM STEM2,6091~0.0%AraDiCE-Culture18000.0%ArabLegalQA7900.0%AraTrust52200.0%
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#taxonomy-of-issues-foundTaxonomy of Issues Found
⚖️ Answer Quality
False or mismatched gold indices, factually wrong answers, missing or raw text answers.
📄 Text & Formatting Quality
Corrupt or illegible text, spelling and grammar errors, and duplicate samples.
💬 Cultural Sensitivity
Stereotype reinforcement and monolithic generalizations about diverse communities.
🤝 Gold Answer Compliance
Misalignment of gold answers with evaluation protocols.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%F0%9F%92%BB-code-benchmark-a-different-kind-of-quality-work💻 Code Benchmark: A Different Kind of Quality Work
Code benchmarks required a different intervention. Rather than discarding samples, werefined the Arabic problem statementsin 3LM’s Arabic adaptations of HumanEval+ and MBPP+, leaving task identifiers, reference solutions, and test suites completely unchanged.
The modification rates were striking:
BenchmarkTotal PromptsModifiedUnchangedModification Rate3LM HumanEval+1641451988%3LM MBPP+3783087081% Modifications fell into five categories:
- Linguistic refinement: normalizing toward natural Modern Standard Arabic and consistent imperative style
- Clarity improvements: fixing ambiguous instructions and unclear constraints
- Consistency normalization: standardizing mathematical terminology, punctuation, and example formatting
- Structural corrections: fixing broken triple-quoted strings, indentation errors, corrupted text fragments
- Semantic refinements: clarifying whether ranges are inclusive/exclusive, preserving task intent
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%E2%9A%99%EF%B8%8F-evaluation-setup⚙️ Evaluation Setup
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#evaluation-frameworkEvaluation Framework
QIMMA usesLightEval,EvalPlusandFannOrFlopas its evaluation framework, chosen for consistency, multilingual community adoption, and reproducibility.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#metrics-by-task-typeMetrics by Task Type
Task TypeMetricBenchmarksMCQNormalized Log-Likelihood AccuracyAraDiCE-Culture, ArabicMMLU, ArabCulture, PalmX, 3LM STEM, MedArabiQ, GAT, MedAraBench, AraTrustMulti-select MCQProbability Mass on Gold ChoicesMizanQAGenerative QAF1 BERTScore (AraBERT v02)MedArabiQ, ArabLegalQA, FannOrFlopCodePass@13LM HumanEval+, 3LM MBPP+
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#prompt-templatesPrompt Templates
QIMMA standardizes prompting by question format, with six template types:
MCQ: generic multiple choice ·MCQ-C: multiple choice with context passage ·MCQ-I: multiple choice with specific instructions (GAT analogy/completion) ·QA: generic open-ended QA ·QA-C: QA with context ·QA-F: fill-in-the-blank QA
All prompts are in Arabic. For MizanQA and ArabCulture, benchmark-specific system prompts from the original papers are preserved.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%F0%9F%8F%86-leaderboard-results🏆 Leaderboard Results
Results as of April 2026; covering top 10 evaluated models. Visit thelive leaderboardfor current rankings.
RankModelAVERAGEAraDiCE-CultureArabicMMLUArabCulturePALMX3LM STEMAraTrustMizanQAMedArabiQArabLegalQAGATMedAraBenchHumanEval+MBPP+FannOrFlop🥇 1Qwen/Qwen3.5-397B-A17B-FP868.06****82.7877.5461.7583.9188.6790.0473.3647.3054.9455.8947.9767.68****76.7244.33🥈 2Applied-Innovation-Center/Karnak66.2073.3380.9453.4981.4093.1089.0855.9255.7871.5861.0654.1933.5464.5558.91🥉 3inceptionai/Jais-2-70B-Chat65.8178.8981.29****83.2483.7387.9690.2371.7852.7969.6051.6750.8919.5143.6556.13#4Qwen/Qwen2.5-72B-Instruct65.7577.2273.7863.8377.7787.5588.5163.4950.0670.7455.9044.1937.2072.7557.51#5Applied-Innovation-Center/AIC-165.3773.3372.0277.5276.1188.1390.6156.3653.7568.9662.1150.7828.0569.5847.83#6Qwen/Qwen3.5-122B-A10B64.8474.4473.1737.7881.4686.1886.9764.0147.0455.1150.9052.4965.2472.4360.54#7Sakalti/Ultiima-72B64.4978.3372.2868.7976.7583.7089.0860.4444.5869.1246.9142.2539.0274.0757.56#8meta-llama/Llama-3.3-70B-Instruct63.9677.2271.5778.0577.9588.2885.6367.4456.2564.0051.1354.8627.4471.1624.43#9Qwen/Qwen2.5-32B-Instruct63.2670.5668.7675.8072.0781.0385.8253.7848.0869.2756.9436.5134.1572.7593.10#10FreedomIntelligence/AceGPT-v2-32B-Chat61.1476.6770.6279.7974.4684.8886.9763.8949.9671.4656.0447.3223.7854.5015.56
- **Scale does not guarantee best performance.**The top 10 spans models from 32B to 397B parameters, with several mid-size models outperforming larger ones on specific domains.
- **Arabic-specialized models lead on cultural and linguistic tasks.**Jais-2-70B-Chat ranks highest on ArabicMMLU and ArabCulture, while Karnak leads on 3LM STEM and ArabLegalQA.
- **Coding remains the hardest domain for Arabic-specialized models.**The top HumanEval+ and MBPP+ scores belong to multilingual models, with Qwen3.5-397B leading both.
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#the-size-performance-relationshipThe Size-Performance Relationship
Across the full leaderboard (46 models), a clear but imperfect size-performance correlation emerges. However, there are interesting exceptions:
- Arabic-specialized models often outperform size-matched multilingual models
- Instruction-tuned models consistently outperform their base counterparts except for Qwen3
- Some smaller Arabic-specialized models (Fanar-1-9B, ALLaM-7B) outperform much larger multilingual models on specific domains
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%F0%9F%8C%9F-what-makes-qimma-different🌟 What Makes QIMMA Different
To summarize the distinctive properties of QIMMA:
PropertyDetailsQuality-first philosophyValidation runsbeforeevaluation, not as an afterthoughtMulti-model validationTwo LLMs with different training + human review for flagged cases99% native ArabicAvoids translation artifacts almost entirelyMulti-domain, multi-task7 domains, 3 task types (MCQ, QA, code), 109 subsetsCode evaluationFirst Arabic leaderboard to include code generationFull transparencyPer-sample inference outputs publicly released, not just aggregate scoresLightEval-basedUnified, reproducible evaluation codebaseDialectal awarenessExplicit handling of MSA vs. dialectal variation in prompts and rubrics
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%F0%9F%94%97-resources🔗 Resources
- 🏆Leaderboard:QIMMA Leaderboard
- 💻Code:GitHub
- 📄Paper:Are Arabic Benchmarks Reliable? QIMMA’s Quality-First Approach to LLM Evaluation
https://huggingface.co/blog/tiiuae/qimma-arabic-leaderboard#%F0%9F%94%96-citation🔖 Citation
@misc{alqadi2026arabicbenchmarksreliableqimmas,
title={Are Arabic Benchmarks Reliable? QIMMA's Quality-First Approach to LLM Evaluation},
author={Leen AlQadi and Ahmed Alzubaidi and Mohammed Alyafeai and Hamza Alobeidli and Maitha Alhammadi and Shaikha Alsuwaidi and Omar Alkaabi and Basma El Amel Boussaha and Hakim Hacid},
year={2026},
eprint={2604.03395},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.03395},
}
Similar Articles
QU-NLP at QIAS 2026: Multi-Stage QLoRA Fine-Tuning for Arabic Islamic Inheritance Reasoning
This paper presents Qatar University's multi-stage QLoRA fine-tuning approach on Qwen3-4B for Arabic Islamic inheritance reasoning, achieving 90% MIR-E score through domain adaptation on Islamic fatwa records followed by task-specific training on 12,000 structured inheritance cases, matching commercial systems like Gemini-2.5-flash with minimal computational resources.
Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants
This paper introduces the first parallel Arabic cultural QA benchmark spanning Modern Standard Arabic and multiple dialects, converting multiple-choice questions to open-ended formats and evaluating LLMs with chain-of-thought reasoning to address gaps in culturally grounded and dialect-specific knowledge.
CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.
I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed
A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.
SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning
Researchers release SAHM, the first Arabic financial benchmark with 14,380 expert-verified instances covering Shari’ah-compliant reasoning, showing large performance gaps for 20 evaluated LLMs.


