Little Brains, Big Feats: Exploring Compact Language Models
Summary
This paper benchmarks 17 compact language models (1B-8B parameters) as generators in Russian-language RAG systems under CPU-only inference, finding that Qwen-family models offer strong quality-latency tradeoffs for private, GPU-free deployment.
View Cached Full Text
Cached at: 07/01/26, 07:41 AM
Paper page - Little Brains, Big Feats: Exploring Compact Language Models
Source: https://huggingface.co/papers/2606.30062
Can small language models be strong enough for practical RAG generation without GPUs?
We benchmark 17 compact language models from 1B to 8B parameters as generators in Russian-language Retrieval-Augmented Generation systems. All candidate models were evaluated as local GGUF variants, including Q4_K_M and Q5_K_M quantized models, under CPU-only inference constraints.
The evaluation uses a 500-sample benchmark built from five Russian-language QA datasets, including open-source and proprietary domain-specific data. Responses are assessed with a multi-judge LLM-as-a-Judge setup across correctness, answer relevance, faithfulness, context relevance, and latency.
A clear pattern emerges: Qwen-family models dominate the top-performing SLM tier in this setting. Qwen3-8B-Q4_K_M achieved the strongest overall SLM quality, reaching 0.72 correctness and 0.83 faithfulness, approaching the GPT-5-mini baseline on correctness. At the same time, Qwen3-4B-Instruct-2507-Q5_K_M provided the best practical quality–latency trade-off, with 0.71 correctness, 0.89 answer relevance, 0.80 faithfulness, and substantially lower CPU latency than the 8B model. Qwen2.5-7B-Instruct-Q4_K_M was also a strong candidate, showing high answer relevance and faithfulness with moderate latency.
Our findings suggest that carefully selected quantized SLMs, especially from the Qwen family, can be competitive RAG generators while enabling local, private, and GPU-free deployment. The work is especially relevant for on-device AI, privacy-sensitive applications, edge deployment, and production RAG systems with limited compute budgets.
Accepted to ECML PKDD 2026 Applied Data Science Track. Author’s preprint version.
Similar Articles
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
The article introduces LLiMba, a 3B parameter model adapted from Qwen2.5 for Sardinian using continued pretraining and supervised fine-tuning on a single consumer GPU. It evaluates various LoRA configurations, finding that adapter capacity significantly impacts performance and factual accuracy in low-resource language adaptation.
Diffusion Language Models: An Experimental Analysis
A systematic experimental analysis evaluating eight state-of-the-art Diffusion Language Models across multiple benchmarks, analyzing trade-offs between generation quality and computational efficiency.
Exploring Lightweight Large Language Models for Court View Generation
This paper systematically explores the capabilities of lightweight (<2B) large language models for criminal court view generation, investigating trade-offs between model architecture, size, and impact on charge prediction. The authors also introduce CVGEvalKit, an evaluation framework with three public datasets.
Large Language Models Are Overkill For Some Marketing Tasks. Enter The Small Language Model
ZeroGPU launches specialized small language models (SLMs) for ad tech tasks, offering lower costs and faster performance compared to large language models. The SLMs run on CPUs and have already reduced expenses for early adopter Dappier by 50%.
@cjzafir: VLMs (Vertical Language Models) are beating top LLMs. These small 7B to 15B niche-focused models are beating SoTA model…
The author demonstrates that small vertical language models (6B-15B) can outperform top LLMs on niche benchmarks through cost-effective fine-tuning using open-source models and Codex orchestration, achieving results with a $300 dataset.