Tag
This paper introduces Pre-Flight, an open-source benchmark of 300 multiple choice questions designed to evaluate large language models on aviation operational knowledge, covering international regulations and ground operations. Results show even the best models in 2026 score 82.7%, significantly below the expert reference of ~95%, highlighting a persistent reliability gap.
This paper investigates the effects of domain-specific expert pruning on both utility and factual reliability of Mixture-of-Experts (MoE) models in the biomedical domain. It finds that moderate pruning preserves in-domain utility without immediate reliability loss, but extreme pruning increases hallucination risks, and generalization degrades rapidly in cross-domain settings.
Emphasizes the importance of open-source AI models for domain-specific tasks, local deployment, and continuous improvement, advocating for owning intelligence rather than renting it.
This paper proposes a modular pipeline that uses a domain-specific knowledge graph to generate multi-hop QA pairs and fine-tune a reasoning LLM (Qwen3-4B) for the travel domain, achieving 82.4% exact match accuracy, significantly outperforming the baseline.
This paper presents an empirical study and benchmark for evaluating tool-augmented LLM agents on real-world energy analytics tasks, comprising 243 expert-curated problems across market data retrieval, knowledge interpretation, and quantitative modeling.
Presented DV-DPO, a method to fine-tune Qwen2.5-7B on domain-specific tasks using only ~$3 in API calls and zero human labelers, achieving 96% composite performance of Claude Haiku via adversarial cross-examination.
This paper introduces ChristBERT, a family of domain-specific RoBERTa-based language models for German clinical NLP, and evaluates three domain adaptation strategies (continued pre-training, pre-training from scratch, and vocabulary adaptation) on medical named entity recognition and text classification tasks, achieving state-of-the-art results.
Proposes KOFF, a framework that decomposes pretrained LLMs into a sparse shared backbone and domain-specific external memories using structured pruning and LoRA adapters, achieving 12% sparsity without significant performance loss.
This paper introduces MechVQA, a dataset with 3.3k high-density mechanical engineering drawings and 21k question-answer pairs, along with the MechVL model that outperforms existing baselines by 7.57 percentage points on the MechVQA total score, advancing multimodal LLM understanding of mechanical drawings.
DOMINO is a novel framework that learns minimal sufficient domain representations from reference examples to synthesize domain-specific data for LLMs, improving code benchmark performance without requiring explicit domain descriptions.
This paper introduces MultiSeismo, a large-scale multimodal seismic dataset with over 16K events integrating waveforms, intensity maps, and metadata, along with MISCE instruction set and SeisModal, a fine-tuned multimodal model for cross-modal seismic understanding.
FAB-Bench is a benchmark framework for evaluating Retrieval-Augmented Generation (RAG) systems in semiconductor manufacturing, with six diagnostic metrics and analysis across context windows. It provides 200 curated query-answer pairs and reveals context-scaling behaviors and attention dilution issues.
Palette proposes a modular framework for selectively relaxing safety refusal behaviors in LLMs for authorized professional domains, using multi-objective search and lightweight adaptation to avoid costly retraining.
Agentic search models are LLMs trained specifically for orchestrating search tasks, offering smaller, faster, and domain-specific alternatives to general models like GPT-5. They unbundle the traditional monolithic search stack by allowing an intelligent model to manage the entire retrieval process.
BAGEL is a new benchmark for evaluating animal-related knowledge in large language models, constructed from diverse scientific sources and covering taxonomy, morphology, habitat, behavior, and species interactions through closed-book question-answer pairs. The benchmark enables fine-grained analysis across taxonomic groups and knowledge categories, providing insights into model strengths and failure modes for biodiversity applications.