Tag
EnergyLens is an end-to-end framework for predictive energy-aware optimization of multi-GPU LLM inference, validated on Llama3 and Qwen3-MoE, achieving mean absolute percentage errors between 9.25% and 13.19% and revealing significant energy variation across configurations.
This paper introduces Bot-Mod, a moderation framework that identifies malicious intent in multi-agent systems through multi-turn dialogue and Gibbs-based sampling, and presents a dataset from Moltbook for evaluation.
A paper introducing 'Context Is Not Control', an evaluation benchmark for assessing source-boundary failures in LLMs' use of controlled text-mediated evidence. Includes replication packages for open-weight and frontier API models.
This paper investigates methods for improving LLM accuracy in chart data extraction, finding that spatial priming via coordinate grids significantly outperforms semantic prompting strategies.
This paper investigates smoothness degradation in extremely quantized Large Language Models, arguing that preserving smoothness is crucial for maintaining performance beyond numerical accuracy.
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.
This paper introduces UniCharacter, a two-stage training framework for Customized Multimodal Role-Play (CMRP) that enables unified customization of persona, dialogue style, and visual identity. It presents the RoleScape-20 dataset and demonstrates that the model can achieve coherent cross-modal generation with minimal data.
This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.
Researchers propose applying the "global ignition" consciousness mechanism from cognitive science to long-context engineering, introducing the MiA-Signature method that uses submodular selection of high-level concepts to cover the activation space. Applied to RAG and agentic systems, it delivers consistent performance improvements across multiple long-context tasks.
This paper introduces PRISM, a framework that integrates Vision-Language Models and Large Language Models through a dynamic question-answering pipeline to improve sequential decision-making in embodied AI tasks.
This position paper analyzes sycophancy in LLMs as a boundary failure between social alignment and epistemic integrity, proposing a new framework and taxonomy to classify and mitigate these behaviors.
This paper introduces Steering via Key-Orthogonal Projections (SKOP), a method to control LLM behavior by preventing attention rerouting, thereby reducing utility degradation while maintaining steering efficacy.
Introduces LoVer, an unsupervised verifier that uses logical rules (negation consistency, intra-group and inter-group consistency) to improve LLM reasoning without labeled data, achieving performance close to supervised verifiers on reasoning benchmarks.
This paper introduces ROPD, a rubric-based on-policy distillation framework that achieves superior sample efficiency compared to traditional logit-based methods. It enables model alignment in black-box scenarios by using structured semantic rubrics instead of teacher logits.
Researchers from Bangladesh University of Engineering and Technology present CBRS, a multi-platform framework that filters and parses blood donation requests from social media using a dual-layer architecture and a novel 11K bilingual dataset in Bengali and English. Their LoRA fine-tuned Llama-3.2-3B model achieves 99% filtering accuracy and 92% zero-shot parsing accuracy, outperforming GPT-4o-mini and other LLMs with 35× reduced token usage.
Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.
Research introduces Skill-RAG, a novel approach that combines Skills with Retrieval-Augmented Generation to address inefficiencies in traditional RAG systems that retrieve on every query regardless of whether the model actually needs the information.