Tag
An independent analysis tested 100+ LLMs on 117 political questions to map their ideological alignment, revealing that DeepSeek and Grok lean left while most other models cluster near the center or right.
Physics-intern is an agentic framework for theoretical physics that improves Gemini 3.1 Pro's performance on the CritPt benchmark from 17.7% to 31.4%, achieving a new state-of-the-art.
A benchmark study demonstrates that using LLMs to analyze entire codebases is cost-effective, identifying DeepSeek V4 Flash as the optimal default model due to its low cost and comparable accuracy to premium options like Claude Opus.
PACT introduces a head-to-head negotiation benchmark for LLMs using a 20-round buyer-seller bargaining game to test persuasion and adaptation. Top performers include GPT-5.5 and Opus 4.7, with ratings computed via Glicko-2 on an Elo-like scale.
The paper introduces IndustryBench, a benchmark evaluating LLMs on industrial procurement QA in Chinese against national standards, highlighting safety compliance gaps. It reveals that extended reasoning often lowers safety-adjusted scores and reshuffles model rankings when safety violations are considered.
Netizens question 阶跃星辰's premature push for commercialization, while praising Xiaomi Mimo's AI coding experience as better than or on par with Claude, and faster.
A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.
An evaluation of four LLMs (Qwen, MiniMax, GLM) using Claude as a prompter for the Opencode agent tool reveals that a smaller local model (Qwen 27B on a 3090) outperforms a larger pruned model in coding quality and reliability.
AlignCultura introduces CulturaX, a UNESCO-grounded dataset and two-stage pipeline for culturally aligning LLMs, showing 4–6 % HHH gains and 18 % fewer cultural failures on Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B.
Empirical study on four 30B-class dense and MoE models showing Gemma-4 26B MoE delivers equal accuracy at 1.9–15 Wh while dense and larger MoE variants consume up to 34 Wh for the same reasoning tasks.
Opus 4.7 has taken the #1 spot on the LLM Debate Benchmark, surpassing Sonnet 4.6 by 106 BT points with a perfect record of 51 wins, 4 ties, and 0 losses in side-swapped matchups. The model wins by identifying and controlling the central hinge of debates, forcing opponents onto its terms.
This paper introduces LifeDialBench, a novel benchmark for evaluating memory capabilities in continuous lifelog scenarios using wearable devices, and proposes an online evaluation protocol that enforces temporal causality. Key finding: sophisticated memory systems underperform simple RAG baselines, highlighting the importance of high-fidelity context preservation over lossy compression.
A user reports achieving impressive results with Qwen 3.6 35B running a 'Browser OS' implementation locally, highlighting the model's capability for complex task execution without cloud dependencies.