llm-benchmark

#llm-benchmark

Which AI is closest to your political views? I tested 100+ LLMs on the same 117 questions

Reddit r/ArtificialInteligence ↗ · 7h ago

An independent analysis tested 100+ LLMs on 117 political questions to map their ideological alignment, revealing that DeepSeek and Grok lean left while most other models cluster near the center or right.

0 favorites 0 likes

#llm-benchmark

@dlouapre: Meet physics-intern, our agentic framework for theoretical physics. It takes Gemini 3.1 Pro from 17.7% to 31.4% on Crit…

X AI KOLs Following ↗ · yesterday

Physics-intern is an agentic framework for theoretical physics that improves Gemini 3.1 Pro's performance on the CritPt benchmark from 17.7% to 31.4%, achieving a new state-of-the-art.

0 favorites 0 likes

#llm-benchmark

We use LLMs to analyze every file in your codebase. Everyone told us this was a stupid idea because of cost but it wasnt.

Reddit r/ArtificialInteligence ↗ · yesterday

A benchmark study demonstrates that using LLMs to analyze entire codebases is cost-effective, identifying DeepSeek V4 Flash as the optimal default model due to its low cost and comparable accuracy to premium options like Claude Opus.

0 favorites 0 likes

#llm-benchmark

PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups.

Reddit r/singularity ↗ · 2d ago

PACT introduces a head-to-head negotiation benchmark for LLMs using a 20-round buyer-seller bargaining game to test persuasion and adaptation. Top performers include GPT-5.5 and Opus 4.7, with ratings computed via Glicko-2 on an Elo-like scale.

0 favorites 0 likes

#llm-benchmark

IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs

Hugging Face Daily Papers ↗ · 2d ago Cached

The paper introduces IndustryBench, a benchmark evaluating LLMs on industrial procurement QA in Chinese against national standards, highlighting safety compliance gaps. It reveals that extended reasoning often lowers safety-adjusted scores and reshuffles model rankings when safety violations are considered.

0 favorites 0 likes

#llm-benchmark

@seclink: 阶跃星辰 is not a consumer product at all, messing with phones and car systems, pushing it to market before C-end users are even satisfied with trials. They're really in a hurry... I feel Xiaomi Mimo is the most stable. After actual testing, the AI coding experience is comparable to the Claude model and even faster. And…

X AI KOLs Following ↗ · 3d ago Cached

Netizens question 阶跃星辰's premature push for commercialization, while praising Xiaomi Mimo's AI coding experience as better than or on par with Claude, and faster.

0 favorites 0 likes

#llm-benchmark

RTX Pro 4500 Blackwell - Qwen 3.6 27B?

Reddit r/LocalLLaMA ↗ · 4d ago

A developer shares local inference benchmarks and systemd configurations for running the Qwen3.6-27B model on an NVIDIA RTX Pro 4500 Blackwell GPU using llama.cpp. The post requests optimization tips for throughput and explores potential use cases for larger models.

0 favorites 0 likes

#llm-benchmark

@kapicode: I've been using Claude as the "human" prompting @opencode to rebuild reference projects, evaluating four LLMs on the sa…

X AI KOLs Following ↗ · 5d ago

An evaluation of four LLMs (Qwen, MiniMax, GLM) using Claude as a prompter for the Opencode agent tool reveals that a smaller local model (Qwen 27B on a 3090) outperforms a larger pruned model in coding quality and reliability.

0 favorites 0 likes

#llm-benchmark

AlignCultura: Towards Culturally Aligned Large Language Models?

arXiv cs.CL ↗ · 2026-04-22 Cached

AlignCultura introduces CulturaX, a UNESCO-grounded dataset and two-stage pipeline for culturally aligning LLMs, showing 4–6 % HHH gains and 18 % fewer cultural failures on Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B.

0 favorites 0 likes

#llm-benchmark

I ran an experiment on the 30b class of gemma4 and qwen3.5 models to try to learn about energy cost and performance tradeoffs. In other words, which models use more energy to give the same answer quality?

Reddit r/LocalLLaMA ↗ · 2026-04-21

Empirical study on four 30B-class dense and MoE models showing Gemma-4 26B MoE delivers equal accuracy at 1.9–15 Wh while dense and larger MoE variants consume up to 34 Wh for the same reasoning tasks.

0 favorites 0 likes

#llm-benchmark

Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses.

Reddit r/singularity ↗ · 2026-04-20

Opus 4.7 has taken the #1 spot on the LLM Debate Benchmark, surpassing Sonnet 4.6 by 106 BT points with a perfect record of 51 wins, 4 ties, and 0 losses in side-swapped matchups. The model wins by identifying and controlling the central hinge of debates, forcing opponents onto its terms.

0 favorites 0 likes

#llm-benchmark

Evaluating Memory Capability in Continuous Lifelog Scenario

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces LifeDialBench, a novel benchmark for evaluating memory capabilities in continuous lifelog scenarios using wearable devices, and proposes an online evaluation protocol that enforces temporal causality. Key finding: sophisticated memory systems underperform simple RAG baselines, highlighting the importance of high-fidelity context preservation over lossy compression.

0 favorites 0 likes

#llm-benchmark

"Browser OS" implemented by Qwen 3.6 35B: The best result I ever got from a local model

Reddit r/LocalLLaMA ↗ · 2026-04-19

A user reports achieving impressive results with Qwen 3.6 35B running a 'Browser OS' implementation locally, highlighting the model's capability for complex task execution without cloud dependencies.

0 favorites 0 likes

llm-benchmark

Submit Feedback