llm-evaluation

#llm-evaluation

Testing Local LLMs in Practice: Code Generation, Quality vs. Speed

Reddit r/LocalLLaMA ↗ · 23h ago

The author built a benchmark harness to evaluate local LLMs for autonomous Go code generation, focusing on log parser generation for SIEM pipelines, and published results comparing quality vs. speed.

0 favorites 1 likes

#llm-evaluation

Can LLMs model real-world systems in TLA+?

Hacker News Top ↗ · yesterday Cached

Researchers from the Specula team created SysMoBench, a benchmark evaluating whether LLMs can faithfully model real-world computing systems in TLA+ or merely recite textbook specifications. The benchmark tests 11 systems across four phases and reveals systematic gaps in current LLMs' ability to accurately model system implementations versus reference papers.

0 favorites 0 likes

#llm-evaluation

One line system prompt change dropped model quality from 84% to 52%. How are people monitoring semantic quality in production?

Reddit r/AI_Agents ↗ · yesterday

A developer shares their experience of a single system prompt change degrading LLM response quality without triggering traditional monitoring alerts, and describes internal tooling they built to monitor semantic quality in production LLM applications.

0 favorites 0 likes

#llm-evaluation

Counterargument for Critical Thinking as Judged by AI and Humans

arXiv cs.CL ↗ · yesterday Cached

This study investigates the use of student-written counterarguments to AI-generated content to foster critical thinking in an educational context, and finds that frontier LLMs can evaluate such submissions with moderate agreement to human assessors.

0 favorites 0 likes

#llm-evaluation

@_reachsumit: OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries @dianetc_ et al pres…

X AI KOLs Following ↗ · yesterday Cached

OBLIQ-Bench is a new benchmark that exposes weaknesses in current retrieval systems when handling oblique queries requiring latent or implicit reasoning, showing that even sophisticated retrieval pipelines fail to surface relevant documents that reasoning LLMs can easily verify.

0 favorites 0 likes

#llm-evaluation

@ArizePhoenix: One of the oldest lessons in ML is still one of the most useful for working with LLM apps: Don’t evaluate on the same d…

X AI KOLs Following ↗ · yesterday Cached

This article discusses best practices for LLM application development using Arize Phoenix, specifically highlighting the importance of using train/validation/test splits for honest evaluation and tracking regressions.

0 favorites 0 likes

#llm-evaluation

We gave 45 psychological questionnaires to 50 LLMs. What we found was not “personality.”

Reddit r/artificial ↗ · yesterday

Researchers analyzed 50 LLMs across 45 psychometric questionnaires, identifying a 'Pinocchio Dimension' that measures how models endorse inner experiences rather than reflecting true personality traits.

0 favorites 0 likes

#llm-evaluation

eTPS Site Plan – Simple Leaderboard + What You’ll Actually See

Reddit r/artificial ↗ · yesterday

The author introduces the site plan for effectiveTPS, a tool designed to compare local AI models using a new 'effective TPS' metric alongside raw speed and latency. It aims to provide a simple leaderboard that highlights useful output quality over raw marketing numbers.

0 favorites 0 likes

#llm-evaluation

The First Token Knows: Single-Decode Confidence for Hallucination Detection

Hugging Face Daily Papers ↗ · 3d ago Cached

This paper introduces a method for detecting hallucinations in large language models by leveraging the confidence of the first generated token, requiring only a single decode step.

0 favorites 0 likes

#llm-evaluation

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

Hugging Face Daily Papers ↗ · 3d ago Cached

The paper introduces CreativityBench, a benchmark for evaluating large language models' ability to creatively repurpose tools based on affordance reasoning. It highlights that current models struggle with creative problem-solving despite strong general reasoning capabilities.

0 favorites 0 likes

#llm-evaluation

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

arXiv cs.CL ↗ · 2026-04-23 Cached

Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.

0 favorites 0 likes

#llm-evaluation

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

arXiv cs.CL ↗ · 2026-04-23 Cached

HumorRank introduces a tournament-based leaderboard using pairwise evaluations and Bradley-Terry MLE to rank LLMs on humor generation, showing humor quality depends on comedic mastery rather than scale.

0 favorites 0 likes

#llm-evaluation

Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models

arXiv cs.CL ↗ · 2026-04-23 Cached

Introduces a framework to quantify how LLMs overstate certainty through rhetorical devices, revealing model-agnostic patterns of epistemic-rhetorical miscalibration.

0 favorites 0 likes

#llm-evaluation