llm-evaluation

Tag

Cards List
#llm-evaluation

Testing Local LLMs in Practice: Code Generation, Quality vs. Speed

Reddit r/LocalLLaMA · 23h ago

The author built a benchmark harness to evaluate local LLMs for autonomous Go code generation, focusing on log parser generation for SIEM pipelines, and published results comparing quality vs. speed.

0 favorites 1 likes
#llm-evaluation

Can LLMs model real-world systems in TLA+?

Hacker News Top · yesterday Cached

Researchers from the Specula team created SysMoBench, a benchmark evaluating whether LLMs can faithfully model real-world computing systems in TLA+ or merely recite textbook specifications. The benchmark tests 11 systems across four phases and reveals systematic gaps in current LLMs' ability to accurately model system implementations versus reference papers.

0 favorites 0 likes
#llm-evaluation

One line system prompt change dropped model quality from 84% to 52%. How are people monitoring semantic quality in production?

Reddit r/AI_Agents · yesterday

A developer shares their experience of a single system prompt change degrading LLM response quality without triggering traditional monitoring alerts, and describes internal tooling they built to monitor semantic quality in production LLM applications.

0 favorites 0 likes
#llm-evaluation

Counterargument for Critical Thinking as Judged by AI and Humans

arXiv cs.CL · yesterday Cached

This study investigates the use of student-written counterarguments to AI-generated content to foster critical thinking in an educational context, and finds that frontier LLMs can evaluate such submissions with moderate agreement to human assessors.

0 favorites 0 likes
#llm-evaluation

@_reachsumit: OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries @dianetc_ et al pres…

X AI KOLs Following · yesterday Cached

OBLIQ-Bench is a new benchmark that exposes weaknesses in current retrieval systems when handling oblique queries requiring latent or implicit reasoning, showing that even sophisticated retrieval pipelines fail to surface relevant documents that reasoning LLMs can easily verify.

0 favorites 0 likes
#llm-evaluation

@ArizePhoenix: One of the oldest lessons in ML is still one of the most useful for working with LLM apps: Don’t evaluate on the same d…

X AI KOLs Following · yesterday Cached

This article discusses best practices for LLM application development using Arize Phoenix, specifically highlighting the importance of using train/validation/test splits for honest evaluation and tracking regressions.

0 favorites 0 likes
#llm-evaluation

We gave 45 psychological questionnaires to 50 LLMs. What we found was not “personality.”

Reddit r/artificial · yesterday

Researchers analyzed 50 LLMs across 45 psychometric questionnaires, identifying a 'Pinocchio Dimension' that measures how models endorse inner experiences rather than reflecting true personality traits.

0 favorites 0 likes
#llm-evaluation

eTPS Site Plan – Simple Leaderboard + What You’ll Actually See

Reddit r/artificial · yesterday

The author introduces the site plan for effectiveTPS, a tool designed to compare local AI models using a new 'effective TPS' metric alongside raw speed and latency. It aims to provide a simple leaderboard that highlights useful output quality over raw marketing numbers.

0 favorites 0 likes
#llm-evaluation

The First Token Knows: Single-Decode Confidence for Hallucination Detection

Hugging Face Daily Papers · 3d ago Cached

This paper introduces a method for detecting hallucinations in large language models by leveraging the confidence of the first generated token, requiring only a single decode step.

0 favorites 0 likes
#llm-evaluation

CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing

Hugging Face Daily Papers · 3d ago Cached

The paper introduces CreativityBench, a benchmark for evaluating large language models' ability to creatively repurpose tools based on affordance reasoning. It highlights that current models struggle with creative problem-solving despite strong general reasoning capabilities.

0 favorites 0 likes
#llm-evaluation

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

arXiv cs.CL · 2026-04-23 Cached

Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.

0 favorites 0 likes
#llm-evaluation

HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models

arXiv cs.CL · 2026-04-23 Cached

HumorRank introduces a tournament-based leaderboard using pairwise evaluations and Bradley-Terry MLE to rank LLMs on humor generation, showing humor quality depends on comedic mastery rather than scale.

0 favorites 0 likes
#llm-evaluation

Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models

arXiv cs.CL · 2026-04-23 Cached

Introduces a framework to quantify how LLMs overstate certainty through rhetorical devices, revealing model-agnostic patterns of epistemic-rhetorical miscalibration.

0 favorites 0 likes
#llm-evaluation

CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

arXiv cs.CL · 2026-04-22 Cached

CulturALL introduces a 2,610-sample benchmark across 14 languages and 51 regions to evaluate LLMs on real-world, culturally grounded tasks; top model scores only 44.48%, highlighting large room for improvement.

0 favorites 0 likes
#llm-evaluation

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

arXiv cs.CL · 2026-04-22 Cached

Researchers introduce HoWToBench, a large-scale Chinese writing benchmark with 1,302 instructions across 12 genres, and Tree-of-Writing (ToW), a tree-structured evaluation method that achieves 0.93 Pearson correlation with human judgments while mitigating biases in LLM writing assessment.

0 favorites 0 likes
#llm-evaluation

Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

arXiv cs.CL · 2026-04-22 Cached

Researchers from Utah State and Vanderbilt benchmark GPT-4, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2 and BERT on three social-media tasks—authorship verification, post generation, and user attribute inference—introducing new sampling protocols and taxonomies to reduce bias and enable reproducible benchmarks.

0 favorites 0 likes
#llm-evaluation

MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

arXiv cs.CL · 2026-04-22 Cached

Researchers introduce MORPHOGEN, a multilingual benchmark testing LLMs’ ability to rewrite first-person sentences in the opposite gender while preserving meaning across French, Arabic, and Hindi.

0 favorites 0 likes
#llm-evaluation

LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

arXiv cs.CL · 2026-04-22 Cached

Researchers release LegalBench-BR, the first public benchmark for evaluating LLMs on Brazilian legal text classification, showing LoRA-fine-tuned BERTimbau dramatically outperforms GPT-4o mini and Claude 3.5 Haiku.

0 favorites 0 likes
#llm-evaluation

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Hugging Face Blog · 2026-04-21 Cached

QIMMA is a new quality-first Arabic LLM leaderboard introduced by TII UAE that validates benchmarks before evaluation to ensure accurate performance measurement. It addresses systematic quality issues in existing Arabic NLP benchmarks through a rigorous multi-stage validation pipeline.

0 favorites 0 likes
#llm-evaluation

PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

arXiv cs.CL · 2026-04-21 Cached

Researchers propose PRISM, a diagnostic benchmark that breaks down LLM hallucinations into four dimensions (knowledge missing/errors, reasoning errors, instruction-following errors) across three generation stages (memory, instruction, reasoning), evaluating 24 LLMs to reveal trade-offs in mitigation strategies.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback