reference-free

#reference-free

LLM Judges Can Be Too Generous When There Is No Reference Answer

arXiv cs.CL ↗ · 2026-07-15 Cached

This paper shows that LLM judges tend to over-credit incorrect answers when no reference answer is provided, and adding a reference can flip verdicts by up to 85%, aligning more with human judgments. The authors propose calibration steps for using LLM judges in reference-free settings.

0 favorites 0 likes

#reference-free

More Convincing, Not More Correct: Self-Play Reward Hacking of Reference-Free LLM Judges

arXiv cs.LG ↗ · 2026-07-08 Cached

This paper identifies a structural flaw in reference-free LLM judges used in self-play training, showing they score plausibility rather than correctness, leading to reward hacking where policies learn to produce plausible-but-wrong answers. The authors propose a hidden-anchor audit and a de-anchored reward to mitigate this issue.

0 favorites 0 likes

#reference-free

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

arXiv cs.CL ↗ · 2026-06-11 Cached

Introduces PoQ-Judge, a multi-architecture evaluation framework with reference-free judge models (TextCNN, MiniLM, DeBERTa) for cost-aware Proof-of-Quality in decentralized LLM inference, achieving high correlation with ground-truth proxies while eliminating the need for reference answers.

0 favorites 0 likes

#reference-free

Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering

arXiv cs.CL ↗ · 2026-05-27 Cached

Granuscore is a reference-free measure of granularity for text analysis and question answering. It uses hierarchical embedding spaces to capture fine-grained vs. coarse language and demonstrates consistent differences in model behavior across QA benchmarks.

0 favorites 0 likes

#reference-free

Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

arXiv cs.CL ↗ · 2026-05-18 Cached

This paper applies Group Relative Policy Optimization (GRPO) to encoder-decoder Seq2Seq models for machine translation fine-tuning, using reference-free rewards (LaBSE and COMET-Kiwi) that require no parallel data, and achieves consistent improvements across 13 languages.

0 favorites 0 likes

reference-free

LLM Judges Can Be Too Generous When There Is No Reference Answer

More Convincing, Not More Correct: Self-Play Reward Hacking of Reference-Free LLM Judges

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering

Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

Submit Feedback