quality-scoring

#quality-scoring

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

arXiv cs.CL ↗ · 14h ago Cached

Introduces PoQ-Judge, a multi-architecture evaluation framework with reference-free judge models (TextCNN, MiniLM, DeBERTa) for cost-aware Proof-of-Quality in decentralized LLM inference, achieving high correlation with ground-truth proxies while eliminating the need for reference answers.

0 favorites 0 likes

#quality-scoring

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

AgentLens is a framework for process-level assessment of software engineering agent trajectories, revealing that over 10% of passing trajectories exhibit a 'Lucky Pass' behavior. It introduces AgentLens-Bench, a dataset annotated with quality scores, and shows that ranking by quality score can shift model rankings significantly.

0 favorites 0 likes

quality-scoring

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Submit Feedback