Tag
Introduces PoQ-Judge, a multi-architecture evaluation framework with reference-free judge models (TextCNN, MiniLM, DeBERTa) for cost-aware Proof-of-Quality in decentralized LLM inference, achieving high correlation with ground-truth proxies while eliminating the need for reference answers.
AgentLens is a framework for process-level assessment of software engineering agent trajectories, revealing that over 10% of passing trajectories exhibit a 'Lucky Pass' behavior. It introduces AgentLens-Bench, a dataset annotated with quality scores, and shows that ranking by quality score can shift model rankings significantly.