标签
This article introduces Magis-Bench, a benchmark for evaluating large language models on magistrate-level legal tasks such as judicial reasoning and sentence drafting, using data from Brazilian judicial exams.
本文讨论了使用Arize Phoenix调试和评估LLM评判者所面临的挑战,Arize Phoenix通过OpenTelemetry追踪评估者运行过程,以检查决策逻辑、成本和潜在偏差。