Tag
The author created an open-source rubric tool to evaluate agentic AI vendor documentation on tool-call correctness, loop termination, and multi-step state coherence, scored five vendors (Anthropic, OpenAI, LangGraph, Sierra, Salesforce), and requests feedback on methodology and potential bias toward public documentation depth.
This paper describes the development of an LLM-based tool using OpenAI's GPT models to evaluate approximately 1,200 Statements of Purpose for Purdue's SURF program, processing them in 4.6 hours and accelerating the review process compared to traditional human grading.
This paper proposes a learner model-based rubric to evaluate the adaptivity of Vision Language Models (VLMs) in mathematics education. Experiments show measurable differences in adaptivity across models and reveal that current VLMs struggle to produce consistent learner-adaptive instructional responses.