Tag
This paper presents an empirical study of 57 ML evaluation harnesses, identifying common operational challenges and root causes across five workflow stages, advocating for evaluation engineering as a distinct software engineering concern.