Tag
This paper introduces DiagFlowBench, a benchmark dataset of 1,676 multi-turn diagnostic conversations derived from industrial flowcharts, designed to evaluate how well language models handle off-procedure inputs and abstain from giving inappropriate advice.