Tag
RoboSemanticBench is a benchmark that diagnoses semantic grounding in action prediction for vision-language-action models, revealing that while robots can grasp objects, they fail to select semantically correct targets based on instruction semantics.
Introduces CAFE, a benchmark for evaluating whether promptable segmentation models truly understand concepts by using counterfactual attribute manipulation, revealing that accurate mask prediction does not guarantee faithful semantic grounding.