Tag
This paper introduces Act2Answer, a protocol to evaluate knowledge retention in Vision-Language-Action (VLA) models by requiring agents to answer questions through physical actions. It finds that VLAs retain basic knowledge but show gaps on richer semantic categories, and that VQA co-training helps.
An analysis of the key differences between cloud-based and local AI agents, arguing that local agents offer better user experience due to richer environmental access, while the LLM layer becomes commoditized.
This position paper argues that advancing robot intelligence requires integrating unstructured behavioral data through specialized interfaces for labeling, embodiment mapping, world modeling, and reward inference, rather than relying solely on scaling Vision-Language-Action (VLA) models and world models.
Project CETI used LLM architectures to decode sperm whale clicks, revealing a phonetic alphabet but also highlighting that AI's statistical pattern-matching lacks true comprehension. The article argues that AGI requires embodied, multimodal grounding rather than just scaling text-based models.