Release of Wall-OSS-0.5, an open-weights vision-language-action model that achieves over 80% task progress on 4 of 17 real-robot tasks with zero fine-tuning, including on a deformable rope task not seen during pretraining. The model preserves general vision-language ability while improving embodied grounding.
Sharing this because it is an embodied AI release trying to make the pretrained checkpoint itself measurable, instead of only showing results after task-specific tuning. The video is a reel from Wall-OSS-0.5, a vision language action model released with open-source resources. Every clip in the reel has the same "Autonomous w/o Fine-Tuning" watermark in the corner. The robot is doing things like opening a pot lid and dropping fruit inside, covering blocks with a cloth, sorting items by color, putting drinks in specific containers in a specified order, shredding paper, putting a cup to the right of a calculator. According to the release, these clips are from the pretrained checkpoint rather than task-specific fine tuning. What is interesting compared with the usual humanoid demo cycle is the evaluation framing. They report 4 of 17 real robot tasks above 80 percent task progress at zero shot, including a deformable rope tightening task that was not in the pretraining set. They also show pretraining task progress rising across checkpoints, with held-out tasks tracking seen tasks. That is the kind of curve people keep asking for in embodied AI, even if it is still early. The other part I found notable is that the model seems to preserve general image/language ability while improving embodied grounding, at least by their evaluation. That matters because a lot of robot policies feel like they gain control ability by becoming narrower. Code: [https://github.com/X-Square-Robot/wall-x](https://github.com/X-Square-Robot/wall-x). Paper: [https://x2robot.com/api/files/file/wall\_oss\_05.pdf](https://x2robot.com/api/files/file/wall_oss_05.pdf). Hugging Face org: [https://huggingface.co/x-square-robot](https://huggingface.co/x-square-robot). The caveat is that the harder tasks are still not solved. Towel folding, charger insertion and table setting are still very low in zero shot, so pretraining alone is not magic. The real test is whether outside groups can run the checkpoint on their own arms and see similar strengths and failures. Reel is attached. Original demo is on their project page.
X Square Robot releases Wall-OSS-0.5, a 4B open-source VLA robot foundation model evaluated on a 17-task real-robot zero-shot suite without task-specific fine-tuning, aiming to directly measure pretraining capability.
Open_MOSS released MOSS-VL, an 11B Apache 2.0 vision-language model using cross-attention and XRoPE that outperforms Qwen3-VL-8B by 8.3 points on VSI-bench.
VoLoAgent integrates vision-language models with robot capabilities for open-vocabulary long-horizon manipulation tasks, introducing a physical orchestrator that plans, monitors, and recovers using interruptible tools, and a benchmark called RoboVoLo for evaluation.
HyVLA-0.5 is an end-to-end robotic learning system that integrates data collection, model design, pre-training, fine-tuning, and reinforcement learning for real-world deployment.