@zhijianliu_: Reasoning VLAs can think. They just can't think fast. Until now. Introducing FlashDrive 716 ms → 159 ms on RTX PRO 6000…
Summary
FlashDrive reduces reasoning vision-language-action model inference latency from 716 ms to 159 ms on RTX PRO 6000—up to 5.7× faster—with zero accuracy loss, enabling real-time autonomous applications.
View Cached Full Text
Cached at: 04/21/26, 09:00 AM
Reasoning VLAs can think. They just can’t think fast. Until now. Introducing FlashDrive 716 ms → 159 ms on RTX PRO 6000 (up to 5.7×) Zero accuracy loss FlashDrive = streaming inference + DFlash speculative reasoning + ParoQuant W4A8 Real-time reasoning for autonomous
Similar Articles
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
@AdinaYakup: Step-3.7-Flash New VL model from @StepFun_ai 198B / 11B active - MoE 256K context 3 reasoning level Up to 400 tokens/sec
StepFun releases Step-3.7-Flash, a new large vision-language MoE model with 198B parameters (11B active), 256K context, and up to 400 tokens/sec inference speed.
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
This paper introduces a paradigm where Vision-Language Models (VLMs) act as test-time teachers to guide Video Generation Models (VGMs) via differentiable rewards and LoRA optimization, achieving a 16.7-point average improvement on video reasoning benchmarks.
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
Fast-dDrive is a block-diffusion VLA model for end-to-end autonomous driving that achieves state-of-the-art trajectory accuracy while delivering over 12x throughput speedup over autoregressive baselines, addressing the trade-off between high-fidelity planning and efficient inference for edge deployment.