vlm

#vlm

@Prince_Canuma: Quick update on the water situation M3 Ultra and Titan (RTX6000 Pro) seem to have recovered with little to no visible d…

X AI KOLs Timeline ↗ · 2026-05-18 Cached

Personal update on hardware water damage recovery, showcasing MLX-VLM serving Qwen3-4B-Instruct locally on an RTX6000 Pro at ~300 tok/s for autocomplete and git commit generation via Zed IDE.

0 favorites 0 likes

#vlm

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Hugging Face Daily Papers ↗ · 2026-05-14 Cached

MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.

0 favorites 0 likes

#vlm

FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness

arXiv cs.CL ↗ · 2026-05-12 Cached

This paper introduces FragileFlow, a plug-in regularizer that improves the robustness of LLMs and VLMs by controlling 'correct-but-fragile' predictions through spectral analysis and PAC-Bayes bounds.

0 favorites 0 likes

#vlm

World Action Models: The Next Frontier in Embodied AI

Hugging Face Daily Papers ↗ · 2026-05-12 Cached

This survey paper introduces World Action Models (WAMs), a unified framework for embodied AI that integrates predictive state modeling with action generation. It provides a taxonomy of existing methods, analyzes the data ecosystem, and outlines evaluation protocols for this emerging paradigm.

0 favorites 0 likes

#vlm

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Hugging Face Daily Papers ↗ · 2026-05-08 Cached

This paper introduces the Auto-Rubric as Reward (ARR) framework, which externalizes implicit preference knowledge into explicit rubrics for multimodal alignment. It proposes Rubric Policy Optimization (RPO) to stabilize policy gradients, achieving better performance in text-to-image and image editing tasks.

0 favorites 0 likes

#vlm

@jerryjliu0: ParseBench is the first benchmark to include VLM chart understanding over enterprise documents. Existing benchmarks (Ch…

X AI KOLs Timeline ↗ · 2026-04-21 Cached

ParseBench introduces the first benchmark evaluating vision-language models on chart comprehension within full enterprise documents, addressing gaps in prior chart-only benchmarks.

0 favorites 0 likes

#vlm

@nomadicai: The future of computer vision is agentic. 1/ We built Nomadic around a gap we kept seeing in video understanding: VLMs …

X AI KOLs Following ↗ · 2026-04-21 Cached

NomadicAI is building an agentic computer-vision product to fix VLMs' weak grounding in actual video content.

0 favorites 0 likes

#vlm

@jerryjliu0: A downside with using VLMs to parse PDFs is guaranteeing that the output text is correct and output in the correct re…

X AI KOLs Following ↗ · 2026-04-18 Cached

Jerry Liu discusses challenges with using Vision Language Models for PDF parsing, particularly around ensuring text correctness and maintaining proper reading order while avoiding hallucinations.

0 favorites 0 likes

#vlm

PersonaVLM: Long-Term Personalized Multimodal LLMs

Hugging Face Daily Papers ↗ · 2026-03-20 Cached

PersonaVLM introduces a personalized multimodal LLM framework that enables long-term user adaptation through memory retention, multi-turn reasoning, and response alignment, outperforming GPT-4o by 5.2% on the new Persona-MME benchmark.

0 favorites 0 likes

vlm

Submit Feedback