vlm

#vlm

FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness

arXiv cs.CL ↗ · yesterday Cached

This paper introduces FragileFlow, a plug-in regularizer that improves the robustness of LLMs and VLMs by controlling 'correct-but-fragile' predictions through spectral analysis and PAC-Bayes bounds.

0 favorites 0 likes

#vlm

World Action Models: The Next Frontier in Embodied AI

Hugging Face Daily Papers ↗ · yesterday Cached

This survey paper introduces World Action Models (WAMs), a unified framework for embodied AI that integrates predictive state modeling with action generation. It provides a taxonomy of existing methods, analyzes the data ecosystem, and outlines evaluation protocols for this emerging paradigm.

0 favorites 0 likes

#vlm

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Hugging Face Daily Papers ↗ · 5d ago Cached

This paper introduces the Auto-Rubric as Reward (ARR) framework, which externalizes implicit preference knowledge into explicit rubrics for multimodal alignment. It proposes Rubric Policy Optimization (RPO) to stabilize policy gradients, achieving better performance in text-to-image and image editing tasks.

0 favorites 0 likes

#vlm

@jerryjliu0: ParseBench is the first benchmark to include VLM chart understanding over enterprise documents. Existing benchmarks (Ch…

X AI KOLs Timeline ↗ · 2026-04-21 Cached

ParseBench introduces the first benchmark evaluating vision-language models on chart comprehension within full enterprise documents, addressing gaps in prior chart-only benchmarks.

0 favorites 0 likes

#vlm

@nomadicai: The future of computer vision is agentic. 1/ We built Nomadic around a gap we kept seeing in video understanding: VLMs …

X AI KOLs Following ↗ · 2026-04-21 Cached

NomadicAI is building an agentic computer-vision product to fix VLMs' weak grounding in actual video content.

0 favorites 0 likes

#vlm

@jerryjliu0: A downside with using VLMs to parse PDFs is guaranteeing that the output text is correct and output in the correct re…

X AI KOLs Following ↗ · 2026-04-18 Cached

Jerry Liu discusses challenges with using Vision Language Models for PDF parsing, particularly around ensuring text correctness and maintaining proper reading order while avoiding hallucinations.

0 favorites 0 likes

#vlm

PersonaVLM: Long-Term Personalized Multimodal LLMs

Hugging Face Daily Papers ↗ · 2026-03-20 Cached

PersonaVLM introduces a personalized multimodal LLM framework that enables long-term user adaptation through memory retention, multi-turn reasoning, and response alignment, outperforming GPT-4o by 5.2% on the new Persona-MME benchmark.

0 favorites 0 likes

vlm

FragileFlow: Spectral Control of Correct-but-Fragile Predictions for Foundation Model Robustness

World Action Models: The Next Frontier in Embodied AI

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

@jerryjliu0: ParseBench is the first benchmark to include VLM chart understanding over enterprise documents. Existing benchmarks (Ch…

@nomadicai: The future of computer vision is agentic. 1/ We built Nomadic around a gap we kept seeing in video understanding: VLMs …

@jerryjliu0: A downside with using VLMs to parse PDFs is guaranteeing that the output text is *correct* and output in the correct re…

PersonaVLM: Long-Term Personalized Multimodal LLMs

Submit Feedback

@jerryjliu0: A downside with using VLMs to parse PDFs is guaranteeing that the output text is correct and output in the correct re…