Tag
ParseBench introduces the first benchmark evaluating vision-language models on chart comprehension within full enterprise documents, addressing gaps in prior chart-only benchmarks.
NomadicAI is building an agentic computer-vision product to fix VLMs' weak grounding in actual video content.
Jerry Liu discusses challenges with using Vision Language Models for PDF parsing, particularly around ensuring text correctness and maintaining proper reading order while avoiding hallucinations.
PersonaVLM introduces a personalized multimodal LLM framework that enables long-term user adaptation through memory retention, multi-turn reasoning, and response alignment, outperforming GPT-4o by 5.2% on the new Persona-MME benchmark.