Tag
H Company releases Holo3.1, a family of Vision-Language Models (0.8B to 35B) for computer use agents, supporting web, desktop, and mobile automation with native function calling and optimized quantized checkpoints for local deployment.
This paper presents MAOAM, a unified vision-language model framework that enables precise object and material selection through text or click interactions for interactive image editing. It introduces a scalable data generation pipeline and shows emergent improvement when combining text and clicks at inference.
PaddleOCR-VL-1.6 improves document parsing by identifying and refining under-optimized regions via targeted data optimization and progressive post-training, achieving state-of-the-art 96.33% on OmniDocBench v1.6.
MLX-VLM v0.6.0 is released, adding speculative decoding, an agent-ready server compatible with Anthropic's API, new models (DeepSeek V4, ZAYA1-VL, etc.), image generation/editing, and audio input support, enabling local AI agents on Apple devices.
MMG2Skill converts web-based procedural guides into executable skills for agents through closed-loop learning, improving performance across GUI control, gameplay, and card play tasks with macro-average gains of +12.8 to +25.3 percentage points.
PhyDrawGen is a neuro-symbolic pipeline that generates physically accurate diagrams from natural language by combining LLM-based scene understanding with a deterministic constraint solver and a VLM-based verify loop, outperforming existing models on a benchmark of physics problems.
This paper introduces 3DCodeBench, a benchmark for evaluating vision-language models on procedural 3D modeling via code, and 3DCodeArena, a ranking platform based on pairwise human preferences.
This paper introduces the PiSAR benchmark for screen-conditioned action prediction and compares supervised fine-tuned models against frontier zero-shot baselines. Key findings show a fine-tuned Qwen3-VL-8B achieves 0.783 semantic similarity, significantly outperforming Claude Opus 4.7 and GPT-5.5 (0.459 and 0.482), but the same fine-tuning recipe on a larger reasoning-tuned Gemma model yields only 0.441, indicating a model-recipe mismatch.
This paper proposes VFEAgent, a multi-agent system that automates finite element analysis by integrating vision-language models with a verification-first code synthesis framework, enabling end-to-end simulation from images and problem descriptions.
Stable-Layers is a reinforcement learning framework that fine-tunes a pretrained image layer decomposition model using VLM feedback instead of paired supervision, employing Flow-GRPO with LoRA and a two-stage reward calibration pipeline to improve layer quality on the Crello dataset.
This paper introduces VisAnomReasoner, a parameter-efficient vision-language model fine-tuned on a novel benchmark (VisAnomBench) with natural-language rationales, achieving over 21pp improvement in precision and F1 for time-series anomaly detection and strong cross-benchmark generalization.
Proposes MedExpMem, an experience memory framework that enables medical vision-language models to accumulate and retrieve discriminative diagnostic experience from past cases, improving differential diagnosis accuracy by up to 7.0% on a radiology benchmark.
InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3, achieving strong results across complex benchmarks.
OpenBMB thanks @_akhaliq for contributing a Hugging Face demo for MiniCPM-V 4.6, using Gradio server for flexible frontend customization.
Numind released NuExtract3, a 4B open-weight vision-language model based on Qwen3.5-4B, designed for converting document images to Markdown, OCR, and structured data extraction. It is Apache-2.0 licensed and self-hostable with quantized versions for low VRAM.
This paper investigates using vision-language models to assess nursing competency from egocentric video during simulation, finding that recognition accuracy inversely relates to competency level, suggesting a pedagogically informative signal.
University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages, using a two-stage pipeline with Qwen2.5-VL for Spanish captioning and retrieval-augmented Gemini 2.5 Flash for target-language translation, achieving significant improvements over the baseline.
SimGym is a framework that simulates A/B tests on e-commerce storefronts using vision-language model agents, reducing experimental cycles from weeks to under an hour while achieving 77% directional alignment with real buyer behavior.
AutoRubric-T2I automatically generates and selects explicit rubrics to guide Vision-Language Model judges for text-to-image generation, achieving high-quality reward signals with minimal human annotation and improving generation quality in downstream tasks.
Aurora is an agentic video editing framework that pairs a tool-augmented vision-language model agent with a diffusion transformer to automatically resolve textual and visual underspecification in user requests, enabling unified video editing tasks like replacement, removal, style transfer, and reference-driven insertion.