vision-language-model

#vision-language-model

Holo3.1 35B/9B/4B/0.8B (Qwen 3.5 finetunes)

Reddit r/LocalLLaMA ↗ · 2026-06-03

H Company releases Holo3.1, a family of Vision-Language Models (0.8B to 35B) for computer use agents, supporting web, desktop, and mobile automation with native function calling and optimized quantized checkpoints for local deployment.

0 favorites 0 likes

#vision-language-model

MAOAM: Unified Object and Material Selection with Vision-Language Models

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

This paper presents MAOAM, a unified vision-language model framework that enables precise object and material selection through text or click interactions for interactive image editing. It introduces a scalable data generation pipeline and shows emergent improvement when combining text and clicks at inference.

0 favorites 0 likes

#vision-language-model

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

PaddleOCR-VL-1.6 improves document parsing by identifying and refining under-optimized regions via targeted data optimization and progressive post-training, achieving state-of-the-art 96.33% on OmniDocBench v1.6.

0 favorites 0 likes

#vision-language-model

@Prince_Canuma: Today we're shipping our biggest MLX-VLM release yet: v0.6.0 ...and we are raising This one's about turning your Apple …

X AI KOLs Following ↗ · 2026-06-01 Cached

MLX-VLM v0.6.0 is released, adding speculative decoding, an agent-ready server compatible with Anthropic's API, new models (DeepSeek V4, ZAYA1-VL, etc.), image generation/editing, and audio input support, enabling local AI agents on Apple devices.

0 favorites 0 likes

#vision-language-model

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Hugging Face Daily Papers ↗ · 2026-06-01 Cached

MMG2Skill converts web-based procedural guides into executable skills for agents through closed-loop learning, improving performance across GUI control, gameplay, and card play tasks with macro-average gains of +12.8 to +25.3 percentage points.

0 favorites 0 likes

#vision-language-model

PhyDrawGen: Physically Grounded Diagram Generation from Natural Language

arXiv cs.AI ↗ · 2026-06-01 Cached

PhyDrawGen is a neuro-symbolic pipeline that generates physically accurate diagrams from natural language by combining LLM-based scene understanding with a deterministic constraint solver and a VLM-based verify loop, outperforming existing models on a benchmark of physics problems.

0 favorites 0 likes

#vision-language-model

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

This paper introduces 3DCodeBench, a benchmark for evaluating vision-language models on procedural 3D modeling via code, and 3DCodeArena, a ranking platform based on pairwise human preferences.

0 favorites 0 likes

#vision-language-model

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper introduces the PiSAR benchmark for screen-conditioned action prediction and compares supervised fine-tuned models against frontier zero-shot baselines. Key findings show a fine-tuned Qwen3-VL-8B achieves 0.783 semantic similarity, significantly outperforming Claude Opus 4.7 and GPT-5.5 (0.459 and 0.482), but the same fine-tuning recipe on a larger reasoning-tuned Gemma model yields only 0.441, indicating a model-recipe mismatch.

0 favorites 0 likes

#vision-language-model

VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper proposes VFEAgent, a multi-agent system that automates finite element analysis by integrating vision-language models with a verification-first code synthesis framework, enabling end-to-end simulation from images and problem descriptions.

0 favorites 0 likes

#vision-language-model

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-05-28

Stable-Layers is a reinforcement learning framework that fine-tunes a pretrained image layer decomposition model using VLM feedback instead of paired supervision, employing Flow-GRPO with LoRA and a two-stage reward calibration pipeline to improve layer quality on the Crello dataset.

0 favorites 0 likes

#vision-language-model

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

This paper introduces VisAnomReasoner, a parameter-efficient vision-language model fine-tuned on a novel benchmark (VisAnomBench) with natural-language rationales, achieving over 21pp improvement in precision and F1 for time-series anomaly detection and strong cross-benchmark generalization.

0 favorites 0 likes

#vision-language-model

MedExpMem: Adapting Experience Memory for Differential Diagnosis

arXiv cs.LG ↗ · 2026-05-25 Cached

Proposes MedExpMem, an experience memory framework that enables medical vision-language models to accumulate and retrieve discriminative diagnostic experience from past cases, improving differential diagnosis accuracy by up to 7.0% on a radiology benchmark.

0 favorites 0 likes

#vision-language-model

InstructSAM: Segment Any Instance with Any Instructions

Hugging Face Daily Papers ↗ · 2026-05-25 Cached

InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3, achieving strong results across complex benchmarks.

0 favorites 0 likes

#vision-language-model

@OpenBMB: Thanks to @_akhaliq for contributing MiniCPM-V 4.6 Hugging Face demo, which allowed us to test the gradio.Server featur…

X AI KOLs Following ↗ · 2026-05-23 Cached

OpenBMB thanks @_akhaliq for contributing a Hugging Face demo for MiniCPM-V 4.6, using Gradio server for flexible frontend customization.

0 favorites 0 likes

#vision-language-model

NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) [P]

Reddit r/MachineLearning ↗ · 2026-05-22

Numind released NuExtract3, a 4B open-weight vision-language model based on Qwen3.5-4B, designed for converting document images to Markdown, OCR, and structured data extraction. It is Apache-2.0 licensed and self-hostable with quantized versions for low VRAM.

0 favorites 0 likes

#vision-language-model

AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education

arXiv cs.AI ↗ · 2026-05-22 Cached

This paper investigates using vision-language models to assess nursing competency from egocentric video during simulation, finding that recognition accuracy inversely relates to competency level, suggesting a pedagogically informative signal.

0 favorites 0 likes

#vision-language-model

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

arXiv cs.CL ↗ · 2026-05-21 Cached

University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages, using a two-stage pipeline with Qwen2.5-VL for Spanish captioning and retrieval-augmented Gemini 2.5 Flash for target-language translation, achieving significant improvements over the baseline.

0 favorites 0 likes

#vision-language-model

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

arXiv cs.AI ↗ · 2026-05-20 Cached

SimGym is a framework that simulates A/B tests on e-commerce storefronts using vision-language model agents, reducing experimental cycles from weeks to under an hour while achieving 77% directional alignment with real buyer behavior.

0 favorites 0 likes

#vision-language-model

AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

Hugging Face Daily Papers ↗ · 2026-05-20 Cached

AutoRubric-T2I automatically generates and selects explicit rubrics to guide Vision-Language Model judges for text-to-image generation, achieving high-quality reward signals with minimal human annotation and improving generation quality in downstream tasks.

0 favorites 0 likes

#vision-language-model

Aurora: Unified Video Editing with a Tool-Using Agent

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

Aurora is an agentic video editing framework that pairs a tool-augmented vision-language model agent with a diffusion transformer to automatically resolve textual and visual underspecification in user requests, enabling unified video editing tasks like replacement, removal, style transfer, and reference-driven insertion.

0 favorites 0 likes

vision-language-model

Submit Feedback