multimodal-llm

#multimodal-llm

@PKUCXK: You can try the following two prompts in the Thinking mode (via web/app) to get a better model experience in certain do…

X AI KOLs Timeline ↗ · 5d ago Cached

Xiaokang Chen shares two prompts, 'Think with Grounding' and 'Think with Pointing', to improve model performance in domains like counting in Thinking mode. These prompts use bounding boxes and points to make the MLLM's reasoning more human-like.

0 favorites 0 likes

#multimodal-llm

MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

MemGUI-Agent introduces proactive context management for long-horizon mobile GUI tasks, using Context-as-Action (ConAct) to maintain critical information. It includes the MemGUI-3K dataset and achieves state-of-the-art performance on MemGUI-Bench and MobileWorld benchmarks with an 8B model.

0 favorites 0 likes

#multimodal-llm

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

A new benchmark called StylisticBias systematically evaluates attribute-level social bias in multimodal large language models, finding that a small set of visual cues like fashion style drive most biases.

0 favorites 0 likes

#multimodal-llm

Are you speaking my languages? On spoken language adherence in multimodal LLMs

arXiv cs.CL ↗ · 2026-06-17 Cached

This paper addresses the problem of spoken language adherence in multimodal LLMs for ASR, proposing a soft prompting approach and novel metric to quantify language violations. It evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning—across multiple languages to improve transcription fidelity.

0 favorites 0 likes

#multimodal-llm

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

arXiv cs.LG ↗ · 2026-06-17 Cached

This paper introduces MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE multimodal LLMs that addresses biases in expert importance estimation by decomposing selection frequency by modality and filtering redundant vision tokens, achieving minimal performance loss under aggressive quantization.

0 favorites 0 likes

#multimodal-llm

Forced Deferral: Manipulating Routing Decisions in Multimodal LLM Cascades

arXiv cs.AI ↗ · 2026-06-16 Cached

This paper introduces the Forced Deferral Attack (FDA), an adversarial image attack that manipulates confidence scores in multimodal LLM cascades, causing queries to be unnecessarily routed to stronger (more expensive) models, thereby shifting compute costs to the provider without degrading answer correctness.

0 favorites 0 likes

#multimodal-llm

Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning

arXiv cs.AI ↗ · 2026-06-16 Cached

Visual-Seeker proposes a visual-native multimodal deep search agent that actively reasons over fine-grained visual details and synthesizes multimodal evidence, achieving state-of-the-art performance on five challenging multimodal search benchmarks.

0 favorites 0 likes

#multimodal-llm

@xichen_pan: Modern text-to-image models are increasingly powered by large pretrained LLMs. But there is a curious mismatch: the LLM…

X AI KOLs Following ↗ · 2026-06-16 Cached

RepFusion introduces a method to use pretrained multimodal LLMs as noisy representation encoders in diffusion transformers for text-to-image generation, outperforming baselines with similar compute.

0 favorites 0 likes

#multimodal-llm

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Hugging Face Daily Papers ↗ · 2026-06-13 Cached

SAGA framework uses frozen multimodal large language models to provide attribute-aware supervision for vision encoders via Group Relative Policy Optimization, improving zero-shot image retrieval by 3–6 points on fine-grained benchmarks.

0 favorites 0 likes

#multimodal-llm

Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach

arXiv cs.AI ↗ · 2026-06-12 Cached

The paper introduces UXBench, a multimodal benchmark for evaluating MLLMs on mobile UX reasoning tasks, and presents UI-UX, a fine-tuned MLLM based on Qwen3-VL-4B-Thinking that achieves state-of-the-art performance on this benchmark.

0 favorites 0 likes

#multimodal-llm

SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning

arXiv cs.AI ↗ · 2026-06-11 Cached

The paper proposes SVoT, a reinforcement learning framework that generates interleaved, verifiable intermediate states and visualizations for multi-hop spatial reasoning in MLLMs, achieving significant accuracy gains on new benchmarks involving multi-object interactions and numerical reasoning.

0 favorites 0 likes

#multimodal-llm

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

ART (Art-based Reinforcement Training) enables parameter-efficient fine-tuning of frozen multimodal LLMs by optimizing raw visual input via gradient backpropagation, achieving performance comparable to LoRA while supporting pre-compiled computational graphs for high-throughput engines like vLLM.

0 favorites 0 likes

#multimodal-llm

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

arXiv cs.LG ↗ · 2026-06-10 Cached

This paper introduces MGAP, a training-free decoding method that reduces hallucinations in Multimodal Large Language Models by adaptively suppressing only the harmful parts of language priors while preserving the model's semantic manifold. The method outperforms prior baselines on POPE and CHAIR benchmarks.

0 favorites 0 likes

#multimodal-llm

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper studies how audio and visual information flow inside Audio-Visual Large Language Models (AVLLMs), revealing that AVLLMs follow sequential or parallel routing depending on input configuration, and that some tokens can be discarded after information transfer for efficiency.

0 favorites 0 likes

#multimodal-llm

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

arXiv cs.CL ↗ · 2026-06-04 Cached

This paper proposes a query-based cross-modal projector that compresses visual tokens via cross-attention to improve Mamba-based multimodal LLMs, boosting both performance and throughput on vision-language benchmarks while eliminating the need for manual 2D scan order design.

0 favorites 0 likes

#multimodal-llm

A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs

arXiv cs.CL ↗ · 2026-06-04 Cached

Researchers from Jilin University systematically evaluate positional bias in multi-video summarization using MLLMs, constructing a benchmark from ActivityNet and News videos and assessing nine models with metrics including Coverage, Directional Positional Bias, and Middle-Edge Gap. Results show positional effects are domain- and model-dependent, and increasing visual or generation budget does not uniformly resolve the imbalance.

0 favorites 0 likes

#multimodal-llm

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

arXiv cs.CL ↗ · 2026-06-04 Cached

VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.

0 favorites 0 likes

#multimodal-llm

BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction

arXiv cs.AI ↗ · 2026-06-04 Cached

BiNSGPS is a framework that introduces bidirectional interaction between a multimodal LLM adviser and a symbolic solver for geometry problem solving, allowing feedback from the solver to correct errors and generate auxiliary hypotheses. It achieves state-of-the-art performance of 90.5% on Geometry3K and 90.1% on PGPS9K benchmarks.

0 favorites 0 likes

#multimodal-llm

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

arXiv cs.AI ↗ · 2026-06-04 Cached

VAMPS is a new benchmark of 1,168 multimodal bilingual math problems designed to evaluate whether LLMs can benefit from constructing and reasoning over graphs/visualizations. Key finding: direct analytical solving surprisingly outperforms tool-enabled visual solving even on problems where plotting is a natural strategy.

0 favorites 0 likes

#multimodal-llm

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

Introduces Future-L1, an interleaved latent visual reasoning framework that improves video event prediction by maintaining visual semantics in latent space. Achieves state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.

0 favorites 0 likes

multimodal-llm

Submit Feedback