Tag
SpeechEQ introduces a benchmark and dataset for evaluating emotional intelligence in speech-language models, covering 15 EQ subscales across 2,265 dialogues. Experiments reveal current models struggle with paralinguistic cues, exhibiting text-reliant shortcuts and other limitations.
This paper introduces an atomistic language model that integrates a 3D atom encoder, Qwen LLM, and diffusion crystal generator to natively handle multimodal materials data, achieving state-of-the-art crystal structure prediction and de novo generation.
Tweet announces Gemma 4 31B multimodal model with high speed, calling it a step towards superintelligence.
A comparison claiming that Google's Gemini outperforms Anthropic's Claude in vision and world knowledge tasks.
This paper introduces the first public multimodal dataset of 100 Turkish scam and benign phone calls, evaluating seven LLMs under raw audio, ASR transcripts, and human-corrected transcripts. Results show transcript-based inputs outperform direct audio, highlighting the need for inclusive AI safety research in low-resource languages.
Introduces PHANTOM, a large-scale open-source dataset of pre-generated adversarial attacks for vision-language models, covering 1010 high-level categories and 55 subcategories of harmful intents with 47,524 adversarial samples. The dataset aims to lower the barrier for adversarial research and enable systematic evaluation of VLM robustness and safety.
AVOC introduces a retrieval-inspired token compression method for omni-modal LLMs that effectively handles hour-long audio-video inputs by selecting informative tokens based on relevance, importance, and diversity. The framework achieves state-of-the-art results on long-form audio-video understanding benchmarks, surpassing prior methods by significant margins.
MedBench v5 is a dynamic, process-oriented benchmark for clinical multimodal models that integrates hallucination detection and stress testing, moving beyond static QA to evaluate reasoning and stability under information-flow stressors.
ByteDance's Seed 2.1 model achieved strong results on multimodal agentic (Claw-Eval) and long video understanding (Video-MME) benchmarks, though a gap remains between perception and agentic capabilities.
V-Zero is a novel label-free framework for fine-grained visual reasoning that uses contrastive evidence gating and on-policy distillation to improve performance without annotated answer labels, achieving faster training than traditional methods.
Wan-Streamer is a unified end-to-end multimodal model for real-time audio-visual interaction using causal attention and integrated processing of visual, audio, and text modalities, achieving sub-second latency.
ReMMD introduces a realistic multilingual multi-image agentic verification framework for multimodal misinformation detection, including a benchmark (ReMMDBench) with 500 samples and 2,756 images, and an agent (ReMMD-Agent) that achieves superior veracity performance with reduced costs.
FlowR2A proposes a novel method that combines dense reward supervision with dynamic proposal generation using a flow-matching decoder for multimodal driving planning, achieving state-of-the-art results on the NAVSIM benchmarks.
SingGuard is a multimodal guardrail system from Ant Group that treats safety policy as an input, allowing dynamic adaptation via natural language. It is released under Apache 2.0 and covers text and image modalities.
VeriEvol is a novel framework for scaling reinforcement learning in visual mathematical reasoning by ensuring reliable reward labels through a two-axis approach separating prompt difficulty from answer reliability, using evolutionary operators and hypothesis-testing verification. It achieves significant accuracy gains on a five-benchmark visual-math suite.
UniverSat introduces a Universal Patch Encoder for Vision Transformers that enables robust, sensor-agnostic spatial feature extraction across diverse Earth Observation data types, achieving strong results on classification and segmentation benchmarks.
SupraLabs released Supra-A2A-Nano-Exp, a small any-to-any autoregressive model that unifies text and image tokenization into a single Transformer, serving as an educational prototype rather than a production-ready system.
This paper systematically evaluates multimodal Chain-of-Thought reasoning across 12 tasks, finding it selectively effective for reasoning tasks but detrimental for perception tasks, and identifies a 'Look Light, Think Heavy' pattern where visual introspection declines during reasoning.
GLM 5.2, a text-only model, outperforms Fable 5 in website design when paired with Browser Use v2 multimodal QA subagents, enabling iterative improvement at low cost.
Unsloth enables free fine-tuning of a 31B parameter multimodal model on Kaggle using 4-bit quantization, requiring only 22-24GB VRAM for local runs.