multimodal

#multimodal

SpeechEQ: Benchmarking Emotional Intelligence Quotient in Socially Aware Voice Conversational Models

arXiv cs.CL ↗ · 7h ago Cached

SpeechEQ introduces a benchmark and dataset for evaluating emotional intelligence in speech-language models, covering 15 EQ subscales across 2,265 dialogues. Experiments reveal current models struggle with paralinguistic cues, exhibiting text-reliant shortcuts and other limitations.

0 favorites 0 likes

#multimodal

@askalphaxiv: "Atomistic Language Models Understand and Generate Materials" Most materials AI still treats crystals and language sepa…

X AI KOLs Timeline ↗ · 17h ago Cached

This paper introduces an atomistic language model that integrates a 3D atom encoder, Qwen LLM, and diffusion crystal generator to natively handle multimodal materials data, achieving state-of-the-art crystal structure prediction and de novo generation.

0 favorites 0 likes

#multimodal

@RayFernando1337: Gemma 4 31B MULTIMODAL!!!! at ROCKET speeds. WHAT DA FAHH!!! I can't contain myself rn. Speed is the first step to supe…

X AI KOLs Timeline ↗ · 18h ago Cached

Tweet announces Gemma 4 31B multimodal model with high speed, calling it a step towards superintelligence.

0 favorites 0 likes

#multimodal

Claude vision v/s Gemini vision (Gemini is much better in vision and world knowledge)

Reddit r/singularity ↗ · yesterday

A comparison claiming that Google's Gemini outperforms Anthropic's Claude in vision and world knowledge tasks.

0 favorites 0 likes

#multimodal

Poster: Exploring the Limits of Audio-Based Detection of Turkish Phone Call Scams

arXiv cs.CL ↗ · yesterday Cached

This paper introduces the first public multimodal dataset of 100 Turkish scam and benign phone calls, evaluating seven LLMs under raw audio, ASR transcripts, and human-corrected transcripts. Results show transcript-based inputs outperform direct audio, highlighting the need for inclusive AI safety research in low-resource languages.

0 favorites 0 likes

#multimodal

PHANTOM: A Large-Scale Dataset of Multimodal Adversarial Attacks for Vision-Language Models

arXiv cs.AI ↗ · yesterday Cached

Introduces PHANTOM, a large-scale open-source dataset of pre-generated adversarial attacks for vision-language models, covering 1010 high-level categories and 55 subcategories of harmful intents with 47,524 adversarial samples. The dataset aims to lower the barrier for adversarial research and enable systematic evaluation of VLM robustness and safety.

0 favorites 0 likes

#multimodal

AVOC: Enhancing Hour-Level Audio-Video Understanding in Omni-Modal LLMs via Retrieval-Inspired Token Compression

arXiv cs.CL ↗ · yesterday Cached

AVOC introduces a retrieval-inspired token compression method for omni-modal LLMs that effectively handles hour-long audio-video inputs by selecting informative tokens based on relevance, importance, and diversity. The framework achieves state-of-the-art results on long-form audio-video understanding benchmarks, surpassing prior methods by significant margins.

0 favorites 0 likes

#multimodal

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

arXiv cs.CL ↗ · yesterday Cached

MedBench v5 is a dynamic, process-oriented benchmark for clinical multimodal models that integrates hallucination detection and stress testing, moving beyond static QA to evaluate reasoning and stability under information-flow stressors.

0 favorites 0 likes

#multimodal

@_TobiasLee: Seed 2.1 from Bytedance achieved impressive results on two of our benchmarks. Claw-Eval (Multimodal, https://claw-eval.…

X AI KOLs Timeline ↗ · yesterday Cached

ByteDance's Seed 2.1 model achieved strong results on multimodal agentic (Claw-Eval) and long video understanding (Video-MME) benchmarks, though a gap remains between perception and agentic capabilities.

0 favorites 0 likes

#multimodal

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

Hugging Face Daily Papers ↗ · yesterday Cached

V-Zero is a novel label-free framework for fine-grained visual reasoning that uses contrastive evidence gating and on-policy distillation to improve performance without annotated answer labels, achieving faster training than traditional methods.

0 favorites 0 likes

#multimodal

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Hugging Face Daily Papers ↗ · 2d ago Cached

Wan-Streamer is a unified end-to-end multimodal model for real-time audio-visual interaction using causal attention and integrated processing of visual, audio, and text modalities, achieving sub-second latency.

0 favorites 0 likes

#multimodal

ReMMD: Realistic Multilingual Multi-Image Agentic Verification for Multimodal Misinformation Detection

Hugging Face Daily Papers ↗ · 2d ago Cached

ReMMD introduces a realistic multilingual multi-image agentic verification framework for multimodal misinformation detection, including a benchmark (ReMMDBench) with 500 samples and 2,756 images, and an agent (ReMMD-Agent) that achieves superior veracity performance with reduced costs.

0 favorites 0 likes

#multimodal

FlowR2A: Learning Reward-to-Action Distribution for Multimodal Driving Planning

Hugging Face Daily Papers ↗ · 2d ago Cached

FlowR2A proposes a novel method that combines dense reward supervision with dynamic proposal generation using a flow-matching decoder for multimodal driving planning, achieving state-of-the-art results on the NAVSIM benchmarks.

0 favorites 0 likes

#multimodal

@AdinaYakup: SingGuard from Ant Group @AntLingAGI A multimodal guardrail where the safety policy is an input, not a fixed weight. - …

X AI KOLs Timeline ↗ · 2d ago Cached

SingGuard is a multimodal guardrail system from Ant Group that treats safety policy as an input, allowing dynamic adaptation via natural language. It is released under Apache 2.0 and covers text and image modalities.

0 favorites 0 likes

#multimodal

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Hugging Face Daily Papers ↗ · 3d ago Cached

VeriEvol is a novel framework for scaling reinforcement learning in visual mathematical reasoning by ensuring reliable reward labels through a two-axis approach separating prompt difficulty from answer reliability, using evolutionary operators and hypothesis-testing verification. It achieves significant accuracy gains on a five-benchmark visual-math suite.

0 favorites 0 likes

#multimodal

UniverSat: Resolution- and Modality-Agnostic Transformers for Earth Observation

Hugging Face Daily Papers ↗ · 3d ago Cached

UniverSat introduces a Universal Patch Encoder for Vision Transformers that enables robust, sensor-agnostic spatial feature extraction across diverse Earth Observation data types, achieving strong results on classification and segmentation benchmarks.

0 favorites 0 likes

#multimodal

[NEW MODEL] SupraLabs started the Any2Any model family!

Reddit r/LocalLLaMA ↗ · 4d ago Cached

SupraLabs released Supra-A2A-Nano-Exp, a small any-to-any autoregressive model that unifies text and image tokenization into a single Transformer, serving as an educational prototype rather than a production-ready system.

0 favorites 0 likes

#multimodal

Look Light, Think Heavy: What Multimodal Chain-of-Thought Reasoning Can and Cannot Do

Hugging Face Daily Papers ↗ · 4d ago Cached

This paper systematically evaluates multimodal Chain-of-Thought reasoning across 12 tasks, finding it selectively effective for reasoning tasks but detrimental for perception tasks, and identifies a 'Look Light, Think Heavy' pattern where visual introspection declines during reasoning.

0 favorites 0 likes

#multimodal

@browser_use: GLM 5.2 just beat Fable 5 at website design. The crazy part: GLM is text-only. It can build the site, but it can’t insp…

X AI KOLs Following ↗ · 4d ago Cached

GLM 5.2, a text-only model, outperforms Fable 5 in website design when paired with Browser Use v2 multimodal QA subagents, enabling iterative improvement at low cost.

0 favorites 0 likes

#multimodal

@0x0SojalSec: Imagine fine-tuning a 31B parameter multimodal model for free,, on Kaggle. Now you can train this massive 31B dense mul…

X AI KOLs Timeline ↗ · 5d ago Cached

Unsloth enables free fine-tuning of a 31B parameter multimodal model on Kaggle using 4-bit quantization, requiring only 22-24GB VRAM for local runs.

0 favorites 0 likes

multimodal

Submit Feedback