benchmark-dataset

Tag

#benchmark-dataset

@drfeifei: I’m very excited by this new benchmark dataset for visual generation that is suitable for the modern era of large scale…

X AI KOLs Following ↗ · 2026-05-29 Cached

Introducing GPIC (Giant Permissive Image Corpus), a large-scale dataset of 100M VLM-captioned image-text pairs for training and 1M pairs for benchmarking, fully permissive for research and commercial use.

0 favorites 0 likes

#benchmark-dataset

EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding

arXiv cs.CL ↗ · 2026-05-12 Cached

This article introduces EmoS, a high-fidelity multimodal benchmark designed for fine-grained streaming emotional understanding, addressing limitations in ecological validity and labeling reliability found in existing datasets.

0 favorites 0 likes

#benchmark-dataset

SynopticBench: Evaluating Vision-Language Models on Generating Weather Forecast Discussions of the Future

arXiv cs.CL ↗ · 2026-04-21 Cached

This paper introduces SynopticBench, a dataset of 1.3M+ weather forecast discussions paired with meteorological images, and SPACE, a novel evaluation framework for assessing VLM-generated weather forecasts.

0 favorites 0 likes

#benchmark-dataset

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

arXiv cs.CL ↗ · 2026-04-20 Cached

RedBench introduces a universal dataset aggregating 37 benchmark datasets with 29,362 samples across 22 risk categories and 19 domains to enable standardized and comprehensive red teaming evaluation of large language models. The work addresses inconsistencies in existing red teaming datasets and provides baselines, evaluation code, and open-source resources for assessing LLM robustness against adversarial prompts.

0 favorites 0 likes

#benchmark-dataset

Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces the first parallel Arabic cultural QA benchmark spanning Modern Standard Arabic and multiple dialects, converting multiple-choice questions to open-ended formats and evaluating LLMs with chain-of-thought reasoning to address gaps in culturally grounded and dialect-specific knowledge.

0 favorites 0 likes

#benchmark-dataset

Is this chart lying to me? Automating the detection of misleading visualizations

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces Misviz, a benchmark dataset of 2,604 real-world visualizations and 57,665 synthetic ones annotated with 12 types of misleading design violations, enabling automated detection of deceptive charts. The work evaluates state-of-the-art multimodal LLMs and rule-based systems on this challenging task, addressing the gap in resources for training AI models to combat data visualization misinformation.

0 favorites 0 likes

#benchmark-dataset

UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval

arXiv cs.CL ↗ · 2026-04-20 Cached

UsefulBench introduces a domain-specific benchmark dataset that distinguishes between document relevance and usefulness for information retrieval, showing that similarity-based IR systems conflate these concepts while LLMs can address this but lack domain expertise.

0 favorites 0 likes

#benchmark-dataset

MUSCAT: MUltilingual, SCientific ConversATion Benchmark

arXiv cs.CL ↗ · 2026-04-20 Cached

MUSCAT is a new multilingual, scientific conversation benchmark dataset for evaluating ASR systems on challenging multilingual scenarios including code-switching, domain-specific vocabulary, and mixed language input. The dataset consists of bilingual discussions on scientific papers between speakers using different languages, with results showing current state-of-the-art systems struggle with these multilingual challenges.

0 favorites 0 likes

#benchmark-dataset

PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection

arXiv cs.CL ↗ · 2026-04-20 Cached

PIIBench presents a unified multi-source benchmark corpus for detecting personally identifiable information (PII) across diverse data sources. This resource addresses the need for standardized evaluation in PII detection tasks, which is critical for privacy-preserving NLP applications.

0 favorites 0 likes

#benchmark-dataset

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

arXiv cs.CL ↗ · 2026-04-20 Cached

CoLabScience introduces a proactive LLM assistant for biomedical research that autonomously intervenes in scientific discussions using PULI (Positive-Unlabeled Learning-to-Intervene), a novel reinforcement learning framework that determines when and how to contribute context-aware insights. The work includes BSDD, a new benchmark dataset of simulated research dialogues with intervention points derived from PubMed articles.

0 favorites 0 likes

← Back to home

Submit Feedback