Tag
Introducing GPIC (Giant Permissive Image Corpus), a large-scale dataset of 100M VLM-captioned image-text pairs for training and 1M pairs for benchmarking, fully permissive for research and commercial use.
This article introduces EmoS, a high-fidelity multimodal benchmark designed for fine-grained streaming emotional understanding, addressing limitations in ecological validity and labeling reliability found in existing datasets.
This paper introduces SynopticBench, a dataset of 1.3M+ weather forecast discussions paired with meteorological images, and SPACE, a novel evaluation framework for assessing VLM-generated weather forecasts.
RedBench introduces a universal dataset aggregating 37 benchmark datasets with 29,362 samples across 22 risk categories and 19 domains to enable standardized and comprehensive red teaming evaluation of large language models. The work addresses inconsistencies in existing red teaming datasets and provides baselines, evaluation code, and open-source resources for assessing LLM robustness against adversarial prompts.
This paper introduces the first parallel Arabic cultural QA benchmark spanning Modern Standard Arabic and multiple dialects, converting multiple-choice questions to open-ended formats and evaluating LLMs with chain-of-thought reasoning to address gaps in culturally grounded and dialect-specific knowledge.
This paper introduces Misviz, a benchmark dataset of 2,604 real-world visualizations and 57,665 synthetic ones annotated with 12 types of misleading design violations, enabling automated detection of deceptive charts. The work evaluates state-of-the-art multimodal LLMs and rule-based systems on this challenging task, addressing the gap in resources for training AI models to combat data visualization misinformation.
UsefulBench introduces a domain-specific benchmark dataset that distinguishes between document relevance and usefulness for information retrieval, showing that similarity-based IR systems conflate these concepts while LLMs can address this but lack domain expertise.
MUSCAT is a new multilingual, scientific conversation benchmark dataset for evaluating ASR systems on challenging multilingual scenarios including code-switching, domain-specific vocabulary, and mixed language input. The dataset consists of bilingual discussions on scientific papers between speakers using different languages, with results showing current state-of-the-art systems struggle with these multilingual challenges.
PIIBench presents a unified multi-source benchmark corpus for detecting personally identifiable information (PII) across diverse data sources. This resource addresses the need for standardized evaluation in PII detection tasks, which is critical for privacy-preserving NLP applications.
CoLabScience introduces a proactive LLM assistant for biomedical research that autonomously intervenes in scientific discussions using PULI (Positive-Unlabeled Learning-to-Intervene), a novel reinforcement learning framework that determines when and how to contribute context-aware insights. The work includes BSDD, a new benchmark dataset of simulated research dialogues with intervention points derived from PubMed articles.