dataset

#dataset

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

arXiv cs.CL ↗ · 2d ago Cached

This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.

0 favorites 0 likes

#dataset

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

arXiv cs.CL ↗ · 2d ago Cached

The paper introduces MIST, a synthetic dataset and framework for training multimodal voice assistants to control IoT devices in smart homes. It highlights significant performance gaps between open and closed-weight models in handling complex, speech-based tool-calling tasks.

0 favorites 0 likes

#dataset

MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation

Hugging Face Daily Papers ↗ · 4d ago Cached

MuSS introduces a large-scale dataset and benchmark for multi-shot subject-to-video generation, addressing narrative logic and copy-paste issues in cinematic storytelling.

0 favorites 0 likes

#dataset

100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

Hugging Face Daily Papers ↗ · 4d ago Cached

This paper introduces a multilingual dataset of over 100,000 movie reviews from Kazakhstan, containing Russian, Kazakh, and code-switched texts. It benchmarks classical and transformer-based models on sentiment polarity and score classification tasks.

0 favorites 0 likes

#dataset

@bclavie: This might be the best IR release of the year. Text benchmarking is (was) broken, DL19/DL20/BEIR no longer hold valuabl…

X AI KOLs Following ↗ · 5d ago Cached

A new IR benchmark release addresses broken text benchmarking in DL19/DL20/BEIR, enabling meaningful measurement of improvements in current era training methods.

0 favorites 0 likes

#dataset

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

arXiv cs.CL ↗ · 5d ago Cached

BioTool introduces a comprehensive biomedical tool-calling dataset with 34 tools and 7,040 human-verified query-API pairs, enabling fine-tuned LLMs to outperform GPT-5.1 on biomedical tool use and significantly enhance answer quality.

0 favorites 0 likes

#dataset

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Hugging Face Daily Papers ↗ · 6d ago Cached

HumanNet is a large-scale human-centric video dataset with one million hours of annotated footage, designed to train vision-language-action models. It demonstrates that egocentric human video can effectively replace robot data for embodied intelligence tasks.

0 favorites 0 likes

#dataset

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Hugging Face Daily Papers ↗ · 6d ago Cached

This paper introduces Sparkle, a new dataset and benchmark for instruction-guided video background replacement, addressing the lack of high-quality training data in this domain. It proposes a scalable pipeline with decoupled guidance to generate realistic foreground-background interactions.

0 favorites 0 likes

#dataset

TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos

Hugging Face Daily Papers ↗ · 2026-05-02 Cached

This paper introduces TT4D, a novel pipeline and large-scale dataset for reconstructing table tennis gameplay in 4D from monocular videos. It features a unique lift-first approach that estimates 3D ball trajectories and spin before time segmentation, enabling robust reconstruction even with occlusions.

0 favorites 0 likes

#dataset

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Hugging Face Daily Papers ↗ · 2026-04-30 Cached

This paper introduces the EDU-CIRCUIT-HW dataset for evaluating multimodal large language models on real-world university-level STEM handwritten solutions, revealing significant recognition limitations and proposing a hybrid approach that combines automated recognition with minimal human oversight to enhance grading robustness.

0 favorites 0 likes

#dataset

MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone

MIT News — Artificial Intelligence ↗ · 2026-04-24 Cached

MIT researchers, in collaboration with KAUST and HUMAIN, have released MathNet, the largest open-source dataset of Olympiad-level math problems, containing over 30,000 expert-authored problems from 47 countries.

0 favorites 0 likes

#dataset

3.4M Solar Panels

Hacker News Top ↗ · 2026-04-22 Cached

Version 2 of the GM-SEUS open dataset now maps 3.4 million U.S. solar panels plus new rooftop arrays, up from 2.9 million in v1.

0 favorites 0 likes

#dataset

MIT & the IMO released MathNet, the world’s largest dataset of International Math Olympiad problems & solutions. MathNet is 5x larger than previous datasets & is sourced from over 40 countries across 4 decades

Reddit r/LocalLLaMA ↗ · 2026-04-22

MIT and the IMO release MathNet, a massive dataset of International Math Olympiad problems and solutions spanning 40 years and 40+ countries, 5x larger than prior datasets.

0 favorites 0 likes

#dataset

@Chenyang_Lyu: Excited to publicly release LongSpeech, which will be presented at #ICASSP2026 ! Most Audio LLMs are at short audio but…

X AI KOLs Following ↗ · 2026-04-21

Researchers release LongSpeech, a 100k-segment dataset of ~10-min clips to benchmark long-form audio understanding across 8 tasks, to be presented at ICASSP 2026.

0 favorites 0 likes

#dataset

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers introduce BIASEDTALES-ML, a large-scale multilingual dataset of ~350,000 LLM-generated children's stories across eight languages, designed to analyze narrative attribute distributions and cross-lingual bias patterns in language model outputs. The work reveals significant cross-lingual variability, highlighting limitations of English-centric bias evaluations.

0 favorites 0 likes

#dataset

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from Peking University introduce CFMS, the first fine-grained Chinese multimodal sarcasm detection benchmark with 2,796 image-text pairs and a triple-level annotation framework (sarcasm identification, target recognition, explanation generation), along with a novel RL-augmented in-context learning method (PGDS) that significantly outperforms existing baselines.

0 favorites 0 likes

#dataset

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

arXiv cs.CL ↗ · 2026-04-20 Cached

KMMMU is a native Korean benchmark for evaluating multimodal understanding with 3,466 questions across nine disciplines and visual modality categories, addressing the gap of English-centric benchmarks by testing performance on Korean-specific cultural and institutional contexts.

0 favorites 0 likes

#dataset

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces the Re3Align dataset, REspGen framework, and REspEval evaluation suite for author-in-the-loop response generation in peer review, integrating author expertise and intent signals. The work addresses gaps in NLP formulation of scientific rebuttal writing with comprehensive datasets, controllable generation frameworks, and multi-dimensional evaluation metrics.

0 favorites 0 likes

#dataset

Learning to Reason with Insight for Informal Theorem Proving

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper proposes DeepInsightTheorem, a hierarchical dataset and Progressive Multi-Stage SFT training strategy to improve LLMs' informal theorem proving by teaching them to identify and apply core techniques through insight-aware reasoning.

0 favorites 0 likes

#dataset

Sentiment Analysis of German Sign Language Fairy Tales

arXiv cs.CL ↗ · 2026-04-20 Cached

A research paper presenting a dataset and XGBoost-based model for sentiment analysis of German Sign Language (DGS) fairy tales using facial and body motion features extracted via MediaPipe, achieving 63.1% balanced accuracy and demonstrating the importance of both facial and body movements for sentiment communication in sign language.

0 favorites 0 likes

dataset

Submit Feedback