Tag
This paper introduces MedAction, a framework for training LLMs on active, multi-turn clinical diagnosis by simulating iterative test ordering and hypothesis updates. It presents a new dataset, MedAction-32K, and demonstrates state-of-the-art performance for open-source models on medical benchmarks.
The paper introduces MIST, a synthetic dataset and framework for training multimodal voice assistants to control IoT devices in smart homes. It highlights significant performance gaps between open and closed-weight models in handling complex, speech-based tool-calling tasks.
MuSS introduces a large-scale dataset and benchmark for multi-shot subject-to-video generation, addressing narrative logic and copy-paste issues in cinematic storytelling.
This paper introduces a multilingual dataset of over 100,000 movie reviews from Kazakhstan, containing Russian, Kazakh, and code-switched texts. It benchmarks classical and transformer-based models on sentiment polarity and score classification tasks.
A new IR benchmark release addresses broken text benchmarking in DL19/DL20/BEIR, enabling meaningful measurement of improvements in current era training methods.
BioTool introduces a comprehensive biomedical tool-calling dataset with 34 tools and 7,040 human-verified query-API pairs, enabling fine-tuned LLMs to outperform GPT-5.1 on biomedical tool use and significantly enhance answer quality.
HumanNet is a large-scale human-centric video dataset with one million hours of annotated footage, designed to train vision-language-action models. It demonstrates that egocentric human video can effectively replace robot data for embodied intelligence tasks.
This paper introduces Sparkle, a new dataset and benchmark for instruction-guided video background replacement, addressing the lack of high-quality training data in this domain. It proposes a scalable pipeline with decoupled guidance to generate realistic foreground-background interactions.
This paper introduces TT4D, a novel pipeline and large-scale dataset for reconstructing table tennis gameplay in 4D from monocular videos. It features a unique lift-first approach that estimates 3D ball trajectories and spin before time segmentation, enabling robust reconstruction even with occlusions.
This paper introduces the EDU-CIRCUIT-HW dataset for evaluating multimodal large language models on real-world university-level STEM handwritten solutions, revealing significant recognition limitations and proposing a hybrid approach that combines automated recognition with minimal human oversight to enhance grading robustness.
MIT researchers, in collaboration with KAUST and HUMAIN, have released MathNet, the largest open-source dataset of Olympiad-level math problems, containing over 30,000 expert-authored problems from 47 countries.
Version 2 of the GM-SEUS open dataset now maps 3.4 million U.S. solar panels plus new rooftop arrays, up from 2.9 million in v1.
MIT and the IMO release MathNet, a massive dataset of International Math Olympiad problems and solutions spanning 40 years and 40+ countries, 5x larger than prior datasets.
Researchers release LongSpeech, a 100k-segment dataset of ~10-min clips to benchmark long-form audio understanding across 8 tasks, to be presented at ICASSP 2026.
Researchers introduce BIASEDTALES-ML, a large-scale multilingual dataset of ~350,000 LLM-generated children's stories across eight languages, designed to analyze narrative attribute distributions and cross-lingual bias patterns in language model outputs. The work reveals significant cross-lingual variability, highlighting limitations of English-centric bias evaluations.
Researchers from Peking University introduce CFMS, the first fine-grained Chinese multimodal sarcasm detection benchmark with 2,796 image-text pairs and a triple-level annotation framework (sarcasm identification, target recognition, explanation generation), along with a novel RL-augmented in-context learning method (PGDS) that significantly outperforms existing baselines.
KMMMU is a native Korean benchmark for evaluating multimodal understanding with 3,466 questions across nine disciplines and visual modality categories, addressing the gap of English-centric benchmarks by testing performance on Korean-specific cultural and institutional contexts.
This paper introduces the Re3Align dataset, REspGen framework, and REspEval evaluation suite for author-in-the-loop response generation in peer review, integrating author expertise and intent signals. The work addresses gaps in NLP formulation of scientific rebuttal writing with comprehensive datasets, controllable generation frameworks, and multi-dimensional evaluation metrics.
This paper proposes DeepInsightTheorem, a hierarchical dataset and Progressive Multi-Stage SFT training strategy to improve LLMs' informal theorem proving by teaching them to identify and apply core techniques through insight-aware reasoning.
A research paper presenting a dataset and XGBoost-based model for sentiment analysis of German Sign Language (DGS) fairy tales using facial and body motion features extracted via MediaPipe, achieving 63.1% balanced accuracy and demonstrating the importance of both facial and body movements for sentiment communication in sign language.