dataset

#dataset

@bclavie: This might be the best IR release of the year. Text benchmarking is (was) broken, DL19/DL20/BEIR no longer hold valuabl…

X AI KOLs Following ↗ · yesterday Cached

A new IR benchmark release addresses broken text benchmarking in DL19/DL20/BEIR, enabling meaningful measurement of improvements in current era training methods.

0 favorites 0 likes

#dataset

BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models

arXiv cs.CL ↗ · yesterday Cached

BioTool introduces a comprehensive biomedical tool-calling dataset with 34 tools and 7,040 human-verified query-API pairs, enabling fine-tuned LLMs to outperform GPT-5.1 on biomedical tool use and significantly enhance answer quality.

0 favorites 0 likes

#dataset

Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance

Hugging Face Daily Papers ↗ · 2d ago Cached

This paper introduces Sparkle, a new dataset and benchmark for instruction-guided video background replacement, addressing the lack of high-quality training data in this domain. It proposes a scalable pipeline with decoupled guidance to generate realistic foreground-background interactions.

0 favorites 0 likes

#dataset

TT4D: A Pipeline and Dataset for Table Tennis 4D Reconstruction From Monocular Videos

Hugging Face Daily Papers ↗ · 2026-05-02 Cached

This paper introduces TT4D, a novel pipeline and large-scale dataset for reconstructing table tennis gameplay in 4D from monocular videos. It features a unique lift-first approach that estimates 3D ball trajectories and spin before time segmentation, enabling robust reconstruction even with occlusions.

0 favorites 0 likes

#dataset

EDU-CIRCUIT-HW: Evaluating Multimodal Large Language Models on Real-World University-Level STEM Student Handwritten Solutions

Hugging Face Daily Papers ↗ · 2026-04-30 Cached

This paper introduces the EDU-CIRCUIT-HW dataset for evaluating multimodal large language models on real-world university-level STEM handwritten solutions, revealing significant recognition limitations and proposing a hybrid approach that combines automated recognition with minimal human oversight to enhance grading robustness.

0 favorites 0 likes

#dataset

MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone

MIT News — Artificial Intelligence ↗ · 2026-04-24 Cached

MIT researchers, in collaboration with KAUST and HUMAIN, have released MathNet, the largest open-source dataset of Olympiad-level math problems, containing over 30,000 expert-authored problems from 47 countries.

0 favorites 0 likes

#dataset

3.4M Solar Panels

Hacker News Top ↗ · 2026-04-22 Cached

Version 2 of the GM-SEUS open dataset now maps 3.4 million U.S. solar panels plus new rooftop arrays, up from 2.9 million in v1.

0 favorites 0 likes

#dataset

MIT & the IMO released MathNet, the world’s largest dataset of International Math Olympiad problems & solutions. MathNet is 5x larger than previous datasets & is sourced from over 40 countries across 4 decades

Reddit r/LocalLLaMA ↗ · 2026-04-22

MIT and the IMO release MathNet, a massive dataset of International Math Olympiad problems and solutions spanning 40 years and 40+ countries, 5x larger than prior datasets.

0 favorites 0 likes

#dataset

@Chenyang_Lyu: Excited to publicly release LongSpeech, which will be presented at #ICASSP2026 ! Most Audio LLMs are at short audio but…

X AI KOLs Following ↗ · 2026-04-21

Researchers release LongSpeech, a 100k-segment dataset of ~10-min clips to benchmark long-form audio understanding across 8 tasks, to be presented at ICASSP 2026.

0 favorites 0 likes

#dataset

BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers introduce BIASEDTALES-ML, a large-scale multilingual dataset of ~350,000 LLM-generated children's stories across eight languages, designed to analyze narrative attribute distributions and cross-lingual bias patterns in language model outputs. The work reveals significant cross-lingual variability, highlighting limitations of English-centric bias evaluations.

0 favorites 0 likes

#dataset

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from Peking University introduce CFMS, the first fine-grained Chinese multimodal sarcasm detection benchmark with 2,796 image-text pairs and a triple-level annotation framework (sarcasm identification, target recognition, explanation generation), along with a novel RL-augmented in-context learning method (PGDS) that significantly outperforms existing baselines.

0 favorites 0 likes

#dataset

KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

arXiv cs.CL ↗ · 2026-04-20 Cached

KMMMU is a native Korean benchmark for evaluating multimodal understanding with 3,466 questions across nine disciplines and visual modality categories, addressing the gap of English-centric benchmarks by testing performance on Korean-specific cultural and institutional contexts.

0 favorites 0 likes

#dataset

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces the Re3Align dataset, REspGen framework, and REspEval evaluation suite for author-in-the-loop response generation in peer review, integrating author expertise and intent signals. The work addresses gaps in NLP formulation of scientific rebuttal writing with comprehensive datasets, controllable generation frameworks, and multi-dimensional evaluation metrics.

0 favorites 0 likes

#dataset

Learning to Reason with Insight for Informal Theorem Proving

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper proposes DeepInsightTheorem, a hierarchical dataset and Progressive Multi-Stage SFT training strategy to improve LLMs' informal theorem proving by teaching them to identify and apply core techniques through insight-aware reasoning.

0 favorites 0 likes

#dataset

Sentiment Analysis of German Sign Language Fairy Tales

arXiv cs.CL ↗ · 2026-04-20 Cached

A research paper presenting a dataset and XGBoost-based model for sentiment analysis of German Sign Language (DGS) fairy tales using facial and body motion features extracted via MediaPipe, achieving 63.1% balanced accuracy and demonstrating the importance of both facial and body movements for sentiment communication in sign language.

0 favorites 0 likes

#dataset

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Reddit r/MachineLearning ↗ · 2026-04-20

SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.

0 favorites 0 likes

#dataset

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Hugging Face Daily Papers ↗ · 2026-04-19 Cached

Researchers release Terminal Wrench, a dataset of 331 reward-hackable terminal environments with 3,632 exploit trajectories spanning sysadmin, ML, and security tasks.

0 favorites 0 likes

#dataset

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

VEFX-Bench introduces a large-scale human-annotated video editing dataset (5,049 examples) with multi-dimensional quality labels and a specialized reward model for standardized evaluation of video editing systems. The paper addresses the lack of comprehensive benchmarks in AI-assisted video creation by providing VEFX-Dataset, VEFX-Reward, and a 300-video-prompt benchmark that reveals gaps in current editing models.

0 favorites 0 likes

#dataset

NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

Hugging Face Daily Papers ↗ · 2026-04-16 Cached

This paper presents the NTIRE 2026 Challenge on Video Saliency Prediction, introducing a novel dataset of 2,000 diverse videos with saliency maps collected via crowdsourced mouse tracking from over 5,000 assessors. Over 20 teams participated, with 7 passing the final phase, and all data is made publicly available.

0 favorites 0 likes

#dataset

Solving math word problems

OpenAI Blog ↗ · 2021-10-29 Cached

OpenAI trained a system using verifiers to solve grade school math word problems with 90% of child-level accuracy, nearly doubling fine-tuned GPT-3 performance. The approach addresses language models' weakness in multistep reasoning by training verifiers to evaluate candidate solutions and select the best one.

0 favorites 0 likes

dataset

Submit Feedback