Tag
ShotcreteDepth is a bi-modal dataset of stereo RGB and LiDAR data from construction environments, designed to support research in depth perception under challenging conditions. The dataset includes 11,252 samples with 220 annotated, and is accompanied by a lightweight annotation tool.
Tmax introduces a simplified RL training recipe for terminal agents, achieving state-of-the-art performance with a 9B parameter model using a novel data generation taxonomy and an expanded open-source dataset.
Counsel is the first public dataset of human meta-evaluations of LLM critiques for agentic tasks, designed to improve the calibration and reliability of automated evaluation methods.
Introduces ViewSuite, a benchmark with 6DoF camera control and ~165K tasks for evaluating VLMs' ability to plan camera moves. Finds a planning gap where models can track but not compose plans, and proposes View Graph Distillation (RL-Graph-SFT) to improve success from 2.5% to 47.8%.
A community member proposes creating a crowdsourced coding dataset for local LLMs to enable collaborative model training and fine-tuning, addressing concerns about future availability of open-weight models.
This paper introduces PEC-Home, a simulated home dataset for interpreting progressively elliptical commands in smart homes, and finds that current LLM-based assistants struggle with such commands due to referential and intention ambiguity.
This paper introduces ComRate, a large-scale dataset of community notes and ratings from X, and proposes MultiCom, a persona-guided multi-agent framework for simulating community note evaluation. The approach achieves 84.7% accuracy in predicting note helpfulness.
ThousandWorlds is a benchmark dataset for machine-learning emulation of exoplanet climates, containing approximately 1800 simulations from five global climate models. Gaussian process methods outperform deep learning baselines in this low-data, multi-simulator regression task.
JamSet and JamBench are introduced as a dataset and benchmark for project-level game code generation on the Godot engine, derived from Game Jam projects, with evaluation showing a capability cliff for AI models as project scale increases.
Introduces DF3DV-1K, a large-scale real-world dataset with 1,048 scenes and 89,924 images for distractor-free novel view synthesis, along with a benchmark of nine methods and an application improving radiance field methods via fine-tuning a diffusion-based 2D enhancer.
Researchers from Stanford, UC, and Nanjing University release SEFD, a dataset of 152B tokens from SEC filings converted to layout-faithful MultiMarkdown, preserving table structure for LLM training with minimal overlap with Common Crawl.
This paper introduces CAPRI, a dataset to evaluate whether LLMs can infer a user's cultural background from conversational cues and adapt their responses (e.g., using appropriate measurement units). Experiments show LLMs can infer cultural context but often fail to apply it unless explicitly prompted.
This paper introduces a structured ontology for untranslatability in machine translation, along with a taxonomy of compensation strategies and a multilingual dataset. Human preference studies show translator quality depends on the strategy used, with a preference for explanatory translations.
This paper introduces MathVis-Fine, a framework for fine-grained visual dependency modeling in multimodal mathematical reasoning, along with a new dataset and a two-stage progressive training paradigm that balances answer correctness and visual grounding rewards based on each sample's intrinsic visual dependency level.
Introduces FllumaOne, a multimodal CAD dataset of 100,000 samples generated by executable Python programs in the Flluma CAD system, providing feature trees, STEP geometry, point clouds, and language descriptions to support editable CAD research.
Downloaded and compiled 16,459 CUDA kernels from SakanaAI's open-source dataset, benchmarking them for performance.
A new initiative called Trace Commons aims to collect coding agent traces into an open CC-BY-4.0 dataset to help train open-weight and open-source models, countering the data advantage of proprietary models from Anthropic and OpenAI.
This paper introduces MoDiCoL, a modular diagnostic continual learning dataset for robust speech recognition, enabling controlled analysis of linguistic content, speaker characteristics, and acoustic environments, and proposes a continual learning curriculum to study how robustness is acquired, transferred, and forgotten.
A user discusses building a small autocomplete model (25M parameters) as a learning project, mentions hardware constraints (32GB VRAM), data requirements (~100M tokens), and seeks advice on datasets and data formatting for autocomplete-style training.
Presents BioStance, a context-aware dataset of 39,600 annotated Reddit post-comment pairs for stance detection in bioethical controversies, covering six targets across three dimensions of bioethical debate.