dataset

Tag

Cards List
#dataset

ShotcreteDepth: A Bi-modal Dataset for Robust Robotic Depth Perception in Shotcrete Construction Environments

Hugging Face Daily Papers · 2d ago Cached

ShotcreteDepth is a bi-modal dataset of stereo RGB and LiDAR data from construction environments, designed to support research in depth perception under challenging conditions. The dataset includes 11,252 samples with 220 annotated, and is accompanied by a lightweight annotation tool.

0 favorites 0 likes
#dataset

Tmax: A simple recipe for terminal agents

Hugging Face Daily Papers · 2d ago Cached

Tmax introduces a simplified RL training recipe for terminal agents, achieving state-of-the-art performance with a 9B parameter model using a novel data generation taxonomy and an expanded open-source dataset.

0 favorites 0 likes
#dataset

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

Hugging Face Daily Papers · 5d ago Cached

Counsel is the first public dataset of human meta-evaluations of LLM critiques for agentic tasks, designed to improve the calibration and reliability of automated evaluation methods.

0 favorites 0 likes
#dataset

@ManlingLi_: Planning with the views: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We int…

X AI KOLs Following · 5d ago Cached

Introduces ViewSuite, a benchmark with 6DoF camera control and ~165K tasks for evaluating VLMs' ability to plan camera moves. Finds a planning gap where models can track but not compose plans, and proposes View Graph Distillation (RL-Graph-SFT) to improve success from 2.5% to 47.8%.

0 favorites 0 likes
#dataset

LocalLLaMA crowdsourced coding dataset

Reddit r/LocalLLaMA · 5d ago

A community member proposes creating a crowdsourced coding dataset for local LLMs to enable collaborative model training and fine-tuning, addressing concerns about future availability of open-weight models.

0 favorites 0 likes
#dataset

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

arXiv cs.CL · 5d ago Cached

This paper introduces PEC-Home, a simulated home dataset for interpreting progressively elliptical commands in smart homes, and finds that current LLM-based assistants struggle with such commands due to referential and intention ambiguity.

0 favorites 0 likes
#dataset

Towards Multi-Agent-Simulation-Based Community Note Evaluation

arXiv cs.AI · 5d ago Cached

This paper introduces ComRate, a large-scale dataset of community notes and ratings from X, and proposes MultiCom, a persona-guided multi-agent framework for simulating community note evaluation. The approach achieves 84.7% accuracy in predicting note helpfulness.

0 favorites 0 likes
#dataset

ThousandWorlds: A benchmark for climate emulation of potentially habitable exoplanets

arXiv cs.LG · 5d ago Cached

ThousandWorlds is a benchmark dataset for machine-learning emulation of exoplanet climates, containing approximately 1800 simulations from five global climate models. Gaussian process methods outperform deep learning baselines in this low-data, multi-simulator regression task.

0 favorites 0 likes
#dataset

JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game Engines

Hugging Face Daily Papers · 6d ago Cached

JamSet and JamBench are introduced as a dataset and benchmark for project-level game code generation on the Godot engine, derived from Game Jam projects, with evaluation showing a capability cliff for AI models as project scale increases.

0 favorites 0 likes
#dataset

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Hugging Face Daily Papers · 6d ago Cached

Introduces DF3DV-1K, a large-scale real-world dataset with 1,048 scenes and 89,924 images for distractor-free novel view synthesis, along with a benchmark of nine methods and an application improving radiance field methods via fine-tuning a diffusion-based 2D enhancer.

0 favorites 0 likes
#dataset

@rohanpaul_ai: This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logi…

X AI KOLs Following · 6d ago Cached

Researchers from Stanford, UC, and Nanjing University release SEFD, a dataset of 152B tokens from SEC filings converted to layout-faithful MultiMarkdown, preserving table structure for LLM training with minimal overlap with Common Crawl.

0 favorites 0 likes
#dataset

LLMs Infer Cultural Context but Fail to Apply It When Responding

arXiv cs.CL · 6d ago Cached

This paper introduces CAPRI, a dataset to evaluate whether LLMs can infer a user's cultural background from conversational cues and adapt their responses (e.g., using appropriate measurement units). Experiments show LLMs can infer cultural context but often fail to apply it unless explicitly prompted.

0 favorites 0 likes
#dataset

Translating the Untranslatable: An Operationalizable Ontology for Untranslatability

arXiv cs.CL · 6d ago Cached

This paper introduces a structured ontology for untranslatability in machine translation, along with a taxonomy of compensation strategies and a multilingual dataset. Human preference studies show translator quality depends on the strategy used, with a preference for explanatory translations.

0 favorites 0 likes
#dataset

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

arXiv cs.AI · 6d ago Cached

This paper introduces MathVis-Fine, a framework for fine-grained visual dependency modeling in multimodal mathematical reasoning, along with a new dataset and a two-stage progressive training paradigm that balances answer correctness and visual grounding rewards based on each sample's intrinsic visual dependency level.

0 favorites 0 likes
#dataset

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

arXiv cs.AI · 6d ago Cached

Introduces FllumaOne, a multimodal CAD dataset of 100,000 samples generated by executable Python programs in the Flluma CAD system, providing feature trees, STEP geometry, point clouds, and language descriptions to support editable CAD research.

0 favorites 0 likes
#dataset

@elliotarledge: just downloaded 16,459 kernels from a @SakanaAILabs dataset and compiling + benchmarking them. great open source datase…

X AI KOLs Following · 2026-06-16 Cached

Downloaded and compiled 16,459 CUDA kernels from SakanaAI's open-source dataset, benchmarking them for performance.

0 favorites 0 likes
#dataset

Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models

Reddit r/LocalLLaMA · 2026-06-16

A new initiative called Trace Commons aims to collect coding agent traces into an open CC-BY-4.0 dataset to help train open-weight and open-source models, countering the data advantage of proprietary models from Anthropic and OpenAI.

0 favorites 0 likes
#dataset

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

arXiv cs.CL · 2026-06-15 Cached

This paper introduces MoDiCoL, a modular diagnostic continual learning dataset for robust speech recognition, enabling controlled analysis of linguistic content, speaker characteristics, and acoustic environments, and proposes a continual learning curriculum to study how robustness is acquired, transferred, and forgotten.

0 favorites 0 likes
#dataset

Want to build a custom model

Reddit r/LocalLLaMA · 2026-06-14

A user discusses building a small autocomplete model (25M parameters) as a learning project, mentions hardware constraints (32GB VRAM), data requirements (~100M tokens), and seeks advice on datasets and data formatting for autocomplete-style training.

0 favorites 0 likes
#dataset

A Context-Aware Dataset for Stance Detection in Bioethical Controversies on Reddit

arXiv cs.CL · 2026-06-12 Cached

Presents BioStance, a context-aware dataset of 39,600 annotated Reddit post-comment pairs for stance detection in bioethical controversies, covering six targets across three dimensions of bioethical debate.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback