KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
Summary
KinDER is a new open-source benchmark for physical reasoning in robotics, featuring procedurally generated environments and baselines to evaluate kinematic and dynamic constraint challenges.
View Cached Full Text
Cached at: 05/08/26, 07:47 AM
Paper page - KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
Source: https://huggingface.co/papers/2604.25788 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
KinDER is a benchmark for physical reasoning in robotics that includes procedurally generated environments and baselines spanning multiple learning paradigms to address kinematic and dynamic constraint challenges.
Robotic systemsthat interact with the physical world must reason about kinematic anddynamic constraintsimposed by their own embodiment, their environment, and the task at hand. We introduce KinDER, a benchmark for Kinematic and DynamicEmbodied Reasoningthat targetsphysical reasoningchallenges arising in robot learning and planning. KinDER comprises 25 procedurally generated environments, a Gymnasium-compatible Python library with parameterized skills and demonstrations, and a standardized evaluation suite with 13 implemented baselines spanning task andmotion planning,imitation learning,reinforcement learning, and foundation-model-based approaches. The environments are designed to isolate five corephysical reasoningchallenges: basic spatial relations, nonprehensile multi-object manipulation, tool use, combinatorial geometric constraints, anddynamic constraints, disentangled from perception, language understanding, and application-specific complexity. Empirical evaluation shows that existing methods struggle to solve many of the environments, indicating substantial gaps in current approaches tophysical reasoning. We additionally includereal-to-sim-to-real experimentson a mobile manipulator to assess the correspondence between simulation and real-world physical interaction. KinDER is fully open-sourced and intended to enable systematic comparison across diverse paradigms for advancingphysical reasoningin robotics. Website and code: https://prpl-group.com/kinder-site/
View arXiv pageView PDFProject pageGitHub25Add to collection
Get this paper in your agent:
hf papers read 2604\.25788
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper4
#### kinder-bench/kinder-openpi-checkpoints Robotics• Updatedabout 12 hours ago
#### kinder-bench/kinder-DP-checkpoints Robotics• Updatedabout 12 hours ago
#### kinder-bench/kinder-DPES-checkpoints Robotics• Updatedabout 12 hours ago
#### kinder-bench/kinder-mbrl-checkpoints Robotics• Updatedabout 12 hours ago
Datasets citing this paper1
#### kinder-bench/kinder-datasets Updatedabout 12 hours ago • 39
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.25788 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies
RoboLab is a high-fidelity simulation benchmarking framework for evaluating task-generalist robotic policies, introducing the RoboLab-120 benchmark with 120 tasks across visual, procedural, and relational competency axes. It enables scalable, realistic task generation and systematic analysis of policy behavior under controlled perturbations to assess true generalization capabilities.
Safety Gym
OpenAI introduces Safety Gym, a new benchmark environment and toolkit for studying constrained reinforcement learning and safe exploration. The platform features multiple robots and tasks designed to quantify and measure safe exploration through cost functions alongside reward functions.
ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes
ShapeCodeBench is a synthetic benchmark for perception-to-program reconstruction where models generate executable drawing programs from raster images, evaluated on metrics like exact match and pixel accuracy. The benchmark is designed to be renewable via seeded RNG, and current models still achieve low exact match rates, indicating room for improvement.
Benchmarking safe exploration in deep reinforcement learning
OpenAI proposes standardizing constrained RL as the formalism for safe exploration and introduces Safety Gym, a benchmark suite for evaluating safe deep RL algorithms in high-dimensional continuous control tasks with safety constraints.
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark
RoboMemArena introduces a large-scale benchmark for evaluating robotic memory across 26 complex tasks with real-world validation, alongside PrediMem, a dual-system vision-language-action model that improves memory management through predictive coding.