Discovering Reinforcement Learning Interfaces with Large Language Models
Summary
This paper introduces LIMEN, an LLM-guided evolutionary framework that automatically discovers reinforcement learning interfaces by jointly optimizing observation mappings and reward functions from raw simulator states. The approach reduces manual engineering effort and demonstrates that co-designing observations and rewards outperforms optimizing either component alone.
View Cached Full Text
Cached at: 05/11/26, 02:52 PM
Paper page - Discovering Reinforcement Learning Interfaces with Large Language Models
Source: https://huggingface.co/papers/2605.03408
Abstract
Automated reinforcement learning interface discovery using LLM-guided evolutionary algorithms that jointly optimize observation mappings and reward functions from raw simulator state.
Reinforcement learningsystems rely onenvironment interfacesthat specify observations andreward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design usinglarge language models(LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where bothobservation mappingsandreward functionsmust be generated. We propose LIMEN (Code available at https://github.com/Lossfunk/LIMEN), a LLM guidedevolutionary frameworkthat produces candidate interfaces as executable programs and iteratively refines them usingpolicy trainingfeedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation,joint evolutionof observations and rewards discovers effective interfaces given only atrajectory-level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit fromco-design, as single-component optimization fails catastrophically on at least one domain in our evaluation suite.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2605\.03408
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.03408 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.03408 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.03408 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
EvoTrainer introduces an autonomous training framework that co-evolves LLM policies and training harnesses through empirical feedback, outperforming human-engineered RL baselines on mathematical reasoning, code generation, and long-horizon software engineering tasks.
Generative Floor Plan Design with LLMs via Reinforcement Learning with Verifiable Rewards
This paper introduces a text-based approach for generative floor plan design that fine-tunes a large language model with reinforcement learning and verifiable rewards to improve adherence to topological and numerical constraints, achieving significant improvements over existing methods.
Generating Robust Portfolios of Optimization Models using Large Language Models
Proposes a method to generate portfolios of optimization models using LLMs, with theoretical guarantees and empirical validation.
Evolution through large models
This paper demonstrates that large language models trained on code can significantly enhance genetic programming mutation operators, enabling the generation of hundreds of thousands of functional Python programs for robot design in the Sodarace domain without prior training data. The approach, called Evolution through Large Models (ELM), combines LLMs with MAP-Elites to bootstrap new conditional models for context-specific artifact generation.
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling
This paper formulates adaptive sampling for large language models as a Markov decision process and trains a lightweight RL controller to balance correctness, latency, and computational cost, achieving improved trade-offs.