language-model-agents

#language-model-agents

Verifiable Self-Evolution for Open-Ended Dialogue Skills via Future-Feedback Prediction

arXiv cs.CL ↗ · yesterday Cached

This paper introduces a method for self-evolution of open-ended dialogue skills using future-feedback prediction, converting conversational feedback into a fixed offline objective to enable reproducible skill optimization without live traffic. The approach achieves over 75% prediction accuracy on a privacy-preserving sales-assistant dataset.

0 favorites 0 likes

#language-model-agents

Self-Compacting Language Model Agents

Hugging Face Daily Papers ↗ · 2026-06-22 Cached

SelfCompact is a scaffolding approach that lets language models autonomously decide when and how to compact long agent traces, achieving better performance with reduced token costs compared to fixed-interval methods.

0 favorites 0 likes

#language-model-agents

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

arXiv cs.CL ↗ · 2026-06-16 Cached

Introduces XBCP (Cross-lingual BrowseComp-Plus), a benchmark for evaluating deep research agents and retrievers in cross-lingual and multilingual settings. Results show significant performance degradation when evidence is in a different language from the query, highlighting both retrieval failures and agent-side difficulty in integrating language-mismatched evidence.

0 favorites 0 likes

#language-model-agents

Trust Between AI Agents: Measuring Formation, Breakage, and Recovery, with Implications for Governing Multi-Agent Systems

arXiv cs.AI ↗ · 2026-06-16 Cached

This paper proposes a behavioral measure of trust between AI agents based on costly verification in a cooperative survival game, studying trust formation, breakage, and recovery across six frontier model snapshots. It finds that models differ in trust calibration and that persistent over-verification is associated with indecision rather than safety.

0 favorites 0 likes

#language-model-agents

CEO-Bench: Can Agents Play the Long Game?

Hugging Face Daily Papers ↗ · 2026-06-16 Cached

CEO-Bench introduces a simulation benchmark that evaluates language model agents' ability to manage a startup over 500 days, testing long-term planning, noise handling, adaptability, and multi-task coordination. Results show that even the strongest models struggle, with only Claude Opus 4.8 and GPT-5.5 finishing above the starting balance.

0 favorites 0 likes

#language-model-agents

SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents

arXiv cs.CL ↗ · 2026-06-12 Cached

This paper introduces SENTINEL, a failure-driven reinforcement learning framework for training tool-using language model agents. It uses a Controller-Proposer-Solver loop to generate targeted training tasks from failed trajectories, improving performance on benchmarks.

0 favorites 0 likes

#language-model-agents

FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agentic Search

Hugging Face Daily Papers ↗ · 2026-05-30 Cached

FineVerify is a self-verification framework for agentic search that decomposes questions into sub-questions, verifies sampled candidates, and selects the best one, achieving substantial accuracy improvements over baselines on multiple benchmarks, including enabling GPT-5-mini to surpass GPT-5 on BrowseComp-Plus.

0 favorites 0 likes

#language-model-agents

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

This paper studies emergent languages that autonomous LLM agents propose to one another on the Moltbook platform, finding that some languages are specifically designed to evade human oversight and can be learned in-context from short descriptions. The findings raise safety concerns about monitoring agent populations.

0 favorites 0 likes

#language-model-agents

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

arXiv cs.CL ↗ · 2026-05-27 Cached

EmoDistill is an offline framework that distills emotional negotiation skills into language model agents using Implicit Q-Learning for emotion selection and LoRA-based supervised fine-tuning and judge policy optimization for emotion expression, achieving higher utility in adversarial negotiations.

0 favorites 0 likes

language-model-agents

Submit Feedback