SWE Context Bench just proved something I think a lot of coding agent users already feel

Reddit r/AI_Agents Papers

Summary

A new benchmark paper 'SWE Context Bench' tests whether coding agents can reuse knowledge across tasks, highlighting a gap in existing benchmarks that only evaluate isolated problem-solving. The author discusses solutions like external memory and mentions tools such as langmem, mem0, supermemory, and Greplica.

I just read the new benchmark paper "SWE Context Bench: A Benchmark for Context Learning in Coding" (arXiv 2602.08316, May 2026). The core finding is pretty obvious once stated out loud: current benchmarks like SWE-bench only test whether an agent can solve a task in isolation. They don't test whether an agent can reuse what it learned on related tasks to work faster and cheaper next time. Would love to know: 1. How do you think this problem will be solved - external memory? In-harness solutions? Models will just get better at it? 2. How are you trying to workaround agent amnesia currently? 3. How do the solutions like langmem / mem0 / supermemory support here if at all? I'm working on Greplica — a lightweight graph-memory layer for coding agents. The idea is simple: capture claims, components, flows, and code anchors from engineering sessions, and let the agent query that graph across sessions instead of starting blind.
Original Article

Similar Articles

SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory

arXiv cs.CL

Introduces SMMBench, a benchmark to evaluate multimodal agents' ability to retrieve, align, and compose evidence scattered across independently originated sources like conversations, tables, and documents. Experiments show current systems struggle with this source-distributed memory composition task.