@bcherny: Seeing a number of benchmarks showing Opus is the best model for long-running work. Five tips for running Opus autonomo…

X AI KOLs Following 06/08/26, 01:16 AM Models

claude-opus autonomous-agents long-running software-engineering tips coding-agents benchmark

Summary

Practical tips for running Anthropic's Claude Opus autonomously for hours or days, such as using auto mode, dynamic workflows, and self-verification; also references the SWE-Marathon benchmark for long-horizon software tasks.

Seeing a number of benchmarks showing Opus is the best model for long-running work. Five tips for running Opus autonomously for hours/days: 1. Use auto mode for permissions, so Claude doesn’t ask for approval 2. Use dynamic workflows, to have Claude orchestrate hundreds/thousands of agents to get a task done 3. Use /goal or /loop, to nudge Claude to keep going until it’s done 4. Use Claude Code in the cloud, so you can close your laptop (easiest way is the desktop or mobile app) 5. Make sure Claude has a way to self-verify its work end to end: Claude in Chrome browser extension for web, iOS/Android sim MCP for mobile, a way to start the full web server or service for backend work

Original Article

View Cached Full Text

Cached at: 06/08/26, 03:23 PM

Seeing a number of benchmarks showing Opus is the best model for long-running work.

Five tips for running Opus autonomously for hours/days:

Use auto mode for permissions, so Claude doesn’t ask for approval
Use dynamic workflows, to have Claude orchestrate hundreds/thousands of agents to get a task done
Use /goal or /loop, to nudge Claude to keep going until it’s done
Use Claude Code in the cloud, so you can close your laptop (easiest way is the desktop or mobile app)
Make sure Claude has a way to self-verify its work end to end: Claude in Chrome browser extension for web, iOS/Android sim MCP for mobile, a way to start the full web server or service for backend work

Nice!

Context rot isn’t a thing with 4.8 imo, but curious if that’s been your experience also

Most important thing I’ve found is self-verification + dynamic workflows prompted with something like “use a workflow to test the result e2e in a browser using claude in chrome mcp. Especially look for edge cases and ui issues”

A few things I’ve used very long running sessions for:

Building complex features
Migrating code from language X to Y
Migrating code from framework X to Y
Repeatedly profiling and optimizing code to hit a specific memory or CPU target
Finding and fixing flaky tests in CI
Profiling CI to make it faster

I think of it in terms of ROI rather than absolute cost: how much would it have cost to do the same work manually? Often the answer is weeks or even months of engineering time

These are not designed for people to invoke them, though you can do so if you want. Just tell the model what you want to happen, and it will do the work to invoke the right skills for you

I don’t see that with Opus 4.8 anymore, do you?

Run /usage to see a breakdown of the specific skills, mcps, and plugins that are using your tokens

Just tell claude to use a workflow

Yes. It’s more powerful and more token-efficient

Enterprise seat limits are configurable, maybe ask you your admin to increase limits?

We do both! Depends if it’s a one-off or something you want to run on future PRs

@bcherny Many people try to achieve this through an orchestration layer. When are you planning an overlay/supervisor agent that monitors, dispatches, summarizes, and manages other sessions?

Agent View is great, but jumping between sessions is getting frustrating - especially when active sessions quietly fall into Completed instead of surfacing Needs Input.

@bcherny: Seeing a number of benchmarks showing Opus is the best model for long-running work. Five tips for running Opus autonomo…

Similar Articles

@bcherny: Opus 5 is a great model for coding, data analysis, design, biology, knowledge work. More than any of these eval scores,…

Claude Opus 4.8 launched May 28 with a feature that signals where AI is actually heading. It can now break one task into dozens of parallel workstreams and run them simultaneously.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

@orca_build: Anthropic’s new Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1… …but it’s noticeably better at UI tasks.…

@bcherny: People often ask what my biggest tip is for getting the most out of Claude Code. These days my #1 tip is: use auto mode…

Submit Feedback

Similar Articles

@bcherny: Opus 5 is a great model for coding, data analysis, design, biology, knowledge work. More than any of these eval scores,…

Claude Opus 4.8 launched May 28 with a feature that signals where AI is actually heading. It can now break one task into dozens of parallel workstreams and run them simultaneously.

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

@orca_build: Anthropic’s new Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1… …but it’s noticeably better at UI tasks.…

@bcherny: People often ask what my biggest tip is for getting the most out of Claude Code. These days my #1 tip is: use auto mode…