Tag
llama.cpp's web UI now supports executing model-generated JavaScript in a sandboxed iframe via Web Workers, enabling lightweight agentic code execution as an opt-in feature.
Lenny Bogdonoff, an early OpenAI employee, rebuilt the Jupyter code execution environment before GPT-4 training and ChatGPT launch. This work became the prototype for the later 'AI computer' concept, but it wasn't recognized at the time.
VELA is a tool for securely executing AI-generated and untrusted code, providing a sandbox environment to prevent malicious actions.
Greptile introduces TREX, an AI code reviewer that executes code and detects runtime bugs, going beyond static analysis by spinning up parallel agents to investigate issues and generate artifacts like screenshots.
CODA-BENCH is a new benchmark for evaluating code agents on data-intensive tasks, bridging the gap between code-centric and data-centric evaluations. It includes over 1,000 tasks from 31 communities, with realistic data scale and noise, revealing that even top agents achieve only 61.1% success rate.
A security vulnerability in objdump -g allows arbitrary code execution via a crafted FR30 object file due to a missing bounds check in the FR30 relocation handler, with a single-shot exploit that defeats ASLR and other mitigations.
Config files for IDEs, AI coding agents, and package managers can execute code automatically, creating a supply chain security blindspot. The article details the Miasma worm attack that uses such config files to drop malware, and provides examples of injection vectors.
LangChain introduces LangSmith Sandboxes, providing each AI agent with its own isolated computer environment for safe code execution, addressing security risks of running untrusted code in containers or locally.
China released OpenSandbox, an open-source sandbox runtime for AI agents, supporting multiple SDKs and secure execution environments with Docker/Kubernetes isolation.
LangChain's newsletter announces major product launches from Interrupt 2026: LangSmith Engine for automated agent failure diagnosis and fixes, and Sandboxes GA for secure code execution, alongside a new LangChain Labs research initiative and upcoming events.
This paper evaluates three approaches (pure chain-of-thought reasoning, single-shot code execution, and iterative code execution) on 1,000 GSM-Symbolic problems using Claude Haiku 4.5, finding that chain-of-thought is the most robust to perturbation, while code execution does not improve reasoning robustness on grade-school math problems.
HOL Guard is an open-source security tool that provides dangerous command identification, interception, and auditing for development agents such as Codex, Claude Code, etc. It supports multiple protection levels and a local approval center to prevent risks like accidental deletion or modification.
Discusses whether to isolate the tool or the agent when running agents that execute arbitrary code, concluding that isolating the agent is superior due to zero secrets and a control-plane proxy.
Introduces ast-guard, an open-source AST-based security tool that prevents malicious code execution from LLM-generated Python strings by parsing them into an abstract syntax tree and applying node-level whitelisting and context-aware safety checks.
Built a GitHub Issue Triage Agent using a single curl to the Gemini API that clones repos, fetches issues, classifies them, and executes reproducer code to confirm bugs, without any orchestration framework.
Harrison Chase announces a lightweight code execution environment called code interpreter that enables RLMs and programmatic tool calling without needing a full sandbox, with more use cases to be detailed.
Deep Agents introduces interpreters: small embedded runtimes that allow agents to write and execute code inside the agent loop, enabling multi-step logic and intermediate state management without full sandbox overhead.
Phil Schmid announces Managed Agents in the Gemini API, enabling one-call agents with code execution, web browsing, and file management in isolated sandboxes, powered by Gemini 3.5 Flash.
This paper introduces ThinC (Thinking in Code), a framework where language models use code blocks exclusively for reasoning after a brief natural language planning step, outperforming existing tool-integrated reasoning baselines on math benchmarks.
Anthropic's 'Code Mode' reframes the MCP vs CLI debate by having AI agents write code to call tools via a runtime rather than loading full schemas into context, drastically reducing token usage. This approach combines MCP's typed contracts with lazy loading, proving the protocol is evolving rather than dying.