@PyTorch: Enable smarter, longer-thinking agents Scale agentic AI and reinforcement learning by shortening CPU execution time, in…
Summary
NVIDIA introduces the Vera CPU with a neural branch predictor to accelerate agentic AI and reinforcement learning workloads by reducing CPU execution time and increasing throughput in AI factories.
View Cached Full Text
Cached at: 06/11/26, 07:41 PM
Enable smarter, longer-thinking agents
Scale agentic AI and reinforcement learning by shortening CPU execution time, increasing task throughput, and improving overall AI factory output.
The @nvidia custom Olympus core in the NVIDIA Vera CPU uses a neural branch predictor to reduce stalls in branch-heavy code. Combined with other prediction mechanisms, it can sustain two taken branches per cycle with zero penalty, maintaining throughput for deep software stacks such as PyTorch, graph workloads, and scripting engines.
Read the complete blog post:
NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories
Source: https://developer.nvidia.com/blog/nvidia-vera-cpu-sets-a-new-standard-for-agentic-workloads-in-ai-factories/ Each wave of AI has created a new scaling law.Pretrainingscaled intelligence through larger datasets, more parameters, and massively parallel GPU systems.Post-trainingscaled usefulness through instruction tuning, and re-balancing GPUs for generative inference.Test-time scalingimproved reasoning by giving models more generated tokens for thinking.
Now,agentic AIandreinforcement learningscale actions. Models take more steps, call more tools, run more evaluations, and interact with execution environments to perform tasks*.*
This blog explains how NVIDIA Vera CPUs help AI factories to scale agentic AI and reinforcement learning by shortening CPU execution time, increasing task throughput, improving overall AI factory output, and enabling smarter, longer-thinking agents.
Figure 1. CPU execution becomes part of the AI loop
Why CPUs matter more in the agentic erahttps://developer.nvidia.com/blog/nvidia-vera-cpu-sets-a-new-standard-for-agentic-workloads-in-ai-factories/#why_cpus_matter_more_in_the_agentic_era
GPUs remain essential for model inference and training. But across agentic AI, reinforcement learning, and data-intensive AI services, much of the execution surrounding the model runs on CPUs, such as:
- Sandboxed code and tool execution
- Data retrieval and data processing
- Results computation
- Scheduling and orchestration
This is a precise loop:
- A prompt (either from a user, reasoning tokens, or a previous turn’s result) kicks off generation:“I should compile and run hello.c.”
- The GPU generates the parameters of the tool call to be performed on the CPU:
gcc \-o hello hello\.c ; \./hello - The CPU executes the tool call, producing results that are fed back to the GPUs to update weights during reinforcement learning, or used by the agent to generate the next prompt: Output: ‘Hello, world!’ – Task Returned (0) – Successful
- The GPU generates reasoning tokens prompted by the result:“Hmm! It looks like that worked!”
As agents become more capable, they take more steps, call more tools, and run more checks. CPU time compounds across the request.
This makes the CPU part of the critical path. It’s no longer just a host processor feeding the GPU. It shapes latency, accelerator utilization, and AI factory output per watt and per dollar.
For the last decade, much of the data center CPU market optimized around cloud economics of more cores, more virtual machines, and lower cost per core. This remains important for general-purpose cloud services, but performance per core has not improved at the same rate.
This is further compounded by the end of Moore’s law, which limited generation-on-generation performance improvements in CPUs, even while GPU architectures and workloads benefited from a continuous cycle of co-optimization.
AI factories shift the metric from cores per dollar to tokens per dollar—from how many CPU cores a data center can rent, to how much AI output it can produce.
This demands a new CPU design point for AI factories:
- High core counts to run thousands of concurrent agents, RL environments, sandboxes, and services.
- High per-core performance, because each agentic step is gated by sequential execution.
- Energy-efficient memory bandwidth to keep data moving without turning CPU infrastructure into a bottleneck.
Figure 2. AI creates a need for a new CPU
The NVIDIA Vera CPU: Built for AI agentshttps://developer.nvidia.com/blog/nvidia-vera-cpu-sets-a-new-standard-for-agentic-workloads-in-ai-factories/#the_nvidia_vera_cpu_built_for_ai_agents
TheNVIDIA Vera CPUis designed for the reality of modern workloads, with fast per-core performance, high concurrency, and power-efficient memory bandwidth to keep the AI factory moving.
The Vera CPU combines 88 NVIDIA Olympus cores with up to 1.2 TB/s of LPDDR5X memory bandwidth to keep cores fed through tool calls, sandboxed execution of both native code and languages like Python or JavaScript, data retrieval, data processing, and orchestration.
The key requirement is fast per-core performance, sustained at all times. Unlike cloud virtual machines, the CPU sockets stay fully loaded, doing the work of many concurrent agents. Cores that remain fast under high system load reduce task completion time, delivering faster results while freeing up resources to serve the next request.
For agents, this means lower latency across multistep requests. For reinforcement learning, this means more completed evaluations and more data from each training window, helping models reach a higher quality bar faster. For AI factories, fast cores keep accelerators from waiting on orchestration, tool execution, or data movement.
Delivering this requires the core, memory subsystem, and fabric to be designed together for branch-heavy code, high-bandwidth data movement, and predictable performance under load.
This starts with the NVIDIA custom Olympus core inside the Vera CPU.
Figure 3. The Vera CPU is built for the agentic design point
NVIDIA Olympus core and memory subsystemhttps://developer.nvidia.com/blog/nvidia-vera-cpu-sets-a-new-standard-for-agentic-workloads-in-ai-factories/#nvidia_olympus_core_and_memory_subsystem
The NVIDIA Olympus core delivers up to 50% higher IPC than NVIDIA Grace, combining a wide front end, advanced branch prediction, deep out-of-order instruction scheduling, and specialized memory prefetching to sustain high throughput on branch-heavy, memory-sensitive agentic code.
Olympus uses a neural branch predictor to reduce stalls in branch-heavy code. Combined with other prediction mechanisms, it can sustain two taken branches per cycle with zero penalty, maintaining throughput for deep software stacks such as PyTorch, graph workloads, and scripting engines.
Olympus also includes a 10-wide decode unit and a deep out-of-order engine designed to sustain high instructions per cycle. Large buffers and advanced instruction scheduling help the core maintain forward progress as code paths, dependencies, and memory access patterns shift.
Sustaining high IPC under load requires keeping the cores fed with data. Vera CPUs deliver up to 1.2 TB/s of LPDDR5X memory bandwidth, sustaining over 90% of peak memory bandwidth under load. It also offers 40% lower peak memory latency compared to x86 CPUs, ensuring Olympus cores are fed on time through retrieval, analytics, sandbox execution, and orchestration.
Olympus also adds a novel graph prefetcher built for indirect memory access patterns common in graph analytics and agent memory traversal. Combined with high-memory per-core bandwidth, Vera CPUs deliver more than 3x performance on graph traversal workloads compared with x86-based architectures.
The NVIDIA Scalable Coherency Fabric (SCF) connects all cores and a unified cache across a monolithic mesh, delivering predictable latency and 50% faster core-to-core data movement compared with CPUs that fragment compute across dies. For reinforcement learning and agentic AI, that predictability helps keep evaluation loops sustained under full load.
Together, the Olympus core, NVIDIA SCF, and LPDDR5X memory subsystem enable the Vera CPU to deliver more than 1.8x higher sandbox performance across agentic workloads under full load compared with the competition, as shown in Figure 4.
Figure 4. The Vera CPU delivers industry-leading agentic sandbox performance
System efficiencyhttps://developer.nvidia.com/blog/nvidia-vera-cpu-sets-a-new-standard-for-agentic-workloads-in-ai-factories/#system_efficiency
Beyond performance, agentic AI places increasing pressure on infrastructure efficiency. As AI factories scale to thousands of CPUs, memory power can become a major contributor to platform power, cooling demand, and operating cost.
The Vera CPU pairs its architecture with high-bandwidth SOCAMM LPDDR5X memory to reduce memory power compared with traditional DDR server designs. The LPDDR5X subsystem typically consumes less than 30 watts, compared with well over 100 watts for DDR5 configurations. MRDIMM-based systems can drive memory power even higher.
With a configurable 250 W to 450 W TDP range, the Vera CPU reduces combined CPU and memory subsystem power while delivering the bandwidth needed for agentic inference and reinforcement learning environments. For AI factories, this translates into better performance per watt, lower operating costs, and more efficient use of power and cooling infrastructure.
The AI factory CPU for agentshttps://developer.nvidia.com/blog/nvidia-vera-cpu-sets-a-new-standard-for-agentic-workloads-in-ai-factories/#the_ai_factory_cpu_for_agents
The era of agentic AI requires a shift in CPU design—from maximizing cores per dollar to maximizing AI factory output per watt and per dollar. NVIDIA Vera CPU is the CPU for agents, combining fast per-core performance, high concurrency, and power-efficient memory bandwidth. With the custom Olympus core, LPDDR5X memory, and NVIDIA Scalable Coherency Fabric, Vera CPU delivers more than 1.8x higher agentic sandbox performance than traditional x86 architectures, helping AI factories complete more tool calls, return more evaluations, and keep accelerators moving.
Learn More about theVera CPU, the NVIDIAVera Rubin NVL2, and theVera CPU benchmarking by Phoronix.
Relative performance based on measured data, and subject to change. NVIDIA Vera CPU with LPDDR5X performance baselined to the latest x86 CPU.
Similar Articles
Vera Arrives: NVIDIA’s First CPU Built for Agents Lands at Top AI Labs
NVIDIA hand-delivers the first Vera CPUs to Anthropic, OpenAI, Oracle Cloud Infrastructure, and SpaceXAI, marking the arrival of a CPU purpose-built for agentic AI workloads.
NVIDIA just dropped their new Vera CPUs — apparently 2x faster than x86
NVIDIA announced its new Vera architecture CPUs, built from the ground up for agentic AI and reinforcement learning, claiming 2x performance over x86 alternatives. The Vera Rubin NVL72 platform integrates 72 GPUs and 36 CPUs, with major customers including Meta, Oracle, and Alibaba.
NVIDIA Vera CPU Opens the Way for Agentic Scientific AI at Los Alamos National Laboratory
NVIDIA announces its Vera CPU will power new supercomputers at Los Alamos National Laboratory, delivering significant performance improvements for agentic AI simulations and scientific workloads.
NVIDIA Vera CPU Is ‘Packing a Heavy-Hitting Punch’ Against Competition
NVIDIA's Vera CPU, featuring custom Olympus Arm cores and LPDDR5X memory, delivers exceptional performance and memory bandwidth for agentic AI workloads, as shown in initial Phoronix benchmarks.
Nvidia's Vera CPU, DGX Station, Windows PCs all go to the same place: AI agents running locally
Nvidia CEO Jensen Huang announced the Vera CPU, DGX Station for Windows, and other PCs aimed at enabling local AI agents, emphasizing on-premises inference to control token costs.