NVIDIA's full-stack inference software, codesigned with hardware, has reduced token costs by up to 5x on the Blackwell platform in just one month, enabling lower cost per token for AI factories. Companies like Baseten, Cognition, Deep Infra, and Together AI are using the stack to optimize inference performance.
<div id="bsf_rt_marker"></div><p><span style="font-weight: 400;">As organizations move from AI pilots to production AI factories, infrastructure decisions have shifted from peak chip specifications to cost per </span><a href="https://blogs.nvidia.com/blog/ai-tokens-explained/"><span style="font-weight: 400;">token</span></a><span style="font-weight: 400;">: how many useful tokens they can deliver per dollar, per watt and within required latency targets.</span></p>
<p><span style="font-weight: 400;">Codesigned with NVIDIA GPUs, CPUs, networking and systems, and strengthened by a broad open source ecosystem, NVIDIA’s full-stack inference software continuously improves hardware performance. On the </span><a target="_blank" href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/"><span style="font-weight: 400;">NVIDIA Blackwell</span></a><span style="font-weight: 400;"> platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month.</span></p>
<figure id="attachment_95787" aria-describedby="caption-attachment-95787" style="width: 1920px" class="wp-caption alignnone"><img fetchpriority="high" decoding="async" class="wp-image-95787 size-full" src="https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x.jpg" alt="" width="1920" height="1080" srcset="https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x.jpg 1920w, https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x-960x540.jpg 960w, https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x-1680x945.jpg 1680w, https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x-1280x720.jpg 1280w, https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x-1536x864.jpg 1536w, https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x-1290x725.jpg 1290w, https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x-630x354.jpg 630w, https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x-300x169.jpg 300w, https://blogs.nvidia.com/wp-content/uploads/2026/06/semi-analysis-inference-x-5x-400x225.jpg 400w" sizes="(max-width: 1920px) 100vw, 1920px" /><figcaption id="caption-attachment-95787" class="wp-caption-text">SemiAnalysis InferenceX results comparing token cost and interactivity for NVIDIA GB300 NVL72 systems with SGLang and the NVIDIA Dynamo inference framework.</figcaption></figure>
<p><span style="font-weight: 400;">Leading companies and inference providers are already seeing the compounding value of NVIDIA’s inference software stack on Blackwell: </span></p>
<ul>
<li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;"><a target="_blank" href="https://www.baseten.co/products/model-apis/">Baseten</a> used the NVIDIA TensorRT-LLM open source library to serve DeepSeek V4 Pro on Blackwell GPUs for reasoning, coding and long-context workloads, applying proprietary runtime optimizations to deliver up to 50% more tokens per second.</span></li>
<li style="font-weight: 400;" aria-level="1"><a target="_blank" href="https://cognition.com/blog/swe-1-6"><span style="font-weight: 400;">Cognition</span></a><span style="font-weight: 400;"> is using the NVIDIA Dynamo inference framework to manage inference GPUs, giving its team a ready-made path to scale reinforcement learning workloads without needing to build that infrastructure from scratch. </span></li>
<li style="font-weight: 400;" aria-level="1"><a target="_blank" href="https://deepinfra.com/blog/deepinfra-nvidia-inference-stack"><span style="font-weight: 400;">Deep Infra</span></a><span style="font-weight: 400;"> uses the NVIDIA inference software stack to serve frontier open source models performantly on Blackwell from day zero, including DeepSeek V4. </span></li>
<li style="font-weight: 400;" aria-level="1"><a target="_blank" href="https://youtu.be/10Kb3IB0d70"><span style="font-weight: 400;">Together AI</span></a><span style="font-weight: 400;"> used NVIDIA TensorRT-LLM on Blackwell to help Cursor accelerate the path from model optimizations to production endpoints for its real-time coding experience. </span></li>
</ul>
<h2><strong>Why Software Matters for Inference Economics</strong></h2>
<p><span style="font-weight: 400;">Traditional web, search and software-as-a-service workloads were relatively predictable: A user might load a page, refresh a feed or update a business record. These requests typically followed similar software paths, reading from or writing to a database, and scaled by adding more of the same servers. </span></p>
<p><span style="font-weight: 400;">Agentic AI is different.</span></p>
<figure id="attachment_95793" aria-describedby="caption-attachment-95793" style="width: 1920px" class="wp-caption alignnone"><img decoding="async" class="wp-image-95793 size-full" src="https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart.jpg" alt="" width="1920" height="1080" srcset="https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart.jpg 1920w, https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart-960x540.jpg 960w, https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart-1680x945.jpg 1680w, https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart-1280x720.jpg 1280w, https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart-1536x864.jpg 1536w, https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart-1290x725.jpg 1290w, https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart-630x354.jpg 630w, https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart-300x169.jpg 300w, https://blogs.nvidia.com/wp-content/uploads/2026/06/traditional-vs-agentic-think-smart-400x225.jpg 400w" sizes="(max-width: 1920px) 100vw, 1920px" /><figcaption id="caption-attachment-95793" class="wp-caption-text">Agentic AI runs distributed, stateful workflows that span LLMs, tools, memory, security, networking and accelerated computing across the data center.</figcaption></figure>
<p><span style="font-weight: 400;">Agents can reason, plan, call tools, spin up specialist subagents and manage massive context across multi-turn workflows. They turn a single request into a distributed computing problem that can span hundreds of subagents, thousands of tasks and multiple large language models, running across GPUs, CPUs, DPUs and storage systems. </span></p>
<p><span style="font-weight: 400;">The software stack determines whether that complexity turns into wasted capacity or lower </span><a href="https://blogs.nvidia.com/blog/lowest-token-cost-ai-factories/"><span style="font-weight: 400;">cost per token</span></a><span style="font-weight: 400;">.</span></p>
<p><span style="font-weight: 400;">Lower cost per token comes from turning individual optimizations into system-level performance. NVIDIA’s inference software stack does this by connecting three layers: </span></p>
<ul>
<li style="font-weight: 400;" aria-level="1"><b>Production Operation:</b><span style="font-weight: 400;"> Coordinates distributed serving, orchestration, autoscaling and memory management so inference can run across the right compute and storage resources.</span></li>
<li style="font-weight: 400;" aria-level="1"><b>Application Acceleration: </b><span style="font-weight: 400;">Runs models with high performance while giving developers room to tune and customize, using runtime optimizations such as overlapping compute and communication and kernel fusion.</span></li>
<li style="font-weight: 400;" aria-level="1"><b>Infrastructure Access:</b><span style="font-weight: 400;"> Exposes NVIDIA GPU, networking, memory and system capabilities without requiring developers to manage every device instruction set or data-transfer protocol directly.</span></li>
</ul>
<figure id="attachment_95854" aria-describedby="caption-attachment-95854" style="width: 1921px" class="wp-caption alignnone"><img decoding="async" class="wp-image-95854 size-full" src="https://blogs.nvidia.com/wp-content/uploads/2026/06/inference-social-os-ai-sw-anference-beat-5244100-v9_Slide3-1-1.jpg" alt="" width="1921" height="862" srcset="https://blogs.nvidia.com/wp-content/uploads/2026/06/inference-social-os-ai-sw-anference-beat-5244100-v9_Slide3-1-1.jpg 1921w, https://blogs.nvidia.com/wp-content/uploads/2026/06/inference-social-os-ai-sw-anference-beat-5244100-v9_Slide3-1-1-960x431.jpg 960w, https://blogs.nvidia.com/wp-content/uploads/2026/06/inference-social-os-ai-sw-anference-beat-5244100-v9_Slide3-1-1-1680x754.jpg 1680w, https://blogs.nvidia.com/wp-content/uploads/2026/06/inference-social-os-ai-sw-anference-beat-5244100-v9_Slide3-1-1-1280x574.jpg 1280w, https://blogs.nvidia.com/wp-content/uploads/2026/06/inference-social-os-ai-sw-anference-beat-5244100-v9_Slide3-1-1-1536x689.jpg 1536w, https://blogs.nvidia.com/wp-content/uploads/2026/06/inference-social-os-ai-sw-anference-beat-5244100-v9_Slide3-1-1-630x283.jpg 630w" sizes="(max-width: 1921px) 100vw, 1921px" /><figcaption id="caption-attachment-95854" class="wp-caption-text">The NVIDIA software stack spans model serving, runtime scheduling, kernels, communication libraries and hardware-aware optimizations, enabling rapid performance gains and lower serving costs as improvements compound across layers.</figcaption></figure>
<p><span style="font-weight: 400;">When these layers work as one system, individual optimizations compound.</span></p>
<p><span style="font-weight: 400;">Disaggregated serving, large expert parallelism over </span><a target="_blank" href="https://www.nvidia.com/en-us/data-center/nvlink/"><span style="font-weight: 400;">NVIDIA NVLink</span></a><span style="font-weight: 400;"> interconnect technology, NVFP4 precision and multi-token prediction each deliver meaningful gains on their own. Combined, they increase throughput by up to 20x.</span></p>
<p><span style="font-weight: 400;">The chart below shows the result. Capturing that gain in production is complex, requiring coordination across the full inference stack — from production operations and model runtimes to kernels, communication libraries and hardware access. NVIDIA’s inference software stack is designed to make those layers work together so each optimization can build on the others. </span></p>
<figure id="attachment_95796" aria-describedby="caption-attachment-95796" style="width: 1920px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="wp-image-95796 size-full" src="https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart.jpg" alt="" width="1920" height="1080" srcset="https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart.jpg 1920w, https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart-960x540.jpg 960w, https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart-1680x945.jpg 1680w, https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart-1280x720.jpg 1280w, https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart-1536x864.jpg 1536w, https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart-1290x725.jpg 1290w, https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart-630x354.jpg 630w, https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart-300x169.jpg 300w, https://blogs.nvidia.com/wp-content/uploads/2026/06/stacking-software-optimizations-think-smart-400x225.jpg 400w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /><figcaption id="caption-attachment-95796" class="wp-caption-text">Stacking software optimizations compounds performance gains, increasing NVIDIA Blackwell token throughput per GPU from baseline to up to 20x with disaggregated serving, large expert parallelism (Large EP), NVFP4 and multi-token prediction (MTP).</figcaption></figure>
<h2><strong>Open Source Amplifies the Full-Stack Advantage</strong></h2>
<p><span style="font-weight: 400;">That same full-stack foundation is amplified by the open source ecosystem. Many of today’s most widely used open source AI frameworks and inference projects are built natively on </span><a target="_blank" href="https://developer.nvidia.com/cuda"><span style="font-weight: 400;">NVIDIA CUDA</span></a><span style="font-weight: 400;">, which means new research and software optimizations run with leading performance on NVIDIA GPUs from day zero.</span></p>
<p><span style="font-weight: 400;">PyTorch is a leading example. Launched in 2016 with native CUDA support, PyTorch has coevolved with NVIDIA’s architecture, giving developers access to innovations such as Tensor Cores, Transformer Engine and NVFP4 directly through a familiar framework. </span></p>
<p><span style="font-weight: 400;">When breakthroughs such as </span><a target="_blank" href="https://developer.nvidia.com/blog/boost-inference-performance-up-to-15x-on-nvidia-blackwell-using-dflash-speculative-decoding/"><span style="font-weight: 400;">DFlash speculative decode</span></a><span style="font-weight: 400;">, which delivers up to 15x more throughput on existing hardware, or </span><a target="_blank" href="https://haoailab.com/blogs/fastvideo_realtime_1080p/"><span style="font-weight: 400;">FastVideo</span></a><span style="font-weight: 400;">, which generates 1080p videos in less than five seconds, land in PyTorch, they can run instantly on NVIDIA, helping AI factories convert research progress into lower token costs.</span></p>
<figure id="attachment_95790" aria-describedby="caption-attachment-95790" style="width: 1920px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="wp-image-95790 size-full" src="https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart.jpg" alt="" width="1920" height="1080" srcset="https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart.jpg 1920w, https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart-960x540.jpg 960w, https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart-1680x945.jpg 1680w, https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart-1280x720.jpg 1280w, https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart-1536x864.jpg 1536w, https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart-1290x725.jpg 1290w, https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart-630x354.jpg 630w, https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart-300x169.jpg 300w, https://blogs.nvidia.com/wp-content/uploads/2026/06/pytorch-nvidia-codevelopment-think-smart-400x225.jpg 400w" sizes="auto, (max-width: 1920px) 100vw, 1920px" /><figcaption id="caption-attachment-95790" class="wp-caption-text">NVIDIA and PyTorch codevelopment helps bring new AI software innovations to developers, helping turn CUDA-native advances into production performance as PyTorch adoption grows.</figcaption></figure>
<p><span style="font-weight: 400;">The same open source momentum is why when a new frontier open model like DeepSeek V4 is released, leading inference frameworks like </span><span style="font-weight: 400;">vLLM </span><span style="font-weight: 400;">and </span><span style="font-weight: 400;">SGLang </span><span style="font-weight: 400;">have </span><a target="_blank" href="https://developer.nvidia.com/blog/build-with-deepseek-v4-using-nvidia-blackwell-and-gpu-accelerated-endpoints/"><span style="font-weight: 400;">day-zero deployment recipes</span></a><span style="font-weight: 400;"> for the NVIDIA Blackwell architecture — making the model accessible across millions of Blackwell GPUs. It’s also why DeepSeek V4 performance on Blackwell improved by up to 5x within about a month across vLLM and </span><a target="_blank" href="https://pytorch.org/blog/serving-deepseek-v4-on-gb300-with-sglang-5x-higher-throughput-at-the-same-interactivity-since-day-0/"><span style="font-weight: 400;">SGLang</span></a> <span style="font-weight: 400;">frameworks, cutting token costs to roughly one-fifth of previous levels.</span></p>
<figure id="attachment_95784" aria-describedby="caption-attachment-95784" style="width: 1280px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="wp-image-95784 size-full" src="https://blogs.nvidia.com/wp-content/uploads/2026/06/think-smart-software-optimizations.png" alt="" width="1280" height="720" srcset="https://blogs.nvidia.com/wp-content/uploads/2026/06/think-smart-software-optimizations.png 1280w, https://blogs.nvidia.com/wp-content/uploads/2026/06/think-smart-software-optimizations-960x540.png 960w, https://blogs.nvidia.com/wp-content/uploads/2026/06/think-smart-software-optimizations-630x354.png 630w, https://blogs.nvidia.com/wp-content/uploads/2026/06/think-smart-software-optimizations-300x169.png 300w, https://blogs.nvidia.com/wp-content/uploads/2026/06/think-smart-software-optimizations-400x225.png 400w" sizes="auto, (max-width: 1280px) 100vw, 1280px" /><figcaption id="caption-attachment-95784" class="wp-caption-text">SemiAnalysis InferenceX results comparing token throughput at same interactivity for NVIDIA GB200 NVL72 systems with vLLM and the NVIDIA Dynamo inference framework.</figcaption></figure>
<p><span style="font-weight: 400;">That’s the open source flywheel: more developers optimize CUDA-native inference paths, more production deployments feed back into the ecosystem and each software improvement increases delivered token output while lowering cost per token over time.</span></p>
<p><i><span style="font-weight: 400;">Explore how software multiplies hardware performance in this </span></i><a target="_blank" href="https://www.youtube.com/watch?v=zNuOOMM20Tk"><i><span style="font-weight: 400;">NVIDIA AI Podcast on tokenomics</span></i></a><i><span style="font-weight: 400;"> and this </span></i><a target="_blank" href="https://www.nvidia.com/en-us/solutions/ai/inference/"><i><span style="font-weight: 400;">inference solutions page</span></i></a><i><span style="font-weight: 400;">. </span></i></p>
# How NVIDIA’s Inference Software Stack Powers the Lowest Token Cost
Source: [https://blogs.nvidia.com/blog/inference-software-lowest-token-cost/](https://blogs.nvidia.com/blog/inference-software-lowest-token-cost/)
As organizations move from AI pilots to production AI factories, infrastructure decisions have shifted from peak chip specifications to cost per[token](https://blogs.nvidia.com/blog/ai-tokens-explained/): how many useful tokens they can deliver per dollar, per watt and within required latency targets\.
Codesigned with NVIDIA GPUs, CPUs, networking and systems, and strengthened by a broad open source ecosystem, NVIDIA’s full\-stack inference software continuously improves hardware performance\. On the[NVIDIA Blackwell](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/)platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month\.
SemiAnalysis InferenceX results comparing token cost and interactivity for NVIDIA GB300 NVL72 systems with SGLang and the NVIDIA Dynamo inference framework\.Leading companies and inference providers are already seeing the compounding value of NVIDIA’s inference software stack on Blackwell:
- [Baseten](https://www.baseten.co/products/model-apis/)used the NVIDIA TensorRT\-LLM open source library to serve DeepSeek V4 Pro on Blackwell GPUs for reasoning, coding and long\-context workloads, applying proprietary runtime optimizations to deliver up to 50% more tokens per second\.
- [Cognition](https://cognition.com/blog/swe-1-6)is using the NVIDIA Dynamo inference framework to manage inference GPUs, giving its team a ready\-made path to scale reinforcement learning workloads without needing to build that infrastructure from scratch\.
- [Deep Infra](https://deepinfra.com/blog/deepinfra-nvidia-inference-stack)uses the NVIDIA inference software stack to serve frontier open source models performantly on Blackwell from day zero, including DeepSeek V4\.
- [Together AI](https://youtu.be/10Kb3IB0d70)used NVIDIA TensorRT\-LLM on Blackwell to help Cursor accelerate the path from model optimizations to production endpoints for its real\-time coding experience\.
## **Why Software Matters for Inference Economics**
Traditional web, search and software\-as\-a\-service workloads were relatively predictable: A user might load a page, refresh a feed or update a business record\. These requests typically followed similar software paths, reading from or writing to a database, and scaled by adding more of the same servers\.
Agentic AI is different\.
Agentic AI runs distributed, stateful workflows that span LLMs, tools, memory, security, networking and accelerated computing across the data center\.Agents can reason, plan, call tools, spin up specialist subagents and manage massive context across multi\-turn workflows\. They turn a single request into a distributed computing problem that can span hundreds of subagents, thousands of tasks and multiple large language models, running across GPUs, CPUs, DPUs and storage systems\.
The software stack determines whether that complexity turns into wasted capacity or lower[cost per token](https://blogs.nvidia.com/blog/lowest-token-cost-ai-factories/)\.
Lower cost per token comes from turning individual optimizations into system\-level performance\. NVIDIA’s inference software stack does this by connecting three layers:
- **Production Operation:**Coordinates distributed serving, orchestration, autoscaling and memory management so inference can run across the right compute and storage resources\.
- **Application Acceleration:**Runs models with high performance while giving developers room to tune and customize, using runtime optimizations such as overlapping compute and communication and kernel fusion\.
- **Infrastructure Access:**Exposes NVIDIA GPU, networking, memory and system capabilities without requiring developers to manage every device instruction set or data\-transfer protocol directly\.
The NVIDIA software stack spans model serving, runtime scheduling, kernels, communication libraries and hardware\-aware optimizations, enabling rapid performance gains and lower serving costs as improvements compound across layers\.When these layers work as one system, individual optimizations compound\.
Disaggregated serving, large expert parallelism over[NVIDIA NVLink](https://www.nvidia.com/en-us/data-center/nvlink/)interconnect technology, NVFP4 precision and multi\-token prediction each deliver meaningful gains on their own\. Combined, they increase throughput by up to 20x\.
The chart below shows the result\. Capturing that gain in production is complex, requiring coordination across the full inference stack — from production operations and model runtimes to kernels, communication libraries and hardware access\. NVIDIA’s inference software stack is designed to make those layers work together so each optimization can build on the others\.
Stacking software optimizations compounds performance gains, increasing NVIDIA Blackwell token throughput per GPU from baseline to up to 20x with disaggregated serving, large expert parallelism \(Large EP\), NVFP4 and multi\-token prediction \(MTP\)\.## **Open Source Amplifies the Full\-Stack Advantage**
That same full\-stack foundation is amplified by the open source ecosystem\. Many of today’s most widely used open source AI frameworks and inference projects are built natively on[NVIDIA CUDA](https://developer.nvidia.com/cuda), which means new research and software optimizations run with leading performance on NVIDIA GPUs from day zero\.
PyTorch is a leading example\. Launched in 2016 with native CUDA support, PyTorch has coevolved with NVIDIA’s architecture, giving developers access to innovations such as Tensor Cores, Transformer Engine and NVFP4 directly through a familiar framework\.
When breakthroughs such as[DFlash speculative decode](https://developer.nvidia.com/blog/boost-inference-performance-up-to-15x-on-nvidia-blackwell-using-dflash-speculative-decoding/), which delivers up to 15x more throughput on existing hardware, or[FastVideo](https://haoailab.com/blogs/fastvideo_realtime_1080p/), which generates 1080p videos in less than five seconds, land in PyTorch, they can run instantly on NVIDIA, helping AI factories convert research progress into lower token costs\.
NVIDIA and PyTorch codevelopment helps bring new AI software innovations to developers, helping turn CUDA\-native advances into production performance as PyTorch adoption grows\.The same open source momentum is why when a new frontier open model like DeepSeek V4 is released, leading inference frameworks likevLLMandSGLanghave[day\-zero deployment recipes](https://developer.nvidia.com/blog/build-with-deepseek-v4-using-nvidia-blackwell-and-gpu-accelerated-endpoints/)for the NVIDIA Blackwell architecture — making the model accessible across millions of Blackwell GPUs\. It’s also why DeepSeek V4 performance on Blackwell improved by up to 5x within about a month across vLLM and[SGLang](https://pytorch.org/blog/serving-deepseek-v4-on-gb300-with-sglang-5x-higher-throughput-at-the-same-interactivity-since-day-0/)frameworks, cutting token costs to roughly one\-fifth of previous levels\.
SemiAnalysis InferenceX results comparing token throughput at same interactivity for NVIDIA GB200 NVL72 systems with vLLM and the NVIDIA Dynamo inference framework\.That’s the open source flywheel: more developers optimize CUDA\-native inference paths, more production deployments feed back into the ecosystem and each software improvement increases delivered token output while lowering cost per token over time\.
*Explore how software multiplies hardware performance in this*[*NVIDIA AI Podcast on tokenomics*](https://www.youtube.com/watch?v=zNuOOMM20Tk)*and this*[*inference solutions page*](https://www.nvidia.com/en-us/solutions/ai/inference/)*\.*
NVIDIA argues that cost per token is the most critical metric for AI Total Cost of Ownership, surpassing traditional measures like FLOPS per dollar, to better reflect real-world inference efficiency and profitability.
Five Chinese AI labs cut inference token prices by up to 99% in a price war, making frontier inference nearly free and shifting the competitive advantage from models to distribution and tooling.
NVIDIA CEO Jensen Huang highlighted an inflection point in AI inference during the GTC keynote, while Supermicro is partnering with NVIDIA to deliver turnkey 'AI Factory' infrastructure solutions built around the Blackwell platform.
Tensordyne announced a breakthrough inference system using logarithmic math in hardware, claiming 17x more tokens per watt and 13x higher throughput than NVIDIA Blackwell, achieved by replacing complex multiplication with simple addition in log space.