Portability Is a Myth: Why the Best AI Stacks Will Never Be Hardware-Agnostic (15 minute read)

TLDR AI 05/18/26, 12:00 AM News

ai-hardware portability kernel-design dsl performance hardware-agnostic ai-stack

Summary

This opinion piece argues that AI kernel portability across different hardware (TPU, GPU, etc.) is structurally impossible due to fundamental hardware differences, and that the best AI stacks will always require hardware-specific DSLs for optimal performance, despite the industry's desire for portability.

AI kernel portability is structurally impossible because TPU's Pallas, NVIDIA's CuTile and CUTLASS, AWS's NKI, AMD's FlyDSL, and Tenstorrent's tt-Metalium each expose hardware-specific concepts that no universal DSL can unify. The evidence: MaxText's MoE grouped matmul ships as 282 lines of Pallas on TPU while flashinfer's equivalent for Blackwell SM100 takes 4 million lines of generated CUDA, with zero shared code because the algorithms themselves diverge across hardware.

Original Article

View Cached Full Text

Cached at: 05/19/26, 12:20 AM

AI kernel portability is structurally impossible because TPU’s Pallas, NVIDIA’s CuTile and CUTLASS, AWS’s NKI, AMD’s FlyDSL, and Tenstorrent’s tt-Metalium each expose hardware-specific concepts that no universal DSL can unify. The evidence: MaxText’s MoE grouped matmul ships as 282 lines of Pallas on TPU while flashinfer’s equivalent for Blackwell SM100 takes 4 million lines of generated CUDA, with zero shared code because the algorithms themselves diverge across hardware.

Portability Is a Myth: Why the Best AI Stacks Will Never Be Hardware-Agnostic

Connect on LinkedIn: https://www.linkedin.com/in/patrick-toulme-150b041a5/

Follow on X: https://x.com/PatrickToulme

Motivation

Disclaimer: This blog is more opinionated than my prior blog posts.

The AI industry says it wants portable kernels. It keeps building hardware-specific ones.

TPUs have Pallas. Trainium has NKI. NVIDIA has CUDA C, CUTLASS, Triton, CuTile, CuTe — five DSLs and counting. AMD adopted Triton, the closest thing to a “portable” kernel DSL — then built FlyDSL anyway. Tenstorrent has tt-Metalium. Mojo targets NVIDIA and AMD.

If portability worked, we’d have one universal DSL. We have many — and the number keeps growing.

282 lines of Pallas Python on TPU. 4 million lines of generated CUDA C++ on Blackwell. Same operation — MoE grouped matrix multiplication. Zero shared code. Not because anyone failed at portability, but because the hardware is different enough that the optimal algorithms, tiling strategies, memory staging, and synchronization are all different. Portability at the performance layer was never possible.

And yet teams keep chasing it. Startups burn months building “hardware-agnostic” training stacks that run everywhere and run fast nowhere. Hardware vendors contort their silicon to fit someone else’s programming model. Engineers write portable kernels that achieve 30% MFU on two platforms instead of 90% MFU on one — and at frontier scale, that gap is millions of dollars in wasted compute.

This post argues that the best AI stacks of the future will not be portable — and that this is a good thing. It lays out why portability at the kernel level is a myth, why every hardware vendor will end up building their own DSL, and what those DSLs should look like to maximize both performance and usability.

Three Layers of the AI Stack

Every AI training or inference stack has three layers. Understanding which layers are portable — and which aren’t — is the key to understanding why the industry keeps rebuilding infrastructure from scratch.

Layer 1: The Math (Portable)

python# This runs on TPU, GPU, CPU, Trainium. It’s the same math everywhere. y = jnp.dot(x, w) # matrix multiply y = jax.nn.softmax(logits) # softmax loss = optax.softmax_cross_entropy(logits, labels)

PyTorch and JAX are portable at this layer. torch.matmul and jnp.dot express a mathematical operation — multiply two matrices. The framework doesn’t specify how. This is real portability. It works, and it matters.

But jnp.dot is not what runs on the hardware. What runs on the hardware is the output of a compiler that made thousands of hardware-specific decisions about tiling, memory placement, instruction selection, synchronization, and scheduling. The math is the same. Everything else is different.

Layer 2: The Compiler (The Portability Illusion)

XLA, torch.compile, Triton’s compiler — these sit between your portable math and the hardware. They take jnp.dot and produce TPU VLIW bundles, or CUDA PTX, or Trainium NeuronCore instructions. The API is portable. The output is not.

This is where the portability illusion lives. **When someone says “JAX runs on TPU and GPU,” what they mean is: JAX’s tracing and HLO generation are shared, but XLA-TPU and XLA-GPU are different compilers that produce fundamentally different code for fundamentally different hardware. **They share an IR format. They don’t share a backend.

The same is true for PyTorch. torch.compile with the Inductor backend targets NVIDIA GPUs. torch.compile with the XLA backend targets TPUs. The torch.matmul call is the same. Everything below it diverges.

Layer 3: The Hardware-Native Code (Not Portable)

This is where MFU is won or lost.

On TPU, a fused attention kernel is a Pallas program that manages VMEM explicitly, tiles across 8 sublanes × 128 lanes, issues DMA operations for HBM↔VMEM transfers, and produces VLIW bundles that pack MXU matmul, VPU elementwise, and XLU cross-lane ops into single cycles.

On Blackwell, the same logical operation is a CuTile program that allocates TMEM columns with atomic contention handling, elects a warp leader with elect.sync, issues tcgen05 MMA instructions, manages 20 mbarrier objects for async pipelining, and backs off with NANOSLEEP on TMEM allocation failure.

On Trainium, that operation is an NKI kernel that partitions work across NeuronCores, stages data through SBUF (on-chip SRAM) and PSUM (accumulation memory), and issues tensor engine instructions with explicit partition-dimension tiling.

These are not different implementations of the same algorithm. They are different algorithms driven by different hardware constraints. You cannot port one to another. You rewrite from scratch.

The Evidence: Same Math, Zero Shared Code

Here’s the most concrete version of this argument. Mixture-of-Experts grouped matrix multiplication — the same operation compiled for two different hardware targets.

On TPU: MaxText MegaBlox (Pallas)

From my last post: MaxText’s MoE layer compiles into 29 GMM Pallas kernel calls. Each one is a tpu_custom_call that:

Runs on a 3D grid (tiles_n, num_active_tiles, tiles_k)
DMA-fetches [512, 1024] tiles of sorted tokens from HBM to VMEM
Selects the correct expert’s weight tile via group_ids[grid_id]
Accumulates dot(lhs_tile, rhs_tile) in VMEM f32 scratch
Applies group boundary masks and stores results back to HBM

The backward pass uses two separate kernels: gmm with transpose_rhs=True for input gradients, and tgmm — a structurally different kernel — for weight gradients. Tiling is (512, 1024, 1024), tuned for the TPU MXU’s 256×256 systolic arrays.

Total handwritten kernel code: 282 lines of Pallas Python targeting TPU memory hierarchy and MXU.

On NVIDIA: flashinfer MoE (CUDA)

flashinfer PR #2917: 300 files of generated CUDA kernels for the same logical operation — MoE batched GEMM. Each file targets SM100 (Blackwell) and encodes its optimization parameters directly in the filename:

bashbatched_gemm_e2_sm100_s128x128x128_…bf16_tma_warpspecialized_cooperative_align.cu batched_gemm_e2_sm100_s128x256x128_…fp8_tma_warpspecialized_pingpong_align.cu

Tile sizes (128×128×128, 128×256×128). Data types (bf16, fp8). Memory access patterns (TMA). Scheduling strategies (cooperative, pingpong). Architecture target (SM100). All baked in. The PR adds approximately 4 million lines of generated CUDA — not because the developers enjoy writing CUDA, but because peak performance on SM100 requires SM100-specific code.

Why a Portable DSL Can’t Fix This

The obvious response is: “What if we had one DSL that abstracts over these differences?”

This is the promise of Triton, and to some extent the MLIR ecosystem. Write your kernel once in a hardware-agnostic DSL, and the compiler lowers it to TPU, GPU, or Trainium.

**The problem is that a portable DSL only works if the compiler is sufficiently complex to bridge the gap between abstract operations and hardware-specific instructions. **And that compiler would need to do something no compiler has ever done.

What the Portable Compiler Would Need to Do

Take tile_matmul(a, b) in some abstract DSL. The compiler must decide:

For TPU:

Allocate VMEM buffers for both operands
Issue DMA copy-start/copy-done pairs for HBM→VMEM transfers
Tile to (512, 1024, 1024) for the 256×256 MXU systolic arrays
Pack MXU matmul + VPU elementwise + XLU cross-lane shuffle into VLIW bundles
Double-buffer DMA and compute across loop iterations

For Blackwell:

Allocate TMEM columns with atomic UTCATOMSWS.FIND_AND_SET
Elect a warp leader with elect.sync before every MMA
Issue tcgen05 MMA instructions through the leader thread only
Manage 20 mbarrier objects with parity-based reuse for async pipelining
Back off with 100ns NANOSLEEP on TMEM allocation contention
Await results with 10ms barrier wait timeouts

For Trainium:

Stage data through SBUF (on-chip SRAM) with explicit partition-dimension tiling
Issue tensor engine instructions across NeuronCore partitions
Accumulate in PSUM memory space
Manage NeuronCore-level parallelism

These aren’t different instruction selections for the same algorithm. They’re different algorithms. The double-buffered DMA approach on TPU, the leader-elected async MMA on Blackwell, and the partition-tiled tensor engine on Trainium are structurally different programs that solve the same mathematical problem through entirely different mechanisms.

The Union Problem

A portable DSL that exposes enough hardware detail to write high-performance kernels would need to contain the union of all hardware concepts:

TPU: VMEM, HBM, DMA engines, MXU, VPU, XLU, sublanes/lanes, VLIW bundles
Blackwell: SMEM, TMEM, L2 cache, tensor cores, tcgen05, mbarriers, TMA, warp specialization
Trainium: SBUF, PSUM, HBM, tensor engine, NeuronCores, partition dimensions

That’s not a DSL — it’s three DSLs wearing a trench coat. Every kernel written in it would need if target == TPU: … elif target == NVIDIA: … elif target == TRAINIUM: … escape hatches that defeat the purpose of a shared abstraction.

Alternatively, a portable DSL that hides these differences produces generic code that doesn’t exploit any of them — and you’re back at the compiler layer, hoping the compiler is smart enough to recover the performance you left on the table.

The DSL Proliferation Is the Proof

If portability at the kernel level were achievable, the market would have converged on one DSL by now. Instead, every hardware vendor has independently built their own:

NVIDIA alone has four DSLs for writing kernels — CUDA C, CUTLASS/CuTe DSL, Triton, and CuTile — spanning from thread-level C to tile-level Python. Four abstraction levels for one vendor’s hardware, because even within a single architecture family, one DSL isn’t enough.

AMD’s trajectory is particularly telling. They adopted Triton — the closest thing the industry has to a “portable” kernel DSL — and forked it for ROCm. But adopting Triton wasn’t enough. AMD still built FlyDSL: a Python DSL with explicit register-level layouts, lane-level operations, and CuTe-style tiling algebra, lowered through their own MLIR Fly dialect to ROCDL machine code. Even when you adopt a “portable” DSL, you still end up building a hardware native one to get the performance the portable one can’t deliver.

This isn’t a failure of engineering. It’s evidence that hardware-specific kernel programming is irreducibly hardware-specific.

For Highest MFU: Write the ISA in Python

Here’s the principle that unifies all the successful DSLs: the highest-performing kernel DSLs are thin Python wrappers around the hardware’s instruction set.

Pallas doesn’t abstract away the MXU — it gives you dot() that maps directly to MXU matmul instructions, load()/store() that map to DMA operations, and Ref types that correspond to VMEM/HBM/SMEM memory spaces.

CuTile doesn’t abstract away tcgen05 — it gives you tile_matmul() that maps to tcgen05 MMA, tile_load() that maps to TMA descriptors, and automatic TMEM allocation that generates UTCATOMSWS.FIND_AND_SET instructions.

NKI doesn’t abstract away the tensor engine — it gives you nisa.nc_matmul() that maps to tensor engine instructions and explicit SBUF/PSUM memory staging. When you write NKI, you’re writing the Trainium ISA in Python.

The value of these DSLs is not abstraction — it’s ergonomics. They let you write hardware-specific code in Python instead of C or assembly. The key is that they don’t hide the hardware. They expose it through a familiar syntax.

This is the right trade-off. C-level ISA access (CUDA C++, raw PTX, assembly) gives you maximum control but poor iteration speed and readability. An abstract portable DSL gives you ergonomics but takes away the hardware control that determines MFU. A Python DSL over the hardware ISA gives you both: hardware control with Python iteration speed.

The flashinfer PR is 4 million lines of generated CUDA C++. MaxText’s MegaBlox kernel is 282 lines of Pallas Python. Both achieve high MFU on their respective hardware. The difference in developer experience is staggering — but the reason Pallas works isn’t that it’s more abstract than CUDA. It’s that it maps Python directly to TPU hardware concepts, so the 282 lines express the same level of hardware-specific intent as the generated CUDA, just more concisely.

Recommendations for Hardware Vendors

If you’re building AI accelerator hardware and want developers to achieve peak performance on your silicon, here’s what the evidence from every successful AI hardware platform tells us:

1. Build Your Own Python DSL

Don’t wait for an open-source DSL to support your hardware. Google built Pallas. NVIDIA built CuTile. AWS built NKI. In every case, the vendor built DSL outperforms third-party alternatives because the DSL designers have deep knowledge of the hardware’s performance model.

Your DSL should let developers write kernels that map to your hardware’s execution model in Python. Not C. Not a custom language. Python — because the ML ecosystem lives in Python, and the fastest path from researcher insight to high-performance kernel is a Python file that directly expresses your hardware’s concepts.

Build the DSL for the silicon. Don’t build the silicon for the DSL.

2. Expose the ISA, Not Just High-Level Abstractions

Your DSL should operate at two levels:

Low level: virtual ISA access. Let developers express the hardware’s instruction set — your matmul unit’s tile operations, your memory hierarchy’s explicit staging, your synchronization primitives. This is where peak MFU comes from. Developers who understand your hardware should be able to write near ISA level code in Python and know exactly what instructions will be generated.

High level: compiler-lowered operations. Not all your users will want to write ISA — and that’s fine. For developers who don’t need peak performance or are prototyping, provide higher-level operations (tile_matmul, tile_load) that the compiler lowers to the ISA. This is where Pallas’s dot() and CuTile’s tile_matmul() live — high-level enough to be ergonomic, low-level enough that an expert can predict the generated code.

Critically, the two levels must be mixable in the same kernel. Pallas GPU gets this right: you can write a kernel that uses high level jnp operations for the parts where the compiler is good enough, and drop into explicit tcgen05 instructions for the parts where you need direct hardware control — in the same function, in the same Python file. The developer chooses the abstraction level per-operation, not per-kernel.

This is how you scale adoption. New users start with the high-level ops and get reasonable performance. Expert users drop into ISA-level control for the hot path. A DSL that only provides the high level interface leaves performance on the table for expert users. A DSL that only provides ISA access has too steep a learning curve for adoption. The DSLs that win will let developers slide between both levels seamlessly.

3. Do Not Constrain Your Hardware to an Open-Source DSL

This is the most counterintuitive recommendation, but the evidence is clear.

Triton is a good DSL for NVIDIA GPUs that are not Blackwell or Rubin. It is not a good target for designing your hardware around. If you build your chip to be “Triton-compatible,” you’ve constrained your hardware architecture to the abstractions Triton exposes — shared memory, thread blocks, warp-level operations. Those abstractions reflect NVIDIA’s hardware design choices. They may not reflect yours.

NVIDIA itself doesn’t constrain Blackwell to Triton. CuTile exposes tcgen05, TMEM, and mbarrier — hardware features that don’t exist in Triton’s programming model. Triton will eventually support them, but CuTile had day one access. If NVIDIA doesn’t limit itself to the open-source DSL it helped create, you shouldn’t limit yourself to it either.

The risk of designing hardware for an existing DSL: **your chip becomes a less-efficient implementation of someone else’s hardware model. **The opportunity of designing a DSL for your hardware: your chip can exploit architectural innovations that no existing DSL anticipated.

Build the DSL for the silicon. Don’t build the silicon for the DSL.

4. Control Your Own IR

When you own your DSL, you own the intermediate representation — and the IR is where hardware specific semantics live. When you adopt someone else’s DSL, you inherit their IR’s assumptions about how hardware works.

If your chip ships a new memory space or a novel synchronization primitive, and you own the IR, you add an op and it’s available to users on day one. The compiler understands it, can optimize around it, can fuse across it, can schedule it relative to other operations. Your new hardware feature is a first class citizen in the optimization pipeline.

If you’re lowering from someone else’s IR — say Triton’s TTIR — your options are worse. You either wait for upstream to add support, hack around it with extension intrinsics, or fork the IR and maintain it yourself.

Triton’s tl.extra.nvidia and tl.extra.cuda extensions illustrate the problem. They give you access to hardware-specific instructions from within Triton — but those calls are escape hatches, not integrated operations. The Triton compiler can’t reason about them, can’t fuse across them, can’t schedule them relative to surrounding Triton ops. You get the syntax of a high level DSL with the semantics of inline assembly.

Compare that to CuTile, where tcgen05 is a first-class IR concept. The compiler understands it, generates TMEM allocation and mbarrier synchronization automatically, and schedules it as part of the overall kernel optimization. The difference isn’t access to the instruction — it’s whether the compiler understands the instruction.

That’s the cost of building on someone else’s IR: your hardware-specific features will always be second-class citizens, bolted on through extensions rather than integrated into the optimization pipeline. This is why NVIDIA built CuTile instead of extending Triton for Blackwell — and it’s why you should own yours too.

5. Build Your Compiler in MLIR — Open Source the Frontend, Keep the Backend

MLIR is often cited as the solution to the portability problem. It’s not — it’s a framework for building compilers, and the compilers people build with it are hardware specific. But that’s exactly why you should use it.

MLIR gives you common infrastructure: parsing, verification, transformation utilities, and a pass pipeline framework that every compiler needs. Instead of building all of that from scratch, you define your own MLIR dialect with ops that match your hardware’s semantics, and plug into the existing ecosystem. Your DSL lowers to your dialect. Your dialect lowers to your backend. MLIR handles the plumbing.

The smart split is: open source your frontend compiler, keep your backend codegen closed. The frontend — the DSL to MLIR lowering, frontend optimization passes, the dialect definition — should be open. This is what developers interact with, debug against, and build tooling around. Transparency here builds trust and adoption.

The backend — the final lowering from your MLIR dialect to machine code, the instruction selection, the register allocation, the hardware specific scheduling — can stay closed. This is where your proprietary IP lives, and keeping it closed doesn’t hurt developer experience because users don’t need to read your codegen to write high performance kernels.

We’re seeing this pattern emerge across the industry. The Mosaic compiler in the JAX repository is open source — developers can read the TPU and GPU Pallas frontend passes, understand how their kernels are lowered, and debug compilation issues. The backend TPU codegen (libtpu) is closed. AWS’s NKI compiler is on the path to being open sourced at the frontend level, with the Trainium backend remaining proprietary.

This is the right model. MLIR makes it cheap to build a non portable compiler. Open-sourcing the frontend makes it usable. Keeping the backend closed protects your IP. Everyone wins.

The Implication for Frontier Labs

If you’re building a frontier training or inference stack, the evidence from every successful large scale deployment says the same thing:

Pick your hardware. Go deep. Don’t plan to port.

The “we’ll stay hardware agnostic and port later” strategy sounds prudent but costs you twice: you lose MFU today because you avoid hardware specific optimizations, and you spend engineering time later porting code that can’t actually be ported — it has to be rewritten.

MaxText achieves high performance on TPU v6e for MoE pretraining because it uses TPU native Pallas kernels, TPU native XLA fusion, and TPU native SPMD partitioning. The flashinfer MoE kernels will achieve peak MFU on SM100 because they’re 4 million lines of SM100 specific CUDA. Neither can run on the other’s hardware. Neither should.

The math is portable. Write your model in JAX or PyTorch — the jnp.dot and torch.matmul calls work everywhere. But the moment you need peak performance — the attention kernel, the MoE routing, the collective overlap, the memory scheduling — you’re writing for one hardware target. Accept it early, and you’ll ship faster than the team that’s still trying to write kernels that run everywhere and run fast nowhere.

Inference: Where This Matters Even More

The portability argument applies to training, but it’s even more acute for inference. In training, you amortize kernel performance over weeks or months of runs. In inference, kernel performance is your cost per token — and cost per token is the metric that determines whether your product is economically viable at scale. Every percentage point of MFU you leave on the table by using a “portable” kernel instead of a hardware native one translates directly into higher serving costs — multiplied by every token you generate, forever.

The Cost of Kernel Development Is Collapsing

One historical argument for portability was that writing hardware-specific kernels was too expensive. If it takes a team of experts six months to write a custom attention kernel, you can’t afford to do it for every hardware target. Better to write it once and port.

**That calculus is changing. **Agentic AI coding tools are **dramatically reducing the cost **of kernel development. An engineer with a Python DSL and an AI coding assistant can iterate on kernel implementations in hours instead of weeks. The DSLs themselves — Pallas, CuTile, NKI — already reduced the barrier from “CUDA C++ expert” to “Python developer who understands the hardware.” AI tooling reduces it further: generate a kernel skeleton, profile it, have the agent iterate on tiling and memory staging, converge on high MFU code in a fraction of the time.

When kernel development cost was high, portability was an economic argument: write once, amortize across targets. When kernel development cost is low, that argument collapses. It becomes cheaper to write two hardware native kernels that each achieve 90%+ MFU than to write one “portable” kernel that achieves 30% MFU everywhere and needs constant tuning to close the gap.

Cost Per Token Is All That Matters

At the end of the day, at frontier scale the economics are driven by one metric: cost per token. Frontier labs and organizations will use whatever silicon provides the best cost per token — and is usable. That second part is the bottleneck. The silicon that wins isn’t necessarily the silicon with the best theoretical FLOP/s. It’s the silicon where an engineer can write a high MFU kernel in a week instead of a quarter. A solid Python DSL that lets you write the ISA directly is a massive step in the usability direction — and usability is what determines whether your hardware gets adopted or sits in a rack collecting dust.

Conclusion

**The frameworks are portable. **jnp.dot runs on TPU, GPU, and Trainium. That’s a genuine achievement, and it matters — your model architecture, your training recipe, your evaluation harness all transfer across hardware.

The kernels are not. 282 lines of Pallas and 4 million lines of CUDA solve the same math with zero shared code, and that gap isn’t closing — it’s widening with every generation of silicon. Blackwell introduced tcgen05 and TMEM. TPU v6e has a different MXU layout than v5e. Trainium’s tensor engine has its own tiling constraints. Each generation adds hardware specific concepts that reward hardware-specific code.

The industry should stop treating this as a problem to solve and start treating it as a reality to build around. Hardware vendors: build your own Python DSL, control your own IR, open source your frontend compiler. Frontier labs: pick your hardware and go deep — the cost of writing native kernels is collapsing, and the cost of not writing them is millions in wasted compute per year.

The race isn’t to build the most portable stack. It’s to build the deepest one.

Portability Is a Myth: Why the Best AI Stacks Will Never Be Hardware-Agnostic (15 minute read)

Portability Is a Myth: Why the Best AI Stacks Will Never Be Hardware-Agnostic

Motivation

Three Layers of the AI Stack

The Evidence: Same Math, Zero Shared Code

Why a Portable DSL Can’t Fix This

The DSL Proliferation Is the Proof

For Highest MFU: Write the ISA in Python

Recommendations for Hardware Vendors

The Implication for Frontier Labs

Conclusion

Similar Articles

"Hardware is the only moat" - Should we buy new hardware now or wait?

The Hardware Coup: Why AI Hardware Just Changed Forever (3 minute read)

AI's Plummeting Prices Are a Software Story, Not a Hardware One (14 minute read)

The AI war is moving from models to machines and I don’t think enough people are talking about it

The best model is the one you can actually run

Submit Feedback

Similar Articles

"Hardware is the only moat" - Should we buy new hardware now or wait?

The Hardware Coup: Why AI Hardware Just Changed Forever (3 minute read)

AI's Plummeting Prices Are a Software Story, Not a Hardware One (14 minute read)

The AI war is moving from models to machines and I don’t think enough people are talking about it

The best model is the one you can actually run