Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

arXiv cs.LG 06/17/26, 04:00 AM Papers
jet-tagging transformer reconfigurable-computing low-latency quantization open-source particle-physics
Summary
This paper presents a quantized, integer-only transformer implementation for jet tagging on AMD Versal AI Engines, including a reusable open-source framework that maps transformer layers to AIE tiles for low-latency trigger systems at CERN LHC.
arXiv:2606.17500v1 Announce Type: new Abstract: Transformer-based models achieve strong performance for jet tagging at the CERN LHC, but deploying them in low-latency, resource-constrained trigger systems is challenging. We present an initial implementation of a quantized, integer-only transformer for jet tagging on the AMD Versal AI Engine (AIE), mapping dense and multi-head attention (MHA) layers to AIE tiles. The main contribution is a reusable software framework that represents transformer layers as composable AIE building blocks and automatically generates the corresponding Vitis graph code from a high-level Python model description. This framework provides a foundation for future research and is released as open-source software at https://github.com/KastnerRG/particle_transformer_aie.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:39 AM
# Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines
Source: [https://arxiv.org/html/2606.17500](https://arxiv.org/html/2606.17500)
###### Abstract

Transformer\-based models achieve strong performance for jet tagging at the CERN LHC, but deploying them in low\-latency, resource\-constrained trigger systems is challenging\. We present an initial implementation of a quantized, integer\-only transformer for jet tagging on the AMD Versal AI Engine \(AIE\), mapping dense and multi\-head attention \(MHA\) layers to AIE tiles\. The main contribution is a reusable software framework that represents transformer layers as composable AIE building blocks and automatically generates the corresponding Vitis graph code from a high\-level Python model description\. This framework provides a foundation for future research and is released as open\-source software athttps://github\.com/KastnerRG/particle\_transformer\_aie\.

## IProblem and Motivation

At the CERN Large Hadron Collider \(LHC\), proton–proton collisions occur at a rate of 40 MHz, producing an enormous stream of particle jets that must be filtered in real time by the Level\-1 Trigger \(L1T\) system\[[3](https://arxiv.org/html/2606.17500#bib.bib1),[4](https://arxiv.org/html/2606.17500#bib.bib3)\]\. Jet tagging is the task of classifying jets according to their originating physics process\. While accurate jet tagging is essential for selecting interesting events, latency and throughput must be kept within tight budgets to enable real\-time processing\[[4](https://arxiv.org/html/2606.17500#bib.bib3)\]\. Recent work has shown that transformer\-based architectures, such as the Particle Transformer \(ParT\), can achieve high accuracy by operating directly on sets of per\-particle features\[[8](https://arxiv.org/html/2606.17500#bib.bib6)\]\. However, such transformer models were deployed on GPUs in an offline setting, where millisecond\-level latencies and high power consumption are acceptable\.

![Refer to caption](https://arxiv.org/html/2606.17500v1/images/data_pipeline.png)Figure 1:Trigger and inference latency scales at the LHC, from 40 MHz Level‑1 decisions to higher‑level triggers and offline analyses\.For online triggering, the constraints are far more stringent: the L1 trigger must produce accept/reject decisions within a few microseconds while limiting the output rate to at most𝒪\(105\)\\mathcal\{O\}\(10^\{5\}\)events per second and operating under tight on\-detector power and resource budgets\[[4](https://arxiv.org/html/2606.17500#bib.bib3),[2](https://arxiv.org/html/2606.17500#bib.bib2)\]\. This makes deploying transformer models on traditional CPU/GPU platforms impractical for real\-time edge inference\. The AI Engine \(AIE\) array on AMD Versal SoCs offers a promising platform for low\-latency, high\-throughput ML inference\.

In this work, we design and implement a transformer\-based jet tagging accelerator on the AMD Versal VCK190 platform\. Our main contributions are as follows:

- •We introduce a reusable software framework that represents transformer layers as composable AIE building blocks and automatically generates the corresponding Vitis graph code from a high\-level Python model description\.
- •We implement a fully quantized, integer\-only transformer, including dense layers, residual connections, and an integer\-only softmax, tailored to the arithmetic and memory constraints of the AIE array\.
- •We propose and evaluate a per\-head multi\-head attention \(MHA\) mapping that assigns heads and their sub\-projections to parallel AIE tiles, achieving improved throughput

![Refer to caption](https://arxiv.org/html/2606.17500v1/images/system_diagram.png)Figure 2:Code\-generation and verification flow frommodel\.forward\(\)using the AIEModel framework\.
## IIPrior and Related Work

Transformer\-based architectures such as the Particle Transformer \(ParT\) have set the state of the art for jet tagging performance in offline analyses\[[8](https://arxiv.org/html/2606.17500#bib.bib6)\]\. Recent work has begun to explore deploying transformers in real\-time trigger environments, including sub\-microsecond FPGA implementations for jet tagging using aggressive quantization and specialized dataflows\[[7](https://arxiv.org/html/2606.17500#bib.bib7)\]\. In parallel, several frameworks target transformers or AI workloads on Versal\-class architectures and AI Engines, such as the CAT customized transformer accelerator framework\[[9](https://arxiv.org/html/2606.17500#bib.bib8)\], and Vitis ONNX Execution Provider backends that map high\-level models to the AIE array of Ryzen NPUs\[[1](https://arxiv.org/html/2606.17500#bib.bib11)\]\. In contrast, CAT jointly exploits both PL and AIE resources, and the ONNX backends focus on Ryzen devices rather than Versal SoCs, whereas our work maps high\-level models using an integer\-only code\-generation framework targeting only the AIE on the VCK190\.

## IIISystem Overview

Our work focuses on building a lightweight code generation framework that maps user\-defined model definitions in Python to the AIE of the Versal SoC and developing support within this framework for the quantized computational layers required for the transformer for jet tagging model\.

### III\-ATarget Platform

Our implementation targets the AMD Versal VCK190 SoC, with computation mapped exclusively onto the on\-chip AI Engine \(AIE\) array\. The AIE tiles provide a spatially distributed, VLIW\-style SIMD compute network with local memories and high\-bandwidth streaming interconnects, which we exploit to implement the transformer layers\. The design is developed using the AMD/Xilinx Vitis 2024\.1\.

All neural network computations are performed using int8 quantized, integer\-only arithmetic, with int32 accumulators and fixed\-point rescaling\. At the low level, we implement AIE kernels in C/C\+\+ and connect them using the Vitis dataflow graph programming model, where kernels are nodes and streaming channels between AI Engine tiles are edges\. We primarily validate the design using AIE hardware emulation, which allows cycle\-accurate evaluation of the generated graph\.

### III\-BCode\-Generation Framework

Porting nontrivial models onto the AI Engine is labor\-intensive\. Developers must hand\-write both the C/C\+\+ kernel code and the Vitis graph description that instantiates kernels, wires streams, and places computation across tiles\. As the dataflow grows to include multiple attention heads, stacked attention blocks, and residual connections, manually managing ports, FIFOs, and tile workloads becomes error\-prone, and debugging mismatches between intended and actual dataflow is difficult\. These challenges motivate a code\-generation framework that separates high\-level model specification from low\-level AIE graph construction\.

Although a transformer block and the full model appear complex, they are composed of a small set of modular, highly repetitive building blocks: dense projections, multi\-head attention, residual additions, and simple nonlinearities\. This regular structure is well suited to a templated code\-generation approach, where each logical layer encapsulates both its numerical behavior and its hardware mapping\.

Our framework exposes a user\-facing Python API centered on anAIEModelclass, in which a model is defined as a sequence ofAIELayerinstances \(e\.g\.,Dense,ResAdd,MHA\), in a style similar to common deep learning libraries\. At build time, eachAIELayerconsumes a set of parameters \(tensor shapes, quantization scales, and tiling choices\) and emits the corresponding fragments of the Vitis graph, kernel instances, input/output ports, and stream connections\. These fragments are then composed into a complete AIE graph, automatically bound to a library of pre\-written C/C\+\+ kernels that implement quantized dense layers, residual additions, and MHA\. To ensure numerical correctness and simplify debugging, the same high\-level Python model definition is also used to construct a NumPy\-based reference implementation of the transformer\. For any given input, the framework runs both the NumPy “golden” model and the AIE implementation and compares the numerical output of the AIE against the reference NumPy implementation\.

The core user\-facing entry point is themodel\.forward\(input\)call\. When invoked, this routine \(i\) constructs the corresponding Vitis AIE graph for the given high\-level model, \(ii\) triggers compilation and AIE hardware emulation to obtain the fixed\-point accelerator output, and \(iii\) in parallel builds an equivalent NumPy reference model and runs a forward pass on the same input\. After emulation completes, the framework compares the AIE output against the NumPy reference to validate numerical correctness of the AIE design\. Figure[2](https://arxiv.org/html/2606.17500#S1.F2)sketches this flow, from high\-level Python model description through code generation, compilation, AIE emulation, and NumPy\-based numerical validation\.

The framework is deliberately simple and modular, so that individual layers can be replaced with more optimized kernels, alternative tiling strategies, or richer quantization schemes without changing user\-facing model code\. Distinct, intermediate model definitions can easily be configured for testing and benchmarking\.

## IVArchitecture and AIE Mapping

### IV\-AJet Tagging Transformer Model

The features and dimensions of the transformer for jet tagging model motivated the development of our framework\. The input of the model is an int8 tensor of shape\(160,8\)\(160,8\)which represents a quantized particle jet after padding for tiling compatibility\. Fig\.[3](https://arxiv.org/html/2606.17500#S4.F3)shows the design of the network: a compact transformer encoder with two stacked 4\-head self\-attention \(MHA\) layers using 57 AIE tiles\. Each transformer block consists of an MHA layer followed by a position\-wise feed\-forward subnetwork with a 64\-dimensional hidden layer, implemented using dense layers with ReLU activations\. Across the entire model, there are seven dense layers and four residual\-add connections, corresponding to the attention and feed\-forward sublayers and the final classification head, respectively\. We do not apply any normalization layers \(e\.g\., layer normalization or batch normalization\)\. Instead, stability is maintained via the quantization scheme described in Section[IV\-C](https://arxiv.org/html/2606.17500#S4.SS3)\. Normalization scales can be folded into upstream transforms prior to quantization\[[5](https://arxiv.org/html/2606.17500#bib.bib12)\]\.

![Refer to caption](https://arxiv.org/html/2606.17500v1/images/PartT.png)Figure 3:High\-level jet tagging transformer model\.
### IV\-BAIE Design and Mapping

We focus on demonstrating a functional mapping of the transformer to the VCK190 AIE array and on exercising the full end\-to\-end flow of our code\-generation framework\. All compute\-intensive operations \(dense layers, multi\-head attention \(MHA\), and residual additions\) are implemented as a set of kernels and mapped to separate AIE tiles\. We introduce an AIE mapping of MHA layers via head\-level parallelism\. For each attention block, the per\-head query, key, and value projections are assigned to separate tiles, so that different head sub\-projections can be processed in parallel\. After each head computes its contribution, a small set of concatenation kernels collect the per\-head outputs and form the full d\_model\-dimensional representation, which is then passed through an output projection kernel\. The mapping of MHA kernels is illustrated in Figure[4](https://arxiv.org/html/2606.17500#S4.F4)\. This highly parallel design exploits the large tile count of the VCK190 AIE and results in a FIFO latency speedup proportional to the number of heads defined in the MHA layer, as shown in Table[I](https://arxiv.org/html/2606.17500#S5.T1)\.

Dense layers are implemented as single AIE kernels that perform int8 matrix–matrix multiplications with int32 accumulation and bias addition\. After accumulation, requantized values are streamed to downstream kernels\. Residual connections are realized as separate AIE kernels that consume two int8 input streams and perform elementwise additions\.

![Refer to caption](https://arxiv.org/html/2606.17500v1/images/mha_design.png)Figure 4:Code\-Generated Vitis AIE graph for an MHA layer\. Purple boxes indicate individual compute tiles, and arrows indicate single\-stream int8 data connections\.
### IV\-CQuantization and Integer\-Only Computations

Deploying attention models efficiently on our AIE requires integer operations\. This makes both quantization of linear layers and the realization of non\-linear operations such as softmax difficult to implement efficiently, since we must control quantization error while conforming to the hardware’s integer\-friendly execution model\.

All neural network computations on the AIE use symmetric int8 quantization with int32 accumulation\. For each weights tensor, we define a per\-tensor scale based on the maximum absolute value observed over calibration and map real values to the integer range\[−127,127\]\[\-127,127\]\. To match the hardware datapath, the effective real\-valued requantization factor is implemented as a dyadic rationalM/2bM/2^\{b\}, whereMMandbbare integers precomputed offline\.

At runtime, the forward pass applies an integer matrix multiplication in int32, followed by a requantization step that multiplies byMMand right\-shifts bybbusing banker’s rounding \(conv\_even\)\. This design keeps all intermediate values in integer formats while closely approximating standard floating\-point inference, thereby improving efficiency on the AIE\.

Softmax is particularly challenging in this setting because exponential operations are typically implemented in floating point\. We adopt an integer\-only variant of the Int\-Softmax algorithm introduced in I\-BERT\[[6](https://arxiv.org/html/2606.17500#bib.bib10)\], which reformulates the exponential into a sequence of fixed\-point operations that efficiently map to our integer hardware\. For numerical stability, we first compute the maximum and subtract it from each element\. We then decompose each shifted input as

xi=−kiln⁡2\+rix\_\{i\}=\-k\_\{i\}\\ln 2\+r\_\{i\}wherekik\_\{i\}is an integer quotient andri∈\[−ln⁡2,0\]r\_\{i\}\\in\[\-\\ln 2,0\]\. This allows the exponential to be approximated as

exi≈2−ki⋅p\(ri\),e^\{x\_\{i\}\}\\approx 2^\{\-k\_\{i\}\}\\cdot p\(r\_\{i\}\),where2−ki2^\{\-k\_\{i\}\}is implemented as a bit\-shift andp\(ri\)p\(r\_\{i\}\)is a second\-order polynomial\. All quantities \(kik\_\{i\},rir\_\{i\}, polynomial coefficients, and intermediate products\) are represented in fixed\-point integer formats with shared scaling factors\. The resulting approximation introduces a maximum error of1\.3×10−31\.3\\times 10^\{\-3\}, which is small relative to the overall quantization error of the network\.

Our quantization and softmax design demonstrates that a fully integer\-oriented attention stack is feasible on the AIE\. However, our integer softmax introduces an order\-of\-magnitude increase in latency\. Optimizing these implementations for higher throughput is a primary direction for future work\.

## VEvaluation and Results

All AIE experiments in this work are conducted with randomly initialized weights and randomly generated inputs, and are intended as implementation validation studies\. Quantization is exercised in two modes across the test suite: a dynamic mode, where quantization factors are obtained by calibration from the reference forward pass used primarily for models with randomly initialized weights and inputs, and a static mode, where fixed factors are chosen by the user ahead of time to provide a numerically informative operating range for validation\. Under both modes, verification is based on agreement between the AIE implementation and the corresponding integer reference computations, as described in Sec\.[IV\-C](https://arxiv.org/html/2606.17500#S4.SS3)and illustrated by the code\-generation and validation flow in Fig\.[2](https://arxiv.org/html/2606.17500#S1.F2)\.

We first evaluate a simplified “skeleton” version of the jet tagging transformer\. This model retains the full MHA and feed\-forward structure but omits softmax and bias addition, allowing us to isolate the impact of dense projections and MHA on latency and scaling behavior\. In this setting, we compare single\-head and four\-head configurations under dynamic quantization\. As shown in Table[I](https://arxiv.org/html/2606.17500#S5.T1), head\-parallel execution on the AIE yields roughly a4×4\\timesreduction in end\-to\-end graph latency \(from7\.35×105ns7\.35\\times 10^\{5\}\\,\\text\{ns\}for 1 head to1\.87×105ns1\.87\\times 10^\{5\}\\,\\text\{ns\}for 4 heads\) and a4×4\\timesincrease in throughput\.

TABLE I:Latency and throughput for the jet tagging skeleton model \(no bias, no softmax\)\.To understand the performance impact of our integer\-only softmax implementation, we conduct a set of benchmarks on a small base model\. The base model consists of a single dense layer, which is computationally equivalent to a160×64×64160\\times 64\\times 64matrix multiplication\. We then incrementally augment this model with bias addition and integer\-only softmax to measure their effects on runtime latency and throughput\. As shown in Table[II](https://arxiv.org/html/2606.17500#S5.T2), adding bias alone increases latency by nearly an order of magnitude relative to the dense\-only baseline, while introducing softmax causes two orders of magnitude slowdown and a corresponding drop in throughput\. These results highlight that the integer\-only softmax and associated data movement currently form a dominant bottleneck in our design and motivate future work on more efficient kernel implementations and dataflows for these operations within our framework\.

TABLE II:Latency and throughput for a base test model consisting of a single dense layer\.
## VIConclusion and Future Work

In this work, we presented a preliminary implementation of a fully quantized, integer\-oriented transformer for jet tagging on the AMD Versal VCK190 AI Engine\. This includes a novel, head\-parallel mapping of MHA layers onto the AIE grid\. Our primary contribution is a lightweight, reusable code\-generation framework that maps high\-level Python model descriptions onto the AIE as composable building blocks, automatically producing Vitis AIE graphs\.

Key directions for future work include: \(i\) developing more optimized kernels and graph dataflows to reduce end\-to\-end latency and mitigate the current softmax and bias bottlenecks; \(ii\) extending the framework to support additional transformer components, such as integer\-only normalization and pooling layers; and \(iii\) improving the quantization scheme beyond symmetric per\-tensor scaling and rigorously evaluating the system with trained jet tagging models and realistic calibration data\. The framework’s modular structure is intended to make each of these enhancements incremental and composable, enabling sustained co\-design of models, quantization, and AIE mappings\.

## References

- \[1\]AMD and Microsoft\(2026\)Vitis ai execution provider for onnx runtime\.Note:https://onnxruntime\.ai/docs/execution\-providers/Vitis\-AI\-ExecutionProvider\.htmlAccessed: 2026\-04\-14Cited by:[§II](https://arxiv.org/html/2606.17500#S2.p1.1)\.
- \[2\]N\. Bartosiket al\.\(2009\)The cms level\-1 trigger at lhc and super\-lhc\.JINST4,pp\. P04010\.External Links:0810\.4133Cited by:[§I](https://arxiv.org/html/2606.17500#S1.p2.1)\.
- \[3\]CERN\(2016\)LHC collisions every 25 nanoseconds\.Note:https://cms\.cern/news/lhc\-collisions\-every\-25\-nanosecondsAccessed: 2026\-04\-14Cited by:[§I](https://arxiv.org/html/2606.17500#S1.p1.1)\.
- \[4\]C\. Collaboration\(2022\)Real time analysis with the cms level\-1 trigger\.Note:https://cms\.cern/news/real\-time\-analysis\-cms\-level\-1\-triggerAccessed: 2026\-04\-14Cited by:[§I](https://arxiv.org/html/2606.17500#S1.p1.1),[§I](https://arxiv.org/html/2606.17500#S1.p2.1)\.
- \[5\]B\. Jacobet al\.\(2018\)Quantization and training of neural networks for efficient integer\-arithmetic\-only inference\.InCVPR,Cited by:[§IV\-A](https://arxiv.org/html/2606.17500#S4.SS1.p1.1)\.
- \[6\]S\. Kim, A\. Gholami, Z\. Yao, M\. W\. Mahoney, and K\. Keutzer\(2021\)I\-bert: integer\-only BERT quantization\.InProceedings of the 38th International Conference on Machine Learning,External Links:2101\.01321,[Link](https://arxiv.org/abs/2101.01321)Cited by:[§IV\-C](https://arxiv.org/html/2606.17500#S4.SS3.p4.8)\.
- \[7\]L\. Laatuet al\.\(2025\)Sub\-microsecond transformers for jet tagging on fpgas\.arXiv preprint arXiv:2510\.24784\.External Links:[Link](https://arxiv.org/abs/2510.24784)Cited by:[§II](https://arxiv.org/html/2606.17500#S2.p1.1)\.
- \[8\]H\. Qu, C\. Li, and S\. Qian\(2022\)Particle transformer for jet tagging\.arXiv preprint arXiv:2202\.03772\.External Links:[Link](https://arxiv.org/abs/2202.03772)Cited by:[§I](https://arxiv.org/html/2606.17500#S1.p1.1),[§II](https://arxiv.org/html/2606.17500#S2.p1.1)\.
- \[9\]W\. Zhang, Y\. Liu, and Z\. Bao\(2024\)Customized transformer accelerator framework on versal acap\.arXiv preprint arXiv:2409\.09689\.External Links:[Link](https://arxiv.org/abs/2409.09689)Cited by:[§II](https://arxiv.org/html/2606.17500#S2.p1.1)\.
Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

Similar Articles

@reach_vb: https://x.com/reach_vb/status/2057880274348695995

Breaking the Transformer Dead-End: A Local-First 3D Point-Cloud Cognition Engine running on consumer hardware

@akshay_pachaar: The Operating System for Al Research Labs. TransformerLab orchestrates GPUs across any cloud and runs any training or e…

I designed a methodology for (autonomously) training transformer language models on a single consumer GPU.

@gordic_aleksa: new in-depth blog post time: Inside the Transformer: The Life of a Token a deep dive into a modern dense transformer, i…

Submit Feedback

Similar Articles

@reach_vb: https://x.com/reach_vb/status/2057880274348695995
Breaking the Transformer Dead-End: A Local-First 3D Point-Cloud Cognition Engine running on consumer hardware
@akshay_pachaar: The Operating System for Al Research Labs. TransformerLab orchestrates GPUs across any cloud and runs any training or e…
I designed a methodology for (autonomously) training transformer language models on a single consumer GPU.
@gordic_aleksa: new in-depth blog post time: Inside the Transformer: The Life of a Token a deep dive into a modern dense transformer, i…