TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI
Summary
TRINE is a single-bitstream FPGA accelerator and compiler for end-to-end multimodal inference, unifying diverse layers and incorporating runtime-adaptive compute modes, token pruning, and dependency-aware offloading, achieving up to 22.57x latency reduction over an RTX 4090 at 20-21W.
View Cached Full Text
Cached at: 06/01/26, 09:28 AM
# TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI Source: [https://arxiv.org/html/2603.22867](https://arxiv.org/html/2603.22867) Hyunwoo Oh1, Hanning Chen1, Sanggeon Yun1, Yang Ni2, Suyeon Jang1, Behnam Khaleghi3, Fei Wen4, and Mohsen Imani11University of California, Irvine,2Purdue University Northwest,3Qualcomm,4Samsung[hyunwooo, m\.imani@uci\.edu](https://arxiv.org/html/2603.22867v1/mailto:hyunwooo,%[email protected]) ###### Abstract\. Multimodal stacks that mix ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms because their compute/memory patterns diverge and hard real\-time targets leave little slack\. TRINE is a single\-bitstream FPGA accelerator and compiler that executes end\-to\-end multimodal inference without reconfiguration\. Layers are unified as DDMM/SDDMM/SpMM and mapped to a mode\-switchable engine that toggles at runtime among weight/output\-stationary systolic, 1×CS SIMD, and a routable adder tree \(RADT\) on a shared PE array\. A width\-matched, two\-stage top\-k unit enables in\-stream token pruning, while dependency\-aware layer offloading \(DALO\) overlaps independent kernels across reconfigurable processing units to sustain utilization\. Evaluated on Alveo U50 and ZCU104, TRINE reduces latency by up to 22\.57× vs\. RTX 4090 and 6\.86× vs\. Jetson Orin Nano at 20–21 W; token pruning alone yields up to 7\.8× on ViT\-heavy pipelines, and DALO contributes up to 79% throughput improvement\. With int8 quantization, accuracy drops remain<<2\.5% across representative tasks, delivering state\-of\-the\-art latency and energy efficiency for unified vision, language, and graph workloads—in one bitstream\. ## 1\.Introduction Multimodal artificial intelligence \(AI\) is reshaping computer vision \(CV\), language understanding, and relational reasoning by jointly modeling images, text, and graphs\. Systems that combine Vision Transformers \(ViTs\), Graph Neural Networks \(GNNs\), Convolutional Neural Networks \(CNNs\), and transformer\-based Natural Language Processing \(NLP\) have enabled text\-conditioned detection and vision–language grounding\(Kamath et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib16); Cherti et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib8)\)as well as graph\-augmented cognition\(Yun et al\.,[2024](https://arxiv.org/html/2603.22867#bib.bib31)\)\. Yet these gains come with heterogeneous compute and memory behaviors that complicate utilization, workload balance, and real\-time inference on embedded platforms\. ViTs are a particular pain point: fixed\-length token processing and attention\-heavy blocks inflate the cost of both feed\-forward networks and attention matrices, especially in CV pipelines with many tokens and moderate embedding widths\(Kamath et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib16); Chang et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib4); Chen et al\.,[2024b](https://arxiv.org/html/2603.22867#bib.bib6)\)\. Token\-pruning methods reduce this cost by discarding low\-importance tokens at inference time\(Tang et al\.,[2022](https://arxiv.org/html/2603.22867#bib.bib26); Fayyaz et al\.,[2022](https://arxiv.org/html/2603.22867#bib.bib11); Rao et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib23); Xu et al\.,[2022](https://arxiv.org/html/2603.22867#bib.bib29); Kong et al\.,[2022](https://arxiv.org/html/2603.22867#bib.bib18)\)\. However, on GPUs the resulting irregular sparsity often degrades utilization, so measured speedups fall short of theoretical potential\(Kong et al\.,[2022](https://arxiv.org/html/2603.22867#bib.bib18); Dong et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib9); You et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib30)\)\. Prior hardware efforts have accelerated isolated pieces—e\.g\., ViT\-only or NLP\-only pruning—or targeted a subset of kernels, leaving end\-to\-end multimodal pipelines underserved\(Dong et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib9); You et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib30); Parikh et al\.,[2024](https://arxiv.org/html/2603.22867#bib.bib21); Lu et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib20); Wang et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib27)\)\. FPGAs are a natural fit for such heterogeneity, but frequent bitstream swapping introduces prohibitive reconfiguration delays and complicates deployment\(AMD,[2024](https://arxiv.org/html/2603.22867#bib.bib2)\)\. While some designs support multiple workloads\(Zhang et al\.,[2024a](https://arxiv.org/html/2603.22867#bib.bib32),[b](https://arxiv.org/html/2603.22867#bib.bib33)\), a unified solution that adapts at runtime across the full diversity of multimodal AI remains elusive\. Moreover, replicating separate dense and sparse engines inflates area and complicates timing closure, often depressing maximum clock frequency\. What is missing is a single bitstream that \(i\) executes ViT/CNN/GNN/NLP end\-to\-end, \(ii\) embraces dynamic token sparsity, and \(iii\) sustains high utilization without reconfiguration\. This work takes the view that multimodal layers can be expressed as three matrix kernels—dense–dense \(DDMM\), sampled dense–dense \(SDDMM\), and sparse \(SpMM\)—provided the hardware can change its dataflow at runtime via switchable interconnects and per\-PE operation modes\. This abstraction lets a single PE array serve all kernels by switching interconnects and per\-PE ops at runtime rather than instantiating duplicated datapaths\. We therefore introduce TRINE, an FPGA\-based accelerator and compiler that map DDMM/SDDMM/SpMM onto a shared datapath with a mode\-switchable engine capable of weight\- and output\-stationary systolic execution,1×CS1\{\\times\}C\_\{S\}SIMD, and a routable adder tree \(RADT\) for highly sparse reductions\. A width\-matched, two\-stage top\-kkunit performs in\-stream token pruning so attention scores are filtered as they are produced, avoiding oversized global sorters and off\-chip detours\. Above the engine, dependency\-aware layer offloading \(DALO\) overlaps independent kernels across multiple reconfigurable processing units \(RPUs\)\. RADT and DALO act at complementary levels—RADT handles intra\-kernel sparsity and reductions, DALO exposes inter\-kernel concurrency—enabling runtime adaptation within a single bitstream without sacrificing efficiency\. - •TRINE unifies ViT/CNN/GNN/NLP layers as DDMM/SDDMM/SpMM and executes them end\-to\-end on one bitstream without reconfiguration\. - •A shared PE array switches among systolic \(WS/OS\),1×CS1\{\\times\}C\_\{S\}SIMD, and RADT via fine\-grained routing, matching kernel structure and sparsity without duplicating engines\. - •In\-stream top\-kkpruning is coupled with RPU\-level scheduling: the array\-width\-matched sorter prunes tokens in flight, and DALO overlaps independent kernels to raise utilization on multi\-RPU fabrics\. We evaluate TRINE on Xilinx Alveo U50 and ZCU104\. Across TinyCLIP\(Wu et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib28)\), MDETR\(Kamath et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib16)\), and MissionGNN\(Yun et al\.,[2024](https://arxiv.org/html/2603.22867#bib.bib31)\)workloads, TRINE reduces latency by up to22\.57×22\.57\\timesthan RTX 4090 and6\.86×6\.86\\timesversus a Jetson Orin Nano while operating at∼\\sim20–21 W; pruning alone yields up to7\.8×7\.8\\timesspeedup on ViT\-heavy cases, and DALO improves throughput by79%79\\%\. To our knowledge, this is the first work to provide state\-of\-the\-art latency and energy efficiency across vision, language, and graph workloads within a single bitstream\. ## 2\.Related Works and Motivations ### 2\.1\.Multimodal AI Models Multimodal systems combine ViTs, CNNs, GNNs, and transformer NLP to align and reason over images, text, and graphs\. MDETR fuses CNN/ViT with language for text\-conditioned detection\(Kamath et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib16)\); CLIP/TinyCLIP learn vision–language embeddings for zero\-shot tasks\(Radford et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib22); Cherti et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib8); Wu et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib28)\); ImageBind extends to additional modalities\(Girdhar et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib13)\)\. On the graph side, MissionGNN converts visual evidence into a knowledge graph processed by a GNN\(Yun et al\.,[2024](https://arxiv.org/html/2603.22867#bib.bib31)\); ML\-CGN mixes CNNs and GNNs for scene/relational understanding\(Chen et al\.,[2019](https://arxiv.org/html/2603.22867#bib.bib7)\); TaskCLIP targets fine\-grained segmentation\(Chen et al\.,[2024a](https://arxiv.org/html/2603.22867#bib.bib5)\)\. Despite this progress, the workload heterogeneity across NLP/GNN/CNN/ViT complicates efficient acceleration\. ### 2\.2\.Token Pruning for ViTs ViT cost scales with tokens and attention\. Dynamic pruning keeps only the most informative tokens using top\-kkselection\(Rao et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib23); Xu et al\.,[2022](https://arxiv.org/html/2603.22867#bib.bib29)\), but on CPUs/GPUs the induced irregular sparsity limits realized speedups\. SPViT replaces top\-kkwith Gumbel\-Softmax for differentiable selection\(Kong et al\.,[2022](https://arxiv.org/html/2603.22867#bib.bib18)\), which introduces nonlinear kernels that are also unfriendly to general\-purpose hardware\. These trends motivate specialized support for pruning that reduces compute*and*preserves utilization\. ### 2\.3\.Domain\-Specific Accelerators #### 2\.3\.1\.Hardware Pruning Accelerators for sparse kernels \(SpMM/SDDMM\) and pruned transformers improve throughput but are typically model\- or kernel\-specific\(Gerogiannis et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib12); Lu et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib20)\)\. Token\-pruning hardware exists—e\.g\., threshold\-based engines for NLP\(Wang et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib27)\), compile\-time pruning for ViTs\(You et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib30)\), and run\-time Gumbel\-Softmax pruning\(Dong et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib9)\)—yet each targets a subset of models or fixes sparsity patterns, limiting generality\. #### 2\.3\.2\.Multimodality\-Aware Accelerators Mode\-switchable FPGA designs covering multiple modalities have emerged: CNN\+GNN\(Zhang et al\.,[2024a](https://arxiv.org/html/2603.22867#bib.bib32)\)and extensions that add ViT support\(Zhang et al\.,[2024b](https://arxiv.org/html/2603.22867#bib.bib33)\)\. However, coverage remains incomplete \(not all four modalities\) and most lack hardware\-accelerated*run\-time*token pruning, leaving end\-to\-end multimodal pipelines under\-optimized\. ### 2\.4\.Motivations As summarized in[Table 1](https://arxiv.org/html/2603.22867#S2.T1), prior work either narrows to a few modalities or omits run\-time token pruning essential for ViT\-heavy pipelines\. We aim for a single\-bitstream framework that spans ViT/GNN/CNN/NLP and provides scalable, hardware top\-kkpruning so emerging pruning strategies can be executed in real time within unified multimodal graphs\. Table 1\.FPGA accelerators vs\. supported workloads, sparse\-kernel support, and token\-pruning capability\.WorkAISparseTokenWorkloadKernelPruningThis workViT\+GNN\+CNN\+NLPYesRun\-time\(Zhang et al\.,[2024b](https://arxiv.org/html/2603.22867#bib.bib33)\)CNN\+GNN or ViTYesNo\(Zhang et al\.,[2024a](https://arxiv.org/html/2603.22867#bib.bib32)\)CNN\+GNNNoNo\(Dong et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib9)\)ViTNoRun\-time∗\(You et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib30)\)ViTNoCompile\-time∗\(Wang et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib27)\)NLPYesCompile\-time∗\(Lu et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib20)\)General MMYesNo\(Gerogiannis et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib12)\)General MMYesNo∗Compile/run\-time denotes when the sparsity pattern is determined\. Figure 1\.TRINE overview and mode\-switchable engine \(MSE\)\.\(a\) Accelerator with an RPU grid and local inter\-RPU buffers to localize traffic and pipeline inter\-tile exchange\. \(b\) Each RPU integrates a shared\-datapath MSE, a width\-matched two\-stage top\-kk, compact nonlinear units, and a lightweight feed scheduler\. \(c\) One PE array time\-shares four dataflows via small interconnect muxes and per\-PE op control: \(1\) Systolic \(WS/OS\) for dense DDMM; \(2\)1×CS1\{\\times\}C\_\{S\}SIMD for moderately sparse SDDMM/SpMM; \(3\) RADT for highly sparse/irregular reductions; \(4\) normal SIMD for element\-wise ops\. Selection policy \(runtime\): DDMM→\\rightarrowWS/OS \(WS for small\-token/high weight reuse; OS for wide feature maps/many tokens\)\. SDDMM/SpMM→\\rightarrow1×CS1\{\\times\}C\_\{S\}when active operands per row/col≲CS\\lesssim C\_\{S\}and fairly uniform; switch to RADT as sparsity grows or degree skews\. Feed scheduling provides systolic delay insertion and sparsity\-aware indexed reads without host\-side packing\. ## 3\.Hardware Architecture TRINE unifies ViT, GNN, NLP, and CNN inference on a single bitstream by combining a grid of reconfigurable processing units \(RPUs\), a mode\-switchable compute engine \(MSE\) that switches dataflows on one PE array, and an in\-stream, width\-matched top\-kkunit for run\-time token pruning \(Fig\.[1](https://arxiv.org/html/2603.22867#S2.F1)\)\. ### 3\.1\.System Overview As shown in Fig\.[1](https://arxiv.org/html/2603.22867#S2.F1)\(a\), anr×cr\{\\times\}cRPU grid with local inter\-RPU buffers minimizes off\-chip traffic and enables pipelined inter\-tile exchange\. A host APU/CPU writes compact control blocks \(mode, tiling, routing masks\) and uses AXI DMA for bulk I/O\. Mode changes drain the in\-flight pipeline and update a few registers; since kernels run for thousands of cycles, the overhead is negligible in wall time\. This separation of control and dataflow allows per\-kernel adaptation at runtime without any bitstream reconfiguration\. ### 3\.2\.Mode\-Switchable Engine \(MSE\) Within each RPU in Fig\.[1](https://arxiv.org/html/2603.22867#S2.F1)\(b\), the MSE is a single PE array that time\-shares multiple dataflows using small interconnect multiplexers and per\-PE operation control\. Each PE provides west/north inputs, a partial\-sum path, a tiny register file for reuse, and a three\-function ALU \(MAC/ADD/PASS\)\. One multiplexer steers the𝖡\\mathsf\{B\}stream to realize output\- or weight\-stationary wavefronts; another selects the PE operation \(multiply–accumulate, reduction for RADT, or pass\-through\)\. Lightweight row/column broadcasts and a short cross\-row tap enable SIMD and tree reductions without duplicating engines or introducing long global wires\. Consolidating dense and sparse dataflows into one array avoids a second large macro and its routing pressure, which helps timing closure and keepsFmaxF\_\{\\max\}stable when pruning is enabled\. The four execution modes are summarized in Fig\.[1](https://arxiv.org/html/2603.22867#S2.F1)\(c\): - •Systolic \(WS/OS\)\.Dense DDMM \(conv\-as\-GEMM, MLP, full attention\) with reuse balanced between weights \(WS, small tokens/large reuse\) and outputs \(OS, wide feature maps/many tokens\)\. - •𝟏×𝐂𝐒\\mathbf\{1\{\\times\}C\_\{S\}\}SIMD\. A single active row of width𝑪𝑺C\_\{S\}\(*array width in columns*\) issues only scheduled nonzeros; effective for SDDMM/SpMM at moderate sparsity or bounded degrees \(GNN\)\. - •RADT\. The same PEs form a programmable multi\-stage reduction tree; active products are injected where they occur and accumulated along routed paths, preferred under high or skewed sparsity\. - •Normal SIMD\. Element\-wise work such as bias, activation, and edge functions\. Mode selection uses kernel shape\(N,M,K\)\(N,M,K\), observed token/edge sparsitypp, and array widthCSC\_\{S\}: dense layers map to WS/OS; pruned attention and message passing use1×CS1\{\\times\}C\_\{S\}when activity per row is nearCSC\_\{S\}and fairly uniform, and switch to RADT when activity is much smaller or heavily skewed\. In our builds, sharing one array instead of duplicating dense/sparse macros reduced logic/routing by roughly 20–30% while maintainingFmaxF\_\{\\max\}\. ### 3\.3\.Top\-kkEngine \(In\-Stream Pruning\) Figure 2\.Two\-stage, width\-matched top\-kk\. A fixed\-width \(up toCSC\_\{S\}\) pipelined bitonic stage matches array width; a lightweightCS→kC\_\{S\}\{\\rightarrow\}kmerge completes selection\. Values/indices stream from MSE into a center buffer \(CB\) and sparse queue buffer \(SQB\), avoiding off\-chip detours and scaling better than single large bitonic networks\.Token and score selection happens immediately after compute\. The unit is split into a fixed\-width \(up toCSC\_\{S\}\) pipelined bitonic stage followed by a smallCS→kC\_\{S\}\{\\rightarrow\}kmergesorter\. Matching the first stage to the array’s output width \(Fig\.[1](https://arxiv.org/html/2603.22867#S2.F1)\(c\)\) caps the number of compare\-and\-swap units; the merge completes selection with little extra logic\. Values and indices stream through a center buffer and sparse queue buffer, so pruning neither stalls the pipeline nor detours through DRAM\. Compared with a single large bitonic network whose area grows asO\(nlog2n\)O\(n\\log^\{2\}n\), this width\-matched split preserves near\-streaming throughput at substantially lower area, and the microarchitecture can be swapped \(dual\-layer or FLiMS\) when devices are tight\. Figure 3\.Feed scheduler\.\(a\) Pipelined delay insertion aligns rows/columns for WS/OS systolic waves with no host\-side data reshaping\. \(b\) Sparsity\-aware indexed reads via a small BRAM\-backed sparse queue \(SQB\) and address generator feed only active pairs to1×CS1\{\\times\}C\_\{S\}SIMD and RADT, eliminating bubbles and DRAM detours after top\-kkpruning\. ### 3\.4\.Lightweight Feed and Nonlinear Units The feed scheduler in Fig\.[3](https://arxiv.org/html/2603.22867#S3.F3)\(a\) inserts per\-row delays for correct systolic alignment and, in Fig\.[3](https://arxiv.org/html/2603.22867#S3.F3)\(b\), performs sparsity\-aware indexed reads using a small BRAM\-backed sparse queue, so inputs need no host\-side repacking while1×CS1\{\\times\}C\_\{S\}and RADT consume only active pairs\. Compact nonlinear units \(SoftMax, GELU/ELU, normalization\) use polynomial or piecewise\-linear approximations and reuse a tree\-style accumulate pattern; they are sized to sustain the array’s streaming rate with modest resources\. ### 3\.5\.Putting It Together At runtime, TRINE applies a simple policy: dense DDMM uses WS or OS according to reuse and width \(Fig\.[1](https://arxiv.org/html/2603.22867#S2.F1)\(c\)\); SDDMM/SpMM use1×CS1\{\\times\}C\_\{S\}when activity per row is nearCSC\_\{S\}and uniform, and RADT when activity is much smaller or skewed \(also Fig\.[1](https://arxiv.org/html/2603.22867#S2.F1)\(c\)\); element\-wise stages use normal SIMD; top\-kkruns in stream so pruning immediately reduces downstream work\. This shared\-datapath approach maintains utilization across dense and sparse regimes and requires no bitstream reconfiguration\. ## 4\.Software Stack TRINE provides a vertically integrated stack that takes a structured model description and produces pruning\-aware, runtime\-adaptive execution on a single\-bitstream FPGA\. The stack enables multimodal offloading \(vision, language, graph\), mixes static and dynamic flows \(e\.g\., variable token counts\), and chooses efficient MSE modes per kernel\. It comprises a model specification interface, a multi\-stage compiler that emits compact instruction blocks, and a lightweight runtime that instantiates fuzzy layers, manages pruning, and dispatches work across the RPU grid \(Fig\.[4](https://arxiv.org/html/2603.22867#S4.F4)\)\. ### 4\.1\.Compiler Infrastructure The model is described in a compact configuration that lists layer types \(Attention, MatMul, GELU, SDDMM, etc\.\), tensor shapes, dataflow hints, and whether token pruning is enabled\. During parsing, layers are classified as*predictable*\(fixed shapes and static dataflow, e\.g\., MLPs or static convolutions\) or*fuzzy*\(runtime\-dependent, e\.g\., variable\-length NLP sequences or dynamically pruned attention\)\. This split allows the compiler to generate final binaries for predictable layers and templates for fuzzy ones whose missing fields will be filled at runtime\. Kernels are unified as DDMM, SDDMM, and SpMM and assigned an MSE mode using a simple scoring policy driven by layer shape and expected sparsity: dense kernels map to WS/OS systolic; pruned attention \(SDDMM\) and message passing \(SpMM\) map to1×CS1\{\\times\}C\_\{S\}SIMD when activity per row is near the array width and uniform, and to RADT when activity is much smaller or skewed\. To keep compile time small, RADT exploration is restricted to a short list of homogeneous partitions consistent with the array dimensions; the compiler evaluates only a handful of candidates rather than a combinatorial space\. Figure 4\.TRINE compile–run flow\. The compiler parses a structured model, classifies layers \(predictable vs\. fuzzy\), maps them to DDMM/SDDMM/SpMM, selects MSE modes via a sparsity/shape policy, and emits compact instruction blocks plus a dependency DAG\. At runtime, an APU\-backed controller fills fuzzy templates \(e\.g\., token counts\), configures top\-kk, and schedules ready blocks across RPUs using dependency\-aware layer offloading \(DALO\)\. Pruning indices flow forward to shrink subsequent kernels\.The compiler then emits compact instruction blocks that encode mode ID, loop bounds/tiling, buffer and operand IDs, pruning options, and dependency tags\. A dependency graph \(DAG\) across blocks enables dependency\-aware layer offloading \(DALO\): independent kernels \(e\.g\., theQQ,KK, andVVprojections\) are placed on different RPUs and overlapped to raise utilization\. Static placement respects RPU capabilities \(presence of top\-kk, nonlinear units\) and co\-locates producer–consumer pairs when it reduces memory traffic \(e\.g\., top\-kkadjacent to its consumer SDDMM/SpMM\)\. ### 4\.2\.Runtime Control and Dynamic Execution At runtime, a lightweight APU controller streams instruction blocks and fills in fuzzy templates with observed parameters \(token counts, pruning ratios, actual sparsity\) before dispatch\. Pruning thresholds andkk\-values are programmed into the top\-kkunit; the resulting indices are propagated forward as compact streams to shrink subsequent SDDMM/SpMM and to guide sparse reads\. The controller tracks RPU availability, honors the DAG to preserve dependencies, and opportunistically overlaps independent work across the grid; if observed sparsity deviates from compile\-time estimates, it can switch the next block’s MSE mode \(e\.g\.,1×CS1\{\\times\}C\_\{S\}to RADT\) by updating a small control block without any reconfiguration\. - •Template instantiation:complete fuzzy blocks with runtime shapes \(e\.g\., sequence length, kept tokens\)\. - •Pruning control:load top\-kkparameters; consume index streams to gate downstream compute and memory\. - •DALO scheduling:dispatch ready blocks to idle RPUs, co\-locate hot producer–consumer pairs, and maintain backpressure via shared queues\. This hybrid compile–run design keeps the critical logic in the compiler \(kernel unification, mode scoring, DAG construction\) while leaving low\-cost decisions \(template fill\-in,kkselection, mode flips under sparsity drift\) to the runtime\. In practice, this preserves high utilization across dense and sparse regimes, allows pruning to immediately reduce downstream work, and maintains adaptation within a single bitstream\. ## 5\.Evaluation We evaluate TRINE on multimodal workloads with latency and power as primary metrics\. Experiments are run on two FPGA targets—Alveo U50 and ZCU104—to show scalability, with comparisons to an RTX 4090 and Jetson Orin Nano\. We report \(i\) end\-to\-end latency on multimodal graphs, \(ii\) ViT token\-pruning impact, and \(iii\) cross\-platform results\. We implement TRINE using a Chisel\-based RTL generator\(Bachrach et al\.,[2012](https://arxiv.org/html/2603.22867#bib.bib3)\)and synthesize/place\-and\-route in Vivado 2022\.1\. Unless noted otherwise, the Alveo U50 runs at 300 MHz and the ZCU104 at 200 MHz; all experiments useint8quantization\. The top\-kkengine supports up to 256 elements \(bitonic on U50, dual\-layer on ZCU104\)\.[Table 2](https://arxiv.org/html/2603.22867#S5.T2)summarizes the configurations and resource utilization\. Table 2\.Hardware configurations and resource utilization\.Alveo U50ZCU104RPU Grid2×21×1MSE PE32×3232×32Top\-kkEng\.Bitonic, Max\-kk=256Dual\-Layer, Max\-kk=256Clock Freq\.300 MHz200 MHzLUT693,092 / 871,680 \(79\.51%\)163,112 / 230,400 \(70\.80%\)FFs379,608 / 1,743,360 \(21\.77%\)84,064 / 460,800 \(18\.24%\)DSPs4,564 / 5,952 \(76\.68%\)1,141 / 1,728 \(66\.03%\)BRAM553 / 1,344 \(41\.15%\)139 / 312 \(44\.55%\)URAM78 / 640 \(12\.19%\)20 / 96 \(20\.83%\)Power20\.99 W8\.05 W Model configurations are listed in[Table 3](https://arxiv.org/html/2603.22867#S5.T3), spanning TinyCLIP \(ViT/NLP\), MDETR \(CNN/NLP\), and MissionGNN \(ViT/GNN, CNN/GNN\)\. Together, these cover ViT, NLP, GNN, and CNN\. Table 3\.AI model configurations used in evaluation\.ModelLabelWorkload1Workload2Param\. Count \(M\)TinyCLIP\(ViT \+ NLP\)AViT\-8M/16TEXT\-3M11BViT\-39M/16TEXT\-19M58CViT\-40M/32TEXT\-19M59DViT\-61M/32TEXT\-29M90MDETR\(CNN \+ NLP\)EResNet\-18RoBERTa152FResNet\-50RoBERTa166GResNet\-101RoBERTa185MissionGNN\(ViT \+ GNN\)HViT\-S/16GNN\-1046IViT\-B/32GNN\-10422JViT\-H/14GNN\-104632MissionGNN\(CNN \+ GNN\)KResNet\-18GNN\-10412LResNet\-101GNN\-10445 ### 5\.1\.Multimodal Workload Analyses Fig\.[5](https://arxiv.org/html/2603.22867#S5.F5)shows the impact of dependency\-aware layer offloading \(DALO\)\. On U50, TinyCLIPA–Dcomplete in4\.2–19\.4 ms; DALO improves throughput by up to79\.2%, keeping latency under25 msfor interactive rates \(\>\>40 FPS\)\. For MDETRE–G, the RoBERTa branch dominates; DALO still yields1\.0–1\.2×gains by overlapping early text/vision blocks on U50, while single\-RPU ZCU104 shows no benefit \(sequential execution\)\. MissionGNNH–Ifinish<<\\\!12 ms; the largestJreaches414 mswithout pruning, indicating pruning is essential for this scale\. Overall, DALO converts inter\-branch independence into wall\-clock savings on multi\-RPU devices and degrades gracefully to correct sequential execution on small devices\. Figure 5\.End\-to\-end latency with/without DALO; labels refer to[Table 3](https://arxiv.org/html/2603.22867#S5.T3)\. Token pruning disabled to isolate scheduling effects\. ### 5\.2\.Token Pruning Impact on Latency Figure 6\.Speedup from ViT token pruning \(DynamicViT\); labels per[Table 3](https://arxiv.org/html/2603.22867#S5.T3)\.We apply DynamicViT\(Rao et al\.,[2021](https://arxiv.org/html/2603.22867#bib.bib23)\)with pruning ratesp∈\{0\.1,0\.2,0\.3\}p\\\!\\in\\\!\\\{0\.1,0\.2,0\.3\\\}\(baselinep=0p\\\!=\\\!0\), with DALO enabled\. Fig\.[6](https://arxiv.org/html/2603.22867#S5.F6)shows up to7\.3×\(U50\) and7\.8×\(ZCU104\) speedups, exceeding GPU\-reported gains \(∼\\sim1\.6×\)\. Larger\-token ViTs \(A,B,H,J\) benefit most; MissionGNN \(H–J\) sees higher speedups than TinyCLIP due to ViT’s share of total compute\. On ZCU104, pruning yields higher relative gains because scheduling is not the bottleneck\. In all cases, pruning immediately reduces downstream SDDMM/SpMM via the in\-stream top\-kkand raises utilization through mode switching\. Figure 7\.Accuracy of FP32 vs\. int8 quantization; labels per[Table 3](https://arxiv.org/html/2603.22867#S5.T3)\. ### 5\.3\.Quantization Effects on Accuracy As summarized in Fig\.[7](https://arxiv.org/html/2603.22867#S5.F7), we quantize transformers with Bitsandbytes and use PyTorch static quantization for CNN/GNN\. TinyCLIP \(A–D\) shows1\.0–2\.2%top\-1 and1\.0–1\.8%top\-5 drops on ImageNet\(Russakovsky et al\.,[2014](https://arxiv.org/html/2603.22867#bib.bib24)\); MDETR \(E–G\) loses0\.1–0\.2%on RefCOCO/\+/g\(Kazemzadeh et al\.,[2014](https://arxiv.org/html/2603.22867#bib.bib17)\); MissionGNN \(J\) decreases by0\.5%mAUC on UCF\-Crime\(Sultani et al\.,[2018](https://arxiv.org/html/2603.22867#bib.bib25)\)\. Across all cases in Fig\.[7](https://arxiv.org/html/2603.22867#S5.F7)\(blue: FP32, red: int8\), degradation remains under2\.5%while reducing model size by 4× and aligning with TRINE’s int8 datapath\. Table 4\.Comparison with other FPGA\-based accelerators including CNN, GNN, ViT, and NLP transformers\.UnitVisionAgileTPDS’24\(Zhang et al\.,[2024b](https://arxiv.org/html/2603.22867#bib.bib33)\)GCV\-TurboFCCM’24\(Zhang et al\.,[2024a](https://arxiv.org/html/2603.22867#bib.bib32)\)HPTAFPL’23\(Han and Liu,[2023](https://arxiv.org/html/2603.22867#bib.bib14)\)FTRANSISLPED’20\(Li et al\.,[2020](https://arxiv.org/html/2603.22867#bib.bib19)\)HeatViTHPCA’23\(Dong et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib9)\)SWATASP\-DAC’24\(Dong et al\.,[2024](https://arxiv.org/html/2603.22867#bib.bib10)\)ML Workload\-CNN, TransformerCNN, ViT, GNNCNN, GNNTransformerTransformerViT \*SwinCNN, GNN, ViT, NLPScheduling Mechanism\-SequentialSequentialSequentialSequentialSequentialSequentialSequentialInter\-Model ParallelNumber Data Type\-fixed\-point8\-float16int8fixed\-point16fixed\-point8int8int8FPGA Platform\-ZCU 104Alveo U250Alveo U250ZCU 102VCU118ZCU102Alveo U50Alveo U50Memory BandwidthGB/s19\.2777719\.2\-19\.2316316Operating FrequencyMHz600/300 \*\*600/300 \*\*600/300 \*\*200\-150200300Peak ThroughputTOPS2\.40\-1\.08\-\-0\.370\.311\.30DSP Usage\-\-9091\-23076531230719684564Maximum PowerW8\.85\-\-2025\.139\.4514\.3520\.99TinyCLIP\(ViT \+ NLP\)ViT\-8M/16\+TEXT\-3MLatencyms\-4\.26\-6\.389\.947\.4622\.041\.60Norm\. Latency \*\*\*ms\-7\.92\-3\.5110\.213\.5517\.831\.60EnergymJ\-\-\-127\.54249\.7270\.54316\.2733\.58MDETR\(CNN \+ NLP\)ResNet\-101\+RoBERTaLatencyms\-30\.54\-\-\-\-\-28\.40Norm\. Latency \*\*\*ms\-58\.60\-\-\-\-\-28\.40EnergymJ\-\-\-\-\-\-\-596\.12MissionGNN\(CNN \+ GNN\)ResNet\-101\+GNN\-104Latencyms21\.189\.428\.96\-\-\-\-7\.40Norm\. Latency \*\*\*ms9\.5017\.6816\.75\-\-\-\-7\.40EnergymJ187\.44\-\-\-\-\-\-155\.33\* The original work only supports ViT, but we assume it can support RoBERTa when padding the input sentence tokens with image token size\.\*\* 600 MHz for computation units and 300 MHz for data buffer\.\*\*\* Normalized the latency based on the utilized DSP count\. ### 5\.4\.Cross\-Platform Analyses #### 5\.4\.1\.Comparison with FPGA Accelerators [Table 4](https://arxiv.org/html/2603.22867#S5.T4)compares TRINE with recent FPGA accelerators across CNN, ViT, GNN, and NLP\. Since most baselines are single\-modal, we report their strongest per\-modal numbers and use FLOPs\-scaled estimates when needed \(e\.g\., TinyCLIP on HeatViT\(Dong et al\.,[2023](https://arxiv.org/html/2603.22867#bib.bib9)\)\)\. We also provide DSP\-normalized latency to account for device size; this is wall\-clock latency divided by the fraction of DSPs used, so designs that achieve more with fewer DSPs are credited appropriately\. Unless stated otherwise, all figures are batch size 1, single\-sample medians after warm\-up, and energy is computed asE=Power×LatencyE=\\text\{Power\}\\times\\text\{Latency\}\. For TinyCLIP \(configurationA\), TRINE achieves1\.6 ms,2\.7×\\timesfaster than VisionAGILE\(Zhang et al\.,[2024b](https://arxiv.org/html/2603.22867#bib.bib33)\)in raw latency and2\.2×\\timesper\-DSP\. Energy per inference is33\.6 mJ, a2\.1×\\timesimprovement over HeatViT\. The gap comes from in\-stream top\-kk\(downstream SDDMM/SpMM immediately shrink\) and DALO overlappingQ/K/VQ/K/Vwith the text branch across RPUs, which raises utilization without extra memory traffic\. For MDETR, TRINE exceeds VisionAGILE by1\.1×\\timesraw and2\.1×\\timesper\-DSP despite lower external bandwidth\. WS/OS mode switching matches layer reuse, and DALO overlaps early CNN and text blocks before fusion\. Gains are modest because the unpruned text encoder dominates, but they are consistent across configurations and devices\. For MissionGNN, TRINE is1\.2×\\timesfaster in raw latency and1\.3×\\timesper\-DSP than GCV\-Turbo\(Zhang et al\.,[2024a](https://arxiv.org/html/2603.22867#bib.bib32)\)and AMD DPU\(Inc\.,[2021](https://arxiv.org/html/2603.22867#bib.bib15)\), with1\.2×\\timesbetter energy efficiency\. Native SDDMM/SpMM support in the MSE \(switching to1×CS1\{\\times\}C\_\{S\}SIMD or RADT as sparsity skews\) sustains utilization on irregular graphs without reconfiguration and avoids host\-side packing\. Overall, TRINE is, to our knowledge, the only accelerator that executes CNN, ViT, GNN, and NLP end\-to\-end on a single bitstream\. Competing works often require FPGA reconfiguration \(100–150 ms\(AMD,[2024](https://arxiv.org/html/2603.22867#bib.bib2)\)\) or CPU offload; those costs are typically excluded from reported latencies, so full\-pipeline advantages are larger in practice\. Table 5\.Latency comparison between TRINE and GPUs \(ms\); ViT usesp=0\.3p\{=\}0\.3configuration\.WorkloadTinyCLIPMDETRMissionGNNWorkloadTinyCLIPCfg\.ABFGJLCfg\.AB1\.66\.518\.723\.056\.97\.4ZCU104∗4\.420\.837\.149\.921\.728\.450\.144\.13040\.622\.577\.641\.161\.240\.885\.986\.861\.96Labels refer to[Table 3](https://arxiv.org/html/2603.22867#S5.T3)\.∗Numbers are latencies \(ms\)\. #### 5\.4\.2\.GPU Comparison [Table 5](https://arxiv.org/html/2603.22867#S5.T5)compares TRINE with an RTX 4090 and a Jetson Orin Nano at batch size 1 and int8 where applicable; ViT usesp=0\.3p\{=\}0\.3, and transfers are overlapped when supported\. For TinyCLIP, the Alveo U50 is22\.57×\\timesfaster than the RTX 4090 on configurationAand7\.64×\\timesonB\. On the embedded side, ZCU104 exceeds Orin Nano by6\.86×\\timesand1\.96×\\times\. GPUs underutilize on irregular sparsity after pruning; the MSE’s sparse\-friendly modes and on\-chip top\-kkimmediately reduce downstream work\. For MDETR, the U50 delivers1\.16–1\.24×\\timesover the RTX 4090\. The text encoder is unpruned and CNNs map well to tensor cores, so the margin is smaller\. For MissionGNN, configurationLachieves5\.98×\\timesversus RTX 4090 due to efficient SpMM, while the very large ViT in configurationJyields0\.88×\\times\(limited by FPGA resources rather than dataflow\)\. Even here, performance per watt remains favorable: FPGA boards operate at or below21 W, whereas the RTX 4090 draws roughly430 W\. ## 6\.Conclusion We presented TRINE, a single\-bitstream, runtime\-adaptive accelerator and compiler that unifies ViT/CNN/GNN/NLP as DDMM/SDDMM/SpMM on a shared mode\-switchable PE array; the MSE switches among WS/OS,1×CS1\{\\times\}C\_\{S\}SIMD, and RADT, while a width\-matched in\-stream top\-kkand DALO sustain utilization without reconfiguration\. Across TinyCLIP, MDETR, and MissionGNN on Alveo U50 and ZCU104, TRINE cuts latency by up to22\.57×22\.57\\timesand6\.86×6\.86\\times\(vs\. RTX 4090 and Orin Nano, respectively\) at≤\\leq21 W; pruning alone yields up to7\.8×7\.8\\timeson ViT\-heavy cases and DALO adds up to79%79\\%throughput, with<2\.5%<2\.5\\%accuracy loss under int8\. Future work includes extending run\-time sparsity to NLP/CNN, exploring heterogeneous RADT partitions with fast cost models, and finer\-grained kernel splitting with telemetry\-guided policies to push TRINE toward a broadly deployable, energy\-efficient substrate for multimodal AI\. ## References - \(1\) - AMD \(2024\)AMD\. 2024\.*Configuration Time*\.[https://docs\.amd\.com/r/en\-US/ug909\-vivado\-partial\-reconfiguration/Configuration\-Time](https://docs.amd.com/r/en-US/ug909-vivado-partial-reconfiguration/Configuration-Time)Vivado Design Suite User Guide: Dynamic Function eXchange \(UG909\), Version 2024\.2\. - Bachrach et al\.\(2012\)Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee, Andrew Waterman, Rimas Avižienis, John Wawrzynek, and Krste Asanović\. 2012\.Chisel: constructing hardware in a Scala embedded language\. In*49th Annual Design Automation Conference \(DAC\)*\. ACM, San Francisco, CA, USA, 1216–1225\.[doi:10\.1145/2228360\.2228584](https://doi.org/10.1145/2228360.2228584) - Chang et al\.\(2023\)Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang, Rong Jin, and Mike Zheng Shou\. 2023\.Making Vision Transformers Efficient from A Token Sparsification View\. In*IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\. Los Alamitos, CA, USA, 6195–6205\.[doi:10\.1109/CVPR52729\.2023\.00600](https://doi.org/10.1109/CVPR52729.2023.00600) - Chen et al\.\(2024a\)Hanning Chen, Wenjun Huang, Yang Ni, Sanggeon Yun, Yezi Liu, Fei Wen, Alvaro Velasquez, Hugo Latapie, and Mohsen Imani\. 2024a\.TaskCLIP: Extend Large Vision\-Language Model for Task Oriented Object Detection\.[doi:10\.48550/arXiv\.2403\.08108](https://doi.org/10.48550/arXiv.2403.08108)arXiv:2403\.08108 \[cs\]\. - Chen et al\.\(2024b\)Zhiyang Chen, Yousong Zhu, Zhaowen Li, Fan Yang, Chaoyang Zhao, Jinqiao Wang, and Ming Tang\. 2024b\.The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers\. In*IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\)*\. Seoul, Korea, 4130–4134\.[doi:10\.1109/ICASSP48485\.2024\.10447756](https://doi.org/10.1109/ICASSP48485.2024.10447756) - Chen et al\.\(2019\)Zhao\-Min Chen, Xiu\-Shen Wei, Peng Wang, and Yanwen Guo\. 2019\.Multi\-Label Image Recognition with Graph Convolutional Networks\. In*IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\. - Cherti et al\.\(2023\)Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev\. 2023\.Reproducible Scaling Laws for Contrastive Language\-Image Learning\. In*IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\. Vancouver, BC, Canada, 2818–2829\.[doi:10\.1109/CVPR52729\.2023\.00276](https://doi.org/10.1109/CVPR52729.2023.00276) - Dong et al\.\(2023\)Peiyan Dong, Mengshu Sun, Alec Lu, Yanyue Xie, Kenneth Liu, Zhenglun Kong, Xin Meng, Zhengang Li, Xue Lin, Zhenman Fang, and Yanzhi Wang\. 2023\.HeatViT: Hardware\-Efficient Adaptive Token Pruning for Vision Transformers\. In*2023 IEEE International Symposium on High\-Performance Computer Architecture \(HPCA\)*\. Montreal, QC, Canada, 442–455\.[doi:10\.1109/HPCA56546\.2023\.10071047](https://doi.org/10.1109/HPCA56546.2023.10071047) - Dong et al\.\(2024\)Qiwei Dong, Xiaoru Xie, and Zhongfeng Wang\. 2024\.SWAT: An Efficient Swin Transformer Accelerator Based on FPGA\. In*29th Asia and South Pacific Design Automation Conference \(ASP\-DAC\)*\. 515–520\.[doi:10\.1109/ASP\-DAC58780\.2024\.10473931](https://doi.org/10.1109/ASP-DAC58780.2024.10473931) - Fayyaz et al\.\(2022\)Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei Jafari, Sunando Sengupta, Hamid Reza Vaezi Joze, Eric Sommerlade, Hamed Pirsiavash, and Juergen Gall\. 2022\.Adaptive Token Sampling for Efficient Vision Transformers\. In*European Conference on Computer Vision \(ECCV\)*\. Tel Aviv, Israel, 396–414\.[doi:10\.1007/978\-3\-031\-20083\-0\_24](https://doi.org/10.1007/978-3-031-20083-0_24) - Gerogiannis et al\.\(2023\)Gerasimos Gerogiannis, Serif Yesil, Damitha Lenadora, Dingyuan Cao, Charith Mendis, and Josep Torrellas\. 2023\.SPADE: A Flexible and Scalable Accelerator for SpMM and SDDMM\. In*50th Annual International Symposium on Computer Architecture \(ISCA\)*\. Orlando, FL, USA, 1–15\.[doi:10\.1145/3579371\.3589054](https://doi.org/10.1145/3579371.3589054) - Girdhar et al\.\(2023\)Rohit Girdhar, Alaaeldin El\-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra\. 2023\.ImageBind: One Embedding Space To Bind Them All\. In*IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\. Vancouver, Bc, Canada\. - Han and Liu \(2023\)Yuntao Han and Qiang Liu\. 2023\.HPTA: A High Performance Transformer Accelerator Based on FPGA\. In*33rd International Conference on Field\-Programmable Logic and Applications \(FPL\)*\. 27–33\.[doi:10\.1109/FPL60245\.2023\.00012](https://doi.org/10.1109/FPL60245.2023.00012) - Inc\. \(2021\)Xilinx Inc\. 2021\.*Zynq DPU Product Guide \(PG338\)*\.[https://docs\.amd\.com/r/3\.3\-English/pg338\-dpu](https://docs.amd.com/r/3.3-English/pg338-dpu)Document ID: PG338\. - Kamath et al\.\(2021\)Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion\. 2021\.MDETR \- Modulated Detection for End\-to\-End Multi\-Modal Understanding\. In*IEEE/CVF International Conference on Computer Vision \(ICCV\)*\. Montreal, QC, Canada, 1760–1770\.[doi:10\.1109/ICCV48922\.2021\.00180](https://doi.org/10.1109/ICCV48922.2021.00180) - Kazemzadeh et al\.\(2014\)Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg\. 2014\.ReferItGame: Referring to Objects in Photographs of Natural Scenes\. In*Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*\. Doha, Qatar, 787–798\. - Kong et al\.\(2022\)Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, Minghai Qin, and Yanzhi Wang\. 2022\.SPViT: Enabling Faster Vision Transformers via Latency\-Aware Soft Token Pruning\. In*17th European Conference on Computer Vision \(ECCV\)*\(Tel Aviv, Israel\)\. 620–640\.[doi:10\.1007/978\-3\-031\-20083\-0\_37](https://doi.org/10.1007/978-3-031-20083-0_37) - Li et al\.\(2020\)Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding\. 2020\.FTRANS: energy\-efficient acceleration of transformers using FPGA\. In*ACM/IEEE International Symposium on Low Power Electronics and Design \(ISLPED\)*\. Boston, MA, USA, 175–180\.[doi:10\.1145/3370748\.3406567](https://doi.org/10.1145/3370748.3406567) - Lu et al\.\(2021\)Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang\. 2021\.Sanger: A Co\-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture\. In*54th Annual IEEE/ACM International Symposium on Microarchitecture \(MICRO\-54\)*\. Virtual, 977–991\.[doi:10\.1145/3466752\.3480125](https://doi.org/10.1145/3466752.3480125) - Parikh et al\.\(2024\)Dhruv Parikh, Shouyi Li, Bingyi Zhang, Rajgopal Kannan, Carl Busart, and Viktor Prasanna\. 2024\.Accelerating ViT Inference on FPGA through Static and Dynamic Pruning\. In*IEEE 32nd Annual International Symposium on Field\-Programmable Custom Computing Machines \(FCCM\)*\. Orlando, FL, USA, 78–89\.[doi:10\.1109/FCCM60383\.2024\.00018](https://doi.org/10.1109/FCCM60383.2024.00018) - Radford et al\.\(2021\)Alec Radford, Jong Wook Kim, Chris Hallacy, A\. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever\. 2021\.Learning Transferable Visual Models From Natural Language Supervision\. In*38th International Conference on Machine Learning*\. - Rao et al\.\(2021\)Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho\-Jui Hsieh\. 2021\.DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification\. In*35th International Conference on Neural Information Processing Systems \(NeurIPS\)*\(Virtual\)\. 13937–13949\. - Russakovsky et al\.\(2014\)Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S\. Bernstein, Alexander C\. Berg, and Li Fei\-Fei\. 2014\.ImageNet Large Scale Visual Recognition Challenge\.[doi:10\.48550/arXiv\.1409\.0575](https://doi.org/10.48550/arXiv.1409.0575)arXiv:1409\.0575 \[cs\.CV\]\. - Sultani et al\.\(2018\)Waqas Sultani, Chen Chen, and Mubarak Shah\. 2018\.Real\-world Anomaly Detection in Surveillance Videos\. In*IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\)*\. Salt Lake City, UT, USA, 6479–6488\. - Tang et al\.\(2022\)Shitao Tang, Jiahui Zhang, Siyu Zhu, and Ping Tan\. 2022\.Quadtree Attention for Vision Transformers\. In*International Conference on Learning Representations \(ICLR\)*\. Virtual, 1–16\. - Wang et al\.\(2021\)Hanrui Wang, Zhekai Zhang, and Song Han\. 2021\.SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning\. In*IEEE International Symposium on High\-Performance Computer Architecture \(HPCA\)*\. Seoul, Korea, 97–110\.[doi:10\.1109/HPCA51647\.2021\.00018](https://doi.org/10.1109/HPCA51647.2021.00018) - Wu et al\.\(2023\)Kan Wu, Houwen Peng, Zhenghong Zhou, Bin Xiao, Mengchen Liu, Lu Yuan, Hong Xuan, Michael Valenzuela, Xi Stephen Chen, Xinggang Wang, Hongyang Chao, and Han Hu\. 2023\.TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance\. In*IEEE/CVF International Conference on Computer Vision \(ICCV\)*\. 21913–21923\.[doi:10\.1109/ICCV51070\.2023\.02008](https://doi.org/10.1109/ICCV51070.2023.02008) - Xu et al\.\(2022\)Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun\. 2022\.Evo\-ViT: Slow\-Fast Token Evolution for Dynamic Vision Transformer\. In*36th AAAI Conference on Artificial Intelligence \(AAAI\)*\(Virtual\)\. 2964–2972\.[doi:10\.1609/AAAI\.V36I3\.20202](https://doi.org/10.1609/AAAI.V36I3.20202) - You et al\.\(2023\)Haoran You, Zhanyi Sun, Huihong Shi, Zhongzhi Yu, Yang Zhao, Yongan Zhang, Chaojian Li, Baopu Li, and Yingyan Lin\. 2023\.ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co\-Design\. In*IEEE International Symposium on High\-Performance Computer Architecture \(HPCA\)*\. Montreal, QC, Canada, 273–286\.[doi:10\.1109/HPCA56546\.2023\.10071027](https://doi.org/10.1109/HPCA56546.2023.10071027) - Yun et al\.\(2024\)Sanggeon Yun, Ryozo Masukawa, Minhyoung Na, and Mohsen Imani\. 2024\.MissionGNN: Hierarchical Multimodal GNN\-based Weakly Supervised Video Anomaly Recognition with Mission\-Specific Knowledge Graph Generation\.[doi:10\.48550/arXiv\.2406\.18815](https://doi.org/10.48550/arXiv.2406.18815)arXiv:2406\.18815 \[cs\]\. - Zhang et al\.\(2024a\)Bingyi Zhang, Rajgopal Kannan, Carl Busart, and Viktor Prasanna\. 2024a\.GCV\-Turbo: End\-to\-end Acceleration of GNN\-based Computer Vision Tasks on FPGA\. In*IEEE 32nd Annual International Symposium on Field\-Programmable Custom Computing Machines \(FCCM\)*\. Orlando, FL, USA, 66–77\.[doi:10\.1109/FCCM60383\.2024\.00017](https://doi.org/10.1109/FCCM60383.2024.00017) - Zhang et al\.\(2024b\)Bingyi Zhang, Rajgopal Kannan, Carl Busart, and Viktor K\. Prasanna\. 2024b\.VisionAGILE: A Versatile Domain\-Specific Accelerator for Computer Vision Tasks\.*IEEE Transactions on Parallel and Distributed Systems*35, 12 \(Dec\. 2024\), 2405–2422\.[doi:10\.1109/TPDS\.2024\.3466891](https://doi.org/10.1109/TPDS.2024.3466891)
Similar Articles
TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators
This paper introduces TRAM, a method that jointly optimizes approximate multiplier structures and AI model parameters to reduce power consumption in AI accelerators while maintaining accuracy.
Introducing Triton: Open-source GPU programming for neural networks
OpenAI releases Triton 1.0, an open-source Python-like GPU programming language that enables researchers without CUDA experience to write highly efficient GPU kernels, achieving performance on par with expert-written CUDA code in as few as 25 lines.
Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
Kog AI launches a tech preview of the Kog Inference Engine, achieving 3,000 tokens/s per request on standard datacenter GPUs by co-designing model architecture, runtime, and low-level GPU code, targeting latency-critical AI agent workflows.
MTP (Multi-Token Prediction): 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro
MTP (Multi-Token Prediction) can accelerate LLM inference by 2x, especially for coding agents. This video demonstrates performance improvements with Qwen 3.6 on AMD Strix Halo and Dual Radeon 9700.
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all the available compute.
Packed Twin Inference (PTI) is a technique that achieves ~2× LLM throughput by running multiple token sequences in a single batch decode, exploiting weight sharing in llama.cpp without needing a draft model or additional VRAM.