TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators
Summary
This paper introduces TRAM, a method that jointly optimizes approximate multiplier structures and AI model parameters to reduce power consumption in AI accelerators while maintaining accuracy.
View Cached Full Text
Cached at: 05/12/26, 07:09 AM
# TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators Source: [https://arxiv.org/html/2605.08231](https://arxiv.org/html/2605.08231) ,Hanyu WangUniversity of California, Los AngelesLos AngelesUSA[hanyuwang@g\.ucla\.edu](https://arxiv.org/html/2605.08231v1/mailto:[email protected]),Yuyang YeChinese University of Hong KongHong KongChina[yuyangye@cuhk\.edu\.hk](https://arxiv.org/html/2605.08231v1/mailto:[email protected]),Mingfei YuÉcole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland[mingfei\.yu@epfl\.ch](https://arxiv.org/html/2605.08231v1/mailto:[email protected]),Wayne BurlesonUniversity of Massachusetts AmherstAmherstUSA[burleson@umass\.edu](https://arxiv.org/html/2605.08231v1/mailto:[email protected])andGiovanni De MicheliÉcole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland[giovanni\.demicheli@epfl\.ch](https://arxiv.org/html/2605.08231v1/mailto:[email protected]) ###### Abstract\. Reducing power consumption in AI accelerators is increasingly important\. Approximate computing can reduce power consumption while keeping the accuracy loss small\. Since multipliers are power\-hungry components in AI models, this paper focuses on synthesizing low\-powerapproximate multipliers \(AxMs\)\. Unlike prior works that design AxMs separately from AI model training, we presentTRAM, which jointly optimizes the AxM structure and AI model parameters to lower power with small accuracy loss\. Experiments show that compared to state\-of\-the\-art AxMs, TRAM achieves up to 25\.05% AxM power reduction on CNNs with CIFAR\-10, and reduces power by up to 27\.09% on vision transformers with ImageNet\. Approximate multiplier, hardware\-software co\-optimization, low\-power, AI accelerator ††copyright:acmcopyright††conference:; ;## 1\.Introduction The wide deployment of AI accelerators raises concerns about power consumption and creates an urgent need for low\-power computing solutions\(Schwartzet al\.,[2020](https://arxiv.org/html/2605.08231#bib.bib49)\)\. Approximate computing reduces power consumption by allowing inaccuracies in computations, making it a promising approach to addressing these concerns\(Leonet al\.,[2025](https://arxiv.org/html/2605.08231#bib.bib27)\)\. Since multipliers are among the most power\-consuming components in AI accelerators\(Armeniakoset al\.,[2022](https://arxiv.org/html/2605.08231#bib.bib51)\), this paper studies the automatic synthesis of low\-powerapproximate multipliers\(AxMs\)\. Many studies have investigated both automatic synthesis and manual design of AxMs\(Wuet al\.,[2024](https://arxiv.org/html/2605.08231#bib.bib35)\)\. For example, Mrazeket al\.\(Mrazeket al\.,[2017](https://arxiv.org/html/2605.08231#bib.bib16)\)proposed a genetic programming\-based method to synthesize AxMs and later extended it toconvolutional neural networks \(CNNs\)\(Mrazeket al\.,[2020](https://arxiv.org/html/2605.08231#bib.bib33)\)\. Xiaoet al\.\(Xiaoet al\.,[2022](https://arxiv.org/html/2605.08231#bib.bib4)\)formulated AxM synthesis as an integer programming problem and produced low\-cost AxMs\. Huet al\.\(Huet al\.,[2024](https://arxiv.org/html/2605.08231#bib.bib11)\)manually designed AxMs for CNNs using partial product speculation\. Furthermore, approximate logic synthesis tools, such as those proposed in\(Wanget al\.,[2023](https://arxiv.org/html/2605.08231#bib.bib23); Maet al\.,[2021](https://arxiv.org/html/2605.08231#bib.bib30); Menget al\.,[2026](https://arxiv.org/html/2605.08231#bib.bib28)\), can synthesize AxMs as well\. However, all the above methods overlook the specific context of AI models, which can lead to suboptimal results when deploying AxMs in accelerators\. First, many existing methods do not consider the data distribution used by AI models\. For instance, Xiaoet al\.\(Xiaoet al\.,[2022](https://arxiv.org/html/2605.08231#bib.bib4)\)assumed a uniform input distribution, while real data distributions vary across layers\. Ignoring this variability can lead to suboptimal designs\. Second, most existing works design or synthesize AxMs using local error metrics such as error rate or error distance\. However, a small local error does not always translate into a small final accuracy loss in the AI model\. Figure 1\.A 4\-bit unsigned array multiplier\. The red crosses indicate candidate signals that may be approximated\. The structure parameterθi\\theta\_\{i\}controls the approximation degree of theii\-th accumulation column\. We assume that at mostP=3P=3columns can be approximated\.Figure 2\.TRAM framework overview\.To address these issues, we presentTRAM, a hardware\-software co\-optimization framework thatTRainsApproximateMultiplier structures for low\-power AI accelerators\. TRAM formulates AxM synthesis as a joint optimization problem that updates the AxM structure and AI model parameters together during training\. By using real training data, TRAM captures the statistics seen by each multiplier and optimizes both the multiplier structure and model parameters with respect to the final accuracy loss\. Our contributions are summarized as follows: - •We introduce a parameterization of the AxM structure, in which each column in the compressor tree is assigned a continuousstructure parameterthat controls the approximation degree of the column\. These parameters are optimized using gradient descent\. - •We devise an analytic power model that estimates multiplier power from the structure parameters and provides useful hardware\-aware guidance during training\. - •We propose an efficient mapping method that converts the optimized structure parameters into concrete AxM designs\. Experimental results show that, compared to state\-of\-the\-art AxM designs, TRAM reduces AxM power by up to 25\.05% on CNNs with CIFAR\-10 at the same accuracy level, and by 27\.09% on vision transformers with ImageNet\. Since TRAM allows different structure parameters for different model layers, it naturally supports layer\-wise application of different AxMs\. Compared to the state\-of\-the\-art layer\-wise AxM exploration methods, TRAM reduces AxM energy by 40\.86%\. Our work is open source and available at[https://github\.com/changmg/TRAM](https://github.com/changmg/TRAM)\. The rest of this paper is organized as follows\. Section[2](https://arxiv.org/html/2605.08231#S2)describes preliminaries\. Sections[3](https://arxiv.org/html/2605.08231#S3)–[5](https://arxiv.org/html/2605.08231#S5)detail the TRAM framework\. Section[6](https://arxiv.org/html/2605.08231#S6)discusses the experimental results\. Section[7](https://arxiv.org/html/2605.08231#S7)concludes this paper\. ## 2\.Integer Multiplier Preliminaries This paper focuses on*unsigned integer multipliers*that are widely used in AI accelerators\(Simonet al\.,[2021](https://arxiv.org/html/2605.08231#bib.bib9); Jainet al\.,[2022](https://arxiv.org/html/2605.08231#bib.bib10); Menget al\.,[2025a](https://arxiv.org/html/2605.08231#bib.bib12); Zhenget al\.,[2022](https://arxiv.org/html/2605.08231#bib.bib3)\)\. Hereafter, we refer to unsigned integer multipliers simply as multipliers\. AxMs are usually obtained by modifyingaccurate multipliers \(AccMuls\)\. ABB\-bit AccMul computes the exact product of two unsigned integer inputsWWandXX, which are represented in binary asW=wB−1wB−2…w0W=w\_\{B\-1\}w\_\{B\-2\}\\ldots w\_\{0\}andX=xB−1xB−2…x0X=x\_\{B\-1\}x\_\{B\-2\}\\ldots x\_\{0\}\. The multiplier contains2B2Baccumulation columns\. Thecc\-th column accumulates the partial products asSc=∑i=0cppi,c−iS\_\{c\}=\\sum\_\{i=0\}^\{c\}pp\_\{i,c\-i\}, whereppi,j=wi⋅xjpp\_\{i,j\}=w\_\{i\}\\cdot x\_\{j\}is the partial product ofwiw\_\{i\}andxjx\_\{j\}, and0≤c≤2B−10\\leq c\\leq 2B\-1is the column index\. The final product is obtained by summing the weighted accumulation results from all columns asY=∑c=02B−1Sc⋅2cY=\\sum\_\{c=0\}^\{2B\-1\}S\_\{c\}\\cdot 2^\{c\}\. For example, Fig\.[1](https://arxiv.org/html/2605.08231#S1.F1)shows a 4\-bit array multiplier with 8 accumulation columns\. Each column generates partial products and accumulates them usinghalf adders \(HAs\)andfull adders \(FAs\)\. Approximation can be introduced to the partial products or to the sum and carry\-out signals of the half adders and full adders in these columns, as shown by the red crosses in Fig\.[1](https://arxiv.org/html/2605.08231#S1.F1)\. To evaluate the accuracy of aBB\-bit AxM, common error metrics includeerror rate \(ER\),normalized mean error distance \(NMED\), andmaximum error distance \(MaxED\)\(Jianget al\.,[2020](https://arxiv.org/html/2605.08231#bib.bib46)\), defined as ER=∑1≤i≤22B:Y\(i\)≠Yacc\(i\)pi,NMED=∑i=122B\|Y\(i\)−Yacc\(i\)\|⋅pi22B−1,\{\\textit\{ER\}\}=\\sum\_\{1\\leq i\\leq 2^\{2B\}:Y^\{\(i\)\}\\neq Y\_\{\\textit\{acc\}\}^\{\(i\)\}\}\{p\_\{i\}\},\\quad\{\\textit\{NMED\}\}=\\sum\_\{i=1\}^\{2^\{2B\}\}\{\\frac\{\\left\|Y^\{\(i\)\}\-Y\_\{\\textit\{acc\}\}^\{\(i\)\}\\right\|\\cdot p\_\{i\}\}\{2^\{2B\}\-1\}\},\\quadMaxED=max1≤i≤22B\|Y\(i\)−Yacc\(i\)\|\.\{\\textit\{MaxED\}\}=\\max\_\{1\\leq i\\leq 2^\{2B\}\}\{\\left\|Y^\{\(i\)\}\-Y\_\{\\textit\{acc\}\}^\{\(i\)\}\\right\|\}\.whereY\(i\)Y^\{\(i\)\}andYacc\(i\)Y\_\{\\textit\{acc\}\}^\{\(i\)\}are the outputs of the AxM and the AccMul under theii\-th input combination,pip\_\{i\}is the probability of theii\-th input combination, and22B2^\{2B\}is the total number of input combinations\. ## 3\.TRAM Overview and Multiplier Structure Parameterization Figure 3\.Dataflow for computing the objective function in Eq\. \([2](https://arxiv.org/html/2605.08231#S4.E2)\)\. The upper part computes the power lossℒpower\\mathcal\{L\}\_\{\\textit\{power\}\}\(Section[4\.2](https://arxiv.org/html/2605.08231#S4.SS2)\), and the lower part computes the AI model lossℒAI\_model\\mathcal\{L\}\_\{\\textit\{AI\\\_model\}\}\(Section[4\.3](https://arxiv.org/html/2605.08231#S4.SS3)\)\.### 3\.1\.TRAM Framework Overview TRAM aims to generate low\-power AxMs for AI accelerators\. The overall flow of TRAM is shown in Fig\.[2](https://arxiv.org/html/2605.08231#S1.F2)\. It starts from a pretrained floating\-point AI model, which is then quantized into an integer model\. To further reduce power, AxMs replace the AccMuls in the quantized model\. To explore the AxM design space, we represent AxM structures using thestructure parameterscollected inΘ\\Theta\. Changing the AxM structure corresponds to updatingΘ\\Theta\. The detailed parameterization ofΘ\\Thetais presented in Section[3\.2](https://arxiv.org/html/2605.08231#S3.SS2)\. Based on this parameterization, we propose a three\-phase method to generate low\-power AxMs for high\-accuracy AI models: Phase 1\.Design space exploration \(details in Section[4](https://arxiv.org/html/2605.08231#S4)\)\. This phase explores the AxM design space defined byΘ\\Thetathrough model retraining and balances power and accuracy\. Phase 2\.AxM structure mapping \(details in Section[5](https://arxiv.org/html/2605.08231#S5)\)\. This phase maps the optimized continuous structure parameters inΘ∗\\Theta^\{\*\}from phase 1 to specific AxM structures for each layer of the model\. Phase 3\.Accuracy recovery\. After mapping the structure parameters to AxM structures in phase 2, we apply these AxMs to the AI model and retrain it to recover the accuracy\. ### 3\.2\.AxM Structure Parameterization We parameterize the AxM structure using continuous structure parameters, enabling gradient\-based AxM structure optimization through model retraining\. LetΘ=\{Θ\(l\)\}\(1≤l≤L\)\\Theta\{=\}\\\{\\Theta^\{\(l\)\}\\\}\(1\{\\leq\}l\{\\leq\}L\)denote the collection of structure parameters for allLLlayers in the model\. We assume each layer uses one AxM structure for all multiplications in that layer, described byΘ\(l\)\\Theta^\{\(l\)\}\.Θ\(l\)\\Theta^\{\(l\)\}hasPPparameters:Θ\(l\)=\[θ0\(l\),θ1\(l\),…,θP−1\(l\)\]\\Theta^\{\(l\)\}\{=\}\[\\theta^\{\(l\)\}\_\{0\},\\theta^\{\(l\)\}\_\{1\},\\ldots,\\theta^\{\(l\)\}\_\{P\-1\}\]\(see Fig\.[1](https://arxiv.org/html/2605.08231#S1.F1)\), whereθc\(l\)\\theta^\{\(l\)\}\_\{c\}\(0≤c≤P−10\{\\leq\}c\{\\leq\}P\{\-\}1\) describes the approximation degree of columncc\. Here,PPis a user\-defined maximum number of columns that can be approximated\. Eachθc\(l\)∈\[0,1\]\\theta^\{\(l\)\}\_\{c\}\\in\[0,1\]is a continuous structure parameter that controls the approximation degree of columnccin the AxM of layerll\. A value of 0 means the column is kept fully accurate, while a value of 1 means that the column is entirely removed\. An intermediate value0<θc\(l\)<10\{<\}\\theta^\{\(l\)\}\_\{c\}\{<\}1represents a partial approximation, where only a subset of the partial products or compressors in columnccis removed\. Next, we explain how the structure parameters control the functional behavior of the AxM\. For thell\-th layer, the approximation error of thecc\-th accumulation column is defined asEc=θc\(l\)⋅Sc=θc\(l\)⋅∑i=0cppi,c−iE\_\{c\}=\\theta^\{\(l\)\}\_\{c\}\\cdot S\_\{c\}=\\theta^\{\(l\)\}\_\{c\}\\cdot\\sum\_\{i=0\}^\{c\}pp\_\{i,c\-i\}, whereScS\_\{c\}is the exact accumulation result of columncc\. The total approximation error over allPPapproximated columns is computed by summing the column errors multiplied by their weights2c2^\{c\},i\.e\.,Etotal=∑c=0P−1Ec⋅2cE\_\{total\}=\\sum\_\{c=0\}^\{P\-1\}E\_\{c\}\\cdot 2^\{c\}\. The AxM output is then obtained by subtracting this error from the exact product: \(1\)Y=WX−Etotal=WX−∑c=0P−1θc\(l\)⋅Sc⋅2c,Y=WX\-E\_\{total\}=WX\-\\sum\_\{c=0\}^\{P\-1\}\\theta^\{\(l\)\}\_\{c\}\\cdot S\_\{c\}\\cdot 2^\{c\},whereWXWXis the exact product ofWWandXX\. Eq\. \([1](https://arxiv.org/html/2605.08231#S3.E1)\) enables smooth adjustment of the approximation degree in each columnccby varying the structure parameterθc\(l\)\\theta^\{\(l\)\}\_\{c\}in the range\[0,1\]\[0,1\]\. A largerθc\(l\)\\theta^\{\(l\)\}\_\{c\}leads to a larger approximation error and reduces power consumption, while a smallerθc\(l\)\\theta^\{\(l\)\}\_\{c\}reduces the error and increases power\. This formulation can be extended beyond array multipliers\. ## 4\.Phase 1: Design Space Exploration through AI Model Retraining ### 4\.1\.Problem Formulation The structure parametersΘ\\Thetadefine the AxM design space\. Different choices ofΘ\\Thetalead to AxMs with different power consumption and different AI model accuracy\. To balance power consumption and accuracy, we formulate the following optimization problem: \(2\)minΘ,𝐖\(ℒpower\(Θ\)⋅λ\+ℒAI\_model\(Θ,𝐖,𝐗\)\)\.\\min\_\{\\Theta,\\mathbf\{W\}\}\\left\(\\mathcal\{L\}\_\{\\textit\{power\}\}\(\\Theta\)\\cdot\\lambda\+\\mathcal\{L\}\_\{\\textit\{AI\\\_model\}\}\(\\Theta,\\mathbf\{W\},\\mathbf\{X\}\)\\right\)\.Eq\. \([2](https://arxiv.org/html/2605.08231#S4.E2)\) consists of two loss terms: the power lossℒpower\\mathcal\{L\}\_\{\\textit\{power\}\}and the AI model lossℒAI\_model\\mathcal\{L\}\_\{\\textit\{AI\\\_model\}\}\.ℒpower\\mathcal\{L\}\_\{\\textit\{power\}\}maps the structure parametersΘ\\Thetato the total power consumed by all AxMs in the AI accelerator\.ℒAI\_model\\mathcal\{L\}\_\{\\textit\{AI\\\_model\}\}is the original model loss \(e\.g\.,cross\-entropy lossfor classification\) and depends on the structure parametersΘ\\Theta, the model weights𝐖\\mathbf\{W\}, and the inputs𝐗\\mathbf\{X\}\. A trade\-off parameterλ\\lambdais introduced to balance these two losses\. Increasingλ\\lambdagives more weight toℒpower\\mathcal\{L\}\_\{\\textit\{power\}\}, which lowers power consumption but increases model loss \(i\.e\., lower model accuracy\)\. By tuningλ\\lambda, we can explore different power\-accuracy trade\-offs\. We solve the optimization problem in Eq\. \([2](https://arxiv.org/html/2605.08231#S4.E2)\) through model retraining\. During retraining, the dataflow for computing the objective in Eq\. \([2](https://arxiv.org/html/2605.08231#S4.E2)\) is shown in Fig\.[3](https://arxiv.org/html/2605.08231#S3.F3)\. The upper part computesℒpower\\mathcal\{L\}\_\{\\textit\{power\}\}and the lower part computesℒAI\_model\\mathcal\{L\}\_\{\\textit\{AI\\\_model\}\}\. Sections[4\.2](https://arxiv.org/html/2605.08231#S4.SS2)and[4\.3](https://arxiv.org/html/2605.08231#S4.SS3)describe howℒpower\\mathcal\{L\}\_\{\\textit\{power\}\}andℒAI\_model\\mathcal\{L\}\_\{\\textit\{AI\\\_model\}\}are computed within this dataflow\. ### 4\.2\.Computation of Power Loss The power lossℒpower\(Θ\)\\mathcal\{L\}\_\{\\textit\{power\}\}\(\\Theta\)estimates the power consumed by all AxMs in all layers of the AI model as follows: \(3\)ℒpower\(Θ\)=∑l=1Lfpower\(Θ\(l\)\)×\#mults at layerl\#mults in all layers,\\displaystyle\\mathcal\{L\}\_\{\\textit\{power\}\}\(\\Theta\)=\\sum\_\{l=1\}^\{L\}f\_\{\\textit\{power\}\}\(\\Theta^\{\(l\)\}\)\\times\\frac\{\\text\{\\\#mults at layer \}l\}\{\\text\{\\\#mults in all layers\}\},whereΘ=\{Θ\(l\)\}\(1≤l≤L\)\\Theta=\\\{\\Theta^\{\(l\)\}\\\}\(1\\leq l\\leq L\)is the collection of structure parameters for theLLlayers\. Here,Θ\(l\)\\Theta^\{\(l\)\}is the structure parameter vector for layerll, andfpower\(Θ\(l\)\)f\_\{\\textit\{power\}\}\(\\Theta^\{\(l\)\}\)denotes the power of the AxM used in that layer\. Eq\. \([3](https://arxiv.org/html/2605.08231#S4.E3)\) is a weighted sum of the AxM power over all layers\. The weight for layerllis the ratio between its number of multiplication operations and the total multiplication count of the whole model\. This weight is computed from the multiplication counts of the model layers\. This weight approximates the fraction of total inference latency spent on layerll\. Thus, the weighted sum estimates the time\-averaged AxM power during inference\. We propose an analytical method to computefpower\(Θ\(l\)\)f\_\{\\textit\{power\}\}\(\\Theta^\{\(l\)\}\)\. We first estimate the power of an AccMul as follows: \(4\)PowerAccMul=∑c=02B−1Powerc=∑c=02B−1∑k=1Kccostc,k⋅Nc,k\.\{\\textit\{Power\}\}\_\{\\textit\{AccMul\}\}=\\sum\_\{c=0\}^\{2B\-1\}\{\\textit\{Power\}\}\_\{c\}=\\sum\_\{c=0\}^\{2B\-1\}\\sum\_\{k=1\}^\{K\_\{c\}\}\{\\textit\{cost\}\}\_\{c,k\}\\cdot N\_\{c,k\}\.Here,BBis the multiplier bit\-width, which results in2B2Baccumulation columns in the AccMul\.Powerc\{\\textit\{Power\}\}\_\{c\}is the power of thecc\-th accumulation column, computed by summing the power of allKcK\_\{c\}component types in that column\. Example component types include logic AND gates for partial\-product generation and various compressors for accumulation\.costc,k\{\\textit\{cost\}\}\_\{c,k\}is the power of a type\-kkcomponent in columncc, pre\-characterized using the standard cell library\.Nc,kN\_\{c,k\}is the number of type\-kkcomponents in columnccof the AccMul\. For example, consider the array multiplier in Fig\.[1](https://arxiv.org/html/2605.08231#S1.F1)\. Column 2 contains 3 AND gates for partial\-product generation, 1 HA and 1 FA for accumulation\. Thus, column 2 hasK2=3K\_\{2\}=3component types, labelled as type 1 \(AND gate\), type 2 \(HA\), and type 3 \(FA\)\. The number of each component type isN2,1=3N\_\{2,1\}=3,N2,2=1N\_\{2,2\}=1, andN2,3=1N\_\{2,3\}=1\. Assume the power of each type iscost2,1=1\{\\textit\{cost\}\}\_\{2,1\}=1,cost2,2=2\{\\textit\{cost\}\}\_\{2,2\}=2, andcost2,3=3\{\\textit\{cost\}\}\_\{2,3\}=3\. Then, the power consumption of column 2 isPower2=1×3\+2×1\+3×1=8\{\\textit\{Power\}\}\_\{2\}=1\\times 3\+2\\times 1\+3\\times 1=8\. Recall thatΘ\(l\)\\Theta^\{\(l\)\}containsPPstructure parametersθc\(l\)\(0≤c≤P−1\)\\theta^\{\(l\)\}\_\{c\}\(0\\leq c\\leq P\-1\)for layerll\. From Eq\. \([1](https://arxiv.org/html/2605.08231#S3.E1)\),θc\(l\)∈\[0,1\]\\theta^\{\(l\)\}\_\{c\}\\in\[0,1\]specifies the fraction of logic components removed from thecc\-th column of layerll\. Based on this, we estimate the normalized AxM power for layerllas follows: \(5\)fpower\(Θ\(l\)\)=PowerAccMul−∑c=0P−1θc\(l\)⋅PowercPowerAccMul,f\_\{\\textit\{power\}\}\(\\Theta^\{\(l\)\}\)=\\frac\{\{\\textit\{Power\}\}\_\{\\textit\{AccMul\}\}\-\\sum\_\{c=0\}^\{P\-1\}\\theta^\{\(l\)\}\_\{c\}\\cdot\{\\textit\{Power\}\}\_\{c\}\}\{\{\\textit\{Power\}\}\_\{\\textit\{AccMul\}\}\},wherePPis the maximum number of approximated columns, andPowerc\{\\textit\{Power\}\}\_\{c\}is the power of thecc\-th accurate accumulation column\. Whenθc\(l\)\\theta^\{\(l\)\}\_\{c\}is close to 0, most components in columnccare preserved and the power remains high\. ScalingPowerc\{\\textit\{Power\}\}\_\{c\}byθc\(l\)\\theta^\{\(l\)\}\_\{c\}provides an estimate of the power reduction gained from approximating that column\. Whenθc\(l\)\\theta^\{\(l\)\}\_\{c\}is close to 1, most components in columnccare removed and the power becomes low\. ### 4\.3\.Computation of the AI Model Loss The AI model loss functionℒAI\_model\\mathcal\{L\}\_\{\\textit\{AI\\\_model\}\}, such as cross\-entropy loss for classification tasks, measures the difference between the model output and the ground truth\. It depends on the structure parametersΘ\\Theta, the model parameters𝐖\\mathbf\{W\}, and the input data𝐗\\mathbf\{X\}\. For illustration, we present the computation ofℒAI\_model\\mathcal\{L\}\_\{\\textit\{AI\\\_model\}\}in the context of a CNN\. As shown in the bottom part of Fig\.[3](https://arxiv.org/html/2605.08231#S3.F3), the forward propagation ofℒAI\_model\\mathcal\{L\}\_\{\\textit\{AI\\\_model\}\}processes the layers of the model in order\. For CNNs, we replace the accurate multiplications in convolutional layers with AxMs, while for transformer\-based models, we replace all linear layers in attention and feed\-forward blocks with AxMs\. In what follows, we first present how to simulate the AxM behavior\. Since the AxM requires quantized inputs, we then present how to simulate the quantization process\. AxM Simulation: As shown in the center of Fig\.[3](https://arxiv.org/html/2605.08231#S3.F3), at thell\-th layer, an AxM takes the integer activationXqX\_\{q\}, the integer weightWqW\_\{q\}, and the structure parameter vectorΘ\(l\)\\Theta^\{\(l\)\}as inputs\. The AxM produces a quantized integer outputYqY\_\{q\}\. As described in Section[3\.2](https://arxiv.org/html/2605.08231#S3.SS2),Θ\(l\)\\Theta^\{\(l\)\}is a vector of real values in\[0,1\]\[0,1\]with lengthPP, wherePPis the maximum number of approximated columns\. Using Eq\. \([1](https://arxiv.org/html/2605.08231#S3.E1)\), we can express the AxM computation in closed form as: \(6\)Yq=WqXq−∑c=0P−1\[2c⋅θc\(l\)⋅∑i=0c\(Wq\[i\]⋅Xq\[c−i\]\)\],\\displaystyle Y\_\{q\}=W\_\{q\}X\_\{q\}\-\\sum\_\{c=0\}^\{P\-1\}\\left\[2^\{c\}\\cdot\\theta\_\{c\}^\{\(l\)\}\\cdot\\sum\_\{i=0\}^\{c\}\\left\(W\_\{q\}\[i\]\\cdot X\_\{q\}\[c\-i\]\\right\)\\right\],whereWq\[i\]W\_\{q\}\[i\]is theii\-th bit ofWqW\_\{q\}, andXq\[j\]X\_\{q\}\[j\]is thejj\-th bit ofXqX\_\{q\}\. The term \(Wq\[i\]⋅Xq\[c−i\]W\_\{q\}\[i\]\\cdot X\_\{q\}\[c\-i\]\) is the partial product ofWq\[i\]W\_\{q\}\[i\]andXq\[c−i\]X\_\{q\}\[c\-i\]\. The sum of partial products in columnccis scaled by the continuous structure parameterθc\(l\)∈\[0,1\]\\theta\_\{c\}^\{\(l\)\}\\in\[0,1\]\. Ifθc\(l\)=0\\theta\_\{c\}^\{\(l\)\}=0, thecc\-th accumulation column is kept exactly\. Ifθc\(l\)=1\\theta\_\{c\}^\{\(l\)\}=1, thecc\-th accumulation column is fully removed\. If0<θc\(l\)<10<\\theta\_\{c\}^\{\(l\)\}<1, logic components in columnccare partially removed, and the corresponding accumulation error is estimated by scaling the exact accumulation result usingθc\(l\)\\theta\_\{c\}^\{\(l\)\}\. Quantization Simulation: Since the inputs and outputs of an AxM are integers, quantization is required before the AxM operation\. We apply the traditionalfake quantizationtechnique\(Jacobet al\.,[2018](https://arxiv.org/html/2605.08231#bib.bib14)\)during training\. The quantization functionsQ\(x\)Q\(x\)andQ\(w\)Q\(w\), and the dequantization functionDQ\(Yq\)DQ\(Y\_\{q\}\)in Fig\.[3](https://arxiv.org/html/2605.08231#S3.F3)follow\(Shaoet al\.,[2024](https://arxiv.org/html/2605.08231#bib.bib38)\)\. ## 5\.Phase 2: AxM Structure Mapping Phase 1 produces a set of continuous structure parametersΘ∗=\{Θ∗\(1\),Θ∗\(2\),…,Θ∗\(L\)\}\\Theta^\{\*\}=\\\{\\Theta^\{\*\(1\)\},\\Theta^\{\*\(2\)\},\\ldots,\\Theta^\{\*\(L\)\}\\\}, whereΘ∗\(l\)\\Theta^\{\*\(l\)\}corresponds to the AxM used in layerll\. LetΘ∗\(l\)=\[θ0∗,θ1∗,…,θP−1∗\]\\Theta^\{\*\(l\)\}=\[\\theta^\{\*\}\_\{0\},\\theta^\{\*\}\_\{1\},\\ldots,\\theta^\{\*\}\_\{P\-1\}\]\(layer indexllomitted for brevity\), andPPis the maximum number of columns that can be approximated\. These continuous structure parametersθc∗\\theta^\{\*\}\_\{c\}cannot be directly implemented in hardware\. The goal of phase 2 is to mapΘ∗\(l\)\\Theta^\{\*\(l\)\}into a concrete AxM netlist used in each layerllso that the resulting circuit behaves as closely as possible to the behavior implied byΘ∗\(l\)\\Theta^\{\*\(l\)\}\. Specifically, during training,Θ∗\(l\)\\Theta^\{\*\(l\)\}controls the amount of error added to each column of the AxM\. The hardware mapping aims to reproduce this same error behavior\. To guide this mapping, we compute the expected AxM output underΘ∗\(l\)\\Theta^\{\*\(l\)\}using the closed\-form model in Eq\. \([6](https://arxiv.org/html/2605.08231#S4.E6)\)\. For all input combinations, we compute the reference output: \(7\)Yref=WqXq−∑c=0P−1\[2cθc∗∑i=0c\(Wq\[i\]⋅Xq\[c−i\]\)\]\.Y\_\{\\mathrm\{ref\}\}=W\_\{q\}X\_\{q\}\-\\sum\_\{c=0\}^\{P\-1\}\\Bigg\[2^\{c\}\\,\\theta^\{\*\}\_\{c\}\\sum\_\{i=0\}^\{c\}\\big\(W\_\{q\}\[i\]\\cdot X\_\{q\}\[c\-i\]\\big\)\\Bigg\]\.Eq\. \([7](https://arxiv.org/html/2605.08231#S5.E7)\) defines the target behavior that the hardware should match\. Thus, givenΘ∗\(l\)\\Theta^\{\*\(l\)\}, our task is to construct an AxM whose output function approximatesYrefY\_\{\\mathrm\{ref\}\}as closely as possible\. To obtain such an AxM, we begin with an AccMul\. In our implementation, the initial multiplier is an array\-based AccMul\. For each accumulation columncc, the candidates for approximation are the sum and carry outputs of compressors such as HAs and FAs in columncc\. Each candidate can be tentatively replaced by constant0\. Such replacements act as discrete forms of the continuous error implied byθc∗\(l\)\\theta^\{\*\(l\)\}\_\{c\}\. Inspired by existing approximate logic synthesis methods that assess local replacements using the errors they induce\(Menget al\.,[2020](https://arxiv.org/html/2605.08231#bib.bib17),[2023](https://arxiv.org/html/2605.08231#bib.bib57); Hashemiet al\.,[2018](https://arxiv.org/html/2605.08231#bib.bib26); Menget al\.,[2025b](https://arxiv.org/html/2605.08231#bib.bib34)\), we define an error\-based metric to evaluate whether a constant replacement of a candidate signal is beneficial\. LetYcirccurrY^\{\\text\{curr\}\}\_\{\\mathrm\{circ\}\}be the output of the current circuit andYcircnewY^\{\\text\{new\}\}\_\{\\mathrm\{circ\}\}the output after a tentative replacement\. We evaluate the errors ofYcirccurrY^\{\\text\{curr\}\}\_\{\\mathrm\{circ\}\}andYcircnewY^\{\\text\{new\}\}\_\{\\mathrm\{circ\}\}with respect toYrefY\_\{\\mathrm\{ref\}\}using the same input patterns\. A replacement is accepted only if it strictly decreases themean squared error \(MSE\)from the reference:MSE\(Ycircnew,Yref\)<MSE\(Ycirccurr,Yref\)\.\\operatorname\{MSE\}\\\!\\big\(Y^\{\\text\{new\}\}\_\{\\mathrm\{circ\}\},\\,Y\_\{\\mathrm\{ref\}\}\\big\)<\\operatorname\{MSE\}\\\!\\big\(Y^\{\\text\{curr\}\}\_\{\\mathrm\{circ\}\},\\,Y\_\{\\mathrm\{ref\}\}\\big\)\. Based on this evaluation metric, we propose a greedy column\-wise mapping flow\. We traverse the multiplier from the least significant column to the most significant column\. For each column, all approximation candidates are considered before moving to the next column\. Each candidate is tentatively replaced by constant0and evaluated using theMSE\\operatorname\{MSE\}criterion\. If a tentative substitution reduces the MSE, it is permanently applied to the circuit\. Otherwise, the tentative substitution is undone\. After processing all columns, we obtain a circuit whose output is close to the reference outputYrefY\_\{\\textit\{ref\}\}\. Finally, the resulting AxM netlist is emitted as Verilog\. Note that this procedure always yields a feasible design, since it starts from an AccMul and only applies constant replacements\. ## 6\.Experimental Results We implement the TRAM framework using PyTorch 2\.4\(Paszkeet al\.,[2019](https://arxiv.org/html/2605.08231#bib.bib53)\)and test it on a single NVIDIA A100 GPU\. Experiments are conducted with CNN and ViT models on the CIFAR\-10\(Krizhevsky and Hinton,[2009](https://arxiv.org/html/2605.08231#bib.bib54)\)and ImageNet\(Denget al\.,[2009](https://arxiv.org/html/2605.08231#bib.bib22)\)datasets\. We evaluate two quantization schemes:w8a8\(8\-bit weights and activations\) andw4a4\(4\-bit weights and activations\), covering both standard and low\-bitwidth regimes commonly used in AI accelerators\. The initial w8a8\-quantized model is obtained through post\-training quantization, while the initial w4a4\-quantized model is prepared using quantization\-aware training\. Channel\-wise quantization is applied to the weights, and layer\-wise quantization is applied to the activations\. Unless otherwise specified, all experiments use the following default settings\. We use a batch size of 256 and the SGD optimizer with momentum 0\.9 and weight decay 5e\-4\. The retraining epochs for design space exploration \(phase 1\) and accuracy recovery \(phase 3\) are both set to 10, which we found sufficient for convergence\. Phase 1 uses a fixed learning rate of 5e\-4, while phase 3 uses a cosine annealing schedule that decreases the learning rate from 5e\-4 to 0\. The parameterPP\(see Section[3\.2](https://arxiv.org/html/2605.08231#S3.SS2)\), the maximum number of accumulation columns allowed to be approximated, is set to 8\. Initially, the AxM in each layerllremoves 4 accumulation columns by settingθc\(l\)=1\\theta^\{\(l\)\}\_\{c\}=1forc=0,1,2,3c=0,1,2,3andθc\(l\)=0\\theta^\{\(l\)\}\_\{c\}=0for all other columns, serving as a moderate starting point for the optimizer\. To evaluate area, delay, and power, we synthesize the AxMs using a commercial logic synthesis tool with the ASAP 7nm standard cell library\(Clarket al\.,[2016](https://arxiv.org/html/2605.08231#bib.bib42)\)\. Power measurements assume a 100 MHz clock frequency\. We evaluate TRAM under two scenarios: 1\) all layers in the AI model share the same AxM type, to compare individual AxM designs against baselines, and 2\) different layers can use different AxM types, to evaluate the benefit of layer\-wise optimization\. ### 6\.1\.Experiments with a Single AxM Type This set of experiments assumes that all layers in the AI model use the same type of AxM\. To achieve this in TRAM, we restrict the structure parameters of all layers to be identical,i\.e\.,Θ\(1\)=Θ\(2\)=…=Θ\(L\)\\Theta^\{\(1\)\}=\\Theta^\{\(2\)\}=\\ldots=\\Theta^\{\(L\)\}, whereLLis the number of layers\. The baseline AxM designs are “Evo”, “OPACT”, and “AMPPS”, taken from\(Mrazeket al\.,[2020](https://arxiv.org/html/2605.08231#bib.bib33)\),\(Xiaoet al\.,[2022](https://arxiv.org/html/2605.08231#bib.bib4)\), and\(Huet al\.,[2024](https://arxiv.org/html/2605.08231#bib.bib11)\), respectively\. The tested 8\-bit unsigned multipliers and their errors, areas, and delays are listed in Table[1](https://arxiv.org/html/2605.08231#S6.T1)\. The ER, NMED, and MaxED metrics of the AxMs \(see Section[2](https://arxiv.org/html/2605.08231#S2)\) are measured by enumerating all possible input combinations under a uniform distribution\. We use the open\-source tool in\(Menget al\.,[2024](https://arxiv.org/html/2605.08231#bib.bib25)\)to obtain the ER and NMED metrics\. For a fair comparison, for each Evo and OPACT AxM, we use the same number of epochs of AxM\-aware retraining as TRAM\. We reimplement this retraining process following the method in\(Danopouloset al\.,[2022](https://arxiv.org/html/2605.08231#bib.bib47)\)\. However, theAMPPSAxMs include an encoding process that is not supported by the existing AxM\-aware retraining methods\. Thus, we do not perform retraining for the AMPPS AxMs, and we directly take the final CNN accuracy reported in\(Huet al\.,[2024](https://arxiv.org/html/2605.08231#bib.bib11)\)\. Table 1\.Tested 8\-bit unsigned multipliers\.MultiplierArea/μm2\\mu m^\{2\}Delay/psPower/mWER/%NMED/%MaxEDAccMul27\.2496\.50\.00310\.00\.0000OPACT\_118\.6499\.60\.001923\.00\.022516OPACT\_1814\.1495\.10\.001151\.00\.6197684Evo\_0AB21\.3499\.60\.002297\.70\.057115Evo\_1DMU16\.9499\.20\.001566\.00\.6504084Evo\_GJM12\.1499\.50\.001074\.91\.5439124AMPPS\_S217\.0499\.90\.001996\.30\.076417AMPPS\_S318\.9497\.90\.002095\.90\.073417AMPPS\_S420\.8499\.70\.002295\.70\.070417#### 6\.1\.1\.Experiments on CIFAR\-10   Figure 4\.Comparison of final accuracy and AxM power consumption using different 8\-bit multipliers on CIFAR\-10\.This set of experiments compares the CNN accuracy and the AxM power consumption on CIFAR\-10\. We first use the w8a8 quantization scheme and test the ResNet18 and ResNet34 models\. Fig\.[4](https://arxiv.org/html/2605.08231#S6.F4)plots the final model accuracy versus the AxM power consumption for the tested AxMs on ResNet18 and ResNet34\. Each figure includes the 8\-bit AccMul result in the upper right corner for reference\. A better AxM achieves higher accuracy and lower power, which appears toward the upper left region of the plot\. We vary the trade\-off parameterλ\\lambdain Eq\. \([2](https://arxiv.org/html/2605.08231#S4.E2)\) to generate different AxM designs using TRAM\. Asλ\\lambdaincreases, both the AxM power consumption and the accuracy decrease\. The results show that TRAM achieves a better trade\-off between accuracy and power consumption than the Evo, AMPPS, and OPACT AxMs for both ResNet18 and ResNet34 models\. For example, compared with OPACT\_1 on ResNet18, TRAM withλ=0\.9\\lambda=0\.9reaches a similar accuracy, while reducing power by 15\.44%\. On ResNet34, TRAM withλ=0\.9\\lambda=0\.9reaches a similar accuracy as Evo\_0AB and OPACT\_1, while reducing power by 25\.05% and 15\.43%, respectively\. As for runtime, TRAM includes two retraining phases \(phases 1 and 3 in Fig\.[2](https://arxiv.org/html/2605.08231#S1.F2)\) and one AxM structure mapping phase \(phase 2\)\. Phase 2 uses a greedy algorithm and takes less than 1 minute to generate an 8\-bit AxM\. The phase 2 runtime is independent of the CNN model size\. With 20 training epochs in total \(10 epochs for phase 1 and 10 epochs for phase 3\), TRAM takes about 17 and 30 minutes for ResNet18 and ResNet34, respectively\. We also evaluate the impact ofλ\\lambdaon the DenseNet161 model under the w4a4 quantization scheme\. The maximum number of approximated columns,PP, is set to 4\. The results are shown in Fig\.[5](https://arxiv.org/html/2605.08231#S6.F5)\. Similarly to previous results, increasingλ\\lambdareduces the AxM power and model accuracy\. The runtime of TRAM for DenseNet161 under w4a4 is about 1\.66 hours for 20 training epochs \(phases 1 and 3\)\. Figure 5\.Impact ofλ\\lambdaon DenseNet161 accuracy and AxM power under w4a4\. Power is normalized to the 4\-bit AccMul\. #### 6\.1\.2\.Experiments on ImageNet This set of experiments evaluates TRAM on the ImageNet dataset\(Denget al\.,[2009](https://arxiv.org/html/2605.08231#bib.bib22)\)\. The baseline method is TransAxx\(Danopouloset al\.,[2025](https://arxiv.org/html/2605.08231#bib.bib29)\), an AxM\-aware retraining method for transformer models\. TransAxx tested the power\-accuracy trade\-off of several 8\-bit AxMs from the EvoApprox library\(Mrazeket al\.,[2020](https://arxiv.org/html/2605.08231#bib.bib33)\), and we compare the TRAM\-generated AxMs with these designs\. We refer to the TransAxx training settings, which use up to 15 epochs for AxM\-aware training\. We perform phase 1 \(design space exploration\) of TRAM for 5 epochs and phase 3 \(accuracy recovery\) for 10 epochs\. Following the TransAxx setup, we sample 100,000 images from the 1\.28 million training images for efficient training, and use the full validation set for evaluation\. We use the Adam optimizer with a learning rate of 5e\-5 for both phases and set the batch size to 64\. Note that TRAM uses channel\-wise weight quantization and layer\-wise activation quantization, whereas TransAxx uses layer\-wise quantization for both weights and activations\. Therefore, the comparison with TransAxx should be viewed as a reference rather than a strictly matched apples\-to\-apples comparison\. To save runtime, we set the parameterPP, which is the maximum number of approximated columns, to 6\. Table 2\.Results on ImageNet with vision transformers\.ModelAxMAccuracyNorm\.AxM PowerDeiT\-S\(8\-bit AccMulacc\. 79\.34%\)Ours \(λ\\lambda=11\)76\.78%82\.53%Ours \(λ\\lambda=100100\)76\.31%66\.08%Evo\_1KV970\.16%93\.17%Evo\_1L2H67\.01%73\.04%Swin\-S\(8\-bit AccMulacc\. 81\.83%\)Ours \(λ\\lambda=11\)79\.54%82\.53%Ours \(λ\\lambda=100100\)79\.15%66\.08%Evo\_1KV979\.25%93\.17%Evo\_1L2H76\.64%73\.04%Table[2](https://arxiv.org/html/2605.08231#S6.T2)compares TRAM\-generated AxMs with the Evo\_1KV9 and Evo\_1L2H AxMs from the EvoApprox library\. We copy the accuracy results of these designs from the TransAxx paper and measure their power using the ASAP 7nm library\. Other AxMs tested in TransAxx are excluded due to their impractically large accuracy loss\. Table[2](https://arxiv.org/html/2605.08231#S6.T2)suggests that TRAM provides a favorable accuracy\-power trade\-off under our setting\. For DeiT\-S and Swin\-S, usingλ=1\\lambda=1keeps the accuracy acceptable compared to that of the 8\-bit quantization model using AccMuls, while reducing AxM power by 17\.47%\. Increasingλ\\lambdato100100provides further power reduction with a small decrease in accuracy\. In contrast, Evo\_1KV9 consumes much more power, while Evo\_1L2H achieves lower power at the cost of a much larger accuracy loss\. For Swin\-S, comparing our AxM whenλ=100\\lambda=100with Evo\_1KV9, TRAM achieves comparable accuracy and reduces power by an additional 27\.09%\. Table 3\.Comparison of ResNet50 accuracy, normalized AxM energy, and runtime using different layer\-wise AxM exploration methods on CIFAR\-10\. “N/A” means not applicable\.MethodResNet50AccuracyNorm\.EnergyRuntime/hourFP3293\.65%N/AN/AAccMul93\.56%100\.00%N/AOurs \(λ\\lambda=1\)93\.71%74\.97%0\.69Ours \(λ\\lambda=10\)93\.65%55\.64%0\.70Ours \(λ\\lambda=100\)92\.79%35\.90%0\.69Ours \(λ\\lambda=1000\)92\.97%35\.81%0\.69MARLIN\-192\.14%80\.67%111\.1MARLIN\-291\.70%76\.67%111\.1ALWANN\-189\.08%78\.47%N/AALWANN\-288\.58%70\.02%N/A ### 6\.2\.Experiments with Layer\-Wise Different AxM Types This set of experiments assumes that different layers can use different types of AxMs\. To support this in TRAM, we allow structure parameters to differ across layers,i\.e\.,Θ\(1\)\\Theta^\{\(1\)\},Θ\(2\)\\Theta^\{\(2\)\},…\\ldots,Θ\(L\)\\Theta^\{\(L\)\}can be different\. We compare TRAM with existing mixed\-type AxM selection frameworks for CNNs,i\.e\., MARLIN\(Guellaet al\.,[2024](https://arxiv.org/html/2605.08231#bib.bib41)\)and ALWANN\(Mrazek and others,[2019](https://arxiv.org/html/2605.08231#bib.bib7)\)\. Both MARLIN and ALWANN select different AxM types across layers to improve energy efficiency\. We use data reported in the MARLIN paper, which also includes results for ALWANN\. Although MARLIN is evaluated in a 65 nm technology and our evaluation is based on the ASAP 7nm library, the AxM energy normalized to the AccMul energy still provides a meaningful basis for comparison\. Table[3](https://arxiv.org/html/2605.08231#S6.T3)compares TRAM with MARLIN and ALWANN on CIFAR\-10 using the ResNet50 model and the w8a8 quantization scheme\. The AxM energy consumption refers to the energy consumed by all AxMs during inference of one input image\. It is estimated by accumulating the energy of approximate multiplications in convolutional layers\. The normalized AxM energy is estimated from number of multiplications and AxM power in each layer\. Since the tested 8\-bit multipliers have nearly identical latency, we assume the same delay for all multipliers, so the delay term cancels out in the normalization\. Therefore, the normalized energy is computed from∑l\#mults\(l\)×AxMPower\(l\)\\sum\_\{l\}\\\#\{\\textit\{mults\}\}^\{\(l\)\}\\times\{\\textit\{AxMPower\}\}^\{\(l\)\}, normalized to the 8\-bit AccMul\. From Table[3](https://arxiv.org/html/2605.08231#S6.T3), our method consistently outperforms both MARLIN and ALWANN in terms of accuracy and energy efficiency\. Whenλ=1\\lambda=1, our method reaches 93\.71% accuracy, which slightly exceeds the baseline 8\-bit AccMul accuracy of 93\.56%\. Meanwhile, AxM energy consumption is reduced by 25\.03%\. Whenλ=1000\\lambda=1000, our method achieves 92\.97% accuracy with 35\.81% normalized energy\. Comparing the case ofλ=1000\\lambda=1000with MARLIN\-2, TRAM improves accuracy by 1\.27% while reducing energy consumption by 40\.86%\. One reason for the large energy savings is that TRAM explores a larger AxM design space\. With the parameterization in Section[3\.2](https://arxiv.org/html/2605.08231#S3.SS2)and the mapping in Section[5](https://arxiv.org/html/2605.08231#S5), TRAM applies constant\-0 replacements to HA and FA sum and carry outputs\. In contrast, MARLIN explores a more restricted design space based on removing entire columns of partial products\. Comparing runtime, TRAM requires 0\.7 hours on a single NVIDIA A100 GPU for the 20 training epochs, while MARLIN consumes over 111 hours using 16 threads of a Ryzen 5950X CPU and an NVIDIA RTX A5000 GPU\. ## 7\.Conclusion In conclusion, to achieve a good trade\-off between power consumption and accuracy of AxM\-based AI accelerators, we propose TRAM to directly explore the AxM design space through model retraining\. At the same accuracy level, TRAM significantly reduces AxM power on several CNNs and vision transformers on the CIFAR\-10 and ImageNet datasets\. In the future, we will extend TRAM to large language models\. ## References - \[1\]G\. Armeniakos, G\. Zervakis, D\. Soudris, and J\. Henkel\(2022\)Hardware approximate techniques for deep neural network accelerators: a survey\.ACM Computing Surveys \(CSUR\)55\(4\),pp\. 1–36\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p1.1)\. - \[2\]L\. T\. Clark, V\. Vashishtha, L\. Shifren, A\. Gujja, S\. Sinha, B\. Cline, C\. Ramamurthy, and G\. Yeric\(2016\)ASAP7: a 7\-nm FinFET predictive process design kit\.Microelectronics Journal53,pp\. 105–115\.Cited by:[§6](https://arxiv.org/html/2605.08231#S6.p2.5)\. - \[3\]D\. Danopoulos, G\. Zervakis, K\. Siozios, D\. Soudris, and J\. Henkel\(2022\)AdaPT: fast emulation of approximate DNN accelerators in PyTorch\.IEEE Transactions on Computer\-Aided Design of Integrated Circuits and Systems \(TCAD\)42\(6\),pp\. 2074–2078\.Cited by:[§6\.1](https://arxiv.org/html/2605.08231#S6.SS1.p1.2)\. - \[4\]D\. Danopoulos, G\. Zervakis, D\. Soudris, and J\. Henkel\(2025\)TransAxx: efficient transformers with approximate computing\.IEEE Transactions on Circuits and Systems for Artificial Intelligence2\(4\),pp\. 288–301\.Cited by:[§6\.1\.2](https://arxiv.org/html/2605.08231#S6.SS1.SSS2.p1.1)\. - \[5\]J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei\(2009\)ImageNet: a large\-scale hierarchical image database\.InIEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 248–255\.Cited by:[§6\.1\.2](https://arxiv.org/html/2605.08231#S6.SS1.SSS2.p1.1),[§6](https://arxiv.org/html/2605.08231#S6.p1.1)\. - \[6\]F\. Guella, E\. Valpreda, M\. Caon, G\. Masera, and M\. Martina\(2024\)MARLIN: a co\-design methodology for approximate reconfigurable inference of neural networks at the edge\.IEEE Transactions on Circuits and Systems I: Regular Papers \(TCAS\-I\)71\(5\),pp\. 2105–2118\.Cited by:[§6\.2](https://arxiv.org/html/2605.08231#S6.SS2.p1.4)\. - \[7\]S\. Hashemi, H\. Tann, and S\. Reda\(2018\)BLASYS: approximate logic synthesis using boolean matrix factorization\.InDesign Automation Conference \(DAC\),pp\. 1–6\.Cited by:[§5](https://arxiv.org/html/2605.08231#S5.p4.6)\. - \[8\]X\. Hu, A\. Liu, X\. Geng, Z\. Wei, K\. Jiang, and H\. Jiang\(2024\)A configurable approximate multiplier for CNNs using partial product speculation\.InDesign, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 1–6\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p2.1),[§6\.1](https://arxiv.org/html/2605.08231#S6.SS1.p1.2)\. - \[9\]B\. Jacob, S\. Kligys, B\. Chen, M\. Zhu, M\. Tang, A\. Howard, H\. Adam, and D\. Kalenichenko\(2018\)Quantization and training of neural networks for efficient integer\-arithmetic\-only inference\.InIEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 2704–2713\.Cited by:[§4\.3](https://arxiv.org/html/2605.08231#S4.SS3.p4.3)\. - \[10\]P\. Jain, S\. Huda, M\. Maas, J\. E\. Gonzalez, I\. Stoical, and A\. Mirhoseini\(2022\)Learning to design accurate deep learning accelerators with inaccurate multipliers\.InDesign, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 184–189\.Cited by:[§2](https://arxiv.org/html/2605.08231#S2.p1.5)\. - \[11\]H\. Jiang, F\. J\. H\. Santiago, H\. Mo, L\. Liu, and J\. Han\(2020\)Approximate arithmetic circuits: a survey, characterization, and recent applications\.Proceedings of the IEEE108\(12\),pp\. 2108–2135\.Cited by:[§2](https://arxiv.org/html/2605.08231#S2.p3.1)\. - \[12\]A\. Krizhevsky and G\. Hinton\(2009\)Learning multiple layers of features from tiny images\.Technical Report, University of Toronto\.Cited by:[§6](https://arxiv.org/html/2605.08231#S6.p1.1)\. - \[13\]V\. Leon, M\. A\. Hanif, G\. Armeniakos, X\. Jiao, M\. Shafique, K\. Pekmestzi, and D\. Soudris\(2025\)Approximate computing survey, part II: application\-specific & architectural approximation techniques and applications\.ACM Computing Surveys57\(7\),pp\. 1–36\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p1.1)\. - \[14\]J\. Ma, S\. Hashemi, and S\. Reda\(2021\)Approximate logic synthesis using Boolean matrix factorization\.IEEE Transactions on Computer\-Aided Design of Integrated Circuits and Systems \(TCAD\)41\(1\),pp\. 15–28\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p2.1)\. - \[15\]C\. Meng, W\. Burleson, W\. Qian, and G\. De Micheli\(2025\)Gradient approximation of approximate multipliers for high\-accuracy deep neural network retraining\.InDesign, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 1–7\.Cited by:[§2](https://arxiv.org/html/2605.08231#S2.p1.5)\. - \[16\]C\. Meng, A\. Mishchenko, W\. Qian, and G\. De Micheli\(2025\)Efficient resubstitution\-based approximate logic synthesis\.IEEE Transactions on Computer\-Aided Design of Integrated Circuits and Systems44\(6\),pp\. 2040–2053\.Cited by:[§5](https://arxiv.org/html/2605.08231#S5.p4.6)\. - \[17\]C\. Meng, W\. Qian, and G\. De Micheli\(2026\)Simulation\-guided approximate logic synthesis under the maximum error constraint\.IEEE Transactions on Computer\-Aided Design of Integrated Circuits and Systems \(TCAD\)\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p2.1)\. - \[18\]C\. Meng, W\. Qian, and A\. Mishchenko\(2020\)ALSRAC: approximate logic synthesis by resubstitution with approximate care set\.InDesign Automation Conference \(DAC\),pp\. 1–6\.Cited by:[§5](https://arxiv.org/html/2605.08231#S5.p4.6)\. - \[19\]C\. Meng, H\. Wang, Y\. Mai, W\. Qian, and G\. De Micheli\(2024\)VECSEM: verifying average errors in approximate circuits using simulation\-enhanced model counting\.InDesign, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 1–6\.Cited by:[§6\.1](https://arxiv.org/html/2605.08231#S6.SS1.p1.2)\. - \[20\]C\. Meng, Z\. Zhou, Y\. Yao, S\. Huang, Y\. Chen, and W\. Qian\(2023\)HEDALS: highly efficient delay\-driven approximate logic synthesis\.IEEE Transactions on Computer\-Aided Design of Integrated Circuits and Systems \(TCAD\)42\(11\),pp\. 3491–3504\.Cited by:[§5](https://arxiv.org/html/2605.08231#S5.p4.6)\. - \[21\]V\. Mrazek, R\. Hrbacek, Z\. Vasicek, and L\. Sekanina\(2017\)EvoApprox8b: library of approximate adders and multipliers for circuit design and benchmarking of approximation methods\.InDesign, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 258–261\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p2.1)\. - \[22\]V\. Mrazeket al\.\(2019\)ALWANN: automatic layer\-wise approximation of deep neural network accelerators without retraining\.InInternational Conference on Computer Aided Design \(ICCAD\),pp\. 1–8\.Cited by:[§6\.2](https://arxiv.org/html/2605.08231#S6.SS2.p1.4)\. - \[23\]V\. Mrazek, L\. Sekanina, and Z\. Vasicek\(2020\)Libraries of approximate circuits: automated design and application in CNN accelerators\.IEEE Journal on Emerging and Selected Topics in Circuits and Systems10\(4\),pp\. 406–418\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p2.1),[§6\.1\.2](https://arxiv.org/html/2605.08231#S6.SS1.SSS2.p1.1),[§6\.1](https://arxiv.org/html/2605.08231#S6.SS1.p1.2)\. - \[24\]A\. Paszke, S\. Gross,et al\.\(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InInternational Conference on Neural Information Processing Systems \(NeurIPS\),pp\. 8026–8037\.Cited by:[§6](https://arxiv.org/html/2605.08231#S6.p1.1)\. - \[25\]R\. Schwartz, J\. Dodge, N\. A\. Smith, and O\. Etzioni\(2020\)Green AI\.Communications of the ACM63\(12\),pp\. 54–63\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p1.1)\. - \[26\]W\. Shao, M\. Chen, Z\. Zhang, P\. Xu, L\. Zhao, Z\. Li, K\. Zhang, P\. Gao, Y\. Qiao, and P\. Luo\(2024\)OmniQuant: omnidirectionally calibrated quantization for large language models\.International Conference on Learning Representations \(ICLR\),pp\. 1–25\.Cited by:[§4\.3](https://arxiv.org/html/2605.08231#S4.SS3.p4.3)\. - \[27\]W\. A\. Simon, V\. Ray, A\. Levisse, G\. Ansaloni, M\. Zapater, and D\. Atienza\(2021\)Exact neural networks from inexact multipliers via Fibonacci weight encoding\.InDesign Automation Conference \(DAC\),pp\. 805–810\.Cited by:[§2](https://arxiv.org/html/2605.08231#S2.p1.5)\. - \[28\]X\. Wang, Z\. Yan, C\. Meng, Y\. Shi, and W\. Qian\(2023\)DASALS: differentiable architecture search\-driven approximate logic synthesis\.InInternational Conference on Computer Aided Design \(ICCAD\),pp\. 1–9\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p2.1)\. - \[29\]Y\. Wu, C\. Chen, W\. Xiao, X\. Wang, C\. Wen, J\. Han, X\. Yin, W\. Qian, and C\. Zhuo\(2024\)A survey on approximate multiplier designs for energy efficiency: from algorithms to circuits\.ACM Transactions on Design Automation of Electronic Systems \(TODAES\)29\(1\),pp\. 1–37\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p2.1)\. - \[30\]W\. Xiao, C\. Zhuo, and W\. Qian\(2022\)OPACT: optimization of approximate compressor tree for approximate multiplier\.InDesign, Automation & Test in Europe Conference & Exhibition \(DATE\),pp\. 178–183\.Cited by:[§1](https://arxiv.org/html/2605.08231#S1.p2.1),[§1](https://arxiv.org/html/2605.08231#S1.p3.1),[§6\.1](https://arxiv.org/html/2605.08231#S6.SS1.p1.2)\. - \[31\]S\. Zheng, Z\. Li, Y\. Lu, J\. Gao, J\. Zhang, and L\. Wang\(2022\)HEAM: high\-efficiency approximate multiplier optimization for deep neural networks\.InIEEE International Symposium on Circuits and Systems \(ISCAS\),pp\. 3359–3363\.Cited by:[§2](https://arxiv.org/html/2605.08231#S2.p1.5)\.
Similar Articles
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.
TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI
TRINE is a single-bitstream FPGA accelerator and compiler for end-to-end multimodal inference, unifying diverse layers and incorporating runtime-adaptive compute modes, token pruning, and dependency-aware offloading, achieving up to 22.57x latency reduction over an RTX 4090 at 20-21W.
A new generation of AI models and one of the most powerful research papers out there.
Token AI releases a research paper introducing STAM, a new adaptive momentum optimizer designed to improve training stability and reduce memory usage compared to standard optimizers like AdamW.
Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.
@jiqizhixin: What if your AI’s memory didn’t have to balloon with every extra sentence? University of Oxford, Technion, AITHYRA, and…
Introduces KV-Compression Aware Training (KV-CAT), a method that encourages transformers to learn compressible key-value caches during training, improving memory efficiency for long-context tasks without sacrificing performance.