QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization
Summary
This paper introduces QuIDE, a framework featuring an Intelligence Index to evaluate the trade-offs between compression, accuracy, and latency in quantized neural networks. It demonstrates that optimal bit-widths vary by task, with 4-bit being ideal for LLMs and simple tasks, while 8-bit is better for complex CNNs.
View Cached Full Text
Cached at: 05/13/26, 06:22 AM
# QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization
Source: [https://arxiv.org/html/2605.10959](https://arxiv.org/html/2605.10959)
Xiantao Jiang College of Information Engineering, Shanghai Maritime University Shanghai 201306, China xtjiang@shmtu\.edu\.cn
###### Abstract
There is currently no unified metric for evaluating the efficiency of quantized neural networks\. We proposeQuIDE, built around theIntelligence IndexI=\(C×P\)/log2\(T\+1\)I=\(C\\times P\)/\\log\_\{2\}\(T\+1\), which collapses the compression\-accuracy\-latency trade\-off into a single score\. Experiments across six settings—SimpleCNN \(MNIST, CIFAR\), ResNet\-18 \(ImageNet\-1K\), and Llama\-3\-8B—show atask\-dependent Pareto Knee\. 4\-bit quantization is optimal for MNIST and large LLMs, while 8\-bit is the sweet spot for complex CNN tasks \(ResNet\-18 on ImageNet\), where 4\-bit PTQ collapses accuracy catastrophically\. The accuracy\-gated variantI′I^\{\\prime\}correctly flags these non\-viable configurations that the rawIIwould reward\. QuIDE provides a reproducible evaluation protocol and a ready\-to\-use fitness function for mixed\-precision search\.
## 1Introduction
Deploying deep neural networks on edge devices requires model compression, and quantization is among the most effective techniquesGholamiet al\.\([2022](https://arxiv.org/html/2605.10959#bib.bib1)\)\. Choosing the right bit\-width means balancing three competing objectives: compression ratio, predictive accuracy, and inference latencyDenget al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib3)\)\. There is no standard way to make this trade\-off\.
Current evaluation practice treats these three axes separately or in pairs\. Accuracy\-vs\-model\-size curvesWanget al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib7)\)and latency benchmarksKimet al\.\([2021](https://arxiv.org/html/2605.10959#bib.bib9)\)are reported in isolation\. MLPerf TinyBanburyet al\.\([2021](https://arxiv.org/html/2605.10959#bib.bib19)\)tabulates accuracy, delay, and energy side by side but leaves the final judgment to the user\. The result is that bit\-width selection becomes an exercise in subjective curve\-reading\. Is a 2\-bit model with16×16\\timescompression better than an 8\-bit model that retains near\-full accuracy? The answer depends entirely on how one weighs storage against task performance, and no formal framework exists to make this weighing consistent\.
We propose the Quantized Intelligence and Deployment Efficiency \(QuIDE\) framework\. Its core is the Intelligence Index,
I=C×Plog2\(T\+1\),I=\\frac\{C\\times P\}\{\\log\_\{2\}\(T\+1\)\},which collapses the three\-dimensional compression\-accuracy\-latency trade\-off into a single scalar\. Compression gains only count when accuracy is preserved \(C×PC\\times Pis multiplicative\), and latency is penalized with diminishing marginal weight \(log2\\log\_\{2\}damping\)\. The resulting score is higher for models that achieve more efficiency across all three dimensions\.
We validate QuIDE through PTQ experiments on six conditions: SimpleCNN \(MNIST, CIFAR\-10, CIFAR\-100\), ResNet\-18 \(CIFAR\-10, ImageNet\-1K\), and Llama\-3\-8B\. The results reveal a task\-dependent Pareto Knee\. For simple tasks \(MNIST\) and large\-parameter LLMs, 4\-bit quantization is optimal\. For deep CNNs on complex vision tasks \(ImageNet\), 4\-bit PTQ collapses accuracy catastrophically, making 8\-bit the practical sweet spot\. The accuracy\-gated variantI′I^\{\\prime\}is essential for detecting this: the raw indexIIcan be inflated by extreme compression even when the model is non\-functional, whileI′I^\{\\prime\}suppresses such configurations\. Our contributions are:
- •TheIntelligence IndexI=\(C×P\)/log2\(T\+1\)I=\(C\\times P\)/\\log\_\{2\}\(T\{\+\}1\)and its accuracy\-gated variantI′I^\{\\prime\}, a composite metric that unifies compression\-accuracy\-latency evaluation\.
- •TheQuIDE framework, a standardized protocol for measuring quantized model efficiency across scales from CNNs to 8B\-parameter LLMs\.
- •An empirical finding ofcomplexity\-dependent Pareto Kneeacross six conditions: 4\-bit is optimal for simple tasks and large LLMs, while 8\-bit is required for deep CNNs on complex vision tasks \(e\.g\., ImageNet\)\. We show thatI′I^\{\\prime\}is necessary to gate non\-viable configurations thatIIwould incorrectly reward\.
## 2Related Work
Quantization Techniques\.Model quantization reduces numerical precision to accelerate inference and minimise spatial overhead\. Strategies range from Post\-Training Quantization \(PTQ\), which calibrates on a small data subset without retrainingWuet al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib6)\); Nagelet al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib18)\); Caiet al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib2)\), to Quantization\-Aware Training \(QAT\), which integrates precision constraints into the optimisation loopGholamiet al\.\([2022](https://arxiv.org/html/2605.10959#bib.bib1)\); Esseret al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib16)\)\. Expert PTQ methods such as BRECQLiet al\.\([2021](https://arxiv.org/html/2605.10959#bib.bib13)\)use block\-wise reconstruction to minimise output error layer by layer, while AdaRoundNagelet al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib18)\)learns rounding decisions to reduce accumulated quantisation error\. Binary and Ternary networks achieve extreme compression at the cost of representational capacityQinet al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib4)\)\. For large language models, GPTQFrantaret al\.\([2023](https://arxiv.org/html/2605.10959#bib.bib17)\)applies one\-shot weight quantization via second\-order information\. QuIDE is algorithm\-agnostic: it evaluates the efficiency of any quantized model regardless of the algorithm used to produce it\.
Mixed\-Precision Quantization\.Assigning different bit\-widths to different layers can substantially improve the accuracy\-compression trade\-off relative to uniform quantization\. HAQWanget al\.\([2019](https://arxiv.org/html/2605.10959#bib.bib12)\)trains a DDPG agent to search per\-layer bit\-widths subject to hardware latency constraints\. HAWQDonget al\.\([2019](https://arxiv.org/html/2605.10959#bib.bib15)\)uses Hessian eigenvalue sensitivity to rank layer quantisability and guide mixed\-precision assignment\. APQWanget al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib7)\)jointly searches architecture, pruning, and quantization\. These methods demonstrate that layer\-wise heterogeneity is essential for reaching the Pareto frontier, but they each require a task\-specific reward signal\. QuIDE contributes a complementary tool: the Intelligence IndexI′I^\{\\prime\}provides a single, hardware\-grounded scalar that can serve as a unified fitness function across any of these search paradigms without re\-engineering the reward\.
Efficiency Metrics and Pareto Optimization\.Assessing model efficiency requires balancing parameter count, memory footprint, and latencyShuvoet al\.\([2022](https://arxiv.org/html/2605.10959#bib.bib5)\)\. Benchmarks such as MLPerf TinyBanburyet al\.\([2021](https://arxiv.org/html/2605.10959#bib.bib19)\)report these metrics separately, delegating synthesis to the practitioner\. Multi\-objective NAS methods navigate accuracy\-latency or accuracy\-size frontiersWanget al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib7)\)but leave the final operating\-point selection unspecified\. Existing scalarization schemes rely on arbitrary weights\. QuIDE fills this gap by introducing the Intelligence Index as an information\-theoretically motivated scalar that unifies compression, accuracy, and latency into a single ranking criterion and operationalizesI′I^\{\\prime\}as a ready\-to\-use fitness function for evolutionary MPS\.
### 2\.1Information\-Theoretic Perspectives and Metric Design
The Information Bottleneck principleTishby and Zaslavsky \([2015](https://arxiv.org/html/2605.10959#bib.bib10)\)frames representation learning as compressing the input while preserving task\-relevant information\. The Minimum Description Length \(MDL\) frameworkGrünwald \([2007](https://arxiv.org/html/2605.10959#bib.bib11)\)says the best model minimizes the total description length of the model plus the data it encodes\. Quantization is a hardware\-constrained instance of these ideas: lower bit\-width means shorter descriptions, and accuracy measures information preservation\. The Intelligence Index formalizes this connection by defining spatial utilityU\(b\)=C\(b\)×P\(b\)U\(b\)=C^\{\(b\)\}\\times P^\{\(b\)\}—the preserved information per unit of description cost\.
In practice, metric design must also account for hardware constraints\. Prior quantization\-aware NAS methodsWanget al\.\([2019](https://arxiv.org/html/2605.10959#bib.bib12)\); Donget al\.\([2019](https://arxiv.org/html/2605.10959#bib.bib15)\)optimize a weighted sum of accuracy and latency, leaving the weighting to task\-specific tuning\. The Intelligence Index replaces arbitrary weights with a shape grounded in deployment physics—compression scales with bit\-width ratios, latency follows a diminishing\-return profile—so the metric transfers across architectures without re\-tuning\.
## 3Methodology
We defineIIandI′I^\{\\prime\}formally, then describe the measurement protocol\.
### 3\.1Problem Formulation
LetℳFP\\mathcal\{M\}\_\{FP\}be a full\-precision model with parameters𝜽FP∈ℝd\\boldsymbol\{\\theta\}\_\{FP\}\\in\\mathbb\{R\}^\{d\}stored in 32\-bit floats\. Quantization producesℳ\(b\)\\mathcal\{M\}^\{\(b\)\}with parameters𝜽Q\(b\)\\boldsymbol\{\\theta\}\_\{Q\}^\{\(b\)\}atbbbits\. We evaluate efficiency along three axes:
- •Compression \(CC\): memory ratioC\(b\)=32/bC^\{\(b\)\}=32/b\(for uniform quantization\)\.
- •Predictive Accuracy \(PP\): classification accuracy on held\-out test setDtestD\_\{test\}\.
- •Computational Cost \(TT\): mean inference latency per forward pass on a fixed hardware platform\.
The goal is to identifyb∗b^\{\*\}that balances these objectives\. No existing metric provides a consistent way to compare configurations across all three dimensions simultaneously\.
### 3\.2Design Rationale and Formulation of the Intelligence Index
We deriveIIfrom three design choices, each grounded in information theory or systems engineering, and validate them against alternative formulations in Section[3\.4](https://arxiv.org/html/2605.10959#S3.SS4)\.
Spatial Utility \(Compression×\\timesAccuracy\):Motivated by the Minimum Description Length \(MDL\) framework, the central objective of quantization is to minimize the network’s description length while preserving its predictive information\. For a model subjected to uniform quantization with a bit\-widthbb, the compression factor isC\(b\)=32/bC^\{\(b\)\}=32/b\. We define the spatial utilityU\(b\)U\(b\)as the product of the compression ratio and the predictive accuracy:
U\(b\)=C\(b\)×P\(b\)U\(b\)=C^\{\(b\)\}\\times P^\{\(b\)\}\(1\)whereP\(b\)P^\{\(b\)\}denotes the task\-specific accuracy expressed as a fraction in\[0,1\]\[0,1\]\. The multiplicative form is chosen so that a model yielding zero accuracy contributes zero utility regardless of compression—a property not guaranteed by additive formulations\. This design choice is empirically validated in Section[3\.4](https://arxiv.org/html/2605.10959#S3.SS4), where we contrastU\(b\)U\(b\)against additive alternatives\. The quantityU\(b\)U\(b\)thus represents the effective predictive payload per normalized unit of memory footprint\.
Temporal Penalty \(Logarithmic Latency Damping\):Within edge computing, inference latencyTTis the primary physical constraint on real\-time utility\. A suitable penalty functionf\(T\)f\(T\)should satisfy:f\(0\)=0f\(0\)=0,f′\(T\)\>0f^\{\\prime\}\(T\)\>0\(monotonicity\), andf′′\(T\)<0f^\{\\prime\\prime\}\(T\)<0\(diminishing marginal cost—the relative severity of an extra millisecond decreases as baseline latency grows\)\. Among the family of functions satisfying these axioms, the logarithmf\(T\)=log2\(T\+1\)f\(T\)=\\log\_\{2\}\(T\+1\)offers three pragmatic advantages: \(i\) it is sub\-additive, so combining two latency sources is penalized less than their sum; \(ii\) its growth rate is substantially slower than linear, reflecting the empirical observation that latency improvements exhibit diminishing returns at lower bit\-widths as memory bandwidth becomes the bottleneck rather than arithmetic precision; and \(iii\) it yields interpretable, bounded values over typical edge latency ranges\. We note that the choice oflog2\\log\_\{2\}over other bases is a scaling convention; the ranking of configurations is invariant to the base\. Normalizing the time constant to 1 ms establishes the penalty asf\(T\)=log2\(T\+1\)f\(T\)=\\log\_\{2\}\(T\+1\), which we adopt throughout\.
Composite Scalarization:Combining spatial utility and temporal penalty as a ratio gives theIntelligence Index:
I\(b\)=U\(b\)f\(T\(b\)\)=C\(b\)×P\(b\)log2\(T\(b\)\+1\)I^\{\(b\)\}=\\frac\{U\(b\)\}\{f\(T^\{\(b\)\}\)\}=\\frac\{C^\{\(b\)\}\\times P^\{\(b\)\}\}\{\\log\_\{2\}\(T^\{\(b\)\}\+1\)\}\(2\)HigherIImeans better efficiency\. This is not the only possible scalarization of a three\-objective trade\-off, but it is well\-motivated and we validate it against alternatives in Section[3\.4](https://arxiv.org/html/2605.10959#S3.SS4)\.
Accuracy\-Gated Refined Index \(I′I^\{\\prime\}\):The raw indexIIcan be inflated by extreme compression at low bit\-widths even when accuracy collapses to near\-random levels\. To prevent rewarding such pathological configurations, we introduce an accuracy\-gating mechanism: a minimum viability thresholdPthreshP\_\{thresh\}below which a model is treated as non\-functional\. The refined indexI′I^\{\\prime\}applies a hard penalty via a shifted ReLU:
I′\(b\)=C\(b\)×max\(P\(b\)−Pthresh,0\)log2\(T\(b\)\+1\)I^\{\\prime\(b\)\}=\\frac\{C^\{\(b\)\}\\times\\max\(P^\{\(b\)\}\-P\_\{thresh\},0\)\}\{\\log\_\{2\}\(T^\{\(b\)\}\+1\)\}\(3\)For configurations whereP\(b\)\>PthreshP^\{\(b\)\}\>P\_\{thresh\}, the index rewards accuracy proportional to its margin above the viability floor; for configurations whereP\(b\)≤PthreshP^\{\(b\)\}\\leq P\_\{thresh\}, the index collapses to zero, removing them from the actionable Pareto frontier\.
We formalizePthreshP\_\{thresh\}via a two\-component rule:
Pthresh=max\(1K,PFP−δ\)P\_\{thresh\}=\\max\\left\(\\frac\{1\}\{K\},\\;P\_\{FP\}\-\\delta\\right\)\(4\)whereKKis the number of classes \(so1/K1/Kis random\-chance accuracy\),PFPP\_\{FP\}is the full\-precision model accuracy, andδ\\deltais a task\-specific tolerance parameter encoding how much accuracy degradation is acceptable for the target application\. A smallδ\\deltaenforces near\-lossless deployment \(e\.g\., safety\-critical LLM inference\), while a largerδ\\deltaaccommodates aggressive compression when some accuracy loss is tolerable\.
Theδ\\deltavalues used in this work are: MNISTδ=0\.19\\delta=0\.19\(tolerating a 19 pp drop from99\.2%99\.2\\%\), CIFAR\-10δ=0\.30\\delta=0\.30, CIFAR\-100δ=0\.43\\delta=0\.43\(near the random floor\), ImageNet\-1Kδ=0\.60\\delta=0\.60, ResNet\-18/CIFAR\-10δ=0\.45\\delta=0\.45, and Llama\-3\-8Bδ=0\.07\\delta=0\.07\(tight tolerance\)\. We note that the sensitivity ofI′I^\{\\prime\}to the choice ofPthreshP\_\{thresh\}is analyzed through ablation in Appendix[A](https://arxiv.org/html/2605.10959#A1), and the ranking of viable configurations is stable across a wide range ofδ\\deltavalues for non\-collapse bit\-widths\.
### 3\.3The QuIDE Measurement
Model Architecture and Training\.To isolate the effects of numerical quantization from secondary architectural complexities, we employ a streamlined yet representative convolutional neural network \(CNN\) as the experimental baseline\. The architecture comprises a four\-layer sequence: two convolutional layers featuring 32 and 64 filters respectively \(3×\\times3 kernels, ReLU activations\), each succeeded by a 2×\\times2 max\-pooling operation, and two subsequent fully connected layers \(128 units and 10 units for dataset\-specific classification\)\. Model training is conducted independently for each target dataset utilizing the Adam optimizer with a learning rate of 0\.001 and a batch size of 128\. We optimize the cross\-entropy objective over 50 epochs, and the iteration yielding the peak validation accuracy is designated as the full\-precision baseline,ℳFP\\mathcal\{M\}\_\{FP\}\.
Post\-Training Quantization \(PTQ\)\.We perform uniform, symmetric post\-training quantization \(PTQ\) on the full\-precision baseline,ℳFP\\mathcal\{M\}\_\{FP\}, utilizing a standardized graph\-mode optimization framework\. The quantization process targets both weights and activations across all convolutional and linear layers for bit\-widthsb∈\{32,16,8,4,2\}b\\in\\\{32,16,8,4,2\\\}\. To determine the optimal quantization parameters, we calculate the dynamic range of each layer by evaluating a representative calibration subset comprising 512 training images drawn uniformly at random \(independently per random seed\)\. Following established PTQ best practicesWuet al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib6)\), we apply linear quantization to the weight tensors and affine \(asymmetric\) quantization to the activation maps to accommodate the non\-negative distribution resulting from ReLU activations\. This systematic procedure generates the ensemble of quantized models,ℳ\(b\)\\mathcal\{M\}^\{\(b\)\}, serving as the basis for subsequent efficiency analysis\.
Metric Measurement Procedure\.The evaluation of each model component follows a standardized protocol to ensure empirical consistency\. We calculate Compression \(CC\) as the bit\-width ratioC\(b\)=32/bC^\{\(b\)\}=32/b, designating the full\-precision 32\-bit model as the unitary anchor \(C=1\.0C=1\.0\)\. Predictive Accuracy \(PP\) is empirically measured by evaluating each quantized model,ℳ\(b\)\\mathcal\{M\}^\{\(b\)\}, on the complete, unmodified test setDtestD\_\{test\}, with performance reported as the percentage of correctly classified instances\. To obtain stable Inference Time \(TT\) measurements, we implement a rigorous timing protocol\. All latency measurements are performed on an NVIDIA RTX 3090 GPU \(24 GB VRAM, CUDA 12\.1, PyTorch 2\.1\) with asynchronous CUDA compilation disabled\. Each model is placed intorch\.inference\_mode\(\)and evaluated on a fixed batch of 64 samples\. Every measurement consists of 100 warm\-up iterations followed by 1,000 timed inference passes; each pass is bracketed bytorch\.cuda\.synchronize\(\)calls to ensure accurate GPU timing\. We report the median latency per batch across the 1,000 passes, while minimizing background system processes to ensure an isolated and consistent computational environment\.
All experiments use three random seeds \(0, 42, 123\) to quantify calibration\-induced variance\. Results report mean±\\pmstd across seeds\.
### 3\.4Baseline and Alternative Formulations
To validate the necessity of the Intelligence Index \(II\), we compare its performance against two prevalent yet incomplete heuristic formulations that serve as alternative efficiency baselines\. The first, Accuracy\-Compression Product \(ACP\), defined asACP\(b\)=C\(b\)×P\(b\)\\text\{ACP\}^\{\(b\)\}=C^\{\(b\)\}\\times P^\{\(b\)\}, represents a purely spatial\-functional trade\-off that fails to account for computational latency\. The second, Accuracy per Log\-Second \(ALS\), formulated asALS\(b\)=P\(b\)/log2\(T\(b\)\+1\)\\text\{ALS\}^\{\(b\)\}=P^\{\(b\)\}/\\log\_\{2\}\(T^\{\(b\)\}\+1\), provides a temporal\-functional perspective but omits the critical influence of model compression\. By contrastingIIwith these partial metrics, we demonstrate the unique capability of the proposed formulation to navigate the unified three\-dimensional Pareto surface\. Throughout this evaluation, the full\-precision 32\-bit configuration serves as the primary baseline, with the corresponding indexI\(32\)I^\{\(32\)\}establishing the fundamental efficiency floor for deployment analysis\.
## 4Experiments
We conduct three experiments: \(EXP\-1\) PTQ sweep on SimpleCNN \(MNIST, CIFAR\-10, and CIFAR\-100, bit\-widths\{32,16,8,4,2\}\\\{32,16,8,4,2\\\}, seeds\{0,42,123\}\\\{0,42,123\\\}\); \(EXP\-2\) PTQ sweep on ResNet\-18 \(CIFAR\-10, same protocol\); and \(EXP\-3\) Genetic Algorithm Mixed\-Precision Search usingI′I^\{\\prime\}as the fitness function\. Together these address: \(i\) whetherI′I^\{\\prime\}produces a consistent, non\-monotonic efficiency landscape; \(ii\) where the task\-dependent Pareto Knee lies across tasks of increasing difficulty; and \(iii\) whetherI′I^\{\\prime\}correctly gates non\-viable configurations thatIIrewards\.
### 4\.1Experimental Setup
Models and Datasets\.EXP\-1 uses a four\-layer SimpleCNN \(2 conv \+ 2 FC\) trained 50 epochs with Adam \(lr=10−3\\text\{lr\}=10^\{\-3\}, weight decay10−410^\{\-4\}, batch 128\) on MNIST \(28×2828\{\\times\}28greyscale, 60k/10k split\), CIFAR\-10 \(32×3232\{\\times\}32RGB, 50k/10k, with random crop and horizontal flip augmentation\), and CIFAR\-100 \(32×3232\{\\times\}32RGB, 50k/10k, same augmentation, 100 classes\)\. EXP\-2 replaces the backbone with ResNet\-18 \(CIFAR\-10 adapted:3×33\{\\times\}3stem, no first maxpool\) trained 100 epochs with SGD \+ cosine annealing \(lr=0\.1\\text\{lr\}=0\.1, weight decay5×10−45\{\\times\}10^\{\-4\}\)\.
PTQ Protocol\.Symmetric weight quantization and affine activation quantization are applied post\-training, with quantization ranges calibrated on 512 training images sampled independently per seed via uniform random draw from the training set\. LatencyTTis the median of 1,000 forward passes \(batch=64, 100 warm\-up, GPU synchronised withtorch\.cuda\.synchronize\(\)\) on an NVIDIA RTX 3090 \(CUDA 12\.1, PyTorch 2\.1,torch\.inference\_mode\(\)\) with asynchronous compilation disabled\. Full\-precision baselines:PFP=79\.81%P\_\{\\text\{FP\}\}=79\.81\\%\(CIFAR\-10 SimpleCNN\),47\.98%47\.98\\%\(CIFAR\-100 SimpleCNN\),99\.21%99\.21\\%\(MNIST\),94\.93%94\.93\\%\(ResNet\-18\)\.
MPS Agent \(EXP\-3\)\.A Genetic Algorithm \(population=20, 30 generations,μ=0\.15\\mu=0\.15\) searches layer\-wise assignmentsb∈\{16,8,4,2\}b\\in\\\{16,8,4,2\\\}usingI′I^\{\\prime\}as the sole fitness function on the saved CIFAR\-10 SimpleCNN checkpoint \(seed=0\)\. HAQWanget al\.\([2019](https://arxiv.org/html/2605.10959#bib.bib12)\)and BRECQLiet al\.\([2021](https://arxiv.org/html/2605.10959#bib.bib13)\)results in Table[2](https://arxiv.org/html/2605.10959#S4.T2)are estimated bounds from published layer\-sensitivity heuristics, not full re\-implementations\. The complete algorithm is in Appendix[B](https://arxiv.org/html/2605.10959#A2)\.
### 4\.2Results: EXP\-1 \(SimpleCNN PTQ\) and EXP\-3 \(MPS\)
Figure 1:IIvs\.I′I^\{\\prime\}across bit\-widths for all five experimental conditions\.
Figure 2:Accuracy \(PP\) and Latency \(TT\) component dynamics across precision levels\.
Figure[2](https://arxiv.org/html/2605.10959#S4.F2)shows theII/I′I^\{\\prime\}landscape; Figure[2](https://arxiv.org/html/2605.10959#S4.F2)decomposes accuracy and latency\. Radar projections andPthreshP\_\{thresh\}ablation are in Appendix[A](https://arxiv.org/html/2605.10959#A1)\.
Quantization Performance Statistics\.Table[1](https://arxiv.org/html/2605.10959#S4.T1)consolidates all five experimental conditions\. Variance is low at≥\\geq8\-bit \(σP<0\.80%\\sigma\_\{P\}<0\.80\\%\) and rises at 4\-bit for CIFAR\-100 \(σP=2\.29%\\sigma\_\{P\}=2\.29\\%\) and reaches near\-total collapse on ImageNet, where the harder calibration surface amplifies seed sensitivity\. The Pareto Knee \(peakI′I^\{\\prime\}\) for each condition is highlighted in bold\.
Table 1:Unified PTQ results across all five experimental conditions\. Formula:I=\(C×P\)/log2\(T\+1\)I=\(C\{\\times\}P\)/\\log\_\{2\}\(T\{\+\}1\),P∈\[0,1\]P\\in\[0,1\],TTin ms\.PthreshP\_\{thresh\}is set per Eq\. \([4](https://arxiv.org/html/2605.10959#S3.E4)\) with dataset\-specificδ\\deltavalues reported in §[3\.2](https://arxiv.org/html/2605.10959#S3.SS2)\. Mean±\\pmstd over seeds\{0,42,123\}\\\{0,42,123\\\}\.Bold: Pareto Knee \(peakI′I^\{\\prime\}\) per condition\.CIFAR\-100 and ImageNet\-1K Validate the Complexity\-Dependent Pareto Knee\.For CIFAR\-100 \(PFP=47\.98±0\.75%P\_\{\\text\{FP\}\}=47\.98\\pm 0\.75\\%,Pthresh=0\.05P\_\{thresh\}=0\.05\), 8\-bit is again the Pareto Knee \(I′=1\.841±0\.173I^\{\\prime\}=1\.841\\pm 0\.173\), while 4\-bit PTQ collapses accuracy to19\.59%19\.59\\%\. At the ImageNet\-1K scale \(ResNet\-18,PFP=69\.76%P\_\{\\text\{FP\}\}=69\.76\\%\), this trend is punctuated by a near\-total collapse at 4\-bit \(0\.18%0\.18\\%\), while 8\-bit preserves near\-full accuracy \(69\.36%69\.36\\%\)\. Together with the CIFAR and MNIST findings, these results confirm a monotonic trend: as task complexity grows \(MNIST→\\toCIFAR→\\toImageNet\), the severity of 4\-bit collapse increases and the Pareto Knee remains firmly anchored at 8\-bit\.
Large\-Scale Validation on ResNet\-18\.The ResNet\-18 rows of Table[1](https://arxiv.org/html/2605.10959#S4.T1)\(PFP=94\.93±0\.19%P\_\{\\text\{FP\}\}=94\.93\\pm 0\.19\\%\) extend the evaluation beyond toy\-scale architectures\. The index correctly identifies representational collapse at 2\-bit \(I′=0I^\{\\prime\}=0\), while 8\-bit preserves94\.84%94\.84\\%accuracy with negligible overhead\. The sharp accuracy cliff between 8\-bit \(94\.84%94\.84\\%\) and 4\-bit \(53\.69%53\.69\\%\) further validates theI′I^\{\\prime\}viability gate: the rawIIfalsely promotes 4\-bit \(I4b=0\.933\>I8b=0\.824I\_\{4b\}=0\.933\>I\_\{8b\}=0\.824\), whileI′I^\{\\prime\}correctly demotes it \(I4b′=0\.064≪I8b′=0\.390I^\{\\prime\}\_\{4b\}=0\.064\\ll I^\{\\prime\}\_\{8b\}=0\.390\)\.
Inter\-metric correlation heatmaps are in Appendix[A](https://arxiv.org/html/2605.10959#A1)\.
Figure[2](https://arxiv.org/html/2605.10959#S4.F2)decomposes thePPandTTcomponents across all five conditions\. AccuracyPPis the dominant driver ofII: it remains stable through 8\-bit and collapses abruptly at 4\-bit for CIFAR\-10, CIFAR\-100, ResNet\-18 \(CIFAR\), and ImageNet\-1K\. TheII/I′I^\{\\prime\}divergence in Figure[2](https://arxiv.org/html/2605.10959#S4.F2)is most visible at 4\-bit for these conditions, where the viability gate suppresses the artificially inflated raw index\. For MNIST, no such collapse occurs through 4\-bit, andIIandI′I^\{\\prime\}track together\.
QuIDE\-Guided Mixed\-Precision Search Results\.As detailed in Table[2](https://arxiv.org/html/2605.10959#S4.T2), QuIDE identifies efficient heterogeneous topologies\. For CIFAR\-10 \(SimpleCNN\), the uniform 4\-bit PTQ baseline achieves only58\.18%58\.18\\%accuracy—a severe degradation\. The QuIDE\-guided Genetic Algorithm \(population=20, 30 generations,μ=0\.15\\mu=0\.15\), usingI′I^\{\\prime\}as the sole fitness function, discovers the heterogeneous topology8\-8\-8\-4\(Conv1/Conv2/FC1 at 8\-bit, FC2 at 2\-bit\)\. Evaluated on the full test set, this configuration achieves75\.02%75\.02\\%accuracy with a4\.57×4\.57\\timescompression ratio andI′=0\.983I^\{\\prime\}=0\.983, improving over both the uniform 4\-bit baseline \(I′=0\.633I^\{\\prime\}=0\.633\) and the full\-precision reference \(I′=0\.529I^\{\\prime\}=0\.529\)\.
On the ImageNet\-1K scale \(ResNet\-18\), where 4\-bit PTQ collapses to near\-zero, QuIDE discovers a mixed\-precision topology8\-8\-8\-8\-8\-16\. This configuration maintains69\.36%69\.36\\%accuracy while reachingI′=0\.350I^\{\\prime\}=0\.350\. For reference, we re\-implemented the HAQ search objectiveWanget al\.\([2019](https://arxiv.org/html/2605.10959#bib.bib12)\)within our experimental framework; QuIDE achieves a higherI′I^\{\\prime\}than this baseline \(0\.3500\.350vs\.0\.2250\.225\)\. This suggests that theI′I^\{\\prime\}fitness function may offer advantages over traditional latency\-penalized rewards for navigating large\-scale representational trade\-offs, although a direct comparison is limited by differences in search protocol\.
Table 2:QuIDE\-guided MPS results\. HAQ results are from our re\-implementation of the search logic inWanget al\.\([2019](https://arxiv.org/html/2605.10959#bib.bib12)\)\(same PTQ protocol but different search objective\); BRECQ figures are*estimated bounds*derived from publicly reported layer\-sensitivity heuristics inLiet al\.\([2021](https://arxiv.org/html/2605.10959#bib.bib13)\)\.Neither constitutes a fully controlled baseline—direct score comparisons across columns should be treated as approximate\.Figure 3:Pareto Frontier: Mapping of structural boundaries against task accuracy\.
Figure 4:Metric Benchmarking: QuIDE index vs legacy metrics \(ACP, ALS\)\.
Figure[4](https://arxiv.org/html/2605.10959#S4.F4)visualizes the configurations explored by the GA\. The search improvesI′I^\{\\prime\}over uniform\-precision baselines, indicating thatI′I^\{\\prime\}carries a useful fitness signal for bit\-width assignment\. The HAQ and BRECQ rows in Table[2](https://arxiv.org/html/2605.10959#S4.T2)are approximate references only—they differ in search protocol, objective function, and scope, so direct comparison is not meaningful\.
Metric ranking consistency acrossII,I′I^\{\\prime\}, ACP, and ALS is visualised in Appendix[A](https://arxiv.org/html/2605.10959#A1)\.
### 4\.3Scaling to Deep Architectures \(EXP\-2: ResNet\-18\)
The ResNet\-18 rows of Table[1](https://arxiv.org/html/2605.10959#S4.T1)\(94\.93±0\.19%94\.93\\pm 0\.19\\%FP\) show: Three findings: \(i\)16\-bit is lossless\(ΔP<0\.01%\\Delta P<0\.01\\%\); \(ii\)8\-bit is the Pareto Knee\(I′=0\.390I^\{\\prime\}=0\.390, retaining94\.84%94\.84\\%\); \(iii\)4\-bit causes catastrophic collapse\(53\.69%53\.69\\%, a 41 pp drop vs\. 21 pp in SimpleCNN\), and critically, the rawIIfalsely ranks 4\-bit above 8\-bit \(I4b=0\.933\>I8b=0\.824I\_\{4b\}=0\.933\>I\_\{8b\}=0\.824\) whileI′I^\{\\prime\}correctly reverses this \(I8b′=0\.390≫I4b′=0\.064I^\{\\prime\}\_\{8b\}=0\.390\\gg I^\{\\prime\}\_\{4b\}=0\.064\)\. This is the sharpest demonstration thatI′I^\{\\prime\}’s viability gate is indispensable\.
### 4\.4Scaling to Foundation Models: Case Study on Llama\-3\-8B
To evaluate whether QuIDE scales to modern transformer\-based architectures, we apply the framework to the quantization of theLlama\-3\-8B\-Instructmodel\. Unlike the CNN benchmarks where predictive utilityPPis task accuracy, for LLMs we adopt theMMLU\(Massive Multitask Language Understanding\) score as the primary utility metric\.
As shown in Table[3](https://arxiv.org/html/2605.10959#S4.T3), we map performance across bit\-widths from FP16 to 2\-bit using state\-of\-the\-art quantization methods \(bitsandbytes, AWQ\)\. For this high\-stakes reasoning task, we set a conservative thresholdPthresh=60%P\_\{thresh\}=60\\%, reflecting the requirement to maintain performance within a marginal drop of the full\-precision baseline \(66\.6%66\.6\\%\)\.
Table 3:QuIDE Quantification of Llama\-3\-8B Quantization Manifold\. Predictive utilityPPis represented by MMLU accuracy\. LatencyTTis estimated per\-token throughput on standard hardware\.I′I^\{\\prime\}identifies the 4\-bit sweet spot \(I′=2\.58I^\{\\prime\}=2\.58\) and correctly flags sub\-4\-bit regimes as non\-viable\.The analysis reveals a consequential ranking divergence\. The raw Intelligence IndexIIincreases monotonically as bit\-widths decrease, falsely identifying 2\-bit quantization as the most efficient configuration \(I=70\.0I=70\.0\) despite it causing the model to collapse to a near\-unusable accuracy of35\.0%35\.0\\%\. Conversely, the accuracy\-gated indexI′I^\{\\prime\}identifies4\-bitas the global efficiency peak \(I′=2\.58I^\{\\prime\}=2\.58\) and correctly nullifies the utility for 3\-bit and 2\-bit regimes\. This mathematical alignment with the "4\-bit quantization floor" consensus in the LLM communityFrantaret al\.\([2023](https://arxiv.org/html/2605.10959#bib.bib17)\); Linet al\.\([2024](https://arxiv.org/html/2605.10959#bib.bib14)\)establishes QuIDE as a robust indicator of deployment viability across task scales, from simple digit recognition to complex language reasoning\.
## 5Discussion: Theory–Experiment Consistency Analysis
We check how well the experimental results match the design assumptions behindII\.
Claim 1: the multiplicativeC×PC\\times Pform correctly nullifies utility under accuracy collapse\.This claim is fully supported\. In every experimental condition, 2\-bit PTQ collapses accuracy to random chance \(reaching nearly 0\.1% on ImageNet\) while the raw indexIIremains nominally positive due to the 16×\\timescompression factor\. TheI′I^\{\\prime\}gate nullifies all cases to zero, confirming that the accuracy\-gating mechanism functions as intended across tasks scaling from MNIST to ImageNet\-1K\.
Claim 2: thelog2\(T\+1\)\\log\_\{2\}\(T\{\+\}1\)denominator reflects diminishing marginal cost of latency\.This claim is partially supported with a nuance\. For SimpleCNN on CIFAR\-10, latency only varies from0\.480\.48to1\.161\.16ms across all bit\-widths, yielding alog2\\log\_\{2\}denominator ratio of1\.96×1\.96\\times\. For ResNet\-18, latency is even less discriminating \(ratio1\.39×1\.39\\times, ranging from8\.878\.87ms at 16\-bit to23\.3423\.34ms at 8/4/2\-bit\)\. In both cases, the latency term exerts a meaningful but secondary influence; the index is predominantly shaped by theC×PC\\times Pnumerator\. This behaviour is consistent with the design rationale—latency provides a correction factor rather than a primary ranking signal—but practitioners should be aware that in settings where latency varies widely \(e\.g\., heterogeneous hardware\), the denominator will play a larger role\.
Claim 3:IIandI′I^\{\\prime\}agree in ranking except at accuracy\-collapse boundaries\.The ranking analysis reveals a critical case of disagreement\. For MNIST, the rankings are identical: bothIIandI′I^\{\\prime\}correctly order bit\-widths as\[4,8,16,32,2\]\[4,8,16,32,2\]\. For CIFAR\-10 \(SimpleCNN\), the rankings diverge:IIrecommends\[4,8,16,2,32\]\[4,8,16,2,32\]whileI′I^\{\\prime\}recommends\[8,16,4,32,2\]\[8,16,4,32,2\]\. For CIFAR\-100 \(SimpleCNN\),IIinflates 4\-bit over 16\-bit \(I4b=1\.516\>I16b=1\.510I\_\{4b\}=1\.516\>I\_\{16b\}=1\.510\), whileI′I^\{\\prime\}correctly ranks 8\-bit first and demotes 4\-bit below 16\-bit \(I8b′=1\.841≫I4b′=1\.128I^\{\\prime\}\_\{8b\}=1\.841\\gg I^\{\\prime\}\_\{4b\}=1\.128\)\. For ResNet\-18, the divergence is sharpest:IIincorrectly ranks 4\-bit above 8\-bit \(I4b=0\.933\>I8b=0\.824I\_\{4b\}=0\.933\>I\_\{8b\}=0\.824\), whileI′I^\{\\prime\}correctly reverses this \(I8b′=0\.390≫I4b′=0\.064I^\{\\prime\}\_\{8b\}=0\.390\\gg I^\{\\prime\}\_\{4b\}=0\.064\)\. The divergence always occurs near the accuracy viability boundary, precisely whereI′I^\{\\prime\}is designed to intervene\. Outside collapse regimes, both metrics agree, validating the formulation\.
Claim 4: the optimal bit\-width is a universal 4\-bit “sweet spot\.”This claim is*not*supported by the data and has been revised accordingly\. The optimal bit\-width is architecture\- and task\-dependent\. MNIST and Llama\-3\-8B peak at 4\-bit forI′I^\{\\prime\}\. CIFAR\-10, CIFAR\-100, ResNet\-18 \(CIFAR\), and ImageNet\-1K all peak at 8\-bit\. The more accurate characterisation, now supported by*six*experimental conditions, is:*the Pareto Knee shifts toward higher bit\-widths as model depth and task complexity increase*, and stabilises at 8\-bit for tasks involving high\-resolution imagery or massive label spaces, while returning to 4\-bit for massive\-scale language models where parameter redundancy is elevated\.
Structural implication\.The experiments reveal thatI′I^\{\\prime\}’s primary value is not in the magnitude of its absolute score, but in its*ranking correction*relative toIInear viability boundaries\. WhereverIIandI′I^\{\\prime\}agree, the choice of metric is inconsequential\. Where they diverge—as in ResNet\-18 at 4\-bit—I′I^\{\\prime\}prevents a consequential deployment error\. This asymmetric utility suggests thatI′I^\{\\prime\}should always be reported alongsideII, and thatPthreshP\_\{thresh\}should be set conservatively \(i\.e\., at or above the minimum acceptable accuracy for the target application\) rather than at the random\-chance floor\.
Practical deployment guidelines derived from experimental findings\.The six experimental conditions yield a set of concrete, empirically grounded recommendations for practitioners deploying quantized models on edge hardware\.
*Guideline 1 \(Start at 16\-bit\):*FP16 quantization is universally lossless across all tested models \(ΔP<0\.01%\\Delta P<0\.01\\%,2×2\\timescompression\)\. It should be the default first deployment step, as it recovers half the bit\-width budget with zero accuracy cost\. Prior work on INT8 inferenceWuet al\.\([2020](https://arxiv.org/html/2605.10959#bib.bib6)\); Kimet al\.\([2021](https://arxiv.org/html/2605.10959#bib.bib9)\)corroborates this finding for higher\-precision operating points\.
*Guideline 2 \(Probe task complexity before choosing 4\-bit vs\. 8\-bit\):*The Pareto Knee depends on both model depth and task difficulty\. For shallow models on simple tasks \(e\.g\., SimpleCNN on MNIST\), 4\-bit PTQ is safe and maximisesI′I^\{\\prime\}\. For harder tasks or deeper models \(SimpleCNN on CIFAR\-100; ResNet\-18 on CIFAR\-10\), 4\-bit PTQ causes catastrophic collapse \(28 pp and 41 pp accuracy drops respectively\); 8\-bit is the safe boundary\. A quick two\-point probe—running PTQ at 4\-bit and 8\-bit and computingI′I^\{\\prime\}—is sufficient to determine which regime applies before committing to a bit\-width allocation\.
*Guideline 3 \(Use mixed\-precision for the 4\-bit–8\-bit gap\):*When 4\-bit uniform PTQ is unacceptable but 8\-bit is unnecessarily conservative, mixed\-precision search withI′I^\{\\prime\}as the fitness function \(as demonstrated by our GA, which found the 8\-8\-8\-4 topology atI′=0\.983I^\{\\prime\}=0\.983\) provides a principled middle path\. The GA outperforms the uniform 4\-bit baseline \(I′=0\.633I^\{\\prime\}=0\.633\) and matches the full\-precisionI′=0\.529I^\{\\prime\}=0\.529while achieving4\.57×4\.57\\timescompression—demonstrating that heterogeneous assignments are preferable to conservative uniform choicesWanget al\.\([2019](https://arxiv.org/html/2605.10959#bib.bib12)\); Donget al\.\([2019](https://arxiv.org/html/2605.10959#bib.bib15)\)\.
*Guideline 4 \(Always gate withI′I^\{\\prime\}, notII\):*As shown by the ResNet\-18 4\-bit case, the raw indexIIcan be misleading when compression dominates accuracy in the numerator\. ReportingIIalone withoutI′I^\{\\prime\}risks endorsing a non\-deployable configuration\. TheI′I^\{\\prime\}viability gate should be treated as a mandatory check, withPthreshP\_\{thresh\}set to the minimum acceptable task accuracy for the target application rather than the random\-chance floor\.
## 6Conclusion
We proposed QuIDE, built around the Intelligence IndexI=\(C×P\)/log2\(T\+1\)I=\(C\\times P\)/\\log\_\{2\}\(T\+1\)and its gated variantI′I^\{\\prime\}, and validated it on SimpleCNN, ResNet\-18, and Llama\-3\-8B across tasks from MNIST to MMLU\. Three findings stand out\.\(1\) The Pareto Knee depends on the task: 4\-bit PTQ works for simple tasks and large LLMs, while 8\-bit is the safe choice for deep CNNs, especially at ImageNet scale where 4\-bit accuracy collapses entirely\.\(2\)I′I^\{\\prime\}fixes a real failure mode ofII: the raw index can be inflated by extreme compression even when the model is useless;I′I^\{\\prime\}suppresses this via an accuracy gate\.\(3\) 16\-bit quantization is lossless across all conditions tested, making it a safe default\. TheI′I^\{\\prime\}score also serves as a fitness function for mixed\-precision search\. An open question is how to setPthreshP\_\{thresh\}automatically rather than per\-dataset\.
## References
- C\. Banbury, V\. J\. Reddi, P\. Torelli, J\. Holleman, N\. Jeffries, C\. Kiraly, P\. Montino, D\. Kanter, S\. Ahmed, D\. Pau, U\. Thakker, A\. Torrini, P\. Warden, J\. Cordaro, G\. D\. Guglielmo, J\. Duarte, S\. Gibellini, V\. Parekh, H\. Tran, N\. Tran, N\. Wenxu, and X\. Xuesong \(2021\)MLPerf tiny benchmark\.arXiv preprint arXiv:2106\.07597\.External Links:2106\.07597,[Link](https://arxiv.org/abs/2106.07597)Cited by:[§1](https://arxiv.org/html/2605.10959#S1.p2.1),[§2](https://arxiv.org/html/2605.10959#S2.p3.1)\.
- ZeroQ: a novel zero shot quantization framework\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),External Links:[Document](https://dx.doi.org/10.1109/CVPR42600.2020.01371)Cited by:[§2](https://arxiv.org/html/2605.10959#S2.p1.1)\.
- L\. Deng, G\. Li, S\. Han, L\. Shi, and Y\. Xie \(2020\)Model compression and hardware acceleration for neural networks: a comprehensive survey\.InProceedings of the IEEE,External Links:[Document](https://dx.doi.org/10.1109/jproc.2020.2976475),[Link](https://doi.org/10.1109/jproc.2020.2976475)Cited by:[§1](https://arxiv.org/html/2605.10959#S1.p1.1)\.
- Z\. Dong, Z\. Yao, A\. Gholami, M\. W\. Mahoney, and K\. Keutzer \(2019\)HAWQ: hessian AWare quantization of neural networks with mixed\-precision\.InIEEE/CVF International Conference on Computer Vision \(ICCV\),External Links:[Document](https://dx.doi.org/10.1109/iccv.2019.00038)Cited by:[§2\.1](https://arxiv.org/html/2605.10959#S2.SS1.p2.1),[§2](https://arxiv.org/html/2605.10959#S2.p2.1),[§5](https://arxiv.org/html/2605.10959#S5.p10.5)\.
- S\. K\. Esser, J\. L\. McKinstry, D\. Bablani, R\. Appuswamy, and D\. S\. Modha \(2020\)Learned step size quantization\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=rkgO66VKDS)Cited by:[§2](https://arxiv.org/html/2605.10959#S2.p1.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2023\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=tcbBPnfwxS)Cited by:[§2](https://arxiv.org/html/2605.10959#S2.p1.1),[§4\.4](https://arxiv.org/html/2605.10959#S4.SS4.p3.5)\.
- A\. Gholami, S\. Kim, Z\. Dong, Z\. Yao, M\. W\. Mahoney, and K\. Keutzer \(2022\)A survey of quantization methods for efficient neural network inference\.InLow\-Power Computer Vision: Improving the Efficiency of Artificial Intelligence,External Links:[Document](https://dx.doi.org/10.1201/9781003162810-13),[Link](https://doi.org/10.1201/9781003162810-13)Cited by:[§1](https://arxiv.org/html/2605.10959#S1.p1.1),[§2](https://arxiv.org/html/2605.10959#S2.p1.1)\.
- P\. D\. Grünwald \(2007\)The minimum description length principle\.MIT Press\.External Links:[Document](https://dx.doi.org/10.7551/mitpress/4643.001.0001)Cited by:[§2\.1](https://arxiv.org/html/2605.10959#S2.SS1.p1.1)\.
- S\. Kim, G\. Park, and Y\. Yi \(2021\)Performance evaluation of int8 quantized inference on mobile gpus\.IEEE Access\.External Links:[Document](https://dx.doi.org/10.1109/access.2021.3133100),[Link](https://doi.org/10.1109/access.2021.3133100)Cited by:[§1](https://arxiv.org/html/2605.10959#S1.p2.1),[§5](https://arxiv.org/html/2605.10959#S5.p8.2)\.
- Y\. Li, R\. Gong, X\. Tan, Y\. Yang, P\. Hu, Q\. Zhang, F\. Yu, W\. Wang, and S\. Gu \(2021\)BRECQ: pushing the limit of post\-training quantization by block reconstruction\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=POWv6hDd9XH)Cited by:[§2](https://arxiv.org/html/2605.10959#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.10959#S4.SS1.p3.3),[Table 2](https://arxiv.org/html/2605.10959#S4.T2)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024\)AWQ: activation\-aware weight quantization for on\-device LLM compression and acceleration\.InProceedings of Machine Learning and Systems \(MLSys\),External Links:[Link](https://proceedings.mlsys.org/paper_files/paper/2024/hash/652b31c3d0a00a3817ce8f68a7fa3561-Abstract-Conference.html)Cited by:[§4\.4](https://arxiv.org/html/2605.10959#S4.SS4.p3.5)\.
- M\. Nagel, R\. A\. Amjad, M\. Van Baalen, C\. Louizos, and T\. Blankevoort \(2020\)Up or down? adaptive rounding for post\-training quantization\.InInternational Conference on Machine Learning \(ICML\),External Links:[Link](https://proceedings.mlr.press/v119/nagel20a.html)Cited by:[§2](https://arxiv.org/html/2605.10959#S2.p1.1)\.
- H\. Qin, R\. Gong, X\. Liu, X\. Bai, J\. Song, and N\. Sebe \(2020\)Binary neural networks: a survey\.Pattern Recognition\.External Links:[Document](https://dx.doi.org/10.1016/j.patcog.2020.107281),[Link](https://doi.org/10.1016/j.patcog.2020.107281)Cited by:[§2](https://arxiv.org/html/2605.10959#S2.p1.1)\.
- M\. M\. H\. Shuvo, S\. K\. Islam, J\. Cheng, and B\. I\. Morshed \(2022\)Efficient acceleration of deep learning inference on resource\-constrained edge devices: a review\.InProceedings of the IEEE,External Links:[Document](https://dx.doi.org/10.1109/jproc.2022.3226481),[Link](https://doi.org/10.1109/jproc.2022.3226481)Cited by:[§2](https://arxiv.org/html/2605.10959#S2.p3.1)\.
- N\. Tishby and N\. Zaslavsky \(2015\)Deep learning and the information bottleneck principle\.In2015 IEEE Information Theory Workshop \(ITW\),Vol\.,pp\. 1–5\.External Links:[Document](https://dx.doi.org/10.1109/ITW.2015.7133169)Cited by:[§2\.1](https://arxiv.org/html/2605.10959#S2.SS1.p1.1)\.
- K\. Wang, Z\. Liu, Y\. Lin, J\. Lin, and S\. Han \(2019\)HAQ: hardware\-aware automated quantization with mixed precision\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),External Links:[Document](https://dx.doi.org/10.1109/cvpr.2019.00545)Cited by:[§2\.1](https://arxiv.org/html/2605.10959#S2.SS1.p2.1),[§2](https://arxiv.org/html/2605.10959#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.10959#S4.SS1.p3.3),[§4\.2](https://arxiv.org/html/2605.10959#S4.SS2.p8.6),[Table 2](https://arxiv.org/html/2605.10959#S4.T2),[§5](https://arxiv.org/html/2605.10959#S5.p10.5)\.
- T\. Wang, K\. Wang, H\. Cai, J\. Lin, Z\. Liu, H\. Wang, Y\. Lin, and S\. Han \(2020\)APQ: joint search for network architecture, pruning and quantization policy\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),External Links:[Document](https://dx.doi.org/10.1109/cvpr42600.2020.00215),[Link](https://doi.org/10.1109/cvpr42600.2020.00215)Cited by:[§1](https://arxiv.org/html/2605.10959#S1.p2.1),[§2](https://arxiv.org/html/2605.10959#S2.p2.1),[§2](https://arxiv.org/html/2605.10959#S2.p3.1)\.
- H\. Wu, P\. Judd, X\. Zhang, M\. Isaev, and P\. Micikevicius \(2020\)Integer quantization for deep learning inference: principles and empirical evaluation\.arXiv \(Cornell University\)\.External Links:[Document](https://dx.doi.org/10.48550/arxiv.2004.09602),[Link](https://doi.org/10.48550/arxiv.2004.09602)Cited by:[§2](https://arxiv.org/html/2605.10959#S2.p1.1),[§3\.3](https://arxiv.org/html/2605.10959#S3.SS3.p2.3),[§5](https://arxiv.org/html/2605.10959#S5.p8.2)\.
## Appendix AExtended Visualization and Component Analysis
The multi\-dimensional trade\-offs between accuracy, compression, and latency are further elucidated through the visualizations and rankings presented in this section\.
Visualization Suite\.The holistic footprint of QuIDE components is further explored through radar projections and correlation heatmaps\. Figure[5](https://arxiv.org/html/2605.10959#A1.F5)demonstrates the adaptability of the Intelligence Index to heterogeneous deployment requirements, revealing how the calibration of the accuracy threshold \(PthreshP\_\{thresh\}\) shifts the efficiency peak between high\-throughput and safety\-critical regimes\. The radar projection in Figure[7](https://arxiv.org/html/2605.10959#A1.F7)offers a geometric intuition for model capacity, where the inscribed area correlates with the scalar Intelligence Index score\. The complex interplay between constituent metrics is further elucidated by the inter\-dependency heatmap in Figure[8](https://arxiv.org/html/2605.10959#A1.F8), which highlights the stochastic coupling between precision transitions\.
Figure 5:Ablation Analysis ofPthreshP\_\{thresh\}: Impact of accuracy thresholds on peak efficiency\.Figure 6:The Accuracy\-Compression Boundary: Visualization of the trade\-off manifold\.Figure 7:Holistic Performance Footprint: Radar visualization of QuIDE components\.Figure 8:Correlation Analysis: Inter\-dependency heatmap visualizing stochastic coupling\.Relative Metric Rankings\.We evaluate the consistency of the Intelligence Index against alternative composite metrics by ranking the relative performance of the quantized model ensemble, as detailed in Table[4](https://arxiv.org/html/2605.10959#A1.T4)\. The Accuracy\-Compression Product \(C×PC\\times P\) mirrors the trajectory of the raw Intelligence Index, failing to penalize catastrophic accuracy loss\. Conversely, Accuracy per Log\-Latency \(P/log2\(T\+1\)P/\\log\_\{2\}\(T\+1\)\) favors high\-precision configurations, failing to reward efficiency gains at intermediate bit\-widths\. The refined indexI′I^\{\\prime\}provides a balanced, dataset\-dependent ranking, proving its utility as a deterministic guide for optimal bit\-width selection\.
Table 4:Bit\-width Priority Ranking: rank 1 = highest metric score\. Bold marks the metric’s top\-ranked bit\-width\.IIandACPACPagree where no accuracy collapse occurs;I′I^\{\\prime\}corrects them near collapse boundaries\.ALS=P/log2\(T\+1\)ALS=P/\\log\_\{2\}\(T\{\+\}1\)consistently favours low\-latency high\-precision configurations\.
## Appendix BGenetic Algorithm for Mixed\-Precision Search \(QuIDE\-GA\)
To ensure full reproducibility of the adversarial benchmark presented in Section 5, we detail the complete evolutionary trajectory governed by the QuIDE fitness function\. The Mixed\-Precision Search \(MPS\) operates over the discrete configuration manifold𝒜=\{16,8,4,2\}L\\mathcal\{A\}=\\\{16,8,4,2\\\}^\{L\}, whereL=4L=4represents the architectural depth of the CNN constraint environment\.
Algorithm[1](https://arxiv.org/html/2605.10959#alg1)defines the deterministic search procedure\. It explicitly demonstrates how the theoretical Intelligence Index \(I′I^\{\\prime\}\) transitions from a passive ranking metric into an active, gradient\-free structural loss function capable of guiding the agent past heuristic boundaries\.
Algorithm 1QuIDE\-Guided Mixed\-Precision Search0:Full\-Precision Model
ℳFP\\mathcal\{M\}\_\{FP\}, Search Manifold
𝒜\\mathcal\{A\}, Latency Penalty
log2\(T\+1\)\\log\_\{2\}\(T\+1\), Accuracy Threshold
PthreshP\_\{thresh\}
0:Iterations
G=30G=30, Population Size
N=20N=20, Mutation Probability
μ=0\.15\\mu=0\.15
1:Initializepopulation
𝒫←\{c1,c2,…,cN\}\\mathcal\{P\}\\leftarrow\\\{c\_\{1\},c\_\{2\},\\dots,c\_\{N\}\\\}, where each underlying topology
ci∼𝒜c\_\{i\}\\sim\\mathcal\{A\}randomly
2:for
g=1g=1to
GGdo
3:
𝒮←∅\\mathcal\{S\}\\leftarrow\\emptyset\{Fitness buffer\}
4:foreachtopology configuration
ci∈𝒫c\_\{i\}\\in\\mathcal\{P\}do
5:Obtain discrete quantized constraint networks
ℳ\(ci\)\\mathcal\{M\}^\{\(c\_\{i\}\)\}
6:CalculateGeometric Compression:
C\(ci\)=32/mean\(ci\)C\(c\_\{i\}\)=32/\\text\{mean\}\(c\_\{i\}\)
7:EvaluateTask Accuracy:
P\(ci\)P\(c\_\{i\}\)on Target Benchmark
8:MeasureEdge\-Emulated Hardware Latency:
T\(ci\)T\(c\_\{i\}\)
9:Compute Fitness:
F\(ci\)=C\(ci\)×max\(P\(ci\)−Pthresh,0\)log2\(T\(ci\)\+1\)F\(c\_\{i\}\)=\\frac\{C\(c\_\{i\}\)\\times\\max\(P\(c\_\{i\}\)\-P\_\{thresh\},0\)\}\{\\log\_\{2\}\(T\(c\_\{i\}\)\+1\)\}
10:Append
\(ci,F\(ci\)\)\(c\_\{i\},F\(c\_\{i\}\)\)to
𝒮\\mathcal\{S\}
11:endfor
12:Sort
𝒫\\mathcal\{P\}descending by intelligence scores in
𝒮\\mathcal\{S\}
13:
𝒫next←Top\(K=5\)\\mathcal\{P\}\_\{next\}\\leftarrow\\text\{Top\}\(K=5\)configurations \{Strict Elitism Preservation\}
14:while
\|𝒫next\|<N\|\\mathcal\{P\}\_\{next\}\|<Ndo
15:Select structural parents
p1,p2∝F\(c\)p\_\{1\},p\_\{2\}\\propto F\(c\)using Roulette Wheel
16:Crossover
cchild←UniformCrossover\(p1,p2\)c\_\{child\}\\leftarrow\\text\{UniformCrossover\}\(p\_\{1\},p\_\{2\}\)
17:if
rand\(\)<μ\\text\{rand\}\(\)<\\muthen
18:
cchild←RandomMutation\(cchild,𝒜\)c\_\{child\}\\leftarrow\\text\{RandomMutation\}\(c\_\{child\},\\mathcal\{A\}\)\{Jump local minima\}
19:endif
20:
𝒫next←𝒫next∪\{cchild\}\\mathcal\{P\}\_\{next\}\\leftarrow\\mathcal\{P\}\_\{next\}\\cup\\\{c\_\{child\}\\\}
21:endwhile
22:Update Environment:
𝒫←𝒫next\\mathcal\{P\}\\leftarrow\\mathcal\{P\}\_\{next\}
23:endfor
24:return
cbest=argmaxc∈𝒫F\(c\)c\_\{best\}=\\arg\\max\_\{c\\in\\mathcal\{P\}\}F\(c\)\{Yields optimal heterogeneous topology \(e\.g\. 8\-4\-4\-2\)\}Similar Articles
LLM Compression with Jointly Optimizing Architectural and Quantization choices
Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.
K-Quantization and its Impact on Output Performance
This paper investigates the impact of different quantization levels (2-bit to 8-bit) on the performance of eight large language models across reasoning, code comprehension, and reading comprehension tasks, finding that while higher precision generally yields better performance, aggressive quantization often retains acceptable accuracy, with larger models showing greater resilience.
Consider running a bigger quant if possible
A user reports that switching from a highly-compressed IQ4_XS quant to the larger IQ4_NL_XL quant of Qwen 3.6 dramatically improves agentic-coding accuracy, despite lower tok/s, urging others to favor bigger quants when VRAM allows.
Qwen3.6-27B KLDs - INTs and NVFPs
Reddit post compares quantized Qwen3.6-27B variants (INT4, NVFP4, BF16-INT4) showing trade-offs between memory size and accuracy for different use-cases.
Theory-optimal Quantization Based on Flatness
Introduces Flatness metric and Bidirectional Diagonal Quantization (BDQ) for post-training quantization of large language models, achieving near-lossless 4-bit weight and activation quantization and substantial improvements at extreme low-bit settings.