Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

arXiv cs.AI Papers

Summary

This paper presents Tail-Aware HiFloat4, a W4A4 post-training quantization method for the Wan2.2 text-to-video diffusion model, which uses activation-tail-aware percentile calibration to mitigate outlier effects while preserving HiFloat4 arithmetic.

arXiv:2605.26628v1 Announce Type: new Abstract: This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts the public ViDiT-Q post-training quantization pipeline to Wan2.2 under the HiFloat4 numerical format. We quantize the main linear layers in both Wan2.2 transformer modules with W4A4 HiFloat4 fake quantization, keep numerically sensitive boundary modules in high precision, and introduce an activation-tail-aware percentile calibration module for channel-mask construction. Together with compact PTQ-state restoration, this design reduces the influence of rare calibration outliers while keeping the runtime HiFloat4 arithmetic and sampling pipeline unchanged.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:06 AM

# W4A4 Post-Training Quantization for Wan2.2Zhanfeng Feng and Shuai Guo contributed equally to this work.
Source: [https://arxiv.org/html/2605.26628](https://arxiv.org/html/2605.26628)
## Tail\-Aware HiFloat4: W4A4 Post\-Training Quantization for Wan2\.2††thanks:Zhanfeng Feng and Shuai Guo contributed equally to this work\.

###### Abstract

This report describes Tail\-Aware HiFloat4, our submission to the low\-bit text\-to\-video generation quantization challenge\. Our method adapts the public ViDiT\-Q post\-training quantization pipeline to Wan2\.2 under the HiFloat4 numerical format\. We quantize the main linear layers in both Wan2\.2 transformer modules with W4A4 HiFloat4 fake quantization, keep numerically sensitive boundary modules in high precision, and introduce an activation\-tail\-aware percentile calibration module for channel\-mask construction\. Together with compact PTQ\-state restoration, this design reduces the influence of rare calibration outliers while keeping the runtime HiFloat4 arithmetic and sampling pipeline unchanged\.

## IIntroduction

Diffusion models have become an important family of visual generators\[[8](https://arxiv.org/html/2605.26628#bib.bib1),[18](https://arxiv.org/html/2605.26628#bib.bib2)\], and diffusion transformers further improve scalability for high\-resolution image and video synthesis\[[17](https://arxiv.org/html/2605.26628#bib.bib3),[20](https://arxiv.org/html/2605.26628#bib.bib4),[9](https://arxiv.org/html/2605.26628#bib.bib5),[7](https://arxiv.org/html/2605.26628#bib.bib6),[12](https://arxiv.org/html/2605.26628#bib.bib7),[21](https://arxiv.org/html/2605.26628#bib.bib8)\]\. Their quality is largely supported by large transformer blocks, wide feed\-forward networks, and attention projections that are repeatedly evaluated over denoising timesteps\. This makes inference memory\-intensive and computationally expensive\. The cost is especially visible in video generation, where temporal length and spatial resolution multiply the number of latent tokens processed by the model\.

Low\-bit inference is therefore attractive for deployment\. Post\-training quantization \(PTQ\) is particularly relevant because it does not require retraining a large generative model\. However, aggressive W4A4 quantization is difficult for diffusion transformers\. Activations vary across denoising timesteps, spatial and temporal tokens, prompt conditions, and classifier\-free guidance branches\. A calibration statistic that is too conservative can preserve rare outliers but waste the limited 4\-bit representation range on values that appear only occasionally\. Conversely, overly aggressive clipping may damage semantic fidelity and temporal consistency\.

The ICME 2026 low\-bit large model quantization challenge focuses on this practical regime\. In the HiFloat4 track, the model must use the provided 4\-bit floating\-point format and keep at most a small number of transformer blocks in high precision\. Our solution follows the public ViDiT\-Q PTQ workflow for diffusion transformers\[[27](https://arxiv.org/html/2605.26628#bib.bib11)\], but adapts it to the Wan2\.2 challenge pipeline\. Wan2\.2\-I2V\-A14B\[[22](https://arxiv.org/html/2605.26628#bib.bib9)\]contains two transformer modules and is evaluated through an image\-to\-video interface\. Since the task is prompt driven, we keep a fixed placeholder image condition for both calibration and inference, so that the comparison isolates the effect of quantization\.

This paper presents Tail\-Aware HiFloat4, a concise W4A4 HiFloat4 PTQ system for Wan2\.2\. The system has three design goals\. First, it should obey the challenge numerical format by using HiFloat4 for both weights and activations in the target linear layers\. Second, it should be stable enough for the dual\-transformer Wan2\.2 backbone by retaining boundary modules, such as embedding and output projection layers, in high precision\. Third, it should make the calibration stage less sensitive to activation tails through percentile\-based statistics\.

The contributions are summarized as follows:

- •We adapt a ViDiT\-Q\-style PTQ pipeline to Wan2\.2\-I2V\-A14B and quantize the main linear layers in both Wan2\.2 transformer modules with HiFloat4 W4A4 fake quantization\.
- •We introduce tail\-aware percentile activation statistics for channel\-mask construction, replacing a hard maximum statistic in the SmoothQuant\-style balancing path\.
- •We provide a compact PTQ checkpoint format that stores inference\-time quantization deltas rather than duplicating the full BF16 transformer weights\.
- •We evaluate the resulting W4A4 system under matched Wan2\.2 generation settings and report the main quality trade\-offs across imaging quality, aesthetic quality, consistency, and motion smoothness metrics\.

## IIRelated Work

### II\-APost\-Training Quantization

Post\-training quantization compresses pretrained neural networks without the cost of full retraining\. Early deployment\-oriented work showed that quantized integer inference can substantially reduce memory traffic and arithmetic cost for convolutional models\[[11](https://arxiv.org/html/2605.26628#bib.bib12)\]\. Subsequent PTQ methods improved calibration and rounding behavior, including adaptive rounding of pretrained weights\[[16](https://arxiv.org/html/2605.26628#bib.bib13)\]and layer\-wise reconstruction for very large transformers\[[5](https://arxiv.org/html/2605.26628#bib.bib14)\]\. Transformer PTQ has also studied activation outliers and mixed numerical paths\. Dettmers et al\.\[[4](https://arxiv.org/html/2605.26628#bib.bib15)\]isolate outlier features for large transformers, while ZeroQuant combines group\-wise quantization and distillation for large transformers\[[25](https://arxiv.org/html/2605.26628#bib.bib16)\]\. SmoothQuant\[[24](https://arxiv.org/html/2605.26628#bib.bib17)\]transfers part of the activation quantization difficulty into the weights through per\-channel scaling, AWQ\[[14](https://arxiv.org/html/2605.26628#bib.bib18)\]further emphasizes activation\-aware protection of salient weights, and QuaRot\[[1](https://arxiv.org/html/2605.26628#bib.bib19)\]removes outliers through equivalent rotations\. Our method follows the activation\-aware scaling viewpoint, but focuses on how the activation statistic should be estimated for W4A4 video generation under a fixed HiFloat4 format\.

### II\-BQuantization for Diffusion Transformers

Diffusion models differ from standard discriminative networks because the same network is evaluated over many denoising timesteps and conditioning states\. PTQ4DM\[[19](https://arxiv.org/html/2605.26628#bib.bib20)\]and Q\-Diffusion\[[13](https://arxiv.org/html/2605.26628#bib.bib21)\]show that timestep\-dependent activation distributions make diffusion quantization substantially different from standard image classification PTQ\. EfficientDM\[[6](https://arxiv.org/html/2605.26628#bib.bib22)\]studies efficient quantization\-aware adaptation for low\-bit diffusion models\. More recent work focuses on transformer\-based generators: PTQ4DiT\[[23](https://arxiv.org/html/2605.26628#bib.bib23)\], Q\-DiT\[[2](https://arxiv.org/html/2605.26628#bib.bib24)\], and VQ4DiT\[[3](https://arxiv.org/html/2605.26628#bib.bib25)\]study post\-training quantization for diffusion transformers, where attention and feed\-forward projections dominate the arithmetic\. ViDiT\-Q\[[27](https://arxiv.org/html/2605.26628#bib.bib11)\]further identifies multiple axes of activation variation, including timestep\-wise, token\-wise, condition\-wise, and channel\-wise variation\. It combines calibration, channel balancing, optional rotation, and mixed precision for sensitive modules\. We use ViDiT\-Q as the public PTQ foundation, but target the Wan2\.2 challenge setting and the HiFloat4 format\.

### II\-CVideo Generation Evaluation

Text\-to\-video generation has evolved from early transformer and diffusion systems to larger open video foundation models\[[20](https://arxiv.org/html/2605.26628#bib.bib4),[9](https://arxiv.org/html/2605.26628#bib.bib5),[7](https://arxiv.org/html/2605.26628#bib.bib6),[12](https://arxiv.org/html/2605.26628#bib.bib7),[21](https://arxiv.org/html/2605.26628#bib.bib8)\]\. Its quality is multi\-dimensional: a quantized model can maintain frame\-level aesthetic quality while degrading subject identity, prompt alignment, or temporal coherence\. VBench\[[10](https://arxiv.org/html/2605.26628#bib.bib26)\]evaluates video generation with dimensions such as imaging quality, aesthetic quality, overall consistency, subject consistency, and motion smoothness\. These metrics are useful for diagnosing low\-bit video generation because the failure modes of W4A4 quantization are not captured by a single scalar score\.

### II\-DHiFloat4 Format

HiFloat4 is a 4\-bit floating\-point\-oriented numerical format for low\-bit inference\[[15](https://arxiv.org/html/2605.26628#bib.bib27)\]\. In the challenge setting, the numerical format is fixed, so the method should not redefine the representation\. We therefore treat HiFloat4 as a prescribed quantize\-dequantize map for both weights and activations\. The proposed percentile calibration only changes the PTQ statistics used to prepare channel masks; it does not change the HiFloat4 arithmetic\.

## IIIPreliminaries

### III\-APTQ Notation

We consider post\-training quantization of a pretrained generatorfθf\_\{\\theta\}without updating the original parameters by gradient training\. For a tensorzz, a generic fake\-quantization operator can be written as

𝒟​\(𝒬​\(z;Δ,𝒞\);Δ\),\\mathcal\{D\}\(\\mathcal\{Q\}\(z;\\Delta,\\mathcal\{C\}\);\\Delta\),\(1\)where𝒬\\mathcal\{Q\}maps values to a low\-precision code set𝒞\\mathcal\{C\}under scale or format parameterΔ\\Delta, and𝒟\\mathcal\{D\}maps the low\-precision code back to floating point for simulation\. Integer PTQ typically chooses a uniform code set with explicit scale and zero point\. In this work, the code set and dequantization behavior are determined by the HiFloat4 toolkit\. We therefore write the weight and activation paths as

W^=HiF4⁡\(W\),x^=HiF4⁡\(x\),\\hat\{W\}=\\operatorname\{HiF4\}\(W\),\\quad\\hat\{x\}=\\operatorname\{HiF4\}\(x\),\(2\)whereHiF4⁡\(⋅\)\\operatorname\{HiF4\}\(\\cdot\)denotes the released quantize\-dequantize operator\. The PTQ algorithm is responsible for preparing transformed tensors that are easier for this fixed operator to represent\.

### III\-BChannel Balancing

Channel balancing uses a diagonal mask to redistribute dynamic range between activations and weights before quantization\. For a linear layer, the floating\-point computation is unchanged by

x​WT=\(x​M\)​\(W​M−1\)T,M=diag⁡\(m\)\.xW^\{T\}=\(xM\)\(WM^\{\-1\}\)^\{T\},\\quad M=\\operatorname\{diag\}\(m\)\.\(3\)After quantization, however, the choice ofmmchanges the approximation error becauseHiF4⁡\(x​M\)\\operatorname\{HiF4\}\(xM\)andHiF4⁡\(W​M−1\)\\operatorname\{HiF4\}\(WM^\{\-1\}\)have different effective ranges\. SmoothQuant\-style methods constructmmfrom weight and activation magnitudes\. This makes calibration statistics central to W4A4 performance: an activation estimate that over\-emphasizes rare outliers can make the common activation range too coarse, while an estimate that clips too aggressively can remove important semantic signal\.

### III\-CWan2\.2 Challenge Setting

Wan2\.2\-I2V\-A14B\[[22](https://arxiv.org/html/2605.26628#bib.bib9)\]is an image\-to\-video model with two transformer modules\. In the challenge protocol, text prompts provide the semantic condition and a fixed placeholder image provides the required image condition\. The target layers are linear projections inside the transformer modules\. The method does not change the scheduler, prompt processing, video resolution, number of frames, or HiFloat4 arithmetic\. This separation is important for camera\-ready reproducibility: differences between BF16 and W4A4 runs should come from quantization and PTQ state restoration, not from a changed sampling protocol\.

## IVMethod

### IV\-AOverview

Our pipeline consists of calibration, PTQ state construction, and quantized inference\. Calibration runs the BF16 Wan2\.2 model on a small prompt set and estimates input activation statistics for linear layers in both transformer modules\. PTQ replaces the selected linear projections with HiFloat4 quantized counterparts, constructs per\-channel masks from the collected statistics, and forms a compact quantization state\. Quantized inference applies this state to the original Wan2\.2 checkpoint and generates videos with the same sampling configuration\.

Fig\.[1](https://arxiv.org/html/2605.26628#S4.F1)shows the procedure\. The calibration and inference interfaces both use the same fixed placeholder image condition because the Wan2\.2 generator follows an image\-to\-video formulation\. Consequently, the BF16 and W4A4 runs share the same sampling protocol, and the remaining differences are attributable to quantization and prompt variation\. Algorithm[1](https://arxiv.org/html/2605.26628#alg1)gives the layer\-wise PTQ procedure used to build the compact state\.

![Refer to caption](https://arxiv.org/html/2605.26628v1/fig2.png)Figure 1:Pipeline of the proposed Tail\-Aware HiFloat4 W4A4 PTQ system for Wan2\.2\. Calibration collects activation statistics from the BF16 model, PTQ builds tail\-aware percentile\-calibrated channel masks and compact states, and inference restores those states while keeping the base sampler unchanged\.Algorithm 1Tail\-Aware Percentile\-Calibrated HiF4 PTQInput:base modelfθf\_\{\\theta\}, calibration prompts𝒫\\mathcal\{P\}, percentilepp, balancing coefficientα\\alpha Output:compact PTQ state𝒮\\mathcal\{S\}

1:foreach target linear layer

ℓ\\elldo

2:collect input activations

xℓx\_\{\\ell\}on

𝒫\\mathcal\{P\}
3:

ai←Qp​\(\{\|xℓ,j,i\|\}j\)a\_\{i\}\\leftarrow Q\_\{p\}\(\\\{\|x\_\{\\ell,j,i\}\|\\\}\_\{j\}\)
4:

wi←maxo⁡\|Wℓ,o,i\|w\_\{i\}\\leftarrow\\max\_\{o\}\|W\_\{\\ell,o,i\}\|
5:

mi←wiα/\(ai\+ϵ\)1−αm\_\{i\}\\leftarrow w\_\{i\}^\{\\alpha\}/\(a\_\{i\}\+\\epsilon\)^\{1\-\\alpha\}
6:

W~ℓ←Wℓdiag\(m\)−1\\tilde\{W\}\_\{\\ell\}\\leftarrow W\_\{\\ell\}\\operatorname\{diag\}\(m\)^\{\-1\}
7:

W^ℓ←HiF4⁡\(W~ℓ\)\\hat\{W\}\_\{\\ell\}\\leftarrow\\operatorname\{HiF4\}\(\\tilde\{W\}\_\{\\ell\}\)
8:store the layer mask and quantization state in

𝒮\\mathcal\{S\}
9:endfor

### IV\-BHiFloat4 Quantized Linear Layers

Let a floating\-point linear layer be

wherexxis the input activation,WWis the weight matrix, andbbis the bias\. For each selected projection, the weight path applies HiFloat4 quantize\-dequantize to the channel\-balanced weight, while the activation path applies the same numerical format to the online input activation\. The quantization axes follow the natural matrix dimensions of the linear operation: weights are quantized over output channels and activations over feature channels\.

Wan2\.2\-I2V\-A14B contains two transformer modules, and we apply the same conversion rule to both\. Matched attention projections and feed\-forward projections are quantized\. Normalization layers, embedding\-related layers, patch embedding, final projection, and output\-head style modules remain in BF16 for stability\. The configuration also supports retaining at most two transformer blocks in high precision across both transformer modules, consistent with the HiFloat4 challenge rule\. In the submitted configuration, no whole transformer block is retained in high precision\.

### IV\-CTail\-Aware Percentile Calibration

For each target linear layer, hooks collect the absolute value of the input activations\. Letxj,ix\_\{j,i\}denote the activation value of channeliiat token/sample indexjj\. A conservative calibration statistic is the maximum absolute value:

aimax=maxj⁡\|xj,i\|\.a\_\{i\}^\{\\max\}=\\max\_\{j\}\|x\_\{j,i\}\|\.\(5\)This statistic avoids clipping calibration observations but can be dominated by a small number of rare outliers\. Under W4A4 activation quantization, such tails are costly because the representable values are sparse\.

We instead support a high\-percentile statistic:

aip=Qp​\(\{\|xj,i\|\}j\),a\_\{i\}^\{p\}=Q\_\{p\}\(\\\{\|x\_\{j,i\}\|\\\}\_\{j\}\),\(6\)whereQpQ\_\{p\}is the empiricalpp\-th percentile over the calibration activations of that channel\. When statistics are accumulated over multiple calibration records, the same percentile rule is applied to the aggregated activation samples before channel masks are constructed\. This gives a single robust range estimate per input channel while avoiding dependence on the partitioning of the calibration set\.

Percentile calibration changes the calibration behavior rather than the runtime numerical format\. It allows a controlled fraction of rare activation extremes to be clipped, while allocating more effective resolution to the main body of the activation distribution\. This is suitable for low\-bit video generation because quality metrics are often more sensitive to systematic rounding noise across many tokens and frames than to a few isolated calibration extremes\.

### IV\-DChannel Mask Construction

The collected activation statistic is used to build a SmoothQuant\-style channel mask\. For input channelii, letwiw\_\{i\}be the maximum absolute weight magnitude over the output dimension:

wi=maxo⁡\|Wo,i\|\.w\_\{i\}=\\max\_\{o\}\|W\_\{o,i\}\|\.\(7\)Given activation statisticaia\_\{i\}, balancing coefficientα\\alpha, and numerical stabilizerϵ\\epsilon, the mask is

mi=wiα\(ai\+ϵ\)1−α\.m\_\{i\}=\\frac\{w\_\{i\}^\{\\alpha\}\}\{\(a\_\{i\}\+\\epsilon\)^\{1\-\\alpha\}\}\.\(8\)In the conservative variant,ai=aimaxa\_\{i\}=a\_\{i\}^\{\\max\}\. In our percentile\-calibrated variant,ai=aipa\_\{i\}=a\_\{i\}^\{p\}\. The scaled linear computation is equivalent before quantization:

xWT=\(xdiag\(m\)\)\(Wdiag\(m\)−1\)T\.xW^\{T\}=\(x\\operatorname\{diag\}\(m\)\)\(W\\operatorname\{diag\}\(m\)^\{\-1\}\)^\{T\}\.\(9\)The benefit appears after quantization: activation and weight ranges become better balanced before HiFloat4 quantize\-dequantize is applied\.

### IV\-ECompact PTQ State

A naive deployment artifact could duplicate a full transformed copy of the Wan2\.2 transformer weights\. Instead, we store a compact PTQ state containing the per\-layer channel masks and quantization descriptors needed to reproduce the W4A4 model\. At inference time, this state is applied to the original floating\-point checkpoint, and each affected weight is transformed and quantized under the restored mask\. This design reduces storage overhead and keeps the quantized model explicitly tied to the declared base checkpoint\.

## VExperiments

### V\-AExperimental Setup

We evaluate the proposed PTQ pipeline on the text\-to\-video task\. The base Wan2\.2 model is loaded in BF16 and serves as the reference model\. Calibration uses the OpenS2V\-5M\-derived JSON prompt file\[[26](https://arxiv.org/html/2605.26628#bib.bib10)\], where each sample contains a prompt under the*cap*field\. Since the organizer\-required generation pipeline is image\-to\-video while the evaluation task is prompt\-driven text\-to\-video generation, we provide an explicit blank placeholder image as the image input\. This behavior is fixed during both calibration and inference\.

The generation protocol uses resolution720×1280720\\times 1280,6161frames,4040denoising steps, and classifier\-free guidance scale3\.53\.5\. Calibration uses1616prompts with random seed4242\. The W4A4 model uses the same base checkpoint, prompt set, placeholder\-image condition, resolution, frame count, and sampling steps as the BF16 baseline\.

TABLE I:Default calibration and generation settings\.
### V\-BImplementation Details

Table[II](https://arxiv.org/html/2605.26628#S5.T2)summarizes the layer conversion\. Each transformer has400400HiFloat4 quantized linear layers and66high\-precision linear layers\. Across both transformer modules,800800linear layers are quantized\. The retained high\-precision layers are boundary modules such as time/text embedding projections and output projections\. No full transformer block is retained in high precision in the default configuration, leaving the explicit high\-precision block budget unused\.

TABLE II:Layer conversion summary for the default W4A4 configuration\.The overall procedure follows the three stages in Fig\.[1](https://arxiv.org/html/2605.26628#S4.F1)\. First, calibration estimates activation statistics on the BF16 model\. Second, PTQ converts the selected projections, applies channel balancing, and records the compact quantization state\. Third, inference restores the quantized model from the base checkpoint and the PTQ state before generating videos\. This staged design separates statistical calibration, model conversion, and evaluation, making the reported comparison reproducible without changing the sampling protocol\.

### V\-CMain Results

Table[III](https://arxiv.org/html/2605.26628#S5.T3)reports the main comparison under matched generation and evaluation settings\. The HiFloat4 W4A4 model reduces the unweighted mean score from0\.68000\.6800to0\.58800\.5880, corresponding to a drop of0\.09200\.0920\. The largest degradation appears in subject consistency, indicating that aggressive 4\-bit activation quantization affects identity and object preservation\. In contrast, aesthetic quality and overall consistency remain comparable to the BF16 baseline, and motion smoothness has only a small drop\.

TABLE III:Main results on the official video generation evaluation\. Higher is better for all metrics\.
### V\-DQualitative Results

Fig\.[2](https://arxiv.org/html/2605.26628#S5.F2)shows qualitative W4A4 generations with their prompts and sampled frames\. The examples illustrate that the quantized model preserves scene layout and plausible motion, while fine subject details and local object geometry remain sensitive under 4\-bit activation quantization, consistent with the subject\-consistency drop in Table[III](https://arxiv.org/html/2605.26628#S5.T3)\.

![Refer to caption](https://arxiv.org/html/2605.26628v1/fig3.png)Figure 2:Qualitative W4A4 examples\. Each row shows the prompt and sampled frames from one generated video\.
### V\-EAnalysis

The metric profile suggests that the main failure mode of W4A4 Wan2\.2 is not temporal smoothness\. Motion smoothness remains close to BF16, which indicates that the denoising trajectory still produces temporally coherent frame transitions\. Aesthetic quality is also stable, suggesting that global visual style and frame\-level appeal are not the most sensitive components under this PTQ configuration\.

Subject consistency is the primary bottleneck\. This is expected for W4A4 activation quantization because subject identity depends on repeated preservation of semantic and local visual details across frames\. Small quantization errors in attention and feed\-forward projections can accumulate over denoising steps and appear as subject drift, local blur, texture collapse, or object inconsistency\. Percentile calibration is designed to reduce systematic range over\-expansion caused by rare activation tails, but it does not eliminate all errors introduced by quantizing both weights and activations to 4 bits\.

The compact PTQ state is also important for reproducibility\. Since it stores transformation parameters rather than full transformed weights, inference remains tied to the declared base Wan2\.2 checkpoint\. This makes the base model, quantization configuration, prompt source, resolution, and frame count explicit, and it also reduces the storage overhead of comparing multiple PTQ variants\.

### V\-FLimitations

The method relies on calibration prompts and percentile hyperparameters\. If the calibration prompts are too few or distributionally different from the evaluation prompts, the selected percentile may under\-cover important activation modes\. The current system does not use the optional ViDiT\-Q rotation path, so part of the channel imbalance may remain unresolved\. Finally, W4A4 quantization remains a severe setting for video generation; preserving subject identity under 4\-bit activations is still challenging even when global quality and motion metrics remain stable\.

## VIConclusion

We presented Tail\-Aware HiFloat4, a percentile\-calibrated W4A4 PTQ pipeline for Wan2\.2 text\-to\-video generation\. The method adapts ViDiT\-Q\-style calibration and channel balancing to the dual\-transformer Wan2\.2 pipeline, uses HiFloat4 quantize\-dequantize for both weights and activations in target linear layers, and stores only compact PTQ deltas for inference\. The results show that several global video\-quality metrics remain close to the BF16 baseline, while subject consistency is the dominant remaining degradation\. Future improvements should focus on identity\-sensitive layers, calibration prompt coverage, and selective use of rotation or high\-precision retention under the challenge constraints\.

## References

- \[1\]S\. Ashkboos, M\. L\. Croci, M\. G\. do Nascimento, T\. Hoefler, and J\. Hensman\(2024\)QuaRot: outlier\-free 4\-bit inference in rotated llms\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§II\-A](https://arxiv.org/html/2605.26628#S2.SS1.p1.1)\.
- \[2\]L\. Chen, Y\. Meng, C\. Tang, X\. Ma, J\. Jiang, X\. Wang, Z\. Wang, and W\. Zhu\(2025\)Q\-dit: accurate post\-training quantization for diffusion transformers\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 28306–28315\.Cited by:[§II\-B](https://arxiv.org/html/2605.26628#S2.SS2.p1.1)\.
- \[3\]J\. Deng, S\. Li, Z\. Wang, H\. Gu, K\. Xu, and K\. Huang\(2025\)VQ4DiT: efficient post\-training vector quantization for diffusion transformers\.Proceedings of the AAAI Conference on Artificial Intelligence39\(15\),pp\. 16226–16234\.Cited by:[§II\-B](https://arxiv.org/html/2605.26628#S2.SS2.p1.1)\.
- \[4\]T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer\(2022\)GPT3\.int8\(\): 8\-bit matrix multiplication for transformers at scale\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 30318–30332\.Cited by:[§II\-A](https://arxiv.org/html/2605.26628#S2.SS1.p1.1)\.
- \[5\]E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh\(2023\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.InInternational Conference on Learning Representations,Cited by:[§II\-A](https://arxiv.org/html/2605.26628#S2.SS1.p1.1)\.
- \[6\]Y\. He, J\. Liu, W\. Wu, H\. Zhou, and B\. Zhuang\(2024\)EfficientDM: efficient quantization\-aware fine\-tuning of low\-bit diffusion models\.InInternational Conference on Learning Representations,Cited by:[§II\-B](https://arxiv.org/html/2605.26628#S2.SS2.p1.1)\.
- \[7\]J\. Ho, W\. Chan, C\. Saharia, J\. Whang, R\. Gao, A\. Gritsenko, D\. P\. Kingma, B\. Poole, M\. Norouzi, D\. J\. Fleet, and T\. Salimans\(2022\)Imagen video: high definition video generation with diffusion models\.arXiv preprint arXiv:2210\.02303\.Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p1.1),[§II\-C](https://arxiv.org/html/2605.26628#S2.SS3.p1.1)\.
- \[8\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 6840–6851\.Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p1.1)\.
- \[9\]W\. Hong, M\. Ding, W\. Zheng, X\. Liu, and J\. Tang\(2023\)CogVideo: large\-scale pretraining for text\-to\-video generation via transformers\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p1.1),[§II\-C](https://arxiv.org/html/2605.26628#S2.SS3.p1.1)\.
- \[10\]Z\. Huang, Y\. He, J\. Yu, F\. Zhang, C\. Si, Y\. Jiang, Y\. Zhang, T\. Wu, Q\. Jin, N\. Chanpaisit, Y\. Wang, X\. Chen, L\. Wang, D\. Lin, Y\. Qiao, and Z\. Liu\(2024\)VBench: comprehensive benchmark suite for video generative models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 21807–21818\.Cited by:[§II\-C](https://arxiv.org/html/2605.26628#S2.SS3.p1.1)\.
- \[11\]B\. Jacob, S\. Kligys, B\. Chen, M\. Zhu, M\. Tang, A\. Howard, H\. Adam, and D\. Kalenichenko\(2018\)Quantization and training of neural networks for efficient integer\-arithmetic\-only inference\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 2704–2713\.Cited by:[§II\-A](https://arxiv.org/html/2605.26628#S2.SS1.p1.1)\.
- \[12\]W\. Kong, Q\. Tian, Z\. Zhang, R\. Min, Z\. Dai, J\. Zhou, J\. Xiong, X\. Li,et al\.\(2024\)HunyuanVideo: a systematic framework for large video generative models\.arXiv preprint arXiv:2412\.03603\.Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p1.1),[§II\-C](https://arxiv.org/html/2605.26628#S2.SS3.p1.1)\.
- \[13\]X\. Li, Y\. Liu, L\. Lian, H\. Yang, Z\. Dong, D\. Kang, S\. Zhang, and K\. Keutzer\(2023\)Q\-diffusion: quantizing diffusion models\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 17535–17545\.Cited by:[§II\-B](https://arxiv.org/html/2605.26628#S2.SS2.p1.1)\.
- \[14\]J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han\(2024\)AWQ: activation\-aware weight quantization for llm compression and acceleration\.InProceedings of Machine Learning and Systems,Vol\.6,pp\. 87–100\.Cited by:[§II\-A](https://arxiv.org/html/2605.26628#S2.SS1.p1.1)\.
- \[15\]Y\. Luo, J\. Huang, Y\. Cheng, Z\. Yu, K\. Tang, X\. Ma, X\. Wang, A\. Tong, G\. Hu, Y\. Xu, M\. Taghian, P\. Wu, G\. Li, Y\. Peng, T\. Hu, M\. Chen, M\. B\. Mi, H\. Liu, X\. Zhou, J\. Wang, Q\. Lin, and H\. Liao\(2026\)HiFloat4 format for language model inference\.arXiv preprint arXiv:2602\.11287\.Cited by:[§II\-D](https://arxiv.org/html/2605.26628#S2.SS4.p1.1)\.
- \[16\]M\. Nagel, R\. A\. Amjad, M\. van Baalen, C\. Louizos, and T\. Blankevoort\(2020\)Up or down? adaptive rounding for post\-training quantization\.InProceedings of the 37th International Conference on Machine Learning,pp\. 7197–7206\.Cited by:[§II\-A](https://arxiv.org/html/2605.26628#S2.SS1.p1.1)\.
- \[17\]W\. Peebles and S\. Xie\(2023\)Scalable diffusion models with transformers\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4195–4205\.Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p1.1)\.
- \[18\]R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer\(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 10684–10695\.Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p1.1)\.
- \[19\]Y\. Shang, Z\. Yuan, B\. Xie, B\. Wu, and Y\. Yan\(2023\)Post\-training quantization on diffusion models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 1972–1981\.Cited by:[§II\-B](https://arxiv.org/html/2605.26628#S2.SS2.p1.1)\.
- \[20\]U\. Singer, A\. Polyak, T\. Hayes, X\. Yin, J\. An, S\. Zhang, Q\. Hu, H\. Yang, O\. Ashual, O\. Gafni, D\. Parikh, S\. Gupta, and Y\. Taigman\(2023\)Make\-a\-video: text\-to\-video generation without text\-video data\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p1.1),[§II\-C](https://arxiv.org/html/2605.26628#S2.SS3.p1.1)\.
- \[21\]Wan Team, A\. Wang, B\. Ai, B\. Wen, C\. Mao, C\. Xie, D\. Chen, F\. Yu,et al\.\(2025\)Wan: open and advanced large\-scale video generative models\.arXiv preprint arXiv:2503\.20314\.Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p1.1),[§II\-C](https://arxiv.org/html/2605.26628#S2.SS3.p1.1)\.
- \[22\]Wan\-AI\(2025\)Wan2\.2\-I2V\-A14B\.Note:Hugging Face model repository\[Online\]\. Available:https://huggingface\.co/Wan\-AI/Wan2\.2\-I2V\-A14B\. Accessed: May 19, 2026Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p3.1),[§III\-C](https://arxiv.org/html/2605.26628#S3.SS3.p1.1)\.
- \[23\]J\. Wu, H\. Wang, Y\. Shang, M\. Shah, and Y\. Yan\(2024\)PTQ4DiT: post\-training quantization for diffusion transformers\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§II\-B](https://arxiv.org/html/2605.26628#S2.SS2.p1.1)\.
- \[24\]G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han\(2023\)SmoothQuant: accurate and efficient post\-training quantization for large language models\.InProceedings of the 40th International Conference on Machine Learning,pp\. 38087–38099\.Cited by:[§II\-A](https://arxiv.org/html/2605.26628#S2.SS1.p1.1)\.
- \[25\]Z\. Yao, R\. Y\. Aminabadi, M\. Zhang, X\. Wu, C\. Li, and Y\. He\(2022\)ZeroQuant: efficient and affordable post\-training quantization for large\-scale transformers\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 27168–27183\.Cited by:[§II\-A](https://arxiv.org/html/2605.26628#S2.SS1.p1.1)\.
- \[26\]S\. Yuan, X\. He, Y\. Deng, Y\. Ye, J\. Huang, B\. Lin, J\. Luo, and L\. Yuan\(2025\)OpenS2V\-nexus: a detailed benchmark and million\-scale dataset for subject\-to\-video generation\.arXiv preprint arXiv:2505\.20292\.Cited by:[§V\-A](https://arxiv.org/html/2605.26628#S5.SS1.p1.1)\.
- \[27\]T\. Zhao, T\. Fang, H\. Huang, R\. Wan, W\. Soedarmadji, E\. Liu, S\. Li, Z\. Lin, G\. Dai, S\. Yan, H\. Yang, X\. Ning, and Y\. Wang\(2025\)ViDiT\-q: efficient and accurate quantization of diffusion transformers for image and video generation\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.26628#S1.p3.1),[§II\-B](https://arxiv.org/html/2605.26628#S2.SS2.p1.1)\.

Similar Articles

Theory-optimal Quantization Based on Flatness

arXiv cs.LG

Introduces Flatness metric and Bidirectional Diagonal Quantization (BDQ) for post-training quantization of large language models, achieving near-lossless 4-bit weight and activation quantization and substantial improvements at extreme low-bit settings.