Tag
This paper reveals that zeroth-order fine-tuning of LLMs is dominated by a single decoding layer, which can be identified by activation outliers, and fine-tuning only that layer matches or exceeds full-model fine-tuning with up to 4.52x speedup.
Introduces Flatness metric and Bidirectional Diagonal Quantization (BDQ) for post-training quantization of large language models, achieving near-lossless 4-bit weight and activation quantization and substantial improvements at extreme low-bit settings.