Tag
InsightTok introduces content-aware perceptual losses to improve discrete visual tokenization for better text and face reconstruction, enhancing autoregressive image generation quality.
This paper introduces DRoRAE, a method that improves visual tokenization by fusing multi-layer features from pretrained vision encoders rather than relying solely on the last layer. It demonstrates significant improvements in reconstruction and generation quality on ImageNet and establishes a scaling law between fusion capacity and performance.