Image Tokenizer Needs Post-Training
This repository contains RobusTok, a novel image tokenizer presented in the paper Image Tokenizer Needs Post-Training.
About RobusTok
Recent image generative models typically rely on a frozen image tokenizer to capture the image distribution in a latent space. However, a significant discrepancy exists between the reconstruction and generation distribution, as current tokenizers often prioritize the reconstruction task without fully considering generation errors during sampling.
RobusTok addresses this by proposing a novel tokenizer training scheme that includes both main-training and post-training:
- Main training: Constructs a robust latent space by simulating sampling noises and unexpected tokens.
- Post-training: Further optimizes the tokenizer decoder with respect to a well-trained generative model, mitigating the distribution difference between generated and reconstructed tokens.
This approach significantly enhances the robustness of the tokenizer, boosting generation quality and convergence speed.
Key Highlights of Post-Training
- π Better generative quality: Achieves notable improvements in gFID (e.g., 1.60 gFID β 1.36 gFID with a ~400M generator).
- π Generalizability: Applicable to both autoregressive & diffusion models.
- β‘ Efficiency: Provides strong results with relatively small generative models.
Model Zoo
| Generator \ Tokenizer | RobusTok w/o. P.T | RobusTok w/. P.T |
|---|---|---|
| Base (weights) | gFID = 1.83 | gFID = 1.60 |
| Large (weights) | gFID = 1.60 | gFID = 1.36 |
Usage
Due to the specialized nature of RobusTok's tokenizer and generator training and inference pipeline, detailed usage instructions, installation guides, and code examples are provided in the official GitHub repository. This includes scripts for:
- Environment setup and package installation.
- Dataset preparation.
- Main training for the tokenizer.
- Training code for the generator.
- Post-training for the tokenizer.
- Inference and evaluation (see Inference Code).
Visualization
Visualization of 256Γ256 image generation before (top) and after (bottom) post-training. Three improvements are observed: (a) OOD mitigation, (b) Color fidelity, (c) detail refinement.
Citation
If our work assists your research, feel free to give us a star β or cite us using:
@misc{qiu2025imagetokenizerneedsposttraining,
title={Image Tokenizer Needs Post-Training},
author={Kai Qiu and Xiang Li and Hao Chen and Jason Kuen and Xiaohao Xu and Jiuxiang Gu and Yinyi Luo and Bhiksha Raj and Zhe Lin and Marios Savvides},
year={2025},
eprint={2509.12474},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.12474},
}