GRN: Generative Refinement Networks for Visual Synthesis

Jian Han, Jinlai Liu, Jiahuan Wang, Bingyue Peng, Zehuan Yuan
ByteDance Logo

Introduction

GRN is a new visual synthesis paradigm that is neither diffusion nor autoregressive. It allocates computation adaptively by sample complexity and progressively refines outputs globally.

  • 🎬Near-Lossless Tokenization: Hierarchical Binary Quantization (HBQ) preserves visual information better than lossy tokenization pipelines.
  • ⚙️Global Refinement: GRN refines all tokens progressively across the full canvas, like a human artist polishing the full image.
  • 📊Adaptive-Step Generation: Entropy-guided sampling dynamically allocates generation steps according to content complexity.
  • 🏆Strong Results: GRN achieves strong reconstruction and generation performance, and scales effectively to text-to-image and text-to-video tasks.

Generative Refinement Framework

How GRN Refines

GRN starts from a random token map, then iteratively fills and refines tokens globally. Compared with fixed-compute generation, this process adapts computation to content complexity and improves synthesis quality progressively.

GRN Refinement Framework

1. Random Token Map

Generation begins from an initially random token layout.

2. Select Predictions

The model randomly selects more confident predictions at each step.

3. Refine Globally

All input tokens are refined together rather than generated one by one.

4. Adapt by Complexity

Entropy-guided sampling spends more steps where content is harder.

Why GRN

Diffusion Inefficiency

Uniform compute on all samples does not reflect complexity variance. GRN addresses this with adaptive refinement.

Autoregressive Limitations

Lossy tokenization and accumulated prediction error can hurt quality. GRN solves this with better tokenization.

Either diffusion nor autoregressive — GRN is a third way. 🧠 Refines globally like an artist. ⚡ Generates adaptively by complexity. 🏆 New SOTA across image & video. The visual generation paradigm just got rewritten.

Class-to-Image

GRN demonstrates strong class-conditional generation quality and reconstruction performance. The official implementation reports state-of-the-art results and provides checkpoints and scripts for reproducible class-to-image experiments.

GRN Class-to-Image Examples

Text-to-Image

The GRN pipeline supports straightforward text-to-image inference with configurable guidance scale, temperature, inference steps, and output resolution. Official examples show 1024x1024 generation from a single prompt.

GRN Text-to-Image Examples

GRN-2B Text-to-Video

GRN scales beyond class-to-image and text-to-image toward text-to-video, while preserving the same refinement-first philosophy: globally update representations and allocate compute adaptively where complexity requires it.

GRN-8B Text-to-Video

As we observe clear scaling behavior in GRN, we are encouraged to scaling GRN from 2B to 8B parameters for the most challenging video generation tasks.

GRN-8B Image-to-Video

Apart from text-to-video, GRN-8B also supports image-to-video generation.

Citations

If you find this work useful, please cite:

@misc{han2026grn,
  title={Generative Refinement Networks for Visual Synthesis},
  author={Jian Han and Jinlai Liu and Jiahuan Wang and Bingyue Peng and Zehuan Yuan},
  year={2026},
  eprint={2604.13030},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.13030}
}