GRN

Introduction

GRN is a new visual synthesis paradigm that is neither diffusion nor autoregressive. It allocates computation adaptively by sample complexity and progressively refines outputs globally.

🎬Near-Lossless Tokenization: Hierarchical Binary Quantization (HBQ) preserves visual information better than lossy tokenization pipelines.
⚙️Global Refinement: GRN refines all tokens progressively across the full canvas, like a human artist polishing the full image.
📊Adaptive-Step Generation: Entropy-guided sampling dynamically allocates generation steps according to content complexity.
🏆Strong Results: GRN achieves strong reconstruction and generation performance, and scales effectively to text-to-image and text-to-video tasks.

Generative Refinement Framework

How GRN Refines

GRN starts from a random token map, then iteratively fills and refines tokens globally. Compared with fixed-compute generation, this process adapts computation to content complexity and improves synthesis quality progressively.

1. Random Token Map

Generation begins from an initially random token layout.

2. Select Predictions

The model randomly selects more confident predictions at each step.

3. Refine Globally

All input tokens are refined together rather than generated one by one.

4. Adapt by Complexity

Entropy-guided sampling spends more steps where content is harder.

Why GRN

Diffusion Inefficiency

Uniform compute on all samples does not reflect complexity variance. GRN addresses this with adaptive refinement.

Autoregressive Limitations

Lossy tokenization and accumulated prediction error can hurt quality. GRN solves this with better tokenization.

Either diffusion nor autoregressive — GRN is a third way. 🧠 Refines globally like an artist. ⚡ Generates adaptively by complexity. 🏆 New SOTA across image & video. The visual generation paradigm just got rewritten.

Class-to-Image

GRN demonstrates strong class-conditional generation quality and reconstruction performance. The official implementation reports state-of-the-art results and provides checkpoints and scripts for reproducible class-to-image experiments.

Text-to-Image

The GRN pipeline supports straightforward text-to-image inference with configurable guidance scale, temperature, inference steps, and output resolution. Official examples show 1024x1024 generation from a single prompt.

GRN-2B Text-to-Video

GRN scales beyond class-to-image and text-to-image toward text-to-video, while preserving the same refinement-first philosophy: globally update representations and allocate compute adaptively where complexity requires it.

GRN-8B Text-to-Video

As we observe clear scaling behavior in GRN, we are encouraged to scaling GRN from 2B to 8B parameters for the most challenging video generation tasks.