Ultra Flash Scaling Real-Time Streaming Video Generation
to High Resolutions

‡ Project leader

1 JD Explore Academy    2 USTC    3 PKU    4 THU    5 BUAA    6 FDU    7 HKUST    8 HKU    9 CUHK

Background: 2K video generated in real-time by Ultra Flash

Ultra Flash is the first framework to achieve real-time high-resolution streaming video generation — producing 1K video at ~30 FPS and 2K video at ~18 FPS on a single GPU through cascaded causal streaming architecture.

~0 FPS
1K Resolution
960×1664
~0 FPS
2K Resolution
1440×2496
0 GPU
Single Device
B200 / H200
0 Steps
Denoising
Single-step SR

Abstract

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single B200 GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling; and (3) a cascade high-resolution streaming video generation optimization scheme that performs hybrid-reward-enhanced sparse causalization and single-step distillation, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time streaming.

Ultra Flash overview: framework and quality-speed comparison

(a) Ultra Flash: Real-Time & High-Resolution streaming framework. (b) Quality–speed comparison with prior methods.

Key Contributions

Three core innovations that enable real-time high-resolution streaming video generation.

T2V-to-TV2V SR Training

Architecture-preserving paradigm that converts any pre-trained T2V model into a generative super-resolution model without modification, with AIGC-oriented degradation.

Causal Streaming Upsampler

Ultralight causal memory network (~2M params) that upsamples latents with spatiotemporal coherence, adding <5% pipeline overhead.

Cascaded Streaming Optimization

Hybrid-reward sparse distillation + cascaded DPO preference optimization + dynamic cache management for real-time inference.

Method

Ultra Flash Training and Optimization Pipeline

(a) T2V-to-TV2V SR Training. (b) Causal Latent Upsampler & High-Resolution Decoder. (c) Hybrid-Reward-Enhanced Sparse Causalization and Single-Step Distillation. (d) Cascade High-Resolution Streaming Video Generation Optimization.

Results

Efficiency Comparison

Method Resolution Steps FPS ↑ Latency (ms) ↓ Streaming
CausVid 480×832424.4123
Self Forcing 480×832432.0125
Causal Forcing 480×832431.2128
DummyForcing 480×832428.0143
Self Forcing + FlashVSR 768×140858.0500×
Ultra Flash (Ours) 960×1664 (1K)430.040
Ultra Flash (Ours) 1440×2496 (2K)418.067

Citation

@inproceedings{luxury2026ultraflash,
  title={Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions},
  author={Luxury and Huang, Jie and Fan, Zihao and Ma, Xiaoxiao and Li, Yuming and Zhuang, Jun-hao and Xue, Zeyue and Fu, Siming and Li, Haoran and Zhong, Mingchen and Zhang, Guohui and Ma, Shichen and Liu, Yijun and Shi, Jiaqi and Ma, Yanwen and Su, Yaofeng and Wang, Haoyu and Li, Yaowei and Zhang, Songchun and Jin, Weiyang and Bian, Yuxuan and Zhang, Shiyi and Xu, Haojun and Lu, Shuai and Han, Xin and Tang, Wei and Huang, Haoyang and Duan, Nan},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2026}
}