Ultra Flash Scaling Real-Time Streaming Video Generation
to High Resolutions

Luxury¹, Jie Huang^1‡, Zihao Fan², Xiaoxiao Ma², Yuming Li³, Jun-hao Zhuang¹,

Zeyue Xue¹, Siming Fu¹, Haoran Li¹, Mingchen Zhong², Guohui Zhang²,

Shichen Ma¹, Yijun Liu⁴, Jiaqi Shi², Yanwen Ma⁵, Yaofeng Su⁶, Haoyu Wang⁴,

Yaowei Li³, Songchun Zhang⁷, Weiyang Jin⁸, Yuxuan Bian⁹, Shiyi Zhang⁴, Haojun Xu⁵,

Shuai Lu¹, Xin Han¹, Wei Tang¹, Haoyang Huang¹, Nan Duan¹

‡ Project leader

¹ JD Explore Academy ² USTC ³ PKU ⁴ THU ⁵ BUAA ⁶ FDU ⁷ HKUST ⁸ HKU ⁹ CUHK

Paper Code Model

Background: 2K video generated in real-time by Ultra Flash

Abstract

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single B200 GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling; and (3) a cascade high-resolution streaming video generation optimization scheme that performs hybrid-reward-enhanced sparse causalization and single-step distillation, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time streaming.

Ultra Flash overview: framework and quality-speed comparison

(a) Ultra Flash: Real-Time & High-Resolution streaming framework. (b) Quality–speed comparison with prior methods.

Key Contributions

Three core innovations that enable real-time high-resolution streaming video generation.

T2V-to-TV2V SR Training

Architecture-preserving paradigm that converts any pre-trained T2V model into a generative super-resolution model without modification, with AIGC-oriented degradation.

Causal Streaming Upsampler

Ultralight causal memory network (~2M params) that upsamples latents with spatiotemporal coherence, adding <5% pipeline overhead.

Cascaded Streaming Optimization

Hybrid-reward sparse distillation + cascaded DPO preference optimization + dynamic cache management for real-time inference.

Method	Resolution	Steps	FPS ↑	Latency (ms) ↓	Streaming
Wan2.1	480×832	50	0.78	103,000	×
CausVid	480×832	4	29.4	34	✓
Self Forcing	480×832	4	32.0	31	✓
Causal Forcing	480×832	4	31.2	32	✓
DummyForcing	480×832	4	28.0	36	✓
Self Forcing + FlashVSR	768×1408	5	15.0	67	✓
Ultra Flash (Ours)	960×1664 (1K)	4	30.0	40	✓
Ultra Flash (Ours)	1440×2496 (2K)	4	18.0	56	✓

Gallery

Real-time 2K streaming video generation results. Hover to see the text prompt.

A child in a cozy winter outfit, blowing gently on a steaming mug of hot cocoa to cool it down.

A cinematic closeup and detailed portrait of a reindeer standing in a snowy forest at sunset.

A cinematic over-the-shoulder shot of a writer sitting at a cluttered desk, lost in thought as they gaze out the window.

A close-up portrait of a woman, her face illuminated by the side lighting, capturing her delicate features.

A close-up shot of a futuristic cybernetic German Shepherd, showcasing its striking brown and black metallic features.

A cyberpunk-style digital illustration of a metal skull growing muscle tendons and flesh, set in a dark futuristic laboratory.

A cyberpunk-style illustration depicting a lone robot navigating a neon-lit cityscape.

A detailed and warm moment captured in a traditional Chinese ink wash painting style.

A detailed illustration in a realistic style depicting a raccoon wearing a classic detective's hat.

A close-up shot of a confident fashion influencer in a chic winter outfit, posing for a photo shoot.

A close-up shot of a woman gently kissing a baby on the cheek, leaving a subtle lipstick mark.

A scientific laboratory scene in a detailed digital painting style, featuring a panda wearing a white lab coat.

A close-up shot of a stunning diamond ring, showcasing its intricate facets and brilliant cut.

A vintage film-style photograph captures a moment between two women, one whispering a secret into the other's ear.

A close-up shot of a young woman deeply engrossed in solving a complex puzzle, her forehead creased in concentration.

A melancholic scene from a vintage film-style photograph captures a woman's lips trembling with sadness.

A close-up shot of a Victoria crowned pigeon in a naturalistic wildlife photography style, showcasing its elegant blue feathers.

A Japanese animated film-style scene of a young woman standing on a ship, looking back at the camera with wind in her hair.

A close-up of honey being drizzled onto pancakes, the thick golden liquid flowing slowly and smoothly.

A serene mountain lake at night, reflecting a starry sky, with a small boat gliding silently across the water.

A dramatic space scene in the style of a sci-fi movie poster, featuring a sleek silver spaceship being pulled into a black hole.

A dynamic and explosive basketball moment captured in a high-energy action style, showcasing a basketball player dunking.

A fantastical aerial landscape in the style of a high-fantasy illustration, depicting a floating island with waterfalls cascading into clouds.

A vibrant and lively vlog-style photo of a corgi in tropical Maui, showcasing the dog energetically playing on the beach.

An arc shot around a couple standing under a blooming cherry blossom tree, with petals gently falling around them.

A surreal and dreamlike scene in the style of a cyberpunk film, depicting New York City submerged underwater.

A post-apocalyptic city scene in a gritty, realistic style, where nature has reclaimed the urban landscape.

A dramatic landscape painting in the style of a Chinese ink wash, depicting a vast desert with towering sand dunes.

A vibrant anime illustration in a dynamic, thick-line painting style of a young girl blowing a kiss.

A slow-motion video captures a drop of liquid mercury, gleaming with a silvery sheen, bouncing gracefully on a surface.

A full-body shot of a man crafted entirely from rocks, walking through a dense forest.

An adorable kangaroo wearing vibrant purple overalls and stylish cowboy boots takes a leisurely stroll down a city street.

Ultra Flash Scaling Real-Time Streaming Video Generation
to High Resolutions

Abstract