1.1 — From Pixels to Latents: Video Diffusion Models
Video diffusion models extend image diffusion to the temporal domain. The core idea is learning to iteratively denoise a noisy video signal, progressively recovering clean frames from Gaussian noise. Two paradigms emerged — and only one survived at scale.
Pixel-space video diffusion, pioneered by Ho et al. in 2022 [Ho et al., 2022], applied the denoising process directly in pixel space. This proved computationally prohibitive at high resolutions and long durations due to the cubic growth of the data tensor (height × width × frames). A 10-second clip at 1080p contains over 500 million pixels per frame — the approach simply couldn't scale.
Latent Video Diffusion (LVD) became the dominant paradigm. The core insight: encode video into a compressed latent space via a Variational Autoencoder, then apply the diffusion process in that latent space. Stable Video Diffusion (SVD) [Blattmann et al., 2023] pioneered this at scale, extending Stable Diffusion 2.1's UNet backbone with temporal convolution and cross-attention layers inserted after every spatial block.
SVD adds approximately 656M temporal parameters on top of ~900M spatial parameters for a total exceeding 1.5B. Its three-stage training pipeline — (1) text-to-image pretraining, (2) large-scale video pretraining on 150M+ curated clips, (3) high-quality finetuning on ~1M handpicked videos — established the template that every subsequent model has followed.
Training Objectives: The Shift to Flow Matching
The field has undergone a decisive shift in how models learn to generate. Two training frameworks dominated:
DDPM/EDM-style denoising: The model learns to predict noise (or the clean signal) at each diffusion step. SVD used the EDM framework [Karras et al., 2022] for noise preconditioning. This approach requires 50–200 inference steps, making generation slow.
Flow matching / Rectified Flow: Increasingly preferred in 2025–2026. Rather than a stochastic diffusion process, flow matching defines deterministic straight-line transport paths between noise and data distributions [Lipman et al., 2023]. Rectified flow forces nearly straight transport paths, reducing ODE discretization error and enabling generation in as few as 1–10 steps — a 10–50× speedup in inference. Meta's Movie Gen, HunyuanVideo, Open-Sora 2.0, and Alibaba's Wan 2.1 all use flow matching.
1.2 — The Transformer Takeover: DiT Architecture
The migration from UNet to Transformer backbones is now complete. Every leading 2025–2026 video model uses a transformer. UNet persists only in Google Veo's hybrid approach and the legacy SVD model.
DiT (Diffusion Transformers), introduced by Peebles and Xie in 2023 [Peebles & Xie, 2023], replaced the UNet with a standard Transformer operating on patchified latent tokens. The key design innovation was Adaptive LayerNorm with zero initialization (AdaLN-Zero), which conditions the transformer on timestep and class embeddings. DiT demonstrated that the UNet's inductive bias — long assumed to be crucial for diffusion quality — was in fact unnecessary. This architecture now underpins Sora, Kling, HunyuanVideo, Hailuo, and most leading models.
Handling the Temporal Dimension
ViViT [Arnab et al., 2021] established foundational approaches to handling video's temporal dimension in transformers. It proposed three factorization strategies that remain in use today:
1. Joint space-time attention — full self-attention across all spatiotemporal tokens. The most powerful approach but O(n²) in token count, making it prohibitive for long videos. Used by Sora and HunyuanVideo in high-quality settings.
2. Factorized encoder — two transformers in series: one models spatial interactions per-frame, the second models temporal interactions across frames.
3. Factorized self-attention — within each block, attention is computed first spatially, then temporally. Most production models use variants of this for efficiency.
Dual-Stream Architectures
HunyuanVideo and Open-Sora 2.0 introduced a "dual-stream to single-stream" design [Tencent, 2024]. In the dual-stream phase, video and text tokens are processed independently through separate transformer blocks, learning modality-specific representations. In the single-stream phase, tokens are concatenated for cross-modal fusion. SkyReels V4 extended this further with MMDiT (Multimodal Diffusion Transformer), adding separate streams for audio alongside video and text.
1.3 — Autoregressive vs. Diffusion: The Great Convergence
One of the most active research debates in 2025–2026 is whether video should be generated all-at-once (diffusion) or frame-by-frame (autoregressive). The answer, increasingly, is both.
Bidirectional diffusion generates all frames simultaneously via iterative denoising. It produces high quality but suffers from high latency — a 128-frame video takes approximately 219 seconds — and outputs are fixed-length.
Autoregressive diffusion generates frames sequentially or in chunks, conditioning each new segment on previously generated frames. The advantages are transformative: initial frame latency drops to ~1.3 seconds, continuous generation runs at ~9.4 FPS (interactive framerate), and sliding-window inference enables arbitrarily long videos despite training on short clips.
The 2025–2026 convergence point is hybrid models. CausVid [CVPR 2025] distills bidirectional diffusion transformers into few-step autoregressive generators via distribution matching distillation — achieving diffusion-quality output in just 4 steps. Runway's GWM-1 (January 2026) is an autoregressive model built atop Gen-4.5 that generates frame-by-frame in real time with interactive control. NOVA achieves a VBench score of 80.1 at 2.75 FPS, trained in only 342 GPU days.
1.4 — Temporal Consistency & 3D VAE Innovation
The Video VAE is a critical and often underappreciated bottleneck. It determines compression quality, latent dimensionality, and downstream generation efficiency. Every improvement in the VAE cascades through the entire pipeline.
Maintaining Temporal Coherence
Keeping generated video consistent across frames is the central engineering challenge. Current approaches form a hierarchy of sophistication:
Temporal attention layers inserted after spatial blocks (SVD, early approaches). Full spatiotemporal attention across all space-time tokens — the gold standard, but quadratically expensive. Causal attention with sliding windows for autoregressive models. And MemoryPack (2025), which integrates FramePack (short-term motion context) with SemanticPack (long-term semantic features from distant frames).
3D VAE Breakthroughs
HunyuanVideo's 3D VAE achieves 4× temporal compression, 8× spatial compression, and 16× channel compression using CausalConv3D. But the real innovation wave came at CVPR and ICLR 2025:
IV-VAE introduced a dual-branch architecture — a 2D branch for keyframe compression and a 3D branch for temporal compression. Counterintuitively, the paper discovered that initializing from image VAEs with matching latent dimensions actually suppresses temporal compression capability.
DLFR-VAE adaptively determines optimal compression frame rates based on information-theoretic content complexity. High-motion segments get more latent frames; static scenes get fewer — an elegant solution to the one-size-fits-all problem.
Progressive Growing of Video Tokenizers achieved 16× temporal compression (versus the standard 4×), enabling generation of 4× longer videos within the same token budget.
VidTwin decouples latent representations into Structure Latents (semantic/spatial information) and Dynamics Latents (motion information), enabling independent control over appearance and movement — a direct path toward fine-grained editing.