The Architecture of Moving Pictures

Abstract

Video generation has undergone a fundamental architectural shift in the span of eighteen months. UNet backbones — the workhorse of image diffusion — have been entirely replaced by Diffusion Transformers (DiT). Flow matching has overtaken DDPM as the dominant training paradigm, enabling generation in as few as 1–10 denoising steps versus 50–200. Joint audio-video generation is now table stakes among leading models. And the framing has shifted from "video generation" to "world simulation," with Runway's GWM-1 and NVIDIA's Cosmos generating interactively in real time. This report traces the technical evolution from pixel-space diffusion to latent DiT architectures, profiles every major model and its design choices, maps the research breakthroughs of 2025–2026, and translates them into production reality — from the $1B Disney–OpenAI licensing deal to 75% of marketing videos predicted to be AI-generated by year's end.

Section 01

Core Technical Foundations

From pixel-space denoising to latent diffusion transformers — the architectural stack that powers modern video generation.

1.1 — From Pixels to Latents: Video Diffusion Models

Video diffusion models extend image diffusion to the temporal domain. The core idea is learning to iteratively denoise a noisy video signal, progressively recovering clean frames from Gaussian noise. Two paradigms emerged — and only one survived at scale.

Pixel-space video diffusion, pioneered by Ho et al. in 2022 [Ho et al., 2022], applied the denoising process directly in pixel space. This proved computationally prohibitive at high resolutions and long durations due to the cubic growth of the data tensor (height × width × frames). A 10-second clip at 1080p contains over 500 million pixels per frame — the approach simply couldn't scale.

Latent Video Diffusion (LVD) became the dominant paradigm. The core insight: encode video into a compressed latent space via a Variational Autoencoder, then apply the diffusion process in that latent space. Stable Video Diffusion (SVD) [Blattmann et al., 2023] pioneered this at scale, extending Stable Diffusion 2.1's UNet backbone with temporal convolution and cross-attention layers inserted after every spatial block.

Architecture Detail

SVD adds approximately 656M temporal parameters on top of ~900M spatial parameters for a total exceeding 1.5B. Its three-stage training pipeline — (1) text-to-image pretraining, (2) large-scale video pretraining on 150M+ curated clips, (3) high-quality finetuning on ~1M handpicked videos — established the template that every subsequent model has followed.

Training Objectives: The Shift to Flow Matching

The field has undergone a decisive shift in how models learn to generate. Two training frameworks dominated:

DDPM/EDM-style denoising: The model learns to predict noise (or the clean signal) at each diffusion step. SVD used the EDM framework [Karras et al., 2022] for noise preconditioning. This approach requires 50–200 inference steps, making generation slow.

Flow matching / Rectified Flow: Increasingly preferred in 2025–2026. Rather than a stochastic diffusion process, flow matching defines deterministic straight-line transport paths between noise and data distributions [Lipman et al., 2023]. Rectified flow forces nearly straight transport paths, reducing ODE discretization error and enabling generation in as few as 1–10 steps — a 10–50× speedup in inference. Meta's Movie Gen, HunyuanVideo, Open-Sora 2.0, and Alibaba's Wan 2.1 all use flow matching.

Noise z₀

→

Straight-line ODE

→

Video z₁

vs.

Noise z₀

→

50–200 stochastic steps

→

Video z₁

1.2 — The Transformer Takeover: DiT Architecture

The migration from UNet to Transformer backbones is now complete. Every leading 2025–2026 video model uses a transformer. UNet persists only in Google Veo's hybrid approach and the legacy SVD model.

DiT (Diffusion Transformers), introduced by Peebles and Xie in 2023 [Peebles & Xie, 2023], replaced the UNet with a standard Transformer operating on patchified latent tokens. The key design innovation was Adaptive LayerNorm with zero initialization (AdaLN-Zero), which conditions the transformer on timestep and class embeddings. DiT demonstrated that the UNet's inductive bias — long assumed to be crucial for diffusion quality — was in fact unnecessary. This architecture now underpins Sora, Kling, HunyuanVideo, Hailuo, and most leading models.

Handling the Temporal Dimension

ViViT [Arnab et al., 2021] established foundational approaches to handling video's temporal dimension in transformers. It proposed three factorization strategies that remain in use today:

1. Joint space-time attention — full self-attention across all spatiotemporal tokens. The most powerful approach but O(n²) in token count, making it prohibitive for long videos. Used by Sora and HunyuanVideo in high-quality settings.

2. Factorized encoder — two transformers in series: one models spatial interactions per-frame, the second models temporal interactions across frames.

3. Factorized self-attention — within each block, attention is computed first spatially, then temporally. Most production models use variants of this for efficiency.

Dual-Stream Architectures

HunyuanVideo and Open-Sora 2.0 introduced a "dual-stream to single-stream" design [Tencent, 2024]. In the dual-stream phase, video and text tokens are processed independently through separate transformer blocks, learning modality-specific representations. In the single-stream phase, tokens are concatenated for cross-modal fusion. SkyReels V4 extended this further with MMDiT (Multimodal Diffusion Transformer), adding separate streams for audio alongside video and text.

1.3 — Autoregressive vs. Diffusion: The Great Convergence

One of the most active research debates in 2025–2026 is whether video should be generated all-at-once (diffusion) or frame-by-frame (autoregressive). The answer, increasingly, is both.

Bidirectional diffusion generates all frames simultaneously via iterative denoising. It produces high quality but suffers from high latency — a 128-frame video takes approximately 219 seconds — and outputs are fixed-length.

Autoregressive diffusion generates frames sequentially or in chunks, conditioning each new segment on previously generated frames. The advantages are transformative: initial frame latency drops to ~1.3 seconds, continuous generation runs at ~9.4 FPS (interactive framerate), and sliding-window inference enables arbitrarily long videos despite training on short clips.

Key Insight

The 2025–2026 convergence point is hybrid models. CausVid [CVPR 2025] distills bidirectional diffusion transformers into few-step autoregressive generators via distribution matching distillation — achieving diffusion-quality output in just 4 steps. Runway's GWM-1 (January 2026) is an autoregressive model built atop Gen-4.5 that generates frame-by-frame in real time with interactive control. NOVA achieves a VBench score of 80.1 at 2.75 FPS, trained in only 342 GPU days.

1.4 — Temporal Consistency & 3D VAE Innovation

The Video VAE is a critical and often underappreciated bottleneck. It determines compression quality, latent dimensionality, and downstream generation efficiency. Every improvement in the VAE cascades through the entire pipeline.

Maintaining Temporal Coherence

Keeping generated video consistent across frames is the central engineering challenge. Current approaches form a hierarchy of sophistication:

Temporal attention layers inserted after spatial blocks (SVD, early approaches). Full spatiotemporal attention across all space-time tokens — the gold standard, but quadratically expensive. Causal attention with sliding windows for autoregressive models. And MemoryPack (2025), which integrates FramePack (short-term motion context) with SemanticPack (long-term semantic features from distant frames).

3D VAE Breakthroughs

HunyuanVideo's 3D VAE achieves 4× temporal compression, 8× spatial compression, and 16× channel compression using CausalConv3D. But the real innovation wave came at CVPR and ICLR 2025:

IV-VAE introduced a dual-branch architecture — a 2D branch for keyframe compression and a 3D branch for temporal compression. Counterintuitively, the paper discovered that initializing from image VAEs with matching latent dimensions actually suppresses temporal compression capability.

DLFR-VAE adaptively determines optimal compression frame rates based on information-theoretic content complexity. High-motion segments get more latent frames; static scenes get fewer — an elegant solution to the one-size-fits-all problem.

Progressive Growing of Video Tokenizers achieved 16× temporal compression (versus the standard 4×), enabling generation of 4× longer videos within the same token budget.

VidTwin decouples latent representations into Structure Latents (semantic/spatial information) and Dynamics Latents (motion information), enabling independent control over appearance and movement — a direct path toward fine-grained editing.

16×

Temporal compression (new SOTA)

47×

VAE training speedup (RAE)

80K+

Tokens per 5s 720p video

Section 02

The Model Landscape

A comparative analysis of every major video generation model — architecture, capability, and competitive position as of April 2026.

Model	Organization	Params	Architecture	Max Res	Duration	Audio	Elo
Seedance 2.0	ByteDance	—	Dual-Branch Diffusion	1080p	15s	Native	1,273
SkyReels V4	SkyReels	—	Dual-stream MMDiT	1080p / 32fps	15s	Native	1,245
Kling 3.0	Kuaishou	—	DiT + 3D VAE	4K / 30fps	15s	Native (5 lang)	1,242
PixVerse V6	PixVerse	—	—	1080p	15s	Native	1,242
Veo 3	Google	—	3D Latent Diffusion	1080p	8s	Native	—
Sora 2	OpenAI	—	DiT (spacetime patches)	—	25s	Yes	—
Movie Gen	Meta	30B	LLaMa-3 style + Flow Match	1080p	16s	Yes	—
HunyuanVideo 1.5 Open	Tencent	8.3B	Dual→Single DiT	—	—	—	—
Wan 2.1 Open	Alibaba	14B	DiT + Flow Match + T5	720p	5s	—	—
Hailuo 02	MiniMax	—	DiT + NCR	1080p	10s	—	—
GWM-1	Runway	—	Autoregressive (on Gen-4.5)	—	Real-time	Yes	—
LTX-2 Open	Lightricks	19B	Asymmetric Dual-stream DiT	4K / 50fps	—	Native	—

ByteDance Seedance 2.0 — The New #1

As of March 2026, ByteDance's Seedance 2.0 holds the top position on the Artificial Analysis Video Arena with an Elo of 1,273 for text-to-video and 1,351 for image-to-video. Its unified multimodal architecture accepts text, image, audio, and video inputs simultaneously — up to 12 input files. The Dual-Branch Diffusion architecture generates video and audio in parallel, enabling native audiovisual output in a single pass.

The strategic significance is platform integration: Seedance 2.0 shipped directly into CapCut (ByteDance's editing platform) in March 2026, creating a seamless pipeline from AI generation to social media distribution. This is the first leading model to achieve consumer-app integration at ByteDance's scale.

Kuaishou Kling 3.0 — The Feature Leader

Kling 3.0 represents the most feature-complete video generation system available. Native 4K output at 30 FPS. 3–15 second multi-shot clips. Built-in multilingual audio (5 languages). But the differentiator is production control: motion brush for drawing motion paths directly on frames, motion capture extraction from reference videos (3–30 seconds), director-level camera controls (pan, tilt, zoom, dolly, rack focus), and physics simulation for gravity, balance, deformation, and collision.

The progression from Kling 2.5 through O1 to 3.0 tells the architectural story of the field. Kling O1 (December 2025) introduced the MVL (Multimodal Visual Language) framework — a unified architecture that handles reference-based generation, text-to-video, start/end frame generation, inpainting, style re-rendering, and shot extension in one model. Kling 3.0 built production tools on top of that unified foundation [Kuaishou, 2026].

OpenAI Sora — The Paradigm Setter

Sora's technical report [OpenAI, 2024] reframed video generation as "world simulation" — a framing that the entire field subsequently adopted. The key innovation was spacetime patches: analogous to how ViT treats images as 16×16 patches, Sora extends this to 3D volumes of video latent codes, enabling training on videos and images of variable resolutions, durations, and aspect ratios within a single model.

Sora 2 (2025) enhanced realism with more accurate physics, sharper visuals, synchronized audio, and multi-shot instruction following. At 25 seconds, it supports the longest single-generation duration among major models. Internal architecture details — parameter count, training compute, dataset composition — remain undisclosed.

Meta Movie Gen — The Open Blueprint

Movie Gen is notable less for its capabilities than for its transparency. The 92-page paper [Meta, 2024] documents a 30B parameter Transformer closely following the LLaMa 3 design, trained with a maximum context length of 73K video tokens (16 seconds at 16 FPS). It uses flow matching as the training objective. Training data: 1 billion image-text pairs plus 100 million video-text pairs.

The paper provides unprecedented detail on latent spaces, training recipes, data curation pipelines, evaluation protocols, parallelization strategies, and inference optimizations. Despite the model not being publicly released, it has become the most-cited reference architecture in the field.

The Open-Source Contenders

Tencent HunyuanVideo 1.5 trimmed the original 13B+ model to 8.3B parameters while maintaining quality — crucially, it runs on consumer GPUs. The key innovation is SSTA (Selective and Sliding Tile Attention), which selectively focuses compute on important regions with sliding-window spatial processing, delivering nearly 2× inference speed without quality loss [Tencent, 2025].

Alibaba Wan 2.1 with its VACE extension became the first open-source unified model for both video generation and editing — supporting multi-modal inputs, video repainting, area modification, and spatiotemporal extension. Available in 14B and 1.3B variants under Apache 2.0 license. A 5-second 480p video generates in under 4 minutes on a single RTX 4090.

Open-Sora 2.0 is perhaps the most remarkable efficiency story: an 11B parameter model achieving commercial-level quality, trained for only $200K — 5–10× more cost-efficient than Movie Gen or Step-Video [Open-Sora, 2025].

MiniMax Hailuo — The Efficiency Innovator

MiniMax's core contribution is Noise-aware Compute Redistribution (NCR), which fundamentally reimagines how computational resources are allocated during diffusion. NCR redistributes compute according to noise levels in the diffusion process, achieving 2.5× training and inference efficiency at comparable parameter scale [MiniMax, 2025]. This is an underappreciated innovation — as models scale to 10B+ parameters, efficiency techniques like NCR become as important as architectural improvements.

Section 04

Production Translation

How research breakthroughs map to real-world deployment — the economics, workflows, and industry shifts already underway.

4.1 — Advertising & Marketing

The most immediate and largest commercial application. Industry projections indicate 75% of marketing videos will be AI-generated or AI-assisted by end of 2026. The economics are straightforward: a 15-second product video that previously required a $10K–$50K production budget can now be generated for under $5 in compute, iterated upon in minutes, and localized across markets in hours.

Disney's $1 billion deal with OpenAI (December 2025) licenses 200+ characters for Sora-based generation — the largest single licensing deal in AI history and a signal that major IP holders see generative video as a distribution channel, not a threat [Disney, 2025].

Kling 3.0 is specifically positioned for this use case: product video, multi-shot commercial sequences, and multilingual content with native audio in five languages.

4.2 — Film Pre-visualization & VFX

VFX pre-viz is the most immediately adopted use case in professional filmmaking — rapid iteration on shot design before committing to expensive production. Directors can now test camera angles, lighting setups, and action sequences in minutes rather than days.

AI-driven lip-sync dubbing reduces localization timelines by 60–70% [McKinsey, 2025]. MARS AI captures original speaker personality and transfers it to dubbed content. Virtual production is becoming economically viable for mid-budget productions, with hardware costs down 40% since 2022.

Industry Nuance

The "Script to Screen" workflow — structured prompt-to-multi-shot generation — is emerging but remains limited by long-form coherence. Current practical limits are approximately 60 seconds of coherent single-generation video. Multi-shot approaches extend this but require careful prompt engineering and manual review at each cut point. AI video is augmenting professional production workflows, not replacing them.

4.3 — Gaming & Physical AI

Runway's Game Worlds (September 2025) and GWM-1 specifically target interactive, real-time world simulation — generating game environments that respond to player actions. NVIDIA's Cosmos World Foundation Models are designed for physical AI training and simulation, enabling robots and autonomous vehicles to learn from generated scenarios [NVIDIA, 2025].

4.4 — E-Commerce & Product Video

Image-to-video capabilities (Kling 3.0, Seedance 2.0) enable automated product demonstration videos from still product photography. The workflow: photograph a product once, generate multiple videos showing it in different contexts, environments, and use cases. The 4K output quality of Kling 3.0 meets e-commerce display requirements.

4.5 — Content at Scale

Enterprise APIs enable automated script-to-video pipelines, brand-safe templating, and bulk rendering. PixVerse V6's CLI accessibility for agentic workflows enables programmatic content generation — an AI agent can generate, review, and publish video without human intervention at each step. ByteDance's CapCut integration of Seedance 2.0 creates a pipeline from generation directly into social media editing and distribution.

Industry prediction: 45% of all video content will be AI-generated by 2027.

75%

Marketing videos AI-generated (2026 est.)

$1B

Disney–OpenAI licensing deal

60–70%

Localization time reduction

$0.50–$2

Compute cost per 10s video

Section 05

Open Frontiers

The unsolved problems that define the next generation of research — where current models fail, and what it would take to fix them.

Physics Accuracy

Current models approximate visual physics but fail systematically on conservation laws (momentum, energy), multi-body interactions and collisions, material property consistency (rigidity, elasticity, fluidity), and cause-and-effect chains beyond simple scenarios. The gap between "visually plausible" and "physically correct" is the difference between a content tool and a world simulator. WISA and PhyGenesis represent early attacks on this problem, but the gap remains large.

Long-Form Coherence

Despite breakthroughs (LoL at 12 hours, StreamingT2V at 2 minutes), long-form coherent storytelling remains unsolved. Character appearance drift, scene inconsistency, and narrative incoherence accumulate over duration. Current practical limits: ~60 seconds for high-quality single-generation. Multi-shot approaches extend this but require manual intervention at cut points. Maintaining a character's identity, clothing, and environment across a 5-minute narrative remains beyond current capabilities.

Computational Cost

A 5-second video requires processing 80,000+ tokens. Video attention operations consume 85%+ of inference time with quadratic scaling. A 5-second 720p video takes ~17 minutes on a single H100 without optimization. Memory requirements: 20–80GB per generation. Video generation requires 10–100× more compute than LLMs. TurboDiffusion (100–200×) and SSTA (2×) are making progress, but real-time high-quality generation at 1080p+ remains out of reach for consumer hardware.

Evaluation Metrics

FVD (Frechet Video Distance) has only moderate agreement with human quality judgments — failing on temporal flicker, semantic correctness, and spatial relationships. VBench-2.0 (March 2025) extends to 18 capabilities across Human Fidelity, Creativity, Controllability, Physics, and Commonsense, using VLM/LLM pipelines with specialist detectors [VBench, 2025]. The Artificial Analysis Video Arena provides Elo ratings via blind pairwise comparisons. But no single metric captures all dimensions. Human evaluation remains the gold standard — expensive and slow.

The U.S. Copyright Office (May 2025) ruled that using copyrighted works for AI training may constitute prima facie infringement, with some uses qualifying as fair use and others not. Bartz v. Anthropic resulted in a $1.5 billion settlement — 2025's largest AI copyright case. Universal Music Group and Udio settled with a licensing agreement for authorized AI music training. The Disney–OpenAI deal established a licensing precedent. The U.S. Supreme Court (March 2, 2026) declined to hear the Thaler appeal — works without a human creator are ineligible for copyright protection. The legal landscape is rapidly crystallizing around licensing models rather than fair-use defenses.

Section 06

What Comes Next

The trajectories that are already locked in, and the open questions that will define the next 12 months.

The Locked-In Trajectories

World models as the convergence point. The reframing from "video generation" to "world simulation" is not marketing — it reflects a genuine architectural convergence. Runway's GWM-1, NVIDIA's Cosmos, and the autoregressive hybrid approaches are all building toward models that don't generate clips but simulate environments. The implications extend far beyond content creation into robotics, autonomous vehicles, and scientific simulation.

Real-time interactive generation. CausVid's 4-step inference at 9.4 FPS and GWM-1's frame-by-frame interactive control signal that the latency problem is being solved. Within 12 months, expect interactive video generation at rates indistinguishable from real-time rendering — enabling AI-powered game engines, virtual production environments, and live content creation.

10-second clips are commoditized. The competitive differentiation has moved upstream. All leading models produce high-quality short clips. The new frontiers are long-form coherence, fine-grained control, native multi-modal output, and production integration. The $0.50–$2.00 per-clip compute cost will continue falling as efficiency techniques like NCR, SSTA, and TurboDiffusion proliferate.

The Open Questions

Open-source gap: 6 months or permanent? Open-source models have consistently closed the gap to frontier models within 6–12 months. But as models scale to 30B+ parameters and require billions in training data, the capital barrier may widen the gap. The $200K Open-Sora 2.0 result is encouraging, but it's unclear if this efficiency scales to the next capability threshold.

Production control vs. creative freedom. Kling 3.0 and PixVerse V6 have pushed cinematographic control further than anyone expected. But there's an inherent tension: the more control parameters you expose, the more you're asking users to manually specify what the model should infer. The models that win in production will be the ones that balance precise control with intelligent defaults.

The licensing model. Disney's $1B deal with OpenAI may establish the template — major IP holders licensing their content for AI training and generation in exchange for revenue share and creative control. If this model scales, it resolves the copyright tension. If it doesn't, litigation will define the boundaries instead.

We are not building a video generator. We are building a world simulator. The video is just what you see when you look at the world from one camera angle. — Cristóbal Valenzuela, CEO, Runway, on GWM-1 (January 2026)

The trajectory is clear: video generation is dissolving into something larger. Within two years, the term itself may feel as quaint as "text-to-image" does today. What we're building are machines that understand — and can instantiate — visual reality. The architecture of moving pictures is becoming the architecture of simulated worlds.

References

Ho, J., Salimans, T., Gritsenko, A., et al. "Video Diffusion Models." arXiv:2204.03458, 2022.
Blattmann, A., Dockhorn, T., Kulal, S., et al. "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." arXiv:2311.15127, 2023.
Karras, T., Aittala, M., Aila, T., Laine, S. "Elucidating the Design Space of Diffusion-Based Generative Models." NeurIPS, 2022.
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., et al. "Flow Matching for Generative Modeling." ICLR, 2023.
Peebles, W. & Xie, S. "Scalable Diffusion Models with Transformers." ICCV, 2023.
Arnab, A., Dehghani, M., Heigold, G., et al. "ViViT: A Video Vision Transformer." ICCV, 2021.
OpenAI. "Video Generation Models as World Simulators." Technical Report, 2024.
Polyak, A., Zohar, A., Brown, A., et al. "Movie Gen: A Cast of Media Foundation Models." Meta AI Research, arXiv:2410.13720, 2024.
Tencent Hunyuan. "HunyuanVideo: A Systematic Framework For Large Video Generative Models." 2024.
Tencent Hunyuan. "HunyuanVideo 1.5: Selective and Sliding Tile Attention." 2025.
Alibaba. "Wan 2.1: Open-Source Video Generation Foundation Model." 2025.
Alibaba. "Wan 2.1-VACE: Unified Model for Video Creation and Editing." 2025.
Open-Sora Team. "Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k." arXiv:2503.09642, 2025.
MiniMax. "Hailuo 02: Noise-aware Compute Redistribution for Video Generation." 2025.
NVIDIA. "Cosmos Predict 2.5: World Foundation Models for Physical AI." CoRL, 2025.
Runway. "Introducing GWM-1: A General World Model." 2026.
Google DeepMind. "Veo 3: Joint Audio-Video Generation." 2025.
Lightricks. "LTX-2: Asymmetric Dual-Stream Video-Audio Generation." 2026.
Kuaishou. "Kling O1: World's First Unified Multimodal Video Model." 2025.
Kuaishou. "Kling 3.0: Native 4K Video Generation with Director Controls." 2026.
ByteDance. "Seedance 2.0: Unified Multimodal Video Generation." 2026.
ShengShu Technology & Tsinghua University. "TurboDiffusion: Real-Time AI Video Generation." 2025.
CausVid Team. "CausVid: Few-Step Causal Video Generation." CVPR, 2025.
Henschel, R., et al. "StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text." CVPR, 2025.
Huang, Z., et al. "VBench: Comprehensive Benchmark Suite for Video Generative Models." CVPR, 2024.
Huang, Z., et al. "VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness." arXiv:2503.21755, 2025.
PixVerse. "PixVerse V6: Cinematic Lens Controls for AI Video." 2026.
Luma Labs. "Ray3 / Dream Machine 2.0: Scalable Video Transformer." 2026.
SkyReels. "SkyReels V4: Dual-Stream MMDiT for Audio-Visual Generation." 2026.
McKinsey & Company. "What AI Could Mean for Film and TV Production." 2025.
U.S. Copyright Office. "Copyright and Artificial Intelligence Part 3: Generative AI Training." 2025.
Artificial Analysis. "Text-to-Video Leaderboard." Accessed April 2026.