NVIDIA Cosmos 3 (2026): Unified Physical AI Model Released

TL;DR

NVIDIA released Cosmos 3, the first unified omni-model for physical AI that combines video generation, physical reasoning, and action prediction in a single architecture
Two model sizes available now on Hugging Face: Cosmos 3 Nano (8B parameters, runs on RTX PRO 6000) and Cosmos 3 Super (32B parameters, requires Hopper/Blackwell GPUs)
Replaces multiple specialized models with one Mixture-of-Transformers architecture that handles text, image, video, audio, and action inputs through joint autoregressive and diffusion processing
Ships with complete tooling: Diffusers integration, post-training scripts on GitHub, and synthetic data generation datasets for physical AI applications

What Happened

NVIDIA shipped Cosmos 3 on Hugging Face today, fundamentally changing how developers build physical AI systems. Previous Cosmos releases required juggling four separate models: Cosmos Predict for world generation, Cosmos Transfer for controlled generation, Cosmos Reason for scene understanding, and Cosmos Policy for action generation. Each required separate inference pipelines and integration work.

Cosmos 3 collapses this complexity into one model. It uses a Mixture-of-Transformers (MoT) architecture that processes all modalities—text, image, video, audio, and action—within a unified forward pass. The architecture splits processing into two subsequences: an autoregressive path for reasoning and understanding, and a diffusion path for generation. Both paths share attention mechanisms but use separate parameters, letting the model switch between acting as a vision-language model, video generator, forward dynamics predictor, or robot policy without architectural changes.

The release includes Cosmos 3 Nano (8B parameters) for workstation deployment and Cosmos 3 Super (32B parameters) for large-scale synthetic data generation. Both models are available now with Apache 2.0 licensing, full Diffusers integration, and post-training scripts on GitHub.

Why It Matters

Physical AI systems need to understand causality, physics, and motion—not just recognize objects in images. Training a robot to fold laundry or simulating edge-case driving scenarios requires models that predict how the physical world responds to actions. Before Cosmos 3, developers stitched together multiple models, each handling one piece of this reasoning.

Cosmos 3 matters because it reduces the inference and integration burden. One model handles the full pipeline from understanding a scene to generating plausible future states to predicting required actions. For robotics teams, this means faster iteration cycles. For autonomous vehicle developers, it means unified simulation pipelines. For anyone generating synthetic training data, it means one model to deploy and maintain.

The open release matters even more. NVIDIA isn’t gating this behind API access or enterprise licenses. Cosmos 3 Nano runs on developer-accessible hardware (RTX PRO 6000 GPUs), and the post-training scripts let teams adapt the model to specific robots, environments, and tasks. This is NVIDIA positioning itself as the foundation layer for physical AI, following the same playbook that made CUDA dominant—make the tools accessible, let the ecosystem build on top.

Key Details

Model Specifications

Model	Parameters	Reasoning Size	Generator Size	Hardware Requirements
Cosmos 3 Nano	8B	8B	8B	RTX PRO 6000 (workstation)
Cosmos 3 Super	32B	32B	32B	Hopper/Blackwell GPUs

Supported Capabilities

Video generation: Text/image/video to video (world modeling)
Vision-language understanding: Text/video to text (scene reasoning)
Forward dynamics: Action/image/text to video (predict outcomes)
Inverse dynamics: Text/video to action (infer required actions)
Policy generation: Image/text to video and action (unified robot control)

Availability

Models: nvidia/Cosmos3-Nano and nvidia/Cosmos3-Super on Hugging Face
Diffusers integration: Available in latest Diffusers release
Training scripts: Cosmos Framework on GitHub
License: Apache 2.0 (open source)
Datasets: Synthetic data generation datasets released on Hugging Face

Inference Example

import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano", 
    torch_dtype=torch.bfloat16, 
    device_map="cuda"
)

prompt = "A robotic arm mounted on a workbench, gripper positioned above colored objects..."
result = pipe(prompt=prompt, num_frames=1, height=720, width=1280)

Implications

Cosmos 3 signals NVIDIA’s bet that physical AI—not just language models—will define the next wave of AI deployment. The timing aligns with increasing investment in robotics (Wayve, Figure AI, 1X Technologies all raised significant rounds in 2025) and the plateau in pure language model scaling returns.

The unified architecture matters strategically. Training separate models for perception, prediction, and control creates data efficiency problems. Each model needs its own training corpus, and transferring learned physics knowledge between models is difficult. A unified world foundation model trained on diverse physical scenarios—driving, manipulation, navigation—can share representations across domains. This is the same insight that made large language models effective: scale and diversity in training data matter more than task-specific architectures.

The open release creates ecosystem lock-in. By making Cosmos 3 accessible and providing post-training infrastructure, NVIDIA encourages developers to build on their stack. Once teams invest in training data pipelines, deployment infrastructure, and domain-specific fine-tuning for Cosmos 3, switching to competing world foundation models becomes costly.

Our Take

Cosmos 3 is significant, but not for the reasons NVIDIA emphasizes in its marketing. The unified architecture is impressive engineering, but the real story is the open release strategy and hardware targeting.

NVIDIA is playing the long game on physical AI infrastructure. By releasing Cosmos 3 under Apache 2.0 and targeting workstation GPUs with the Nano variant, they’re seeding adoption before the market proves out. This is smart—world foundation models are still early, and the killer applications haven’t emerged yet. Rather than charge for API access to a model that might not fit most use cases, NVIDIA is betting that widespread adoption drives demand for their hardware ecosystem (GPUs, NIM microservices, Omniverse simulation).

The MoT architecture deserves attention. Mixing autoregressive and diffusion processing in one model is technically interesting, but the real test is whether this architectural choice holds up as the field matures. If separate specialized models (pure diffusion for generation, pure transformers for reasoning) prove more efficient or capable, the unified architecture becomes technical debt rather than an advantage.

What to watch: Adoption in robotics companies over the next 6 months. If major robotics labs (Boston Dynamics, Agility, etc.) integrate Cosmos 3 into their simulation and training pipelines, that validates the approach. If they don’t, it suggests the model isn’t yet production-ready or that specialized models still outperform. Also watch for competing releases from Google DeepMind (likely through their Genie/Gemini work) and OpenAI (which has been notably quiet on physical AI).