Coming Soon

Unleash Your Creativity with a Transformer Video Model

Experience the world's first *unified* multi-modal video foundation model. Generate, edit, and extend videos like never before.

Start Now See How It Works

Unified Workflow

Consistent Characters

AI-Powered Editing

Kling O1: The Only Transformer Video Model You'll Need

From initial concept to final cut, experience a unified workflow powered by the Multi-modal Visual Language framework.

All-in-One Video Engine

Kling O1 merges text-to-video, image-to-video, subject-to-video, and more into a single unified model within a single semantic space. No more switching between different tools!

Conversational Video Workflow

Edit your videos using natural language commands. Remove objects, change styles, and adjust scenes with unprecedented ease, thanks to pixel-level semantic reconstruction.

Consistent Characters, Every Shot

Maintain character faces, clothing, and props consistently across multiple shots using up to 5 reference images. Finally, character consistency that actually works!

Create Videos in Three Simple Steps

See how easy it is to bring your vision to life with Kling O1, the transformer video model.

Input Your References

Provide text prompts, reference images, and videos to define your desired content.

Generate and Refine

Let Kling O1 generate your initial video. Use natural language commands to refine and edit the output.

Extend and Share

Extend your video with advanced shot extension capabilities and native audio synchronization. Share your creations with the world!

Frequently Asked Questions

Learn more about using Kling O1 as your transformer video model.

It's a single AI model that can handle various video generation and editing tasks, such as text-to-video, image-to-video, and video editing, all within the same semantic space. Kling O1, also known as Omni One, by Kuaishou Technology is the first of its kind.

Kling O1 uses a subject-based reference system, allowing you to input up to 5 reference images to maintain consistent character faces, clothing, and props across multiple shots, even with changing camera angles.

You can perform pixel-level semantic reconstruction using natural language commands like 'remove the passerby in the background' or 'change it to pixel art style' without manual masking or keyframing.

MVL treats text, images, videos, and subjects as combinable instructions within the same large model. This eliminates the need to switch between sub-models and tools, streamlining the video creation process.