Stable Video Diffusion Img2Vid XT is an advanced image-to-video latent diffusion model developed by Stability AI, designed to generate short video clips from a single static image. It produces 25 frames at 576x1024 resolution, offering improved temporal consistency by fine-tuning from an earlier 14-frame version. The model operates without text prompts and instead uses a single input frame to guide visual generation, making it ideal for stylized motion or animation. It includes both a standard frame-wise decoder and a fine-tuned f8-decoder to enhance coherence across frames. Despite its high quality, output videos are short (under 4 seconds) and not always fully photorealistic. Faces, text, and realistic motion may be inconsistently rendered, and the model cannot generate legible writing. It is suited for creative video generation, research, and educational applications under a community license, with image-level watermarking enabled by default.
Features
- Converts a single image into a 25-frame video
- Fine-tuned from the SVD 14-frame model for smoother motion
- Outputs videos at 576x1024 resolution
- Includes frame-wise and f8-decoder for improved temporal coherence
- Supports latent diffusion for efficient generation
- Intended for artistic, educational, and research purposes
- Inference code includes watermarking via imWatermark
- Developed with safety filtering and red-team evaluations for responsible use