CogVideo is an open source text-/image-/video-to-video generation project that hosts the CogVideoX family of diffusion-transformer models and end-to-end tooling. The repo includes SAT and Diffusers implementations, turnkey demos, and fine-tuning pipelines (including LoRA) designed to run across a wide range of NVIDIA GPUs, from desktop cards (e.g., RTX 3060) to data-center hardware (A100/H100). Current releases cover CogVideoX-2B, CogVideoX-5B, and the upgraded CogVideoX1.5-5B variants, plus image-to-video (I2V) models, with options for BF16/FP16/FP32—and INT8 quantized inference via TorchAO for memory-constrained setups. The codebase emphasizes practical deployment: prompt-optimization utilities (LLM-assisted long-prompt expansion), Colab notebooks, a Gradio web app, and multiple performance knobs (tiling/slicing, CPU offload, torch.compile, multi-GPU, and FA3 backends via partner projects).
Features
- Multiple tasks: text-to-video, image-to-video, and video-to-video generation.
- Dual stacks: SAT implementations and Diffusers pipelines with shared demos.
- Fine-tuning recipes (incl. LoRA), plus cogvideox-factory for single-GPU (4090) training.
- Quantized inference (INT8 via TorchAO) and memory optimizations (CPU offload, tiling, slicing).
- Ready-to-run assets: Colab notebooks, CLI demos, and a Gradio web UI with tools (SR/interp).
- Utilities & ecosystem: weight converters (SAT→HF), captioning tools, and third-party integrations (ComfyUI, ControlNet, xDiT, VideoSys).