Implementation of NÜWA, state of the art attention network for text-to-video synthesis, in Pytorch. It also contains an extension into video and audio generation, using a dual decoder approach. It seems as though a diffusion-based method has taken the new throne for SOTA. However, I will continue on with NUWA, extending it to use multi-headed codes + hierarchical causal transformer. I think that direction is untapped for improving on this line of work. In the paper, they also present a way to condition the video generation based on segmentation mask(s). You can easily do this as well, given you train a VQGanVAE on the sketches beforehand. Then, you will use NUWASketch instead of NUWA, which can accept the sketch VAE as a reference. This repository will also offer a variant of NUWA that can produce both video and audio. For now, the audio will need to be encoded manually.
Features
- Train the VAE
- Conditioning on Sketches
- Text to video and audio
- Te audio will need to be encoded manually
- This library will offer some utilities to make training easier
- This library depends on this vector quantization library