CascadeV | An Implemention of Würstchen architecture for High-Resolution Video Generation
News
[2024.07.17]
We release the
code
and pretrained
weights
of a DiT-based video VAE, which supports video reconstruction with a high compression factor (1x32x32=1024). The T2V model is still on the way.
Introduction
CascadeV is a video generation pipeline built upon the
Würstchen
architecture. By using a highly compressed latent representation, we can generate longer videos with higher resolution.
Video VAE
Comparison of Our Cascade Approach with Other VAEs (on Latent Space of Shape 8x32x32)
Video Recontruction: Original (left) vs. Reconstructed (right) |
Click to view the videos
1. Model Architecture
1.1 DiT
We use
PixArt-Σ
as our base model with the following modifications:
Use sematic compressor from
StableCascade
to provide the low-resolution latent input.
Remove text encoder and all multi-head cross-attention layers since we are not using text condition.
Replace all 2D attention layers to 3D. We find that 3D attention outperforms 2+1D (i.e. alternative spatial and temporal attention), especially in temporal consistency.
Comparison of 2+1D Attention (left) vs. 3D Attention (right)
1.2. Grid Attention
Using 3D attention requires much more computational resources than 2D/2+1D, especially with higher resolution. As a compromise solution, we replace some 3D attention layers with alternative spatial and temporal grid attention.
2. Evaluation
Dataset: We perform qualitative comparison with other baselines on the dataset
Inter4K
, by sampling the first 200 videos from the Inter4K to create a video dataset with a resolution of 1024x1024 and 30 FPS.
Metrics: We use PSNR, SSIM and LPIPS to evaluate the per-frame quality (and the similarity between original and reconstructed video) and
VBench
to evaluate the video quality independently.
2.1 PSNR/SSIM/LPIPS
Diffusion-based VAEs (like StableCascade and our model) performs poorly in reconstruction metrics, due to their ability to produce videos with more fine-grained details but less similiar to the original ones.
Model/Compression Factor
PSNR↑
SSIM↑
LPIPS↓
Open-Sora-Plan v1.1/4x8x8=256
25.7282
0.8000
0.1030
EasyAnimate v3/4x8x8=256
28.8666
0.8505
0.0818
StableCascade/1x32x32=1024
24.3336
0.6896
0.1395
Ours/1x32x32=1024
23.7320
0.6742
0.1786
2.2 VBench
Our approach has comparable performance to the previous VAEs in both frame-wise and temporal quality even with much larger compression factor.
CascadeV huggingface.co is an AI model on huggingface.co that provides CascadeV's model effect (), which can be used instantly with this ByteDance CascadeV model. huggingface.co supports a free trial of the CascadeV model, and also provides paid use of the CascadeV. Support call CascadeV model through api, including Node.js, Python, http.
CascadeV huggingface.co is an online trial and call api platform, which integrates CascadeV's modeling effects, including api services, and provides a free online trial of CascadeV, you can try CascadeV online for free by clicking the link below.
ByteDance CascadeV online free url in huggingface.co:
CascadeV is an open source model from GitHub that offers a free installation service, and any user can find CascadeV on GitHub to install. At the same time, huggingface.co provides the effect of CascadeV install, users can directly use CascadeV installed effect in huggingface.co for debugging and trial. It also supports api for free installation.