Skip to content

CogVideoX

CogVideoX 是一种专注于根据提示生成更连贯视频的文本到视频生成模型。它通过几种方法实现这一目标。

  • 一个三维变分自编码器,用于在空间和时间上压缩视频,提高压缩率和视频准确性。

  • 一个专家变换器块,帮助对齐文本和视频,以及一个三维全注意力模块,用于捕捉和创建空间和时间上准确的视频。

对视频指令维度的实际测试发现,CogVideoX 在一致的主题、动态信息、一致的背景、对象信息、平滑运动、颜色、场景、外观风格和时间风格方面有良好的效果,但在人类动作、空间关系和多个对象方面无法取得良好的结果。

通过使用 Diffusers 进行微调可以帮助弥补这些不足的结果。

数据准备

训练脚本接受两种格式的数据。

第一种格式适用于小规模训练,第二种格式使用 CSV 格式,更适合用于大规模训练的流数据。未来,Diffusers 将支持<video>标签。

Small format

Two files where one file contains line-separated prompts and another file contains line-separated paths to video data (the path to video files must be relative to the path you pass when specifying --instance_data_root). Let's take a look at an example to understand this better!

Assume you've specified --instance_data_root as /dataset, and that this directory contains the files: prompts.txt and videos.txt.

The prompts.txt file should contain line-separated prompts:

A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.
A black and white animated sequence on a ship's deck features a bulldog character, named Bully Bulldoger, showcasing exaggerated facial expressions and body language. The character progresses from confident to focused, then to strained and distressed, displaying a range of emotions as it navigates challenges. The ship's interior remains static in the background, with minimalistic details such as a bell and open door. The character's dynamic movements and changing expressions drive the narrative, with no camera movement to distract from its evolving reactions and physical gestures.
...

videos.txt 文件应包含以行分隔的视频文件路径。请注意,路径应为相对于 --instance_data_root 目录的 相对路径

videos/00000.mp4
videos/00001.mp4
...

总体而言,如果你在数据集根目录上运行 tree 命令,你的数据集将如下所示:

/dataset
├── prompts.txt
├── videos.txt
├── videos
    ├── videos/00000.mp4
    ├── videos/00001.mp4
    ├── ...

使用此格式时,--caption_column 必须为 prompts.txt,而 --video_column 必须为 videos.txt

流格式

你可以使用单个 CSV 文件。为了便于说明,假设你有一个 metadata.csv 文件。预期的格式为:

<CAPTION_COLUMN>,<PATH_TO_VIDEO_COLUMN>
"""A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.""","""00000.mp4"""
"""A black and white animated sequence on a ship's deck features a bulldog character, named Bully Bulldoger, showcasing exaggerated facial expressions and body language. The character progresses from confident to focused, then to strained and distressed, displaying a range of emotions as it navigates challenges. The ship's interior remains static in the background, with minimalistic details such as a bell and open door. The character's dynamic movements and changing expressions drive the narrative, with no camera movement to distract from its evolving reactions and physical gestures.""","""00001.mp4"""
...

在这种情况下,--instance_data_root 应该是存储视频的位置,而 --dataset_name 应该是一个指向本地文件夹的路径,或者是一个在 Hub 上托管的与 [~datasets.load_dataset] 兼容的数据集。假设你在 https://huggingface.co/datasets/my-awesome-username/minecraft-videos 有 Minecraft 游戏视频,你需要指定 my-awesome-username/minecraft-videos

使用这种格式时,--caption_column 必须为

<caption_column>and--video_columnmust be<path_to_video_column>`.

You are not strictly restricted to the CSV format. Any format works as long as the load_dataset method supports the file format to load a basic <path_to_video_column> and <caption_column>. The reason for going through these dataset organization gymnastics for loading video data is because load_dataset does not fully support all kinds of video formats.

NOTE

CogVideoX works best with long and descriptive LLM-augmented prompts for video generation. We recommend pre-processing your videos by first generating a summary using a VLM and then augmenting the prompts with an LLM. To generate the above captions, we use MiniCPM-V-26 and Llama-3.1-8B-Instruct. A very barebones and no-frills example for this is available here. The official recommendation for augmenting prompts is ChatGLM and a length of 50-100 words is considered good.

![NOTE] It is expected that your dataset is already pre-processed. If not, some basic pre-processing can be done by playing with the following parameters: --height, --width, --fps, --max_num_frames, --skip_frames_start and --skip_frames_end. Presently, all videos in your dataset should contain the same number of video frames when using a training batch size > 1.

</caption_column></path_to_video_column></path_to_video_column></caption_column>

训练

你需要通过安装必要的依赖来设置开发环境。以下是所需的包:

  • 基于你使用的训练功能,需要安装Torch 2.0或更高版本(可能需要最新或夜间版本以支持量化/DeepSpeed训练)
  • 对于所有与建模和训练相关的内容,执行pip install diffusers transformers accelerate peft huggingface_hub
  • 对于加载视频训练数据,执行pip install datasets decord
  • 对于使用8位Adam或AdamW优化器进行内存优化训练,执行pip install bitsandbytes
  • 可选地,执行pip install wandb以监控训练日志
  • 可选地,执行pip install deepspeed以进行DeepSpeed训练
  • 可选地,如果你希望使用Prodigy优化器进行训练,执行pip install prodigyopt

为了确保你能成功运行最新版本的示例脚本,我们强烈建议从源代码安装,并保持安装更新,因为我们经常更新示例脚本,并安装一些示例特定的依赖。为此,请在一个新的虚拟环境中执行以下步骤:

在运行脚本之前,请确保从源代码安装库:

bash
git clone https://github.com/huggingface/diffusers
cd diffusers
pip install -e .

然后导航到包含训练脚本的示例文件夹,并安装你使用的脚本所需的相关依赖项:

  • PyTorch
bash
cd examples/cogvideo
pip install -r requirements.txt

并使用以下命令初始化一个🤗 Accelerate环境:

bash
accelerate config

或者对于不回答有关你环境的默认加速配置

bash
accelerate config default

或者如果你的环境不支持交互式shell(例如,笔记本)

python
from accelerate.utils import write_basic_config
write_basic_config()

在运行accelerate config时,如果你使用torch.compile,可能会获得显著的速度提升。PEFT库被用作LoRA训练的后端,因此请确保在你的环境中安装了peft>=0.6.0

如果你想在训练完成后将模型推送到Hub并附带一个整洁的模型卡,请确保你已登录:

bash
huggingface-cli login

# Alternatively, you could upload your model manually using:
# huggingface-cli upload my-cool-account-name/my-cool-lora-name /path/to/awesome/lora

确保你的数据按照数据准备中所述进行准备。准备好后,你就可以开始训练了!

假设你在训练50个相似概念的视频,我们发现1500-2000步效果良好。然而,官方推荐是100个视频,总共4000步。假设你在单个GPU上训练,--train_batch_size1

  • 50个视频的1500步相当于30个训练周期
  • 100个视频的4000步相当于40个训练周期
bash
#!/bin/bash

GPU_IDS="0"

accelerate launch --gpu_ids $GPU_IDS examples/cogvideo/train_cogvideox_lora.py \
  --pretrained_model_name_or_path THUDM/CogVideoX-2b \
  --cache_dir <CACHE_DIR> \
  --instance_data_root <PATH_TO_WHERE_VIDEO_FILES_ARE_STORED> \
  --dataset_name my-awesome-name/my-awesome-dataset \
  --caption_column <CAPTION_COLUMN> \
  --video_column <PATH_TO_VIDEO_COLUMN> \
  --id_token <ID_TOKEN> \
  --validation_prompt "<ID_TOKEN> Spiderman swinging over buildings:::A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance" \
  --validation_prompt_separator ::: \
  --num_validation_videos 1 \
  --validation_epochs 10 \
  --seed 42 \
  --rank 64 \
  --lora_alpha 64 \
  --mixed_precision fp16 \
  --output_dir /raid/aryan/cogvideox-lora \
  --height 480 --width 720 --fps 8 --max_num_frames 49 --skip_frames_start 0 --skip_frames_end 0 \
  --train_batch_size 1 \
  --num_train_epochs 30 \
  --checkpointing_steps 1000 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-3 \
  --lr_scheduler cosine_with_restarts \
  --lr_warmup_steps 200 \
  --lr_num_cycles 1 \
  --enable_slicing \
  --enable_tiling \
  --optimizer Adam \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --max_grad_norm 1.0 \
  --report_to wandb

为了更好地跟踪我们的训练实验,我们在上述命令中使用了以下标志:

  • --report_to wandb 将确保训练运行在Weights and Biases上被跟踪。要使用它,请确保使用pip install wandb安装wandb
  • validation_promptvalidation_epochs 允许脚本进行几次验证推理运行。这使我们能够定性地检查训练是否按预期进行。

设置`

<id_token>is not necessary. From some limited experimentation, we found it works better (as it resembles [Dreambooth](https://huggingface.co/docs/diffusers/en/training/dreambooth) training) than without. When provided, the<id_token>is appended to the beginning of each prompt. So, if your<id_token>was"DISNEY"and your prompt was"Spiderman swinging over buildings", the effective prompt used in training would be "DISNEY Spiderman swinging over buildings"`. When not provided, you would either be training without any additional token or could augment your dataset to apply the token where you wish before starting the training.

NOTE

You can pass --use_8bit_adam to reduce the memory requirements of training.

IMPORTANT

The following settings have been tested at the time of adding CogVideoX LoRA training support:

  • Our testing was primarily done on CogVideoX-2b. We will work on CogVideoX-5b and CogVideoX-5b-I2V soon
  • One dataset comprised of 70 training videos of resolutions 200 x 480 x 720 (F x H x W). From this, by using frame skipping in data preprocessing, we created two smaller 49-frame and 16-frame datasets for faster experimentation and because the maximum limit recommended by the CogVideoX team is 49 frames. Out of the 70 videos, we created three groups of 10, 25 and 50 videos. All videos were similar in nature of the concept being trained.
  • 25+ videos worked best for training new concepts and styles.
  • We found that it is better to train with an identifier token that can be specified as --id_token. This is similar to Dreambooth-like training but normal finetuning without such a token works too.
  • Trained concept seemed to work decently well when combined with completely unrelated prompts. We expect even better results if CogVideoX-5B is finetuned.
  • The original repository uses a lora_alpha of 1. We found this not suitable in many runs, possibly due to difference in modeling backends and training settings. Our recommendation is to set to the lora_alpha to either rank or rank // 2.
  • If you're training on data whose captions generate bad results with the original model, a rank of 64 and above is good and also the recommendation by the team behind CogVideoX. If the generations are already moderately good on your training captions, a rank of 16/32 should work. We found that setting the rank too low, say 4, is not ideal and doesn't produce promising results.
  • The authors of CogVideoX recommend 4000 training steps and 100 training videos overall to achieve the best result. While that might yield the best results, we found from our limited experimentation that 2000 steps and 25 videos could also be sufficient.
  • When using the Prodigy opitimizer for training, one can follow the recommendations from this blog. Prodigy tends to overfit quickly. From my very limited testing, I found a learning rate of 0.5 to be suitable in addition to --prodigy_use_bias_correction, prodigy_safeguard_warmup and --prodigy_decouple.
  • The recommended learning rate by the CogVideoX authors and from our experimentation with Adam/AdamW is between 1e-3 and 1e-4 for a dataset of 25+ videos.

Note that our testing is not exhaustive due to limited time for exploration. Our recommendation would be to play around with the different knobs and dials to find the best settings for your data.

</id_token></id_token></id_token>

推理

一旦你训练了一个lora模型,推理可以通过简单地将lora权重加载到CogVideoXPipeline中来完成。

python
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16)
# pipe.load_lora_weights("/path/to/lora/weights", adapter_name="cogvideox-lora") # Or,
pipe.load_lora_weights("my-awesome-hf-username/my-awesome-lora-name", adapter_name="cogvideox-lora") # If loading from the HF Hub
pipe.to("cuda")

# Assuming lora_alpha=32 and rank=64 for training. If different, set accordingly
pipe.set_adapters(["cogvideox-lora"], [32 / 64])

prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion."
frames = pipe(prompt, guidance_scale=6, use_dynamic_cfg=True).frames[0]
export_to_video(frames, "output.mp4", fps=8)

减少内存使用

在使用diffusers库进行测试时,启用了diffusers库中包含的所有优化。此方案尚未在NVIDIA A100 / H100架构之外的设备上测试实际内存使用情况。 通常,此方案可以适应所有NVIDIA Ampere架构及以上的设备。如果禁用优化,内存消耗将成倍增加,峰值内存使用量大约是表格中数值的3倍。 然而,速度将增加约3-4倍。你可以选择性地禁用一些优化,包括:

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
  • 对于多 GPU 推理,需要禁用 enable_sequential_cpu_offload() 优化。
  • 使用 INT8 模型会减慢推理速度,这是为了适应内存较低的 GPU 同时保持最小的视频质量损失,尽管推理速度会显著降低。
  • CogVideoX-2B 模型以 FP16 精度训练,所有 CogVideoX-5B 模型以 BF16 精度训练。我们建议使用模型训练时的精度进行推理。
  • PytorchAOOptimum-quanto 可用于量化文本编码器、transformer 和 VAE 模块,以减少 CogVideoX 的内存需求。这使得模型可以在免费的 T4 Colabs 或内存较小的 GPU 上运行!此外,请注意 TorchAO 量化与 torch.compile 完全兼容,这可以显著提高推理速度。在 NVIDIA H100 及以上的设备上必须使用 FP8 精度,需要源码安装 torchtorchaodiffusersaccelerate Python 包。推荐使用 CUDA 12.4。
  • 推理速度测试也使用了上述内存优化方案。如果没有内存优化,推理速度大约提高 10%。只有 diffusers 版本的模型支持量化。
  • 该模型仅支持英文输入;其他语言可以通过大模型精调翻译成英文使用。
  • 模型微调的内存使用在 8 * H100 环境中进行测试,程序自动使用 Zero 2 优化。如果表格中标记了特定数量的 GPU,则必须使用该数量或更多的 GPU 进行微调。
AttributeCogVideoX-2BCogVideoX-5B
Model NameCogVideoX-2BCogVideoX-5B
Inference PrecisionFP16* (Recommended), BF16, FP32, FP8*, INT8, Not supported INT4BF16 (Recommended), FP16, FP32, FP8*, INT8, Not supported INT4
Single GPU Inference VRAMFP16: Using diffusers 12.5GB* INT8: Using diffusers with torchao 7.8GB*BF16: Using diffusers 20.7GB* INT8: Using diffusers with torchao 11.4GB*
Multi GPU Inference VRAMFP16: Using diffusers 10GB*BF16: Using diffusers 15GB*
Inference SpeedSingle A100: ~90 seconds, Single H100: ~45 secondsSingle A100: ~180 seconds, Single H100: ~90 seconds
Fine-tuning PrecisionFP16BF16
Fine-tuning VRAM Consumption47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)