InstructVideo

Align Generated Videos with Human Preferences

InstructVideo: Instructing Video Diffusion Models
with Human Feedback

Hangjie Yuan1, Shiwei Zhang2, Xiang Wang2, Yujie Wei2, Tao Feng3
Yining Pan4, Yingya Zhang2, Ziwei Liu5, Samuel Albanie6, Dong Ni1

1Zhejiang University, 2Alibaba Group, 3Tsinghua University, 4Singapore University of Technology and Design
5S-Lab, Nanyang Technological University, 6CAML Lab, University of Cambridge

🥳 Accepted to CVPR 2024 🥳

Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.

Overview of Generated Videos
The Evolution of Generated Videos
Videos are optimized towards human preferences.
Comparison with the Base Model
Video quality is significantly boosted after reward fine-tuning.
Comparison with Other Reward Fine-tuning Methods
Reward fine-tuning as editing exhibits more superior fine-tuning efficacy.
Generalization Capabilities
InstructVideo exhibits more superior generalization capabilities to unseen text prompts.
Generation with 50-step DDIM inference
Fine-tuning with 20-step DDIM inference works well in the 50-step setting.

Consider citing our works if they inspire you.

@article{2023InstructVideo,
         title={InstructVideo: Instructing Video Diffusion Models with Human Feedback},
         author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Wei, Yujie and Feng, Tao and Pan, Yining and Zhang, Yingya and Liu, Ziwei and Albanie, Samuel and Ni, Dong},
         booktitle={arXiv preprint arXiv:2312.12490},
         year={2023}
}

@article{2023I2VGen-XL,
         title={I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models},
         author={Zhang, Shiwei and Wang, Jiayu and Zhang, Yingya and Zhao, Kang and Yuan, Hangjie and Qing, Zhiwu and Wang, Xiang and Zhao, Deli and Zhou, Jingren},
         booktitle={arXiv preprint arXiv:2311.04145},
         year={2023}
}

@article{2023DreamVideo,
         title={DreamVideo: Composing Your Dream Videos with Customized Subject and Motion},
         author={Wei, Yujie and Zhang, Shiwei and Qing, Zhiwu and Yuan, Hangjie and Liu, Zhiheng and Liu, Yu and Zhang, Yingya and Zhou, Jingren and Shan, Hongming},
         journal={arXiv preprint arXiv:2312.04433},
         year={2023}
}


@article{2023VideoComposer,
         title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
         author={Wang, Xiang and Yuan, Hangjie and Zhang, Shiwei and Chen, Dayou and Wang, Jiuniu, and Zhang, Yingya, and Shen, Yujun, and Zhao, Deli and Zhou, Jingren},
         booktitle={arXiv preprint arXiv:2306.02018},
         year={2023}
}

@article{2023ModelScopeT2V,
         title={Modelscope Text-to-Video Technical Report},
         author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
         journal={arXiv preprint arXiv:2308.06571},
         year={2023}
}

@article{2023TF-T2V,
         title={A Recipe for Scaling up Text-to-Video Generation with Text-free Videos},
         author={Wang, Xiang and Zhang, Shiwei and Yuan, Hangjie and Qing, Zhiwu and Gong, Biao and Zhang, Yingya and Shen, Yujun and Gao, Changxin and Sang, Nong},
         journal={arXiv preprint},
         year={2023}
}