InstructVideo

InstructVideo: Instructing Video Diffusion Models
with Human Feedback

Hangjie Yuan¹, Shiwei Zhang², Xiang Wang², Yujie Wei², Tao Feng³
Yining Pan⁴, Yingya Zhang², Ziwei Liu⁵, Samuel Albanie⁶, Dong Ni¹

¹Zhejiang University, ²Alibaba Group, ³Tsinghua University, ⁴Singapore University of Technology and Design
⁵S-Lab, Nanyang Technological University, ⁶CAML Lab, University of Cambridge

🥳 Accepted to CVPR 2024 🥳

Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.

Overview of Generated Videos

The Evolution of Generated Videos

Videos are optimized towards human preferences.

ModelScopeT2V

InstructVideo (5k steps)

InstructVideo (10k steps)

InstructVideo (15k steps)

InstructVideo (20k steps)

Comparison with the Base Model

Video quality is significantly boosted after reward fine-tuning.

Cat walking to the camera.

ModelScopeT2V
(D=20)

ModelScopeT2V
(D=50)

InstructVideo
(D=20)

Mountain goat grazing on a cliff by the sea.

Bee collecting sunflower pollen closeup footage.

Parrot fish in sea underwater eating stone coral.

Small bird during rain in india.

Bee collects honey in flower at morning.

Bighorn sheep ovis canadensisram between snowcovered sage bushes.

Bantam or chicken on the garden.

Great tit in bird feeder.

Comparison with Other Reward Fine-tuning Methods

Reward fine-tuning as editing exhibits more superior fine-tuning efficacy.

White butterfly on violet flower closeup macro view.

RWR

DDPO

DRaFT

InstructVideo

Mule deer in Nebraska running across landscape.

Duck walks by the lake.

Horse grazing on pasture and eating green grass.

Close up grey rabbit eating corn.

Brush turkey head comes toward bower.

Generalization Capabilities

InstructVideo exhibits more superior generalization capabilities to unseen text prompts.

Land snail in southern of Thailand moving forward.

RWR

DDPO

DRaFT

InstructVideo

Fly over teen on rocks revealing frosted mountain.

Side view of elderly woman petting cat in window.

Portrait of cheetah acinonyx jubatus.

Drone fly up near old historical white Christianity church.

Lobster moth caterpillar is eating leaf of host plant.

Generation with 50-step DDIM inference

Fine-tuning with 20-step DDIM inference works well in the 50-step setting.

Close up of llama in autumn

ModelScopeT2V
(D=50)

InstructVideo
(D=20)

InstructVideo
(D=50)

Water bird on the lake in spring podiceps cristatus.

Goat in a green summer.

Fish swimming by in kelp.

Consider citing our works if they inspire you.

@article{2023InstructVideo,
         title={InstructVideo: Instructing Video Diffusion Models with Human Feedback},
         author={Yuan, Hangjie and Zhang, Shiwei and Wang, Xiang and Wei, Yujie and Feng, Tao and Pan, Yining and Zhang, Yingya and Liu, Ziwei and Albanie, Samuel and Ni, Dong},
         booktitle={arXiv preprint arXiv:2312.12490},
         year={2023}
}

@article{2023I2VGen-XL,
         title={I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models},
         author={Zhang, Shiwei and Wang, Jiayu and Zhang, Yingya and Zhao, Kang and Yuan, Hangjie and Qing, Zhiwu and Wang, Xiang and Zhao, Deli and Zhou, Jingren},
         booktitle={arXiv preprint arXiv:2311.04145},
         year={2023}
}

@article{2023DreamVideo,
         title={DreamVideo: Composing Your Dream Videos with Customized Subject and Motion},
         author={Wei, Yujie and Zhang, Shiwei and Qing, Zhiwu and Yuan, Hangjie and Liu, Zhiheng and Liu, Yu and Zhang, Yingya and Zhou, Jingren and Shan, Hongming},
         journal={arXiv preprint arXiv:2312.04433},
         year={2023}
}

@article{2023VideoComposer,
         title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
         author={Wang, Xiang and Yuan, Hangjie and Zhang, Shiwei and Chen, Dayou and Wang, Jiuniu, and Zhang, Yingya, and Shen, Yujun, and Zhao, Deli and Zhou, Jingren},
         booktitle={arXiv preprint arXiv:2306.02018},
         year={2023}
}

@article{2023ModelScopeT2V,
         title={Modelscope Text-to-Video Technical Report},
         author={Wang, Jiuniu and Yuan, Hangjie and Chen, Dayou and Zhang, Yingya and Wang, Xiang and Zhang, Shiwei},
         journal={arXiv preprint arXiv:2308.06571},
         year={2023}
}

@article{2023TF-T2V,
         title={A Recipe for Scaling up Text-to-Video Generation with Text-free Videos},
         author={Wang, Xiang and Zhang, Shiwei and Yuan, Hangjie and Qing, Zhiwu and Gong, Biao and Zhang, Yingya and Shen, Yujun and Gao, Changxin and Sang, Nong},
         journal={arXiv preprint},
         year={2023}
}

InstructVideo

Align Generated Videos with Human Preferences

ModelScopeT2V

InstructVideo (5k steps)

InstructVideo (10k steps)

InstructVideo (15k steps)

InstructVideo (20k steps)

Cat walking to the camera.

ModelScopeT2V(D=20)

ModelScopeT2V(D=50)

InstructVideo(D=20)

Mountain goat grazing on a cliff by the sea.

Bee collecting sunflower pollen closeup footage.

Parrot fish in sea underwater eating stone coral.

Small bird during rain in india.

Bee collects honey in flower at morning.

Bighorn sheep ovis canadensisram between snowcovered sage bushes.

Bantam or chicken on the garden.

Great tit in bird feeder.

White butterfly on violet flower closeup macro view.

RWR

DDPO

DRaFT

InstructVideo

Mule deer in Nebraska running across landscape.

Duck walks by the lake.

Horse grazing on pasture and eating green grass.

Close up grey rabbit eating corn.

Brush turkey head comes toward bower.

Land snail in southern of Thailand moving forward.

RWR

DDPO

DRaFT

InstructVideo

Fly over teen on rocks revealing frosted mountain.

Side view of elderly woman petting cat in window.

Portrait of cheetah acinonyx jubatus.

Drone fly up near old historical white Christianity church.

Lobster moth caterpillar is eating leaf of host plant.

Close up of llama in autumn

ModelScopeT2V(D=50)

InstructVideo(D=20)

InstructVideo(D=50)

Water bird on the lake in spring podiceps cristatus.

Goat in a green summer.

Fish swimming by in kelp.

Consider citing our works if they inspire you.

ModelScopeT2V
(D=20)

ModelScopeT2V
(D=50)

InstructVideo
(D=20)

ModelScopeT2V
(D=50)

InstructVideo
(D=20)

InstructVideo
(D=50)