FlexiFilm: Long Video Generation with Flexible Conditions

Zhejiang University, Peking University, Tsinghua University, Oxford university, BAAI

FlexiFilm pipeline. The subplot (a) shows the backbone of FlexiFilm is a 3D U-Net working on the latent space of VAE, using a temporal conditioner for multi-modal (text, image or video) referencing for video frames generate. The subplot (b) shows the workflow of the proposed temporal conditioner, where visual contents (image or frames) and text contents are fused to guide the video generation process with both spatial and temporal information.


Structure of video projector. In the proposed video projector, the condition frames pass through IP samplers separately to obtain independent spatial information, and then go through temporal transformers together to learn inter-frame temporal information. After that, the finally obtained projected feature contains rich information both spatially and temporally.


Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy—originally designed for image generation—cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses.

Long & Short Video Generation

Long Video Generation. (Caption: "The video shows a car driving on a highway in a sunny daytime, passing by buildings and trees on the side, encountering a front car, and slowing down when encountering the front car and passing through the traffic light.")


BibTex Code Here