DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

Zihan Ding¹, Chi Jin¹, Difan Liu², Haitian Zheng², Krishna Kumar Singh²,
Qiang Zhang², Yan Kang², Zhe Lin², Yuchen Liu²

¹Princeton University, ²Adobe

Paper (ArXiv)

Text-to-Video Samples

Fast text-to-video generation with 4-step inference on video diffusion model

An older man playing piano, lit from the side.

A middle-aged sad bald man becomes happy as a wig ofcurly hair and sunglasses fall suddenly on his head.

Vampire makeup face of beautiful girl, red contact lenses.

An astronaut running through an alley in Rio de Janeiro.

Origami dancers in white paper, 3D render, on white background, studio shot, dancing modern dance.

A storm trooper vacuuming the beach.

A jellyfish floating through the ocean, with bioluminescent tentacles.

Yoda playing guitar on the stage.

An Iron man is playing the electronic guitar, high electronic guitar.

Close-up ofan Asian man with a hopeful expression. He’s wearing a knit navy sweater and leaning forward slightly. His eyes are wide and focused, giving a sense ofurgency or excitement. Soft glowing light illuminates his face, highlighting his features and the texture ofhis skin. The mood is hopeful, as if he's in the middle ofan exciting conversation or reacting to something surprising.

Reflections in the window of a train traveling through the Tokyo suburbs.

A panda standing on a surfboard in the ocean in sunset.

A shark is swimming in the ocean, in cyberpunk style.

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

A panda drinking coffee in a cafe in Paris, oil painting.

A panda drinking coffee in a cafe in Paris, pixel art.

A slow cinematic push in on an ostrich standing in a 1980s kitchen.

A shark is swimming in the ocean by Hokusai, in the style of Ukiyo.

A shark is swimming in the ocean, Van Goph style.

A squirrel eating a burger.

A slow cinematic push in on an ostrich standing in a 1980s kitchen.

A panda drinking coffee in a cafe in Paris by Hokusai, in the style of Ukiyo.

A teddy bear is playing drum kit in NYC Times Square.

A shark is swimming in the ocean, Van Goph style.

A teddy bear is swimming in the ocean.

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background.

A panda drinking coffee in a cafe in Paris.

Two pandas discussing an academic paper.

A panda drinking coffee in a cafe in Paris by Hokusai, in the style of Ukiyo.

A tropical beach at sunrise, with palm trees and crystal-clear water in the foreground.

An ice cream is melting on the table.

Balloon full of water exploding in extreme slow motion.

A beautiful coastal beach in spring, waves lapping on sand by Hokusai, in the style of Ukiyo.

An empty warehouse, zoom in into a wonderful jungle that emerges from the ground.

A beautiful coastal beach in spring, waves lapping on sand, pixel art.

A beautiful coastal beach in spring, waves lapping on sand, surrealism style.

Handheld camera moving fast, flashlight light, in a white old wall in a old alley at night a black graffiti that spells “Cool Baby”.

A breathtaking aurora dances across the night sky, vibrant green and purple hues illuminating snow-covered mountains and a serene lake.

Few big purple plums rotating on the turntable. water drops appear on the skin during rotation.

Macro slo-mo. Slow motion cropped closeup of roasted coffee beans falling into an empty bowl.

FPV moving through a forest to an abandoned house to ocean waves.

A Mars rover moving on Mars.

Motion colour drop in water, ink swirling in water, colourful ink in water, abstraction fancy dream cloud of ink.

A drone view of celebration with Christmas tree and fireworks, starry sky - background.

Snow rocky mountains peaks canyon. snow blanketed rocky mountains surround and shadow deep canyons. the canyons twist and bend through the high elevated mountain peaks.

Time lapse of sunrise on mars.

A tranquil tableau of the Parthenon stands resolute in its classical elegance, a timeless symbol of Athens' cultural legacy.

A laptop, frozen in time.

A lightning striking atop of eiffel tower, dark clouds in the sky.

Abstract

Diffusion probabilistic models have shown significant progress in video generation; however, their computational efficiency is limited by the large number of sampling steps required. Reducing sampling steps often compromises video quality or generation diversity. In this work, we introduce a distillation method that combines variational score distillation and consistency distillation to achieve few-step video generation, maintaining both high quality and diversity. We also propose a latent reward model (LRM) fine-tuning approach to further enhance video generation performance according to any specified reward metric. This approach reduces memory usage and does not require the reward to be differentiable.

Our method demonstrates state-of-the-art performance in few-step generation for 10-second videos (128 frames at 12 FPS). The distilled student model achieves a score of 82.57 on VBench, surpassing the teacher model as well as baseline models Gen-3, Gen-2, T2V-Turbo, Kling and Pika. One-step distillation accelerates the teacher model's diffusion sampling by up to 278.6 times, enabling near real-time generation. Human evaluations further validate the superior performance of our 4-step student models compared to teacher model using 50-step DDIM sampling.

Comparison with Baselines

The teacher model is a pre-trained text-to-video (T2V) model without any distillation and it samples with 50 steps following DDIM schedule. Variational score distillation (VSD) is a baseline distillation method to improve diffusion model efficiency in 3D and image synthesis, by aligning the distribution between teacher and student diffusion models. Consistency distillation (CD) enforces the consistency loss as another distillation method from a pre-trained teacher model. Denoising diffusion policy optimization (DDPO) is a reinforcement learning method to optimize the diffusion sampling policy.

Our DOLLAR

Below are 4-step sampling results by our DOLLAR, which distills student models with variational score distillation, consistency distillation and latent reward model optimization.

A shark is swimming in the ocean, Van Gogh style.

Vampire makeup face of beautiful girl, red contact lenses.

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background.

Gen-3

A shark is swimming in the ocean, Van Gogh style.

Vampire makeup face of beautiful girl, red contact lenses.

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background.

Gen-2

A shark is swimming in the ocean, Van Gogh style.

Vampire makeup face of beautiful girl, red contact lenses.

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background.

Kling

A shark is swimming in the ocean, Van Gogh style.

Vampire makeup face of beautiful girl, red contact lenses.

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background.

Pika

A shark is swimming in the ocean, Van Gogh style.

Vampire makeup face of beautiful girl, red contact lenses.

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background.

Teacher (50 steps)

Below are 50-step sampling results by the teacher model.

A shark is swimming in the ocean, Van Gogh style.

Vampire makeup face of beautiful girl, red contact lenses.

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background.

VSD

Below are 4-step sampling results by the distilled student model via variational score distillation.

A shark is swimming in the ocean, Van Gogh style.

Vampire makeup face of beautiful girl, red contact lenses.

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background.

VSD+DDPO

Below are 4-step sampling results by the distilled student model via variational score distillation and reward fine-tuning with denoising diffusion policy optimization.

A shark is swimming in the ocean, Van Gogh style.

Vampire makeup face of beautiful girl, red contact lenses.

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

A happy fuzzy panda playing guitar nearby a campfire, snow mountain in the background.

Overview of Training Pipeline

Overview of the training pipeline of our DOLLAR. The few-step generator is trained to generate high-quality samples from random noise in latent space, guided by a combination of variational score distillation (VSD), consistency distillation (CD), and latent reward model (LRM) fine-tuning objectives. VSD loss enhances sample quality, albeit with a risk of mode collapse, while CD loss increases sample diversity without compromising generation quality. The LRM enables reward-based optimization to further improve sample quality, by bypassing the large, pixel-space reward model and the decoder, thereby reducing memory usage and removing the need for differentiable reward models.

Automatic Evaluation on VBench

Our DOLLAR (HPSv2) and DOLLAR (PickScore) models are evaluated against baselines on VBench evaluation benchmark, including 16 different reward metrics with higher value indicating better performance. The highest value in each metric is marked in bold and 2nd and 3rd underlined. Our methods with 4 inference steps achieve superior performance over the baselines including Pika, Gen-2, Gen-3, Kling, T2V-Turbo and our teacher model by higher total scores. The highest semantic scores of our models indicate a significant improvement over baselines for text-video alignment.

Various Styles and Motions

Artistic Styles

style of Ukiyo

cyberpunk style

oil painting

pixel art

surrealism style

Van Gogh style

watercolor painting

black and white

Camera Motions

pan left

tilt down

tilt up

zoom in

racking focus

in super slow motion

Ablation Study

Explore the details of our Ablation Study on a separate page.

BibTeX

@article{ding2024dollar,
      title={DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization},
      author={Ding, Zihan and Jin, Chi and Liu, Difan and Zheng, Haitian and Singh, Krishna Kumar and Zhang, Qiang and Kang, Yan and Lin, Zhe and Liu, Yuchen},
      journal={arXiv preprint arXiv:2412.15689},
      year={2024}
    }