Stable Video Diffusion: experimental image-to-video synthesis

Stability AI‘s latest, early, experimental “open-weights” image-to-video synthesis tool, Stable Video Diffusion, consists of two models:

  • one, “SVD”, can produce image-to-video synthesis at 14 frames of length
  • another, “SVD-XT”, generates 25 frames

They can operate at varying speeds from 3 to 30 frames per second, and output short (typically 2-4 second-long) MP4 video clips at 576×1024 resolution. Some parts of the results are static while others display movement (panning, zooming, animated fire and smoke, for example).

“While we eagerly update our models with the latest advancements and work to incorporate your feedback. This model is not intended for real-world or commercial applications at this stage. Your insights and feedback on safety and quality are important to refining this model for its eventual release.”

Stable Diffusion

Processing can be performed on a local host computer with an NVIDIA GPU onboard. In tests performed by Ars Technica, using an NVIDIA RTX 3060, generating a 14-frame clip took about 30 minutes. Cloud-based services like Hugging Face and Replicate can speed this up.

The Stable Video Diffusion research paper reveals that they used “a large video dataset comprising roughly 600 million samples” and “curated into the Large Video Dataset (LVD), which consists of 580 million annotated video clips that span 212 years of content in duration”. The Stable Video Diffusion source and weights are available on GitHub.

Mondatum provides advice, guidance and support on the use of generative AI and machine learning in general, particularly in the content industry. To start a conversation, get in touch via email –