Stable Video Diffusion: experimental image-to-video synthesis

Stability AI‘s latest, early, experimental “open-weights” image-to-video synthesis tool, Stable Video Diffusion, consists of two models:

one, “SVD”, can produce image-to-video synthesis at 14 frames of length
another, “SVD-XT”, generates 25 frames

They can operate at varying speeds from 3 to 30 frames per second, and output short (typically 2-4 second-long) MP4 video clips at 576×1024 resolution. Some parts of the results are static while others display movement (panning, zooming, animated fire and smoke, for example).

“While we eagerly update our models with the latest advancements and work to incorporate your feedback. This model is not intended for real-world or commercial applications at this stage. Your insights and feedback on safety and quality are important to refining this model for its eventual release.”
Stable Diffusion

Processing can be performed on a local host computer with an NVIDIA GPU onboard. In tests performed by Ars Technica, using an NVIDIA RTX 3060, generating a 14-frame clip took about 30 minutes. Cloud-based services like Hugging Face and Replicate can speed this up.

The Stable Video Diffusion research paper reveals that they used “a large video dataset comprising roughly 600 million samples” and “curated into the Large Video Dataset (LVD), which consists of 580 million annotated video clips that span 212 years of content in duration”. The Stable Video Diffusion source and weights are available on GitHub.

Mondatum provides advice, guidance and support on the use of generative AI and machine learning in general, particularly in the content industry. To start a conversation, get in touch via email – contact@mondatum.com.

Content_Creation Generative_AI image-to-video machine_learning Stable_Diffusion Stable_Video_Diffusion

Stable Video Diffusion: experimental image-to-video synthesis

RELATED INSIGHTS

AI In Business In 2026: Creative & Media Industries Edition

AI For All for West London creative and media businesses