Chinese text-to-video tool Vidu takes a bow and trains its sights on Sora

At the Zhongguancun Forum in Beijing last weekend, Chinese startup Shengshu Technology and Tsinghua University unveiled Vidu, a new text-to-video model to compete with OpenAI’s Sora that generates “realistic” 1080p clips up to 16 seconds in length (compared to Sora’s 60 seconds) “with dynamic camera movements, detailed facial expressions, and natural lighting and shadows”.. The team have built Vidu using their own visual transformation model architecture called Universal Vision Transformer (U-ViT).

“Vidu is the latest achievement of self-reliant innovation, with breakthroughs in many areas. It is imaginative, can simulate the physical world, and produces 16-second videos with consistent characters, scenes, and timeline”

Zhu Jun, chief scientist at Shengshu and deputy dean at Tsinghua’s Institute for AI

It takes eight NVIDIA A100 GPUS three hours to inference Sora’s 60-second clips. The sale of NVIDIA GPUs is prohibited in China. This might explain why it has taken a while for Vidu to emerge and why it ‘lags behind in terms of output capability. However, there are suggestions that Vidu already exceeds Sora in terms of temporal consistency, for example with “the natural movement of water or the bustling activity of a cityscape at night”.

If you have been reading about generative AI and maybe even trying it out for yourself and would like some advice and guidance about using it safely and responsibly in a professional context, why not get in touch for a chat? At MondatumColin Birch ( and John Rowe ( are your initial points of contact.

Source: Medium