The highly-valued startup behind popular text-to-image generator Dall-E, OpenAI ,has announced the release of Point-E, which can produce 3D point clouds directly from text prompts in a minute or two using a computer with a sufficiently powerful GPU.
How is it done?
“To produce a 3D object from a text prompt, we first sample an image using the text-to-image model, and then sample a 3D object conditioned on the sampled image. Both of these steps can be performed in a number of seconds, and do not require expensive optimization procedures.”
Point-E first generates a synthetic view 3D rendering of of a scene described in a text prompt, then runs that generated image through a series of diffusion models to create the 3D, RGB point cloud of the initial image — first producing a coarse 1,024-point cloud model, then a finer 4,096-point.
“In practice, we assume that the image contains the relevant information from the text, and do not explicitly condition the point clouds on the text,”
Diffusion models are trained on “millions” of 3D models, all converted into a standardised format.
“While our method performs worse on this evaluation than state-of-the-art techniques, it produces samples in a small fraction of the time.”
Text-to-3D, especially in real-time, is very much still mostly a research area, but Point-E does have an advantage over sector competitors in the shape of its parent company’s text and image data resources.
OpenAI has posted the pPoint-E open-source code on Github.
Source: Mashable