OpenAI’s GPT-3 bridges language – image gap

Researchers at OpenAI have been able to train a machine learning model fed with pixel sequences to generate coherent images. This is another step toward understanding and bridging the gap between computer vision and language understanding techniques.

Google’s BERT, Facebook’s RoBERTa, and OpenAI’s GPT-3 have made great strides with language tasks, but have hitherto been less successful when applied to generating or classifying images. This new model now understands characteristics like object appearances and categories without any hand coding.

Writing on VentureBeat, Kyle Wiggers tells us that OpenAI trained three versions of image-generating GPT-2 models:

  • iGPT-S – 76 million parameters
  • iGPT-M – 455 million parameters
  • iGPT-L – 1.4 billion parameters

plus another, much bigger version

  • iGPT-XL – 6.8 billion parameters

and the results show that image feature quality sharply increased with depth before mildly decreasing. They also found that both increasing the scale of its models and training for more iterations resulted in better image quality.