A Picture is Made of a Thousand Words
Note - some blogs will be technically detailed, some will be overviews, and some will be thought pieces. This is an overview of an interesting development in generative AI.
The speed at which text-to-image generation has developed is nothing short of staggering. An excellent generative deep learning textbook from 2019 doesn’t even mention it, and just three years later there are enough easily accessible powerful online tools that anyone can generate whatever they can dream up. And they have been.
We used DALLE-2 to generate all the origami images for our website to allow us to rapidly create content and keep a coherent brand theme. We left our logo and branding in the hands of professionals (thanks Visions!), but the tools are good enough now that generating high quality pictures is a very simple process.
We’re even at the stage where this is so accessible that it’s banned from some online communities. That’s one of the quickest hype cycles I’ve ever seen. The number of papers published is steadily increasing too, with nearly 2021 seeing nearly 10x as many as 2016.
Annual Text-To-Image Generation Papers
Evolution Revolution
The research, and available tools have come an incredible way in a short time. Below are a number of outputs from different services ranging just from 2020-2022, all given the same prompt:
Origami camera made out of yellow paper on a white background
Over a very short space of time, both the graphical fidelity and the linguistic comprehension of the prompt has increased dramatically. You can see initial attempts at drawing both a camera and origami separately begin to coalesce into a camera made out of origami, until you finally reach a basically perfect representation of the prompt from DALLE-2.
This codevelopment is no coincidence - increased research into coupling language and vision models (kickstarted by OpenAI’s CLIP in Jan 21) has allowed rapid advancements in both fields to be exploited concurrently. Both the image and language models improve over time in the image above, moving from BERT style transformers, through GANs, and now into most recently using Diffusion models.
Model size has also drastically increased, going from the ~450 million parameter CLIP-BigGAN to the 3.5 billion parameter DALLE-2. This scaling of both model size and the associated data requirements moves the training of these models further away from most of us, at least for generalised purposes. Specialised models utilising transfer learning will likely spring up in some communities.
Controversies
As with any rapid advancements, there have been a number of issues with how people are using these tools. From angry artists, copyright infringement, bias and stereotyping, to even non-consensual NSFW deepfakes, there are clearly societal issues that need to be addressed with this new technology. OpenAI has tried to actively manage this, and has only recently allowed face editing after adding safety controls, but as models become more prevalent and easy to deploy these controls will likely be easily avoidable.
The legal system will likely have to do some quick thinking to understand how to adapt to these creations; if a model is capable of producing illegal content then at what point is the binary blob considered a representation of that content? The border between legal and not is often difficult enough in real life, let alone in the state space of a generative model.
So What’s Next?
Alongside text-to-image generation, researchers are starting to generate 3D point clouds and textured meshes from text. Whilst image and language research generally progresses faster due to the sheer volumes of easily interpretable data, other fields quickly capitalise on these advances. I would expect to see a text-to-music model coming along on the heels of OpenAI’s Jukebox.
Once we start talking in common latent spaces however, there’s nothing stopping more interesting combinations coming along. Audio from image? Robotic control from text? Watch this space.