An image generated by DALL-E 2 based on the text prompt "Teddy bears working on new AI research underwater with 1990s technology" | |
Original author(s) | OpenAI |
---|---|
Initial release | January 5, 2021 |
Type | Transformer language model |
Website | openai |
Part of a series on |
Artificial intelligence |
---|
DALL-E (stylized as DALL·E) and DALL-E 2 are transformer models developed by OpenAI to generate digital images from natural language descriptions. DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a version of GPT-3[1] modified to generate images. In April 2022, OpenAI announced DALL-E 2, a successor designed to generate more realistic images at higher resolutions that "can combine concepts, attributes, and styles".[2]
OpenAI has not released source code for either model, although output from a limited selection of sample prompts is available on OpenAI's website.[1] As of 20 July 2022[update], DALL-E entered into a beta phase with invitations sent to 1 million waitlisted individuals.[3][4] Access was previously restricted to pre-selected users for a research preview due to concerns about ethics and safety.[5][6] Despite this, several open-source imitations trained on smaller amounts of data were released by others.[7][8][9]
The software's name is a portmanteau of the names of animated Pixar character WALL-E and the Spanish surrealist artist Salvador Dalí.[10][1]
The Generative Pre-trained Transformer (GPT) model was initially developed by OpenAI in 2018,[11] using the Transformer architecture. The first iteration, GPT, was scaled up to produce GPT-2 in 2019;[12] in 2020 it was scaled up again to produce GPT-3, with 175 billion parameters.[13][1][14] DALL-E's model is a multimodal implementation of GPT-3[15] with 12 billion parameters[1] which "swaps text for pixels", trained on text-image pairs from the Internet.[16] DALL-E 2 uses 3.5 billion parameters, a smaller number than its predecessor.[17]
DALL-E was developed and announced to the public in conjunction with CLIP (Contrastive Language-Image Pre-training).[16] CLIP is a separate model based on zero-shot learning that was trained on 400 million pairs of images with text captions scraped from the Internet.[1][16][18] Its role is to "understand and rank" DALL-E's output by predicting which caption from a list of 32,768 captions randomly selected from the dataset (of which one was the correct answer) is most appropriate for an image. This model is used to filter a larger initial list of images generated by DALL-E to select the most appropriate outputs.[10][16]
DALL-E can generate imagery in multiple styles, including photorealistic imagery, paintings, and emoji.[1] It can "manipulate and rearrange" objects in its images,[1] and can correctly place design elements in novel compositions without explicit instruction. Thom Dunn writing for BoingBoing remarked that "For example, when asked to draw a daikon radish blowing its nose, sipping a latte, or riding a unicycle, DALL-E often draws the handkerchief, hands, and feet in plausible locations."[19] DALL-E showed the ability to "fill in the blanks" to infer appropriate details without specific prompts such as adding Christmas imagery to prompts commonly associated with the celebration,[20] and appropriately-placed shadows to images that did not mention them.[21] Furthermore, DALL-E exhibits broad understanding of visual and design trends.[citation needed]
DALL-E is able to produce images for a wide variety of arbitrary descriptions from various viewpoints[22] with only rare failures.[10] Mark Riedl, an associate professor at the Georgia Tech School of Interactive Computing, found that DALL-E could blend concepts (described as a key element of human creativity) and represented a major advance.[23][24]
Its visual reasoning ability is sufficient to solve Raven's Matrices (visual tests often administered to humans to measure intelligence).[25]
DALL-E's reliance on public datasets influences its results and lead to algorithmic bias in some cases such as generating higher numbers of men than women for requests that do not mention gender.[26] DALL-E 2's training data was filtered to remove violent and sexual imagery, however this was found to increase bias in some cases such as reducing the frequency of women being generated.[27] OpenAI hypothesize that this may be because women were more likely to be sexualized in training data which caused the filter to influence results.[27]
A concern about DALL-E and similar image generation models is that they could be used to propagate deepfakes and other forms of misinformation.[28][29] As an attempt to mitigate this, the software rejects prompts involving public figures and uploads containing human faces.[30] Prompts containing potentially objectionable content are blocked, and uploaded images are analyzed to detect offensive material.[31] A disadvantage of prompt-based filtering is that it is easy to bypass using alternative phrases that result in a similar output. For example, the word "blood" is filtered, but "ketchup" and "red liquid" are not.[32][31]
DALL-E's language understanding has limits. It is unable to distinguish "A yellow book and a red vase" from "A red book and a yellow vase" or "A panda making latte art" from "Latte art of a panda".[33] It generates images of "an astronaut riding a horse" when presented with the prompt "a horse riding an astronaut".[34] It also fails to generate the correct images in a variety of circumstances. Requesting more than 3 objects, negation, numbers, and connected sentences may result in mistakes and object features may appear on the wrong object.[22]
Most coverage of DALL-E focuses on a small subset of "surreal"[16] or "quirky"[23] outputs. DALL-E's output for "an illustration of a baby daikon radish in a tutu walking a dog" was mentioned in pieces from Input,[35] NBC,[36] Nature,[37] and other publications.[1][38][39] Its output for "an armchair in the shape of an avocado" was also widely covered.[16][24]
ExtremeTech stated "you can ask DALL-E for a picture of a phone or vacuum cleaner from a specified period of time, and it understands how those objects have changed".[20] Engadget also noted its unusual capacity for "understanding how telephones and other objects change over time".[21]
According to MIT Technology Review, one of OpenAI's objectives was to "give language models a better grasp of the everyday concepts that humans use to make sense of things".[16]
There have been several attempts to create open-source implementations of DALL-E.[7][40] Released in 2022, Craiyon (formerly DALL-E Mini until a name change was requested by OpenAI in June 2022) is an AI model based on the original DALL-E and was trained on unfiltered data from the Internet. It attracted substantial media attention in 2022 after its release as a result of its capacity for producing humorous imagery.[41][42][43]