Nataniel Ruiz, Yuanzhen Li,Varun Jampani,Yael Pritch,Michael Rubinstein, Kfir Aberman, CVPR 2023
Generating realistic images from text prompts has made impressive strides thanks to the development of large text-to-image models. However, these models have a notable drawback—they struggle to maintain the accurate appearance of specific subjects in diverse settings. The presented work aims to overcome this limitation by introducing an innovative method to personalize text-to-image diffusion models. This enables the creation of photorealistic images featuring a specific subject in various scenes, poses, and lighting conditions.
The main idea of the method is to enhance a pre-trained text-to-image diffusion model by adjusting it with a small set of images (typically 3-5) featuring a particular subject. This involves integrating the subject into the model’s output framework, enabling the creation of new images by using a distinctive identifier. The method relies on a unique loss function called autogenous class-specific prior preservation loss, which taps into the inherent knowledge within the model. This ensures that the refined model can produce various versions of the subject without straying too far from its original look or the characteristics of its class.
Subject Embedding: Achieved by fine-tuning the model using images of the subject paired with text prompts containing a unique identifier followed by a class name (e.g., “A [V] dog”), embedding the subject into the model’s output domain.
Rare-token Identifiers: Utilization of rare token identifiers to minimize the chances of the chosen identifiers having strong pre-existing associations within the model, thus maintaining the integrity of the generated images.
Prior Preservation Loss: Introduction of a class-specific prior-preservation loss to counteract language drift and maintain the diversity of output, which is crucial for generating the subject in various contexts and viewpoints. Specifically, data is generated as ${x_{\text{pr}} = \hat{x}{(z_{t_{1}}, c_{\text{pr}})}}$ by employing the ancestral sampler on the frozen pre-trained diffusion model with randomly initialized noise ${z_{t_{1}} \sim \mathcal{N}(0, I)}$ and conditioning vector ${c_{pr} := \Gamma(f(\text{“a [class noun]”}))}$.
Model Training Approach:
Model Overview:
The research paper performed thorough experiments to highlight how flexible the technique is. It showcased its ability to recontextualize subjects, tweak their characteristics, and create artistic interpretations.
The authors created artistic versions of their subject dog, mimicking the styles of renowned painters. It’s noteworthy that numerous poses generated were not part of the training set, like the renditions inspired by Van Gogh and Warhol. Additionally, some of the interpretations displayed original composition, closely resembling the painter’s style, hinting at a degree of creativity.
The method can also generate images featuring a cat from specified viewpoints, including top, bottom, side, and back views. It’s important to mention that the poses in the generated images differ from the input poses, and the background changes realistically with each pose variation. The authors also emphasize the ability to maintain intricate fur patterns on the cat’s forehead.
In the image below, first row demonstrates changes in color by using prompts like “a [color] [V] car.” While the second row explores hybrids between a particular dog and various animals, employing prompts such as “a cross of a [V] dog and a [target species].” It’s important to note that the approach retains distinct visual traits that define the subject’s identity or essence, even when modifying specific properties.
Dressing up a dog with various accessories is an interesting process. The dog’s identity remains intact, and the author’s try out different outfits or accessories by using prompts like “a [V] dog wearing a police/chef/witch outfit.” Notably, there’s a realistic interaction between the dog and the outfits or accessories, offering a wide range of creative possibilities.
The research paper introduces an innovative way to customize text-to-image diffusion models, allowing the creation of incredibly lifelike and diverse images of particular subjects. However, it’s important to note that the paper acknowledges some limitations. These include difficulties in certain situations where the model’s performance might decline, either due to weak priors or challenges in accurately producing the intended environment.