Researchers from Google have developed the Imagic neural network model that edits images according to a text description. For example, it can change a photo of a dog so that it is not sitting, but standing, while retaining all other details. An article about the algorithm is published on arXiv.org.
Over the past two years, machine learning researchers have made great strides in creating algorithms that can generate fairly realistic images (and, more recently, videos) from a textual description. Quite quickly, these features began to be integrated into graphic editors and even create new services for designers based on generative neural networks. For example, there is a plug-in for Photoshop based on the Stable Diffusion neural network that allows you to generate or draw images.
Developers from Google, led by Michal Irani, went further and taught the neural network to edit images without the need for manual manipulation at all, requiring only a textual description of the edits from the user. Like many of the recent generative models, the new algorithm creates images using a diffusion method in which it progressively improves an image that initially contains only noise over dozens of stages. You can read more about the principle of operation of such generative models in our other note.
The main innovation of the authors of the new algorithm concerns not the generation itself, but the work with its “precursors”. The fact is that the text does not reach the generative neural network immediately. Before that, the text is fed to the encoder, which converts it into a compressed vector representation (embedding), which encodes the meaning so that sentences that are similar in meaning will have similar embeddings. The researchers decided not to change the generated image itself, but to work with text embeddings, says N+1.
The scheme of the algorithm can be divided into three stages. First, the user gives the original image and a textual description of what needs to be changed, such as a photo of a dog standing on the lawn and the text “dog sitting”. At the first stage, this phrase is converted into embedding, and then it is optimized so that the image generated on its basis looks like the original one. At the second stage, the diffusion generative neural network itself is optimized so that, in response to optimized embedding, it generates images similar to the original. And at the third stage, a linear interpolation takes place between the original and optimized embeddings, and the result is fed to the optimized neural network. NIXSolutions notes that tests have shown that such a scheme allows you to change only the necessary details in the image, leaving the rest almost untouched.