Semantically multi-modal image synthesis

Zhen Zhu, Zhiliang Xu, Ansheng You, Xiang Bai, CVPR 2020


The paper focuses on semantically multi-modal image synthesis(SMIS) task, namely, generating multi-modal images at the semantic level. It proposes a novel network for semantically multi-modal synthesis task, called GroupDNet (Group Decreasing Network). The network unconventionally adopts all group convolutions and modifies the group numbers of the convolutions to decrease in the decoder, considerably improving the training efficiency over other possible solutions like multiple generators.

For each semantics, there is a specific controller. By adjusting the controller of a specific class, only the corresponding areas are changed accordingly.


For conducting label-to-image translation,G requires M as conditional input to generate images. However, in order to support multi-modal generation, another input source to control generation diversity. Normally, an encoder is applied to extract Z as the controller. Upon receiving these two inputs, the image output O can be yielded through O = G(Z, M ). However, in the SMIS task, we aim to produce semantically diverse images by perturbing the class-specific latent code which independently controls the diversity of its corresponding class.

For the SMIS task, the key is to divide Z into a series of class-specific latent codes each of which controls only a specific semantic class generation. The traditional convolutional encoder is not an optimal choice because the feature representations of all classes are internally entangled inside the latent code. This phenomenon inspired some architecture modifications in both the encoder and decoder to accomplish the task more effectively.


Main Contributions

Our two cents