Perception.AI - synthesizing highly realistic images from text descriptions

There is a serious predicament of class imbalance in AI with Perception.AI, we aim to solve this problem by generating images for rare classes which otherwise occur infrequently, or not at all.
Written on 
Jan 5, 2023
in 
Applications

Perception.AI - synthesizing highly realistic images from text descriptions

There is a serious predicament of class imbalance in AI with Perception.AI, we aim to solve this problem by generating images for rare classes which otherwise occur infrequently, or not at all.
Written on 
Jan 5, 2023
in 
Applications

Perception.AI - synthesizing highly realistic images from text descriptions

There is a serious predicament of class imbalance in AI with Perception.AI, we aim to solve this problem by generating images for rare classes which otherwise occur infrequently, or not at all.
Written on 
Jan 5, 2023
in 
Applications

Perception.AI - synthesizing highly realistic images from text descriptions

There is a serious predicament of class imbalance in AI with Perception.AI, we aim to solve this problem by generating images for rare classes which otherwise occur infrequently, or not at all.
Written on 
Jan 5, 2023
in 
Applications
Play video
Written on 
Jan 5, 2023
in 
Applications

All our knowledge has its origins in our perceptions.” ~Leonardo da Vinci.

Perception is the organization, identification, and interpretation of sensory information in order to represent and understand the presented information or environment. It could differ from person to person, be impacted by one’s surroundings, or even be similar to someone’s perception.

Background

Automatic synthesis of realistic images from text would be interesting and useful, and we could witness some AI systems in recent years such as GAN-INT-CLS, which is an RNN encoder with GAN decoder that came out in 2016 was the first paper to propose the Idea of text to image using generative adversarial modeling. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories, such as faces, album covers, and room interiors, such as GAWWN, StackGANs, etc.

Introduction

Automatically generating images according to natural language descriptions is a fundamental problem in many applications, such as art generation and computer-aided design. It also drives research progress in multimodal learning, inference across vision and language, and one of the most important: Biomedical Imaging and Research, which is one of the most active research areas in recent years.

There is a serious predicament of class imbalance in Artificial Intelligence and with Perception.AI, we aim to solve this problem by generating images for rare classes which otherwise occur infrequently, or not at all.

In this project, we propose to implement a text-to-image adversarial generative model that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation.

Split into three parts; the project includes reviewing Deep Learning concepts for data and modeling and how to apply them to different tasks, including vision and language tasks. Next, we move to development, where we use the models we trained and incorporate them into real-world applications. Finally, we deploy our application in Google Cloud Platform (GCP).

The Data

We used the Caltech-UCSD Birds 200 (CUB-200) dataset. It is an image dataset with photos of 200 bird species (mostly North American).

With approximately 12 thousand images and one bounding box per image for object detection, it also has 10 captions corresponding to each image.

Components and Pipeline Enablement

The components of this work listed below cover the frontend, backend, and all aspects of pipeline enablement:

Tech stack used while working on the project
  • Python, PyTorch, and Tensorflow for coding and deep learning.
  • Ansible for automating Docker and operationalizing the process of building and deploying containers.
  • Docker for containerization of code and product.
  • Google Cloud for compute engine and bucket storage
  • Google Colab and Drive for interactive development and collaboration
  • React for working on app frontend.

Project working

THE BASELINE MODEL:

We took StackGAN as our baseline model and implemented it from scratch.

Baseline model architecture (StackGAN)

It supported two important ideas:

  • Conditional Augmentation Block that samples latent variables from a distribution and makes the generator more robust in capturing various objects and poses and at the same time increases randomness to the network.
  • Two Generative models are stacked on top of each other to give a high-resolution image.

This was the baseline model that we implemented from scratch (find code in the GitHub like provided at the end of the article). A setback about StackGANs is that only a single sentence embedding is used as an input therefore there is no word-level association between the sentence and the image.

Therefore, we researched further and switched to AttnGANs for an improved model.

IMPROVED MODEL:

With a novel attentional generative network, the AttnGAN can synthesize fine-grained details at different subregions of the image by paying attention to the relevant words in the natural language description. In addition, a deep attentional multimodal similarity model is proposed to compute a fine-grained image-text matching loss for training the generator. It supports two important ideas:

  • Word-level features are used to train the model along with sentence embedding.
  • The final image is passed through an encoder and DAMSM loss is calculated, which helps us make relations between parts of the image and words in the sentence.
The proposed working of AttnGAN (model used in the project)

The working model:

The AttnGAN builds on top of StackGAN by using an attention network which allows it to capture word-level information, along with the broader sentence-level information that the StackGAN model already has in form of embedding. The AttnGAN does this by passing the sentence through a bidirectional-LSTM which outputs sentence-level and world-level features. The sentence-level features are a D-dimensional vector whereas the word-level features are D x T-dimensional matrix where T represents the number of words in the text description and D represents the dimensions of the embedding.

The first step is to pass the sentence level embedding through a conditioning augmentation. This takes a random sample from a normal distribution where the mean is the mean of the vector and the standard deviation is the covariance matrix of the sentence vector, this makes the model more robust by giving it a larger variety of samples.

We now take the sentence level representation and concatenate it with a random noise vector and pass it through the first feature generator which is responsible for most of the upsampling and outputs a hidden state.

The hidden state is also passed on to the next stage of the feature generator along with the word-level embeddings. The feature generator outputs a list where each item represents how important each of the words was in drawing a specific sub-region. The third feature generator does the same thing as the second feature generator.

Training this model is quite interesting as every feature generator has its corresponding image generator and discriminator. The feature generator also passes its hidden state to the image generator which are just convolutional layers that converts the hidden states in RGB image. The image generated is then passed through a discriminator which then tries to distinguish if the image is fake or not based on which the generator and the discriminator get better.

The Loss for the discriminator is broken up into two parts, the conditional loss and the unconditional loss, the unconditional loss is the discriminator trying to figure if the image is real or not. The conditional loss is how close is the sentence to the image and if the image makes sense given the sentence.

The loss for the generator can be divided into two parts, one being the sum of loss for all generators and the other being the lambda times the DAMSM loss.

Let’s go through the first part of the generator loss first, which is the sum of loss for all generators, that is, image generator one, two, and three, each generator loss is made up of conditional and unconditional loss. Its trying to capture if the image is real or fake and how well the sentence level vector matches up with the image.

To capture how well the word-level features are captured, the authors of the paper came with DAMSM loss which stands for deep attention multimodal similarity model. The image generated by our generator is passed in through the inception v3 model and gets global and local image features which are passed through a perceptron layer to have the same dimensions as the word level embedding. A dot product is then taken between the image features and word-level features to compute a similarity matrix, a weighted sum is calculated over all the subregions, cosine similarity is calculated between word and region of the image. If these have high similarity means that word had an impact on the corresponding region of the image thus giving us the attention part of the model.

The Product:

Technical Architecture of the project

This is the high-level view from development to deployment, where we are illustrating the interactions of various components:

  1. We used IDE (such as VSCode) and CLI to build the app and all the development is containerized.
  2. We used GitHub for source control and collaborative working of the team.
  3. Google colaboratory for EDA and baseline-modeling. Furthermore, we also established experiment tracking methods such as model checkpoints during training, etc.
  4. We pushed our container images to the Google Container Registry (GCR) to host all the container images.
  5. We used a Compute Instance for hosting a single instance of all containers.

Workflow:

The user can use the application through the deployed webpage.

Here, the user can enter the description of a bird in the text box, that the user wants an image of.

After hitting the ‘Generate’ button, a POST request is submitted and then our model comes to play. It generates a corresponding image as per the description along with the feature maps. It returns the output to the frontend via a GET request.

We can view the generated images. Also, when the user hits the ‘Get Preception’ button, the feature maps are displayed. Feature maps help the user understand where our model is focusing on with the help of a given set of words.

Results and Findings

The proposed AttnGAN significantly outperforms the previous state-of-the-art, boosting the best-reported inception score by 14.14% on the CUB dataset. It shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image. The experimental results show that, compared to previous state-of-the-art approaches, the AttnGAN is more effective for generating complex scenes due to its novel attention mechanism that catches fine-grained word level and sub-region level information in text-to-image generation.

Future Work

  • Due to limited computation power and the scope of time, we kept our project limited to a smaller dataset like CUB. Given more computation power, we can extend our model to a generalized set of data. We plan to extend the idea to images that are high-quality and diverse images with unforeseen image combinations, such as camels surrounded by snow or zebras in a city.
  • With these new capabilities, our model could be used to create new visual examples to augment data sets to include diverse objects and scenes; help artists and creators with more expansive, creative AI-generated content; and advance research in high-quality image generation.
  • We will like to extend this idea of generating images to generating 3-dimensional images and videos using GANs. Such architectures can be useful in forecasting applications for weather prediction, autonomous driving, etc.
  • Building generative models of the world around us is considered as one way to measure our understanding of the physical common sense and predictive intelligence.

Conclusion

We hope, with Perception.AI we achieve a well-generalized and diversified model, that can be taken further in medicinal studies to combat class imbalance in medical datasets.

citations
Written by

Anshika Gupta, Harsh Vardhan, Meghana Sarikonda & Vishnu M

Related content