In this blog post I will go over my PyTorch implementation of HUSE : Hierarchical Universal Semantic Embeddings. This paper is authored by Pradyumna Narayana, Aniket Pednekar, Abishek Krishnamoorthy, Kazoo Sone and Sugato Basu; and can be found here.
Introduction
This paper deals with cross-modal representation learning corresponding to images and text. We want to map images and text to a shared latent space, such that not only embeddings corresponding to a similar class lie closer to each other, but also embeddings which belongs to similar concepts should be close to each other too.
This example is taken from the paper itself. Say we have four classes of instances with us namely: cat, dog, bridge and tower. The embedding space corresponding to these classes can be seen as:
In this space, the embeddings corresponding to a given class (e.g., image and text corresponding to Golden Gate bridge) lie closest to each other. But we can also see that the embeddings corresponding to cat and dog classes are closer to each other as they are semantically similar. Similarly, the embeddings corresponding to bridge and tower are closer to each other.
Implementation details
Visual: Using the pretrained ResNet50 model to extract embeddings of size 64 from individual images.
Text: Using BERT embeddings to obtain the text embeddings. For the text associated with each image, we concatenate the embeddings from the last four layers for each token and then average all token embeddings
for the text.
Semantic graph: Constructing an adjacency matrix based on the embeddings extracted from the class names. As class names often contain more than a single word (e.g. bags tote bags tote bags), we use Sentence Transformers that provide sentence level embeddings of class names.
To build the adjacency matrix, each class name is treated as a vertex and the cosine distance between sentence encoder embeddings of two class names is treated as edge weight.
This matrix is used to calculate the Graph Loss (which will be defined later.)
Prerequisites
For this implementation, images from the a fashion retail store has been used. The images can be found in ./images folder, and the corresponding .csv files for both training and validation can be found at ./data/ folder at the repo here.
I have also used pre-trained PyTorch models for BERT and Sentence Transformers. Install them from here and here.
Workflow
Let us start by importing basic libraries and defining some variables.
Dealing with data
Let us read the training data and have a look at it.
The text associated with each image is the description of the apparel in the picture. If you notice the class for each instance, you will see that it follows a hierarchical pattern, in which each level is seperated by ‘<’.
I’ll show you the unique classes of the training data.
Before proceeding forward, let us clean the data in text column. The clean text will be stores in processed_text column.
Also for the classes column, I will replace each ‘<’ by whitespace, and store the result in a new column processed_classes
You’ll notice that some classes have repeated words. I will remove the repeated words, but I will maintain the original order of words while doing so.
Since Pytorch only takes float values as labels, I also create a feature mapped_class which is just an integer mapping to every class. The classes are mapped to integers using a dictionary classes_dict.
Sentence encoding of the classes
I will use the Sentence Transformer to generate encoding for the classes. From these encodings, I will create an adjacency matrix which will further be used in calculating Graph Loss down the line.
Let us import the required libraries first.
I’ll also import the Sentence Transformer model which will give us BERT embeddings of the each class.
I will find the embeddings for each class and store it in sentence_embeddings.
I will now create an adjaceny matrix which will be used to store cosine distances between class embeddings.
The shape of the adjaceny matrix is 3 x 3, corresponding to the number of classes. I will now store cosine distances between class embeddings in this matrix.
Datasets and Dataloader
In this section we will define a custom dataset and build our dataloader on top of that.
I’ll import the necessary libraries first.
I will first store the image path and the corresponding text of all instances in array X_train and the targets in y_train.
Now let us create a custom dataset ImageTextDataset which returns the image data, text and target value.
Let us also define some transforms for augmenting the image data.
Creating a train dataloader based on our ImageTextDataset.
What does our train dataloader return?
images is nothing but a tensor of size [B, 3, 284, 284], where B = BATCH_SIZE and 284 is the size we had mentioned in the transforms.
What about texts? It is tuple of length = BATCH_SIZE, and each element is the processed text of the given instance
And targets? It is tensor of size = BATCH_SIZE having the ground truth labels of the instances.
Towers model
In this section I will define our model Towers. This model takes the image and text embeddings as input, and passes each embedding through a number of MLP layers. In the end, the image and text embeddings and passed through a shared layer.
I have decided to make a single model for both image and text towers.
Note that the Towers model returns 3 outputs. The weighted embedding of image and text features after passing through the shared layer, as well as image and text embeddings individually after passing through the shared layer.
Utility functions
Before defining the custom loss functions, I will define some utility functions that help this implementation run smoothly.
First, I will define a function get_encoding that will generate the Bert embeddings of the text data that we get from our training dataloader. I have left the comments alongside the code for better understanding.
I now define a function graph_loss_utility whose output will be used by our graph loss function to compute the loss. It takes in input, the predicted classes outputs, the ground truth targets and adjacency matrix of class embeddings as input.
what does this graph_loss_utility function do?
First for a given batch, I find the total combinations of 2 tuples (from 0 to B - 1) that can be made. Note that I do not include tuples like (1,1), (2,2), (0,0) etc because they do not contribute to the loss. I store these combinations in a list x_tuples.
For each tuple in x_tuples, I find the real target value of each x in that tuple and store them as a tuple. Hence creating the list of tuples target_tuples. Since we only have 3 classes in this data, the targets will be 0, 1 or 2.
For each tuple in target_tuples, we find the cosine distance between the two targets in the tuple using the adjacency matrix that we had created earlier. Hence creating a list A_ij.
We have got with us an output tensor of size (B, 3), where 3 = # classes and B = BATCH_SIZE. For each tuple in x_tuples, I find the cosine distance between the two examples in that tuple.
e.g. (0,1) be the tuple in x_tuples, then I will find the distance between output[0] and output[1].
This function finally returns A_ij and cosine_x, both filtered by the margin parameter. See the original paper for more details on this topic.
Loss functions
The authors have employed three losses in the paper namely:
Classification Loss - It is essentially a Cross Entropy loss between the outputs of our model and the real targets
GAP Loss - This loss enforces that both image and text embeddings of a single instance should be as similar as possible.
Graph Loss - is essentially a slightly modified MSE Loss function and it make our embeddings semantically meaningful.
Explaining the graph loss function
How will our graph_loss_fn make embeddings semantically meaningful?
For any x-tuple pair, say (0,1) having real target tuple, say (1, 2), we find the squared difference between
Cosine distance of embeddings of outputs[0] and outputs[1].
Cosine distance of class embeddings of targets 1 and 2 (can be found easily from the adjacency matrix).
This is done for all the tuples, and a MSE like loss function is generated from these. Each squared error term can be minimised if the cosine distance of embeddings is as close as possible to the value obtained from the adjacency matrix.
If say the value from the matrix is low (meaning that class similarity is high), this forces the cosine distance of the input embeddings to be low too. In contrast, if the value from the matrix is high (meaning that class similarity is low), this forces the cosine distance of the input embeddings to be high as well.
This type of regularisation enforces semantically similar classes to be closer to each other, and dis-similar classes to be farther away.
Training our model!
Finally we have arrived at the stage when we can train our model!
Let us begin by importing the necessary modules.
I am defining train function for training a single epoch.
Let us define the train_setup function also, which handles the complete training of our model. I have also added the ability to checkpoint the model after every epoch (if checkpoints directory has been given). After the training is complete, you can also save the model (if model save directory is given).
Let us train our model for one epoch:
The data that I have provided in the repo consists of only 31 examples belonging to 3 classes. However I trained the model on 1,22,164 instances having 48 classes for one epoch.
The Average Loss over the epoch reduced from 7.440107 in the first batch iteration to 3.941868 in the last one.
Same decreasing trend in average loss can be seen for individual loss components :
1) Average Classification Loss : 3.469530 to 1.426620
2) Average Graph Loss : 0.100957 to 0.027340
3) Average GAP Loss: 0.740253 to 0.560462
Evaluating our model
Once the training is complete, we can test our model on the validation data. The validation data can be found at ./data/eval_data.csv in the repo.
Let us read the validation data first.
I’ll also clean the text and do other processing as I did during the training phase.
I will first store the image path and the corresponding text of all validation instances in array X_val and the targets in y_val.
Now I will create a validation dataloader based on ImageTextDataset.
Let us also define the validation function val which will evalute the model for us.
This function will return the validation loss, validation accuracy, and the confusion matrix for our prediction.
Conclusion
Did a PyTorch implementation of HUSE, to learn a universal embedding space that incorporates semantic information. The model learns a new universal embedding space that still has the same semantic distance as the class label embedding space.