This website uses cookies. By using the website you agree with our use of cookies. Know more

Data

Graph Communities in High-End Fashion Outfit Complementary Retrieval

Graph Communities in High-End Fashion Outfit Complementary Retrieval

Introduction


Figure 1: Illustration of the outfit complementary product retrieval problem

FARFETCH is the world's global platform for luxury fashion, connecting millions of customers in 190 countries to the world's best brands through its e-commerce platform. 
 
The challenge of keeping the customer continuously interested has motivated the development of several techniques that can determine the compatibility between fashion products, through pairwise compatibility [1, 2, 3, 3], or outfit compatibility [4, 5, 6]. The former considers a fashion product as a query and then tries to retrieve compatible items, typically from different categories (e.g., find a t-shirt that fits well a given pair of shoes). The latter aims at finding compatible fashion product(s) to form or complete an entire outfit. The outfit complimentary product retrieval problem is illustrated in Figure 1.
 
FARFETCH co-organized a challenge within the scope of the VISion Understanding and Machine intelligence (VISUM) summer school. The challenge aimed to investigate and develop AI techniques for fashion outfit complimentary product retrieval, by leveraging the vast visual and textual data together with the fashion domain knowledge encountered in our catalog. The underlying idea (Figure 1) of the challenge is to solve the fill-in-the-blank (FITB) problem. More specifically, given a subset of product items in an outfit and a set of candidate products from the missing category (i.e., one positive and three negatives), the task is to retrieve the most compatible candidate.

The Challenge

This challenge will act as a starting point for exploring the recommendation of complementary products, allowing the integration of this baseline in the iFetch project. Follow some of the use cases. 

Figure 2: fashion outfit complementary product retrieval.

For this challenge, FARFETCH implemented a baseline that served as a starting point for students' development. We based the implementation on distance metric learning, to learn sub-spaces where complementary products are closer and non-complementary are distant. In practice, learning distance or similarity metrics between complementary products usually resort to Siamese neural networks or triplet strategies [5, 6, 7]. A major issue across existing approaches, which use a triplet loss to learn feature embeddings for complementary product retrieval, is the triplet generation process. To form a triplet, a random pair of products is first selected from a given outfit (i.e. anchor and positive products) and, then, a negative product is randomly sampled from a different outfit with the only restriction being from the same category as the positive product. However, this negative sampling process may lead to a large number of false-negative products (i.e., negatives that can go well with the anchor), especially when the outfits in the training dataset share a large number of products. To mitigate this problem, we further constrained the negative sampling process on the product communities obtained by applying the Louvain method on the product’s graph. That is, a negative product should be not only from a different outfit but also from a different community graph than the anchor and the positive. Experimental results suggest that this additional constraint yields an overall improvement in the FITB accuracy.

Dataset

The dataset of the VISUM-2021 challenge comprises a total of 128,398 outfits. Each outfit is composed of an arbitrary number of products, ranging from 2 up to 14 products per outfit, each containing rich multimodal information. Find an example in Figure 3 coupled with a summary of the dataset (Table 1). 


Figure 3: example of an outfit and the information related to it.

The dataset is arranged in two tabular files as follows:
  • Outfits: relates every outfit along with the corresponding set of products that belong to it, and is organized by:
    • "outfit_id”: the outfit id; 
    • "main_product_id”: the main product id, representing the anchor product in the outfit;
    • "outfit_products”: the set of product ids that belong to the outfit.
  • Products: relates every product metadata available in the outfits table along with the required product information, and is organized as follows:
    • "productid”: the product id;
    • "productname”: the product name;
    • "category”: the product category;
    • "description”: the product description.
  • Note that product images are available in a dedicated folder and named according to the corresponding product id.


Table 1: summary of the dataset

Baseline Model

The baseline model is trained to map the product image and its description into a common multimodal "complementary” embedding space in which compatible products are close to each other and non-complementary products are far apart. For inference, the distances between the outfit products and the candidates are used to select the most compatible candidate.

The work of Lin et al. [5] and Vasileva et al. [7] are the most related to our proposed baseline. Nevertheless, in terms of triplet strategies, our proposed negative sampling process mitigates the problem of sampling false-negative products which is a common issue in these state-of-the-art approaches.

Architecture

To induce the model to learn a complementary embedding space, the proposed baseline model contains three main modules (Figure 4), namely an image, a text, and a multimodal encoder.


Figure 4: Model architecture

Image Encoder

The image encoder aims at learning an encoding function that maps from an input product image, X, to a latent feature representation, h
 
The architecture of the image encoder comprises a pre-trained ResNet-50 as its base block followed by a projection block with additional trainable layers to increase the overall representational capability of the encoder for our task. The projection block consists of two fully connected layers, with the first one having a Gaussian Error Linear Unit (GELU) non-linearity, a dropout layer, and a residual connection between the first and last layers. In order to maintain feature comparability, the output image representation is normalized onto the unit hypersphere.
 

Text Encoder

Analogously, the text encoder aims to learn an encoding function that maps from a given product description, T, to a latent text representation, f. It consists of a pre-trained DistilBERT model followed by a projection block with the same topology as the image encoder. Following the original BERT and DistilBERT papers, the hidden representation of the [CLS] token is used to summarize the whole product description. This works under the assumption that this representation is able to capture the overall meaning of the product description. For feature comparability, l2-normalization is also applied to the output text embedding, f. 

Multimodal Encoder

The multimodal encoder learns a mapping from both visual and text representations to a multimodal "complementary” feature space. The multimodal encoder comprises a merged layer that first concatenates both text and image representations, followed by a projection block (with the same topology as the other two encoders) to properly fuse both modalities into a shared embedding space. The final multimodal latent representation m is also normalized onto the unit hypersphere.

Training

The ultimate goal is to learn a "complementary” latent space, where the embeddings of products that go well together are close to each other, while non-complementary product embeddings are far apart. The network parameters are then optimized via a triplet loss, that forces the distance between non-complementary product embeddings (anchor and negative samples) to be larger than the distance of complementary product embeddings (anchor and positive samples) by a margin. 

The triplet loss  used to train the implemented baseline is defined as follows:



where,


which means that anchor and positive samples are from different categories and appear together within the same outfit. Additionally, the mining process of negative samples is being constrained not only by the product category and outfit id (as in [5]) but also by the product communities. That is, a negative is randomly sampled from a different outfit but with the same semantic category as the positive, and further constrained to belong to a different product community. The underlying idea of applying the community constraint is to further reduce the likelihood of selecting negatives that can go well with the anchor (i.e., false negatives). This is especially important when the outfits in the training dataset share a large number of products.
 
Summing up, at each training iteration, we sample a mini-batch of N triplets according to the following constraints:
  • positive and anchor pairs have to belong to the same outfit but from a different category;
  • positive and negative pairs have to belong to the same semantic category;
  • positive and negative pairs have to belong to different outfits;
  • positive and negative pairs have to belong to different communities;
  • anchor and negative pairs have to belong to different communities.
To apply the above-mentioned community constraints, we resort to the Louvain method, which connects communities by optimizing the modularity of the products’ graph, which measures the relative density of edges inside the communities concerning the outside edge. The nodes in the products’ graph denote the products, while the edges represent the number of occurrences between two products in outfits. Figures 6 and 7 depict the product communities detection process. In this particular example, two communities were found from a total of five outfits, which reduces the likelihood of sampling false negatives during training.


Figure 6: product’s graph built from 5 outfits.


Figure 7: Louvain communities generated from the product graph in Figure 6, where each color represents a community.




Below are some examples of the triplets generated for a given outfit.



Figure 8: Louvain communities generated from the product graph in Figure 6, where each color represents a community.


Figure 9:Triplets generated from the outfit in Figure 8. For a dress as the anchor and the red sneakers as positive, we have white sneakers as negative for the three items on the left; Analogously, for the three items on the right.

Inference

After training, model inference can be simply performed by querying the learned "complementary” embedding space to return the most compatible product.  The model receives a query representing the outfit and a set of candidates composed of 4 products (1 positive and 3 negatives). Then, the predicted candidate product is returned according to one of the following aggregation rules:  

  • sum, where we return the candidate product with the lowest sum of the distances to all the query products;
  • min, where we return the candidate with the lowest distance to one product in the outfit query.

Experimental Evaluation 

For training and evaluation purposes, the VISUM-2021 challenge has a specific evaluation protocol that comprises non-disjoint training, validation, and test splits of 40000, 2000, and 3000 outfits, respectively. This means that, although no entire outfit appearing in one of the three sets is seen in the other two, some product items are in both training and test splits. The FITB test queries are composed of a set of products from a given test outfit (query products) and a set of product candidates (one positive and three negatives). The positive candidate is a randomly selected product from the outfit, whereas negative candidates are randomly sampled products of the same category as the positive from other outfits.  All versions of the baseline (i.e., ImageModel, TextModel, and MultimodalModel) were implemented in PyTorch and trained for 100 epochs using the Adam optimization algorithm with a learning rate of 1e-03, a batch size of 128 triplets, and a margin of 1.0. Regarding the employed regularization techniques, the l2 coefficient was set to 1e-04, and the dropout rate was empirically set at 0.1.  Regarding the model architecture, we used an embedding size of 2048 for the image model, 768 for the text model, and 1024 for the multimodal model.  The text tokenizer was distilbert. 

Results 

We start by analyzing the impact of the different data modalities and Louvain communities - see  Table 2.


Table 2: Quantitative results

Our expectation was that the model shows the best performance when we are using the community's constraint, and the best model to be multimodal once we are considering the two different modalities (text and image).
 
In Table 2 we highlighted the best (green) and worst (red) results, where the lowest loss and the highest accuracy are obtained by the multimodal model with the community constraint as expected. These results support the importance of both text and visual cues for retrieving complementary products. 
 
Regarding the impact of the constraint of the Louvain community, we can see that in all the modalities the model works better when applying the constraint, which supports our hypothesis on the impact of using it to reduce the probability of choosing false negatives.
 
We evaluated our models' performance by applying the aggregation rules mentioned above.


Table 3: aggregation function results for the models with the constraint of the community.

Analogously to what was observed previously, regardless of the chosen function, the multimodal modality remains the best, presenting an accuracy of 0.483 for the sum and 0.443 for the min - Table 3. Another interesting observation is that the sum function is the one that promotes the best results, regarding all modalities, as it considers all the query products for its computation.

Qualitative

To evaluate the model’s performance, we also visually inspect the inference results of our model. Table 4 shows two examples where our model has predicted the correct complementary product, marked with a violet rectangle. Besides selecting the product the model also ranks the candidates according to the outfit query they are going to be selected.


Table 4: Visual results, right predictions.

Naturally, our model failed some predictions - Table 5. However, we state that the selected product does not differ in almost any aspect from the correct one. Actually, for the masks example, besides being the same color and shape they also share the brand, which leads to saying that even though the selection wasn’t right the model has correctly learned the "complementary” latent space. Through the results shown, we can say that the Louvain communities help to control the selection of false negatives.


Table 5: Visual results, wrong predictions.

In a nutshell, in both analyses of the results, we see that the impact of the Louvain constraint is very important for this problem, as it helps with disambiguation in large volumes of data. In addition to the Louvain community, it is also worth noting that the different aggregation functions also influence product choice as we have seen before when analyzing the results obtained.

Final Remarks

The development and implementation of this baseline served as a lever for the beginning of the investigation of our multimodal conversational agent.
 
For the iFetch project, we aim to learn multiple notions of complementary, and it will be interesting to explore Context-aware visual compatibility prediction, which explores the notion of complementarity and the importance of context, using graph neural networks.

Acknowledgements

This work was partially funded by the iFetch project, Ref. 45920, co-financed by ERDF, COMPETE 2020, NORTE 2020 and FCT under CMU Portugal.
Related Articles