Li 2021 - TaoBao Embedding-Based Retrieval
Li 2021 - Embedding Based Product Retrieval in TaoBao Search
This paper explains TaoBao's embedding-based search retrieval system.
Context
TaoBao's product search has many stages. This paper focuses on the initial retrieval stage, when around 10k
products are retrieved per query. The retrievers used are:
- Traditional lexical based term match retriever
- Item-based collaborative filtering retriever (presumably based on recent items)
- Embedding based retriever
The retrieval results from each retriever are merged and de-duplicated, and then sent to participate in the next phase (pre-ranking). This paper is only focused on the embedding-based retriever.
Problem
Taobao tackles several problems in this paper:
- Poor control of relevance for embedding based retrieval (EBR). Taobao reported that over time, the EBR contributed to an increase in complaints as non-relevant items started to get surfaced. For example, if a query is made for
Nike Shoes
, shoes from similar brands likeAdidas
may get surfaced. - Balancing relevance and personalization. Taobao notes that personalization is an important aspect of product search, and they propose a way of merging search relevance and personalization in this paper.
Setup
Let denote a set of users, denote their corresponding queries, and denote the collection of items. Let us divide the user 's historical sequence of items into 3 buckets:
- Real time, i.e. most recent item interactions, denote as
- Short term, i.e. within last
10 days
but older than real time, denote as - Long term, i.e. within last
1 month
but older than short term, denote as
The search task is as follows. We are given the historical behaviours of a given user and the submitted query at time . We also get access to a sequence of historical queries for the user . Our task is to return a set of items that satisfy the search request. Typically, we score each query, item
pair according to score and return the top-K items:
Where:
- denotes the scoring function, typically the inner product function
- denotes the query + behaviour encoder
- denotes the item encoder
Model Architecture
The model architecture in the paper is quite complicated, although understandable after some time. I don't think it is profitable to describe the architecture in full specificity, but there are useful ideas within.
At a high level, the model architecture is a standard two-tower setup, where the user query and behaviours are encoded into a user embedding, and the item is encoded into an item embedding. The dot product is used to score user-item pairs and ANN search is used for retrieval. The interesting details lie mostly within the user tower, in particular the way the query and behaviours are integrated.
Query Representation
We first tackle the Query Representation. Given a query string, the query encoder encodes it into , where is an arbitrary embedding dimension. The first dimension is 6
because the paper uses 6
different ways to encode the query (they call this multi-grained representation). The 6
representations are briefly explained as follows:
- 1-gram and 2-gram embeddings. The query string is tokenized into 1-gram or 2-gram and embeddings are looked up for each token. Mean pooling over tokens is done to get a single embedding.
- Phrase embeddings. Similarly, the query string is tokenized into phrases using their query segmentation engine. For example, 红色 is a phrase and 连衣裙 is a phrase. The same embedding lookup and mean-pooling is done. Call this embedding.
- Phrase transformer embeddings. The phrase embeddings are also passed through a transformer before mean-pooling. I suppose the sequence of phrases matters enough to do this.
- Historical query embeddings. This is the interesting part where the query interacts with historical queries.
- Specifically, the phrase embedding and the historical query matrix is used to form an attention matrix in the form of , which provides the relevance of each historical query to the current submitted query.
- The attention weights are then used to do a weighted average over the historical query embeddings .
- Hence we get a weighted representation of historical queries where more relevant queries to the current query are emphasized.
- Mix embeddings. This is simply a sum over the above
5
embeddings
Note that in the query representation, both token-based embeddings and phrase-based embeddings are used to provide a fine grained representation of the query.
User Behaviour Representation
Now we tackle the representation of user behaviours. There are some minor deviations on how they treat respectively, but broadly they follow the same structure. We refer to below but similar treatment applies to or .
- Each item in the sequence is represented by an embedding
- The item embedding comprises a concatenation of several embeddings, such as item ID embedding, item category embedding, brand embedding etc.
- is passed into a transformer with self attention to get a hidden representation
- Similar to the treatment of historical query above, cross attention is performed with the query representation to get a query-weighted representation of , as follows:
- The attention matrix is formed by taking , which provides the relevance of each historical interacted item to the current submitted query.
- The attention weights are then used to do a weighted average over the historical item embeddings .
- Hence we get a weighted representation of historical item interactions where more relevant items to the current query are emphasized.
- The same operations are applied to and to get and respectively.
Finally, the query and the representations of historical item interactions are passed into a self attention layer to get the final user representation, which captures both query semantics and historical information about the user:
The embedding at the token is taken to represent the user. This has dimension .
Item Representation
The item embedding is represented similarly to how each item was represented above. Mean pooling is done over the phrase segmentation of the item title, and added to the ID embedding of the item.
Loss Function
Now that we have described the model architecture, we proceed to training details. The authors used sampled softmax loss to train the model. The sampled softmax loss was compared against pairwise hinge loss and found to be superior. The hinge loss also has a downside of needing to tune the margin parameter, which can significantly affect results.
Specifically, for a given user representation and item representation of item . Let the predicted score for the match between user and item be denoted:
Where is the temperature parameter.
We can then denote the loss as the negative log likelihood of the positive interactions:
Adjustments
The authors propose two adjustments to the training procedure to improve performance.
Firstly, they propose to use temperature to handle noisy relevance labels. They argue that the positive signals in e-commerce data from clicks and purchases are noisy, and will benefit from tuning the temperature. The argument is:
- As , the softmax function approaches a one-hot function of the item with the highest score. Thus the model will be encouraged to fit the training signals exactly, which runs the risk of overfitting to noisy labels.
- Conversely, as , the softmax function approaches a uniform distribution over all items regardless of relevance. This encourages the model to underfit the training signals, since scoring positive items higher does not affect the loss by much.
Therefore, supposing the positive signals to be noisy, we should increase to accommodate the noise. The authors found that performed the best for their data.
Note: This approach advocates for more smoothing of the softmax distribution, which is opposite to the findings from papers like SimCSE, which advocate for less smoothing (or a very low ). I suppose the optimal temperature depends on how noisy the positive signals are relative to how noisy the negative signals are.
Secondly, they propose a unique way of mining hard negative samples. Firstly, they mine hard negatives from the random negatives as per normal by choosing the top negative items with the highest inner product with the user representation. Let this matrix be called . Now, they linearly interpolate with the positive item embedding as follows:
This is then used in the loss function as the negative item embeddings. In their experiment, they found that between worked well.
Note: This approach seems to exacerbate the problem of false negatives, since we are using the positive item as part of the negative signal.
Finally, the authors propose a simple solution to the problem they raised of poor relevance matches surfaced by the EBR approach. The propose relevance control measure is to simply add a boolean term-based filter to the EBR results. For example, if the query is Red Nike Shoes
, the boolean query will be something like Color:Red AND Brand:Nike AND Product:Shoes
. The boolean filter will thus filter out any irrelevant results.
The paper does not elaborate much on how the query filter is implemented. Taobao decides on the key query terms to filter based on their understanding of what constitutes essential components of the query (i.e. brand name, color, style etc.).
Note: One may wonder what is the point of Embedding Based Retrieval (EBR) if we are going to slap a lexical boolean filter on it at the end. The answer is that Taobao is not using EBR as a way to achieve fuzzy semantic matching, as one might do when the lexical retrieval is too strict. Rather, it seems that due to the large number of items in Taobao's catalogue, the number of items passing the boolean filter already far exceeds the
10k
limit. Thus they are using EBR as a way of surfacing the most relevant items within the subset of items that pass the boolean filter.
Results
Interestingly, Taobao evaluates the quality of the retrieval based on two metrics:
- Recall@1k: This is a typical measure of how many positive items were retrieved by the algorithm
- Relevant %: Taobao has an in-house relevance model that scores
0.915
AUC on human-annotated data on whether a query and product has good relevance match. Hence they use this model to determine if each query and retrieved product has good relevance, and reports the % of good relevance.
The findings of the paper are:
- The multi-grained representation of the query adds a small amount of recall
- Tuning temperature / using hard negatives adds significant amount of relevance
- Adding all the improvements raises relevance but decreases recall slightly, showing that there is some inherent trade-off between relevance and recall (i.e. some users do indeed engage with non-relevant items to their query)
It seems that Taobao prioritizes relevance over recall, which makes sense as surfacing irrelevant results undermines the integrity of the search system. Furthermore, since they use a boolean filter for relevance control, irrelevant results surfaced by the EBR system would get filtered out anyway.
Takeaways
This paper tackles an important problem for EBR systems where irrelevant results get surfaced sometimes. While a boolean filter might make sense for Taobao's use case, other systems might need a different approach to avoid being overly strict in relevance control. It is also insightful to understand the inherent trade-off between recall and relevance when training the model.
Finally, the paper also demonstrates a way to incorporate personalization into search via cross attention mechanisms between the query representation and the historical item interactions of the user.