Zhang 2025 - Qwen3 Embedding
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
This short paper introduces the Qwen3 Embedding and reranker series, which are the strongest open source models currently for such tasks. The Qwen3 foundation decoder LLM serves as the backbone for fine-tuning and also is used to generate high quality synthetic training data. Note that Qwen3 is multi-lingual and are publicly available under the Apache 2.0 license, which means it can be used commercially.
Characteristics
The embedding and reranking models come in 3 sizes: 0.6B
, 4B
and 8B
.
0.6B
: 28 layers, embedding dimension of1024
for the embedder4B
: 36 layers, embedding dimension of2560
for the embedder8B
: 36 layers, embedding dimension of4096
for the embedder
Since the 4B
and 8B
have same number of layers, presumably the 8B
model has larger hidden sizes.
All the models have 32K
sequence length limit, and are instruction aware, meaning that we can adjust the instruction at the start of the prompt to adjust the behaviour of the embedder or reranker. For the embedding models, there is also MRL support (Matryoshka Representation Learning), meaning that we can use custom dimensions from the embeddings.
Embedder
The text embeddings are obtained by appending an [EOS]
token at the end of every input sequence. The final embedding is derived from the hidden state of the last layer corresponding to this [EOS]
token.
Input format for queries or Documents is as follows:
{Instruction} {Query or Document}<|endoftext|>
The contrastive loss based on InfoNCE is used for training the embedder. Specifically, given a batch of training instances, the loss is defined as:
is cosine similarity function, is temperature and is the normalization factor which includes the positive pair +
various negative pairs:
Comment on the above normalization factor:
- The second term is the similarity between each anchor query and hard negatives per query. Note that as it is written, only the hard negatives in the same row are used as negatives for each anchor query, but in theory we could use all negatives in the mini-batch.
- The third term is the similarity between pairs of queries. The assumption is that randomly selected queries should be unrelated to each other.
- The last term is the similarity between the positive document in each row (i.e. ) and all other documents (including hard negatives).
The and are mask factors designed to reduce impact of false negatives in the normalization factor . Specifically, given an anchor query or document and a potential negative query or document :
This means that for each row, we use the similarity score between the query and as a dynamic threshold to filter out false negatives. For any term in which has too high similarity exceeding this threshold (plus a small margin), we reject the false negative and mask it out. Note that this approach is reminiscent of triplet loss semi-hard masking or the GISTEmbed loss.
Reranker
The reranker is simpler, and training remains in the text paradigm. Specifically, the authors use the LLM chat template to incorporate instruction and frame the reranking task as a yes
or no
question:
<|im_start|>system
Judge whether the Document meets the requirements based on the Query and
the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>
<|im_start|>user
<Instruct>: {Instruction}
<Query>: {Query}
<Document>: {Document}<|im_end|>
<|im_start|>assistant
<think>\n\n</think>\n\n
Instead of fitting a classifier head, no change is made to the architecture. The reranking score is computed as the likelihood ratio of the next token being yes
or no
:
The task then reduces to a supervised fine-tuning task, where the label is either yes
for positives or no
for negatives. The loss is simply the log probability of the correct label for each row (yes
or no
).
Multi-stage Training
The multi-stage training has emerged as a common practice for training text embedding models. The 3 stages used are as follows:
- Stage 1: Large scale synthetic data. The Qwen 32B model was used to synthesize training pairs of data across many tasks, such as retrieval, classification, semantic textual similarity.
- To create diversity, a document is taken from the Qwen3 training corpus, and top 5 similar documents are retrieved
- Qwen3 is presented with these documents and a user persona to generate a potential query
- Qwen3 is also instructed to vary the query type, length, difficulty and language for each query
- 150 million query - document pairs are generated this way
- Stage 2: High quality synthetic data
- The 150 million pairs in Stage 1 are filtered down to 12 million high quality pairs
- Specifically, only query - document pairs with cosine similarity greater than 0.7 are selected
- Stage 3: Model merging
- Model merging based on Spherical Linear Interpolation is used, which merges multiple model checkpoints saved during the fine tuning process.
Note that all 3 stages were used for the embedder, but stage 1 was omitted for the reranker as it did not help. The ablation studies show that all 3 stages are crucial for final performance of the 0.6B
embedding model.