Hertel 2026 - LinkedIn Feed Seq Recommender
This documents LinkedIn journey moving their feed ranker from a DCNv2 pointwise ranking system to a sequential recommender system (up to 1,000 past impressions).
Challenges with LinkedIn scale:
- Sheer scale and skew with power users having many interactions vs rare users
- Posts get a lot of interactions initially, but persist for weeks
- Real time serving
Background
Existing model is a well tuned DCNv2 model
- Two DCNv2 towers:
- One for passive actions (click, skip, long well)
- One for active actions (like, comment, share)
- Each tower has a multi-task output which I suppose gets a loss against multiple binary signals using binary cross entropy loss
- The models are run on CPU in a Java stack
Problem with this approach is feature engineering: onerous feature engineering work is done to compute:
- numeric interaction counts
- content embeddings
- Learning ID embeddings
- History transforms
The bet is that a sequential causal transformer attending over raw interaction history can learn these patterns. The challenge is latency.
Sequential Model
The pipeline:
- Encode each historical interaction as a pair of tokens (post + action), so that we get an interleaved sequence of
2Tpositions forTpost impressions - Actions include click, like, skip etc.
- After the transformer processes the
2Tsequence, all the outputs atactionpositions are discarded - The
Toutputs atpostpositions are combined with other candidate context features and then used to produce multi-task predictions using a prediction head
The forward pass of the causal transformer decoder is like so:
This is a standard transformer block, with 3 design decisions that made a difference:
- Pre layer norm. Doing layer norm on instead of doing it after the
RescaleAndAdd(their form of residual connection) made a big difference to training stability - RoPE instead of absolute postion embeddings. RoPE was much more stable, because absolute position embeddings sees a skewed distribution of data across positions (large positions are under-trained).
- The position embeddings at large positions are unstable and cause degradation in prediction quality
- RescaleAndAdd. The normal residual connection is . They did , where is learnable.
Late Fusion
The late fusion design refers to keeping context features out of the causal transformer, and only appending it to the transformer outputs. This means that they did not add historical item features like popularity into the 2T sequence.
Head Architecture
The prediction head was ablated against 3 designs:
- Linear
- MLP
- DCNv2
- Mixture of Experts
Mixture of Experts was the best design.
In-Session Data Leakage
One subtle issue was that sequential models can learn to overfit to in-session correlated behaviours. This is because within each browsing session, if the user is in an engaged mode, the click behaviours can be highly correlated, and the model can learn this pattern.
At serving time, we do not immediately have access to in-session engagement labels (probably due to data freshness), causing a test time and train time mismatch.
The solution is simple - randomize the order of items within each session when constructing training sequences.
LLM Ranker
LinkedIn experimented with an LLM ranker where the content of each post was represented as text. The LLM ranker then predicts for each new post whether the engagement is Yes or No. This had some appealing qualities but the efficiency was terrible because token sequences of interaction history took tens of thousands of tokens.
Serving System
Some design choices:
- CPU for feature fetching, tracking, and request-specific transformations; GPU for pytorch inference server
- "High performance gRPC interface that wraps Apache Arrow buffers in protobuf messages for zero-copy conversion to PyTorch tensors"
- Shared context batching. Since this is a ranker, all
512candidates for a given member request shares the same interaction history. Instead of sending a batch of512, they append all candidates along with the historical context into one long sequence, then use custom attention masks to ensure that historical context attends to each other, but candidate items do not attend to each other.- PyTorch's SDPA fallback to a naive attention with custom masks, so they built a custom CUDA kernel that extends Flash Attention
- Fused data loading. Consolidating padding, batching and packing into a C++ data loader was 50% faster for the training step, avoiding python multiprocessing overhead.