Hertel 2026 - LinkedIn Feed Seq Recommender

This documents LinkedIn journey moving their feed ranker from a DCNv2 pointwise ranking system to a sequential recommender system (up to 1,000 past impressions).

Challenges with LinkedIn scale:

Sheer scale and skew with power users having many interactions vs rare users
Posts get a lot of interactions initially, but persist for weeks
Real time serving

Background

Existing model is a well tuned DCNv2 model

Two DCNv2 towers:
- One for passive actions (click, skip, long well)
- One for active actions (like, comment, share)
Each tower has a multi-task output which I suppose gets a loss against multiple binary signals using binary cross entropy loss
The models are run on CPU in a Java stack

Problem with this approach is feature engineering: onerous feature engineering work is done to compute:

numeric interaction counts
content embeddings
Learning ID embeddings
History transforms

The bet is that a sequential causal transformer attending over raw interaction history can learn these patterns. The challenge is latency.

Sequential Model

The pipeline:

Encode each historical interaction as a pair of tokens (post + action), so that we get an interleaved sequence of 2T positions for T post impressions
Actions include click, like, skip etc.
After the transformer processes the 2T sequence, all the outputs at action positions are discarded
The T outputs at post positions are combined with other candidate context features and then used to produce multi-task predictions using a prediction head

The forward pass of the causal transformer decoder is like so:

$Q, K, V Q_{r}, K_{r} Attn Y Z = W_{q} LN (X_{in}), W_{k} LN (X_{in}), W_{v} LN (X_{in}), = RoPE (Q, K) = W_{v} Concat (SDPA (Q_{r}, K_{r}, V; causal)) = RescaleAndAdd (X_{in}, Attn) = RescaleAndAdd (Y, FFN (LN (Y)))$

This is a standard transformer block, with 3 design decisions that made a difference:

Pre layer norm. Doing layer norm on $X_{in}$ instead of doing it after the RescaleAndAdd (their form of residual connection) made a big difference to training stability
RoPE instead of absolute postion embeddings. RoPE was much more stable, because absolute position embeddings sees a skewed distribution of data across positions (large positions are under-trained).
- The position embeddings at large positions are unstable and cause degradation in prediction quality
RescaleAndAdd. The normal residual connection is $Y = X + SubLayer (X)$ . They did $Y = X + α \cdot SubLayer (X)$ , where $α$ is learnable.

Late Fusion

The late fusion design refers to keeping context features out of the causal transformer, and only appending it to the transformer outputs. This means that they did not add historical item features like popularity into the 2T sequence.

Head Architecture

The prediction head was ablated against 3 designs:

Linear
MLP
DCNv2
Mixture of Experts

Mixture of Experts was the best design.

In-Session Data Leakage

One subtle issue was that sequential models can learn to overfit to in-session correlated behaviours. This is because within each browsing session, if the user is in an engaged mode, the click behaviours can be highly correlated, and the model can learn this pattern.

At serving time, we do not immediately have access to in-session engagement labels (probably due to data freshness), causing a test time and train time mismatch.

The solution is simple - randomize the order of items within each session when constructing training sequences.

LLM Ranker

LinkedIn experimented with an LLM ranker where the content of each post was represented as text. The LLM ranker then predicts for each new post whether the engagement is Yes or No. This had some appealing qualities but the efficiency was terrible because token sequences of interaction history took tens of thousands of tokens.

Serving System

Some design choices:

CPU for feature fetching, tracking, and request-specific transformations; GPU for pytorch inference server
- "High performance gRPC interface that wraps Apache Arrow buffers in protobuf messages for zero-copy conversion to PyTorch tensors"
Shared context batching. Since this is a ranker, all 512 candidates for a given member request shares the same interaction history. Instead of sending a batch of 512, they append all candidates along with the historical context into one long sequence, then use custom attention masks to ensure that historical context attends to each other, but candidate items do not attend to each other.
- PyTorch's SDPA fallback to a naive attention with custom masks, so they built a custom CUDA kernel that extends Flash Attention
Fused data loading. Consolidating padding, batching and packing into a C++ data loader was 50% faster for the training step, avoiding python multiprocessing overhead.

Keyboard shortcuts

Chux's Notebook