Hameed 2025 - 360Brew
360Brew: LLM-based Personalized Ranking and Recommendation - Hamed and Maziar, LinkedIn AI
Pain points on current LinkedIn ML:
- Operational: costly low agility development lifecycle
- Quality: disjoint optimization
- Developer experience: slow to roll out changes to models one by one
Goal: build a foundational model capturing the lifetime member activity data that solves all LinkedIn matching problems
- Zero shot capability: works well out of the box for next prediction tasks
- Measure how well the model does on new products
- In-Context learning: Learning from few examples without needing to retrain
- How well does the model do on new users / items? [cold start]
- Follow instruction from developers / users
- User control via prompts
Development
Building the LLM:
- Need to convert user history into a prompt by verbalizing user information and activities
- Provide instruction on what problem we are solving
- At time of training use different styles for verbalization
Prompt looks something like:
## Instruction
You are provided a member's profile and a set of jobs, their description, and interactions that the member had with the jobs. For each past job, the member has taken one of the following actions: applied, viewed, dismissed, or did not interact. Your task is to analyze the job interaction data along with the member's profile to predict whether the member will apply, view, or dismiss a new job referred to as the "Question" job.
Note: Focus on skills, location, and years of experience more than other criteria.
## Member Profile
Current position: software engineer, current company: LinkedIn, Location: Sunnyvale, California.
## Past job interaction data
Member has applied to the following jobs: [Age: 2 days, Title: Software Engineer, Location: New York, Country: USA, Company: Meta, Description: . . . ]
Member has viewed the following jobs: [Age: 1 week, Title: Software Engineer, Location: Texas, Country: USA, Company: AMD, Description: . . . ]
## Question 1
Will the member apply to the following job: [Age: 1 day, Title: Software Engineer, Location: Seattle, Country: USA, Company: Apple, Description: . . . ]
## Question 2
Will the member apply to the following job: [Age: 5 days, Title: RF Engineer, Location: Bay Area, Country: USA, Company: Google, Description: . . . ]
So in contrast to YouTube's semantic IDs, LinkedIn encodes past interactions in textual form.
Development pipeline:
- Start with OSS model
- Continued pre-training
- Supervised Finetuning
- Alignment
- Generate Brew-XL
150B
model - Distill to
Brew-mini
- Prune and quantize to
Brew-mini-turbo
at3B
parameters- Ablation studies show that it is critical to
first go BIG, then go small
- Ablation studies show that it is critical to
To make development cycle smooth, build in a lot of automation into the pipelines. Especially evaluation loop.
Three levers to improve model quality:
- More (and better data)
- Prepare data to maximize accuracy, distribution of different type of data
- Bigger model size
- Context length
- Longer context length means deeper user activity
- Increasing context length initially improves performance to a certain point (around
20-30k tokens
) - Beyond that models don't generalize that well and performance degrades
Tasting
Performance of model is best for cold start users. Measure relative gain over production model:
5 maximum activities
:+6%
10 maximum activities
:+4%
100 maximum activities
:+2%
Generalization to new domain. 360Brew model can generalize to out of domain tasks and surfaces and beat production models in those tasks.
- Increases team agility to roll out new features without training new model
Serving
Three levers to improve efficiency:
- Sparsification
- Smaller model
- Distillation from big model to small model done using SFT + KD loss
- Gradual distillation is more effective than direct distillation, i.e. go 8B model to 6B model to 3B model etc.
- Pruning is done layerwise, gradual pruning
- Quantization: Mix precision
- FP8 for all weights
- FP32 for language model head and logit processor. Important for recommendations, otherwise predictions collapse.
- Sparsification
- Star attention (reduce attention quadratic cost)
- Not every item needs to attend to every item
- When scoring, we can score multiple items at the same time (sounds like
500
)- Need to make sure these items do not attend to each other
- Star attention (reduce attention quadratic cost)
Q&A:
- They use 50-60 tasks out of domain to measure the effectiveness of the model in the eval loop.
- Designed custom vLLM kernels to allow multi-item scoring by modifying the attention mask