Huang 2026 - Semantic Tube Predictions
This paper is a follow-up to LLM-JEPA, trying to find another way to use representation regularization to improve LLM learning. It seems to be much more successful.
The main idea of the paper is to add a loss term to normal Next Token Prediction (NTP) cross entropy loss. The idea for the loss term is that trajectory of an LLM's final layer hidden representation through time steps (i.e. moving through the sequence of tokens) should be locally linear. By adding this constraint in the form of a loss term, they find that LLM learning becomes 16x more data efficient.