Chlon 2025 - LLMs are Bayesian in Expectation

LLMs are Bayesian, in Expectation, not in Realization

This paper uses information theoretic lens to analyze the effect of positional encoding on permutation qualities of LLMs and propose an algorithm to derive the optimal chain of thought length.

Summary

In context learning allows LLMs to adapt to new tasks using only a few examples at inference time. A theoretical framework for interpreting ICL is through the lens of bayesian inference (see Xie 2022 - ICL as Implicit Bayesian Inference). It proposes that transformers implicitly perform posterior updates over a latent concept variable, with the pretraining distribution encoding a prior over possible tasks.

This perspective has been challenged by Falck 2024 - Is ICL in LLMs bayesian?, which demonstrates empirically that transformer based language models systematically violate the martingale property. Specifically, for exchangeable data where the order of observations carries no information, bayesian posterior predictive distributions must satisfy: $E [f (X_{n + 1}) ∣ X_{1}, ..., X_{n}] = E [f (X_{n + 1}) ∣ X_{π (1)}, ..., X_{π (n)}]$

For any permutation $π$ and bounded function $f$ . The experiments show that LLMs like GPT-3.5 consistently violate this property under input permutation.

This paper observes that while bayesian inference assumes exchangeable data, position encodings fundamentally break the symmetry. This is formalized using two complexity measures:

Kolmogorov complexity $K (X)$ of a sequence, which is permutation invariant for exchangeable data
Conditional complexity $K (X ∣ π)$ given a specific ordering $π$

It then shows that transformers with position encoding minimizes: $E_{π \sim U (S_{n})} [K (X ∣ π)] = K (X) + I (X; π)$

Where:

$U (S_{n})$ denotes the uniform distribution over permutations consistent with sufficient statistics of the data
- For iid data, we can just sample uniformly over all permutations
$I (X; π)$ is the mutual information between sequences and their orderings

Note that this is a well known theorem (the kolomogorov version of Shannon's information identity) applied to this context.

Notations

$X = (x_{1}, ..., x_{n})$ denotes a sequence of observations
$S_{n} = \sum_{i = 1}^{n} x_{i}$ is the sufficient statistic for bernoulli sequences
$π \in S_{n}$ represents a permutation of $n$ elements
$T_{θ}$ denotes a transformer with parameters $θ$
$K (X)$ is the kolmogorv complexity of sequence $X$
$H (p) = - p lo g p - (1 - p) lo g (1 - p)$ is the binary entropy function
$k$ denotes the number of chain of thought tokens
$ϵ$ denotes the target error tolerance

Keyboard shortcuts

Chux's Notebook

Chlon 2025 - LLMs are Bayesian in Expectation

Summary

Notations