Weinberger 2009 - Hashing for Multitask Learning

Weinberger 2009 - Feature Hashing for Large Scale Multitask Learning

This paper proposes a method to represent a large feature set in a smaller space by using hashing. It shows analytically that with a sufficiently large hash dimension $m$ :

The inner product between instances is preserved, i.e. doing a dot product between instances in the hashed dimension approximates the true dot product in the original dimension
The same applies to learning a weight vector to generate predictions in the hashed space: the error of approximation goes to zero as $M$ increases

Setup

Consider having data points $x^{(1)}, ..., x^{(n)} \in R^{d}$ , where $d$ can be very large (e.g. millions). This setting is easily realized when we use, for example, word bi-grams and tri-grams as term-frequency features to perform some kind of text classification task. Such a large feature vector is unwieldy, and also inefficient since the feature vector is very sparse for a given text.

The hashing trick maps the high dimensional input vector to a smaller dimension feature space with the notation $ϕ : X \to R^{m}$ , such that $m << d$ .

We start with the following definitions:

Let $h$ be a hash function $h : N \to {1, ..., m}$
Let $E$ be a hash function $E : N \to {\pm 1}$

Note that while the definitions map from an input integer, we may apply them to texts as well, since any finite-length text may be assigned to a unique integer. This is typically done in practice by applying some hash algorithm to a given string, and then using the modulo function to restrict it to the desired range.

With this, and given two vectors $x, x^{'} \in R^{d}$ , we define the hash feature map:

$ϕ_{j}^{(h, E)} (x) = i \in Z : h (i) = j, 1 \leq i \leq d \sum E (i) x_{i}$

Where $j \in 1, ..., m$ is an index in the hashed dimension space, and $i \in 1, ..., d$ is an index in the input dimension space. We get a hash collision if more than one $i$ term is hashed into a given position $j$ . For brevity, we may just write $ϕ_{j}^{(h, E)} := ϕ$ .

Analysis

With this setup, the paper aims to prove analytically that hashing in this manner preserves the characteristics of the original space. In other words, we can significantly reduce the dimension of our features but achieve the same predictive effect as the original space by doing the hashing trick. This also means that the detrimental effect of hash collisions is minimal with a sufficiently large $m$ .

We won't trace through all the results, just the important and simple ones.

Lemma 2 The hash kernel is unbiased, i.e. $E_{ϕ} [⟨ x, x^{'} ⟩_{ϕ}] = ⟨ x, x^{'} ⟩$ .

Proof. The proof simply starts by expanding the inner product in the hashed space as follows: $⟨ x, x^{'} ⟩_{ϕ} = i = 1, ..., d \sum j = 1, ..., d \sum E (i) E (j) \cdot x_{i} x_{j}^{'} \cdot δ_{h (i), h (j)}$

Where $δ_{h (i) = h (j)}$ is an indicator variable which takes $1$ if $h (i) = h (j)$ (i.e. they are hashed to the same position) and $0$ otherwise.

To see that this expansion is true, consider a position in the hashed space, e.g. $k$ . The value at position $k$ looks something like the following. We just need to move the summands to the left and use the $δ$ variable to denote the common hash positions where $x_{i}$ and $x_{j}^{'}$ interact (if i and j are hashed to different positions, they clearly do not interact in an inner product). $[⟨ x, x^{'} ⟩_{ϕ}]_{k} = i \in Z : h (i) = k, 1 \leq i \leq d \sum E (i) x_{i} j \in Z : h (j) = k, 1 \leq j \leq d \sum E (j) x_{j}$

Now note that we can decompose the expectation over $ϕ$ into its independent constituents, i.e. $h$ and $E$ respectively (since the two hashes are independent):

$E_{ϕ} [⟨ x, x^{'} ⟩_{ϕ}] = E_{h} [E_{E} ⟨ x, x^{'} ⟩_{ϕ}]$

Now we just need to observe that the hashed values $E (i), E (j)$ are independent from all other terms in general, but also independent from each other whenever $i \neq = j$ (provided our hash function is pairwise independent). Thus when $i \neq = j$ , the summand is:

$E_{E} [E (i)] \cdot E_{E} [E (j)] \cdot E_{E} [x_{i} x_{j}^{'} \cdot δ_{h (i), h (j)}]$

These are clearly $0$ because $E [E (i)] = 0$ . So the original summation reduces to:

$E_{ϕ} [⟨ x, x^{'} ⟩_{ϕ}] = E_{ϕ} i = 1, ..., d \sum E (i)^{2} \cdot x_{i} x_{i}^{'} = ⟨ x, x^{'} ⟩$

Not only is the hashed inner product unbiased, it also has a variance that scales down in $O (\frac{1}{m})$ . The proof does a similar but more tedious expansion as the above, and assumes that $x, x^{'}$ have l2-norm of $1$ . This suggests that the hashed inner product will be concentrated within $O (\frac{1}{m})$ of the true value.

These results are sufficient to justify use of the hashed inner product space in practice. That is, we can perform recommendations in the hashed space with sufficiently large $m$ (we can tune that using validation error) to make the large feature space tractable. The paper goes on to prove more detailed bounds on the error and norm which are of less practical significance.

Multi-task Learning

The authors argue that this method is especially useful in the multi-task learning setting. Consider an email spam classification task where the vocab space is $V$ and the user space is $U$ . The parameter space is thus $V \times U$ , i.e. we wish to learn a user-specific weight vector $w_{u} \in R^{∣ V ∣}$ for each user $u$ , which allows us to personalize the spam filter for each user (different users have slightly differing definitions of what is spam).

The authors suggest the following approach:

Use the hashing trick to hash each term $v$ into the hashed space. e.g. data is passed into a global hash function $ϕ_{0}$ and assigned to a position
Each user gets his/her own hash function $ϕ_{u}$ . This may be implemented by using the same hash function but appending the user_id like so: user1_data, which hashes the same term into a new position.
We may thus represent each instance by $ϕ_{0} (x) + ϕ_{u} (x) \in R^{m}$ , capturing both a global element (some terms are universally spam-indicative) and a personalized element (some terms are specifically indicative for a user)
Finally, we learn a weight parameter $w_{h} \in R^{m}$ by training it in the hashed space

Empirically, for their task of $∣ V ∣ = 40 million$ , $∣ U ∣ = 400,000$ , they found that performance starts to saturate with $m \approx 4 million$ . This is a very small fraction of the total space $∣ V ∣ \times ∣ U ∣$ , showing the effectiveness of their method. Nevertheless, we should note that 4 million is still a rather large space.

Keyboard shortcuts

Chux's Notebook

Weinberger 2009 - Hashing for Multitask Learning

Setup

Analysis

Multi-task Learning