Wang 2020 - DCNv2

DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems

This paper proposes a neural network architecture called Deep and Cross Networks for effectively learning feature crosses compared to a standard feedforward MLP.

Background

Effective feature crossing is essential in many ML tasks, especially for search and recommendation. For example, a country feature crossed with language is more informative than either alone. Manually searching for good feature crosses is a combinatorial exercise which is intensive.

Deep neural networks in the form of MLPs are generally viewed as universal function approximators in the limiting case. But in the finite depth, they are often incapable of fully modelling feature crosses effectively (say using a simulated dataset).

Traditionally, Factorization Machines seek to overcome the feature combination issue by embedding a sparse feature with large dimensionality into a small dense vector. Feature crosses between features is then performed using a dot product between features. However, we are limited in expressiveness to order-2 crosses and the number of feature crosses can still be large with a large number of features (after densification). The other limitation of FMs is that we require the embedding dimension of each feature be the same, which is limiting if our features have different cardinality needs.

Traditional approach is to mix implicit crossing (i.e. fitting a DNN to the features) with explicit crossing (say with the FM approach, or by multiplying raw features together etc.). The implicit network and explicit network are usually in parallel and the output is added together to form the final prediction.

DCNv1

The same authors proposed DCNv1 in 2017, and it is useful to see how it evolved. The method is as follows:

Let $x^{[0]}$ represent a concatenated feature vector at layer 0. That is, we lay out each dense or sparse feature sideways and concatenate them into a long feature vector. Let $x^{[0]} \in R^{d}$ .
We have a cross and deep layer in parallel
- The deep side is simply a standard MLP, i.e. $h_{1} = Relu (w^{[0]} x^{[0]} + b^{[0]})$
- The cross side is where the magic happens: $x^{[l + 1]} = x^{[0]} x^{[l]}^{T} w^{[l]} + b^{[l]} + x^{[l]}$
- Note that $w, b \in R^{d}$ are learnable parameters
- $x^{[0]} \cdot x^{[l]}^{T}$ is a $d \times d$ matrix of rank 1. At layer 1, it comprises all the pairwise crosses, e.g. $x_{1}^{[0]} \cdot x_{1}^{[l]}, x_{1}^{[0]} \cdot x_{2}^{[l]}, ...$
- Hence the transformation at each cross layer is rank 1.
As we increase the number of cross layers, we will get feature crosses of increasing polynomial degree. We end up with a $d$ -dimensional feature vector $x^{[l]} \in R^{d}$ which comprises complex weighted polynomial feature crosses of polynomial degrees $1, ..., l$ .
At the final layer, we have $x_{w i d e}^{[L]}, x_{d ee p}^{[L]} \in R^{d}$ . The two features are concatenated together to form a vector $R^{2 d}$ , which can then be fitted to a classifier head for final predictions.

DCNv2

The criticism of DCNv1 is that the transformation at each cross layer is rank 1 and hence not expressive enough. DCNv2 tries to make the cross layer more expressive while still making it parameter-efficient.

The cross layer formulation of DCNv2 is: $x^{[l + 1]} = x^{[0]} ⊙ (W^{[l]} x^{[l]} + b^{[l]}) + x^{[l]}$

Where $W^{[l]} \in R^{d \times d}, b^{[l]} \in R^{d}$ .

To see how it compares to DCNv1, we can let $W^{[l]} = u^{[l]} \cdot v^{[l]}^{T}$ be a rank 1 matrix and $b^{[l]} = 0$ . Furthermore, we set $u^{[l]} = 1$ . Then we have: $x^{[l + 1]} = x^{[0]} ⊙ (u^{[l]} \cdot v^{[l]}^{T} x^{[l]}) + x^{[l]} = (v^{[l]}^{T} x^{[l]}) (u^{[l]} ⊙ x^{[0]}) + x^{[l]} = (v^{[l]}^{T} x^{[l]}) x^{[0]} + x^{[l]}$

Note that in line 2, we use the fact that $v^{[l]}^{T} x^{[l]}$ is a scalar and move it out to the left. In line 3 since $u^{[l]} = 1$ we can remove it.

Similarly for DCNv1, we can pull out the scalar: $x^{[l + 1]} = x^{[0]} x^{[l]}^{T} w^{[l]} + b^{[l]} + x^{[l]} = (x^{[l]}^{T} w^{[l]}) x^{[0]} + b^{[l]} + x^{[l]}$

We thus see that DCNv2 ends up in exactly the same form as DCNv1 (with just a missing $b^{[l]}$ term).

This reformulation helps us see that DCNv1 is DCNv2 when $W^{[l]}$ is rank 1. Hence when we allow $W^{[l]}$ to be higher rank, we get more expressiveness than DCNv1.

Stack vs Parallel

In addition to the parallel structure proposed in DCNv1, where a deep MLP runs parallel to the cross network and the final vector is concatenated together, DCNv2 proposes an alternative stacked architecture. In this formulation, we run through the cross layers first to get $x^{[l]} \in R^{d}$ . Then, this is fed into a deep MLP. The paper says that which architecture performs better depends on the task.

Loss

Finally, the loss is computed as standard binary cross entropy wrt the binary labels: $L = - \frac{1}{N} i = 1 \sum N y_{i} lo g (\overset{y}{^}_{i}) + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i}) + λ l \sum ∣∣ W_{l} ∣ ∣_{2}^{2}$

Modifications

Using ranking models in production settings usually has strict latency requirements, hence it is important to reduce cost while maintaining accuracy. The paper thus proposes 3 modifications to make the model more efficient.

Modification 1: Low Rank Approximation

In practice, the weight matrix $W^{[l]}$ is usually effectively low rank, so it is well motivated to approximate it with smaller matrices $U^{[l]} V^{[l]}^{T}$ , where both $U^{[l]}, V^{[l]} \in R^{d \times r}, r << d$ . So we have: $x^{[l + 1]} = x^{[0]} ⊙ (U^{[l]} (V^{[l]}^{⊺} x^{[l]}) + b^{[l]}) + x^{[l]}$

For the experimental setting in the paper, $r = 64$ was the low rank threshold after which they reported diminishing returns for increasing rank.

Modification 2: Mixture of Experts

Instead of just having one expert weight for each cross layer, they propose having multiple experts and then combining the expert outputs together using a gating mechanism. This is analogous to multi-headed attention with multiple heads. The idea is that each expert can learn effective feature crosses in a certain subspace. The input-dependent gating mechanism can then select the appropriate experts for a given input.

We have: $x^{[l + 1]} E_{i} (x^{[l]}) = i = 1 \sum K G_{i} (x^{[l]}) E_{i} (x^{[l]}) + x^{[l]} = x^{[0]} ⊙ (U_{i}^{[l]} (V_{i}^{[l]}^{⊺} x^{[l]}) + b^{[l]})$

Where:

$G_{i} (\cdot) : R^{d} \mapsto R$ is the gating function which represents the input-dependent weight of expert $i$ . It can be a learned softmax function.
$E_{i}$ is the expert $i$ . It is simply the earlier equation but with separate weights for each expert $i$ .

Modification 3: Pre Activation Functions

With the low rank approximation, we effectively project the features to a low dimension and project it back up. Instead of immediately projecting back from dimension $r$ to $d$ , we can apply non-linear transformations. This allows the function to learn a richer set of representations.

$E_{i} (x^{[l]}) = x^{[0]} ⊙ (U_{i}^{[l]} \cdot g (C_{i}^{[l]} \cdot g (V_{i}^{[l]}^{⊺} x^{[l]})) + b^{[l]})$

Here, $g (\cdot)$ represents any non-linear activation function (like Relu) and $C_{i} \in R^{r \times r}$ is a learned weight. In the paper, the sigmoid function was chosen.

In practice, the tensorflow implementation seems to incorporate (i) low rank approximation and (ii) pre activation function, but does not do the mixture of experts.

Keyboard shortcuts

Chux's Notebook