What is Neural Thompson Sampling (NeuralTS)?

Neural Thompson Sampling (NeuralTS) was introduced by Zhang et al. (2021) (Reference 1) as a way to introduce Thompson sampling (TS) to neural network prediction models. In this post, we introduce Thompson sampling, outline the problem setting for NeuralTS, describe the NeuralTS algorithm and give a bit more detail on how we can derive it.

Introduction to Thompson sampling (TS)

Thompson sampling  is an algorithm for online decision problems. Imagine that we have to make a sequence of sequences across T time periods. In each time period t, we make an action x_t. The system receives the action and generates a response according to the conditional probability distribution q_\theta ( \cdot | x_t), and we get a reward r_t = r(y_t). Our goal is to maximize our expected reward across the t time periods. The value of \theta is assumed to be unknown to us.

While we don’t know the value of \theta, we can use information from previous rounds to get better and better estimates of \theta. In particular, we can use the Bayesian paradigm to do so. We start with an initial prior distribution for \theta: \theta \sim p. After each round, we get an updated posterior distribution for \theta, which we can use to make decisions for future rounds.

There are many ways to use our estimate of \theta to choose an action in each round. In the greedy approach, we compute a “most likely” value of \theta (e.g. the mean of the posterior distribution of \theta), then take the action that maximizes the reward under this value of \theta. In Thompson sampling, we take a sample of \theta from the posterior distribution, then take the action that maximizes the reward under the sampled \theta. The difference between the two approaches is shown in the figure below taken from Russo et al. (2018) (Reference 2).

Greedy algorithm vs. Thompson sampling. From Russo et al. (2018). Only line 3 is different.

Contextual bandits

In the setting above (sometimes called the “context-free bandit setting”), in each round we choose an action based solely on our understanding of \theta. In the contextual bandit setting, in each round t, we are additionally given a feature vector x_t representing specific information for round t. Our job is to choose an action a_t, and the system will generate a reward or outcome according to q_\theta(\cdot | x_t, a_t). (Note the different notation in this section, it’s to be consistent with the NeuralTS paper.)

An example of the contextual bandit setting is selecting the top story to headline a news website for each user. Each request for a top story corresponds to a round, each story that could be shown represents an action, and the context vector could be features about the user (e.g. age, interests, past browsing history).

Instead of limiting the context vector to just user features, the context vector can also include arm features or features from the (user, arm) interaction.

NeuralTS

NeuralTS operates in the contextual bandit setting. Assume that we have K different arms/actions and we have to make decisions across T rounds. In each round t, we get K contextual vectors x_{t,1}, \dots, x_{t,K} \in \mathbb{R}^d, one for each arm. NeuralTS assumes that the reward for a given context vector x is the output of a neural network f(x; \theta) of depth L \geq 2:

\begin{aligned} f_1 &= W_1 x, &(W_1 \in \mathbb{R}^{m \times d}), \\  f_\ell &= W_\ell \text{ReLU}(f_{\ell - 1}), &(2 \leq \ell \leq L, W_\ell \in \mathbb{R}^{m \times m}), \\  f(x; \theta) &= \sqrt{m} f_L. \end{aligned}

Let \theta = (\text{vec}(W_1), \dots, \text{vec}(W_L)) denote the vector of parameters for the neural network. If we apply Thompson sampling directly to this setting, in each round we would sample the network parameters \theta from its posterior distribution, then choose the action that maximizes the reward: \text{argmax}_k \, f(x_{t,k}; \theta). While workable, this becomes computationally difficult when the network is large.

Instead of sampling from the posterior distribution of \theta, NeuralTS samples from the posterior distribution of the scalar rewards f(x_{t, k}; \theta). Here is the algorithm as written in the paper:

NeuralTS algorithm (Algorithm 1 in Zhang et al.).

For completeness, g(x;\theta) = \nabla_\theta f(x; \theta), and here is equation (2.3) referenced in the Algorithm:

While it might look intimidating, NeuralTS is basically using the Laplace method to approximate the output of the neural network as a Gaussian distribution with suitably chosen parameters. It maintains a Gaussian distribution for each arm’s reward. To select an arm, we sample the reward for each arm from the reward’s posterior distribution, then pull the greedy arm. Details on the Laplace method can be found in this previous post. Here are some additional notes:

  • Line 1: U_0 represents the estimated covariance matrix of \theta at time 0 (i.e. the prior distribution).
  • Line 2: This is the initialization for the neural network parameters.
  • Line 5: This looks like the formula for posterior variance of the target variable in the previous post but with some minor differences (it removes the additive \beta^{-1} term but has a multiplicative \lambda / m term).
  • Line 6: NeuralTS adds an additional hyperparameter, \nu, used to control how much exploration the algorithm will do. Larger values of \nu correspond to more exploration. The original Laplace method has \nu = 1.
  • Line 9: In the original Laplace method, we set \theta_t to be the minimizer of a least-squares regression problem: equation (2.3) above with \lambda = 0. In contrast, NeuralTS takes \theta_t to be the minimizer of an \ell_2-regularized least-squares problem.
  • Line 10: This is the posterior variance update for the neural network parameters.

References:

  1. Zhang, W. et al. (2021). “Neural Thompson Sampling.
  2. Russo et al. (2018). A Tutorial on Thompson Sampling.

Using the Laplace method to approximate the posterior distribution of neural network parameters and outputs

In this previous post, we described how the Laplace method (or Laplace approximation) allows us to approximate a probability distribution with a Gaussian distribution with suitably chosen parameters. In this post, we show how it can be used to estimate the posterior distribution of the parameters and output of a neural network (NN), or really any prediction function. This post generally follows the exposition in Section 5.7.1 of Bishop (2006) (Reference 1).

Set-up

Assume that we are in the supervised learning setting with i.i.d. data \mathcal{D} = \{ x_i, y_i \}_{i=1}^n, where x_i \in \mathbb{R}^p and y_i \in \mathbb{R}. We have a neural network model f(\cdot ; w) which takes in an input vector x and outputs a prediction y. We would like to use the Bayesian framework to learn w. To do that, we need to specify a prior distribution for the weights w, and likelihood function for the data, and then we can turn the proverbial Bayesian crank to get posterior distributions.

Assume the prior distribution of the network weights is normal with some mean \mu_w and precision matrix \Lambda (inverse of covariance matrix):

\begin{aligned} p(w) = \mathcal{N}(w | \mu_w, \Lambda^{-1}) \end{aligned}

Assume that the conditional distribution of the target variable y given x is normal as well, with mean at the neural network output and some precision parameter \beta \in \mathbb{R}:

\begin{aligned} p(y \mid x, w) = \mathcal{N}(y | f(x; w), \beta^{-1}). \end{aligned}

Since our data is i.i.d., if we assume the x_i‘s are fixed, our likelihood function is

\begin{aligned} p(\mathcal{D} | w) = \prod_{i=1}^n \mathcal{N}(y_i | f(x_i ; w), \beta^{-1}). \end{aligned}

Posterior distribution of NN parameters

We now have all the ingredients we need for Bayesian inference. The posterior distribution is proportional to the product of the prior and the likelihood:

\begin{aligned} p(w \mid \mathcal{D}) \propto p(\mathcal{D}|w) p(w). \end{aligned}

Taking logarithms on both sides:

\begin{aligned} \log p(w \mid \mathcal{D}) &= -\frac{1}{2}(w-\mu_w)^\top \Lambda (w - \mu_w) - \frac{\beta}{2}\sum_{i=1}^n \left[ f(x_i; w) - y_i \right]^2 + C \end{aligned}

for some constant C. If f(x_i; w) was linear in w, the RHS is a quadratic expression in w, which implies that p(w \mid \mathcal{D} has Gaussian distribution with parameters which we can derive. Unfortunately for NNs and many other prediction models, f(x_i; w) is not linear in w.

Instead, we approximate the posterior with Laplace’s approximation. The details of Laplace’s approximation are in this previous post. we omit them here. If we let w_{MAP} be the mode of the posterior p(w | \mathcal{D}), then we can approximate the posterior as

\begin{aligned} p(w \mid \mathcal{D}) \approx \mathcal{N}(w | w_{MAP}, \Lambda^{-1}), \end{aligned}

where

\begin{aligned} \Lambda &= \left[ - \nabla^2 [\log p(w \mid \mathcal{D})] \right]_{w = w_{MAP}} \\  &= \Lambda + \frac{\beta}{2} \left[ \sum_{i=1}^n \nabla^2 \left[ f(x_i; w) - y_i \right]^2 \right]_{w = w_{MAP}}. \end{aligned}

Posterior distribution of target variable

Once we have the posterior distribution of the parameters, we can get the posterior predictive distribution of the target by marginalizing out the parameters:

\begin{aligned} p(y \mid x, \mathcal{D}) &= \int p(y \mid x, w) p(w \mid \mathcal{D}) dw. \end{aligned}

If the mean of y | x was  linear in w, then this would be a linear Gaussian model, and we could use standard results for that model to obtain the posterior predictive distribution (see this previous post). Unfortunately, for NNs this is not the case.

Again we turn to a Taylor approximation to move forward. Assuming we can ignore quadratic and higher order terms,

\begin{aligned} f(x; w) &\approx f(x; w_{MAP}) + g^\top (w -w_{MAP}), \\  g &= \left[ \nabla_w f(x; w) \right]_{w = w_{MAP}}. \end{aligned}

With this approximation, we have a linear Gaussian model:

\begin{aligned} p(w \mid \mathcal{D}) &= \mathcal{N}(w | w_{MAP}, \Lambda^{-1}), \\  p(y \mid x, w) &= \mathcal{N} \left( y \mid  f(x; w_{MAP}) + g^\top (w -w_{MAP}), \beta^{-1} \right). \end{aligned}

Applying the result for the linear Gaussian model,

\begin{aligned} p(y \mid x, \mathcal{D}) &= \mathcal{N}\left( y \mid f(x; w_{MAP}) + g^\top (w_{MAP} -w_{MAP}), \beta^{-1} + g^\top \Lambda^{-1} g \right) \\  &= \mathcal{N}\left( y \mid f(x; w_{MAP}), \beta^{-1} + g^\top \Lambda^{-1} g \right). \end{aligned}

References:

  1. Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Section 4.4).

What is the swish activation function?

The swish activation function is a commonly used activation function in deep learning networks. As a quick recap, at each neuron (node) of a neural network, we take a weighted sum of the inputs (plus a bias term) to get a real number, and then we pass that number through an activation function to get the neuron’s output. In the diagram below, x_1, \dots, x_n \in \mathbb{R} are the inputs, w_1, \dots, w_n \in \mathbb{R} are the network weights, b \in \mathbb{R} is the bias term for the neuron. The neuron’s output is y = f \left( b + \displaystyle\sum_{i=1}^n w_i x_i \right).

Illustration of a node from a neural network

Illustration of a node from a neural network. (Taken from https://www.superannotate.com/blog/activation-functions-in-neural-networks)

For most of the early 2010s, the most commonly used activation function was the Rectified Linear Unit (ReLU), defined as ReLU(x) = \max (0,x).

The swish function was introduced in Ramachandran et al. (2017) (Reference 1) as a replacement of the ReLU. It is defined as

\begin{aligned} \text{swish}(x) &= x \cdot \text{sigmoid}(\beta x) = \dfrac{x}{1 +\exp (-\beta x)}, \end{aligned}

where \beta > 0 is a parameter that is either fixed (usually at \beta = 1) or learned from the data.

Here is a plot of the swish function for a few values of \beta, along with a plot of the ReLU function as the dotted line:

Reference 1 notes that \text{swish}(x) = x/2 when \beta = 0, and as \beta \rightarrow \infty, the swish function approaches the ReLU function (it is in fact the pointwise limit). Hence, the swish function “can be loosely viewed as a smooth function which nonlinearly interpolates between the linear function and the ReLU function.”

The way the swish function was discovered is quite interesting. Ramachandran et al. (Reference 1) performed an automated search over a particular search space to find “good” activation functions, where “good” is defined as neural networks using the activation function performing well on a pre-defined prediction task. Candidate activation functions were constructed by repeatedly composing the “core unit” (Figure 1 of Reference 1):

The unary and binary functions that the search algorithm was allowed to use to form the activation functions are listed in the paper:

As an example, the swish function is constructed with just one core unit, the two unary functions being \beta x and x and the binary function being \sigma (x_1) \cdot x_2.

In their experiments, the swish function consistently outperformed ReLU and other candidate activation functions over a range of tasks. The table below (Table 1 of Reference 1) shows validation accuracy numbers for 3 different architectures for the CIFAR-10 dataset for the top-performing activation functions encountered in the search:

References:

  1. Ramachandran, P. (2017). Searching for Activation Functions.