- The estimator is
if converges in probability to .**consistent** - It is
if it is, on average, equal to the true value of the parameter, i.e. if .**unbiased** - It is
if as .**asymptotically unbiased**

**Unbiasedness is not the same as consistency**

* It’s important to note that unbiasedness and consistency do not imply each other.* The examples below from Reference 1 show that.

*Unbiasedness does not imply consistency:* Let , . Consider the estimator for the mean . We always have , so it is unbiased. However, converges in distribution to , and so is not consistent.

*Consistency does not imply unbiasedness:* Let , . The maximum likelihood estimator (MLE) for is , where . It is consistent (the MLE is always consistent), but it is not hard to show that , i.e. it is biased.

**Asymptotic unbiasedness and consistency**

**Asymptotic unbiasedness and consistency also do not imply each other.**

*Asymptotic unbiasedness does not imply consistency:* This is a variation of the example for “unbiasedness does not imply consistency”. Instead of taking , take .

*Consistency does not imply asymptotic unbiasedness:* From Reference 2: consider a silly example where and we want to estimate using random variables with

is consistent since it converges in probability to 0, but it is not asymptotically unbiased: for every .

References:

]]>* Why am I not simply posting the git commands?* That’s because the correct git command depends a lot on the context. By linking to the full reference, you can check if the situation the reference is addressing matches yours before applying its solution.

(* Note:* This post will be updated periodically as I find other useful idioms that I keep forgetting.)

**How to remove a single file from the staging area.****How to amend a commit message.**This reference also deals with the case where you want to amend messages for older commits, or if you want to amend messages for commits that have been pushed to a remote repository.**How to “squash” multiple commits into a single commit.**This is especially useful for removing small, intermediate checkpoints that help during development but clutter up the commit history.**How to merge a branch into master.**This deals with the simple case when there are no merge conflicts.**How to push changes on a local branch to the remote version of the branch.**This includes how to delete local and remote branches.

`Matrix::Matrix`

with the option `sparse=TRUE`

) and found it difficult to track down documentation about what the slots in the matrix object are. This post describes the slots in a class `dgCMatrix`

object.
(Click here for full documentation of the `Matrix`

package (and it is a lot–like, 215 pages a lot).)

**Background**

It turns out that there is some documentation on `dgCMatrix`

objects within the `Matrix`

package. One can access it using the following code:

library(Matrix) ?`dgCMatrix-class`

According to the documentation, the `dgCMatrix`

class

…is a class of sparse numeric matrices in the compressed, sparse, column-oriented format. In this implementation the non-zero elements in the columns are sorted into increasing row order.

`dgCMatrix`

is the “standard” class for sparse numeric matrices in the`Matrix`

package.

**An example**

We’ll use a small matrix as a running example in this post:

library(Matrix) M <- Matrix(c(0, 0, 0, 2, 6, 0, -1, 5, 0, 4, 3, 0, 0, 0, 5, 0), byrow = TRUE, nrow = 4, sparse = TRUE) rownames(M) <- paste0("r", 1:4) colnames(M) <- paste0("c", 1:4) M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 .

Running `str`

on `x`

tells us that the `dgCMatrix`

object has 6 slots. (To learn more about slots and S4 objects, see this section from Hadley Wickham’s *Advanced R*.)

str(M) # Formal class 'dgCMatrix' [package "Matrix"] with 6 slots # ..@ i : int [1:7] 1 2 1 2 3 0 1 # ..@ p : int [1:5] 0 1 2 5 7 # ..@ Dim : int [1:2] 4 4 # ..@ Dimnames:List of 2 # .. ..$ : chr [1:4] "r1" "r2" "r3" "r4" # .. ..$ : chr [1:4] "c1" "c2" "c3" "c4" # ..@ x : num [1:7] 6 4 -1 3 5 2 5 # ..@ factors : list()

`x`

, `i`

and `p`

If a matrix `M`

has `nn`

non-zero entries, then its `x`

slot is a vector of length `nn`

containing all the non-zero values in the matrix. The non-zero elements in column 1 are listed first (starting from the top and ending at the bottom), followed by column 2, 3 and so on.

M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . M@x # [1] 6 4 -1 3 5 2 5 as.numeric(M)[as.numeric(M) != 0] # [1] 6 4 -1 3 5 2 5

The `i`

slot is a vector of length `nn`

. The `k`

th element of `M@i`

is the row index of the `k`

th non-zero element (as listed in `M@x`

). * One big thing to note here is that the first row has index ZERO, unlike R’s indexing convention.* In our example, the first non-zero entry, 6, is in the second row, i.e. row index 1, so the first entry of

`M@i`

is 1.M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . M@i # [1] 1 2 1 2 3 0 1

If the matrix has `nvars`

columns, then the `p`

slot is a vector of length `nvars + 1`

. * If we index the columns such that the first column has index ZERO,* then

`M@p[1] = 0`

, and `M@p[j+2] - M@p[j+1]`

gives us the number of non-zero elements in column `j`

.In our example, when `j = 2`

, `M@p[2+2] - M@p[2+1] = 5 - 2 = 3`

, so there are 3 non-zero elements in column index 2 (i.e. the third column).

M # 4 x 4 sparse Matrix of class "dgCMatrix" # c1 c2 c3 c4 # r1 . . . 2 # r2 6 . -1 5 # r3 . 4 3 . # r4 . . 5 . M@p # [1] 0 1 2 5 7

With the `x`

, `i`

and `p`

slots, one can reconstruct the entries of the matrix.

`Dim`

and `Dimnames`

These two slots are fairly obvious. `Dim`

is a vector of length 2, with the first and second entries denoting the number of rows and columns the matrix has respectively. `Dimnames`

is a list of length 2: the first element being a vector of row names (if present) and the second being a vector of column names (if present).

`factors`

This slot is probably the most unusual of the lot, and its documentation was a bit difficult to track down. From the CRAN documentation, it looks like `factors`

is

… [an] Object of class “list” – a list of factorizations of the matrix. Note that this is typically empty, i.e.,

`list()`

, initially and is updatedwhenever a matrix factorization isautomagically

computed.

My understanding is if we perform any matrix factorizations or decompositions on a `dgCMatrix`

object, it stores the factorization under `factors`

so that if asked for the factorization again, it can return the cached value instead of recomputing the factorization. Here is an example:

M@factors # list() Mlu <- lu(M) # perform triangular decomposition str(M@factors) # List of 1 # $ LU:Formal class 'sparseLU' [package "Matrix"] with 5 slots # .. ..@ L :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots # .. .. .. ..@ i : int [1:4] 0 1 2 3 # .. .. .. ..@ p : int [1:5] 0 1 2 3 4 # .. .. .. ..@ Dim : int [1:2] 4 4 # .. .. .. ..@ Dimnames:List of 2 # .. .. .. .. ..$ : chr [1:4] "r2" "r3" "r4" "r1" # .. .. .. .. ..$ : NULL # .. .. .. ..@ x : num [1:4] 1 1 1 1 # .. .. .. ..@ uplo : chr "U" # .. .. .. ..@ diag : chr "N" # .. ..@ U :Formal class 'dtCMatrix' [package "Matrix"] with 7 slots # .. .. .. ..@ i : int [1:7] 0 1 0 1 2 0 3 # .. .. .. ..@ p : int [1:5] 0 1 2 5 7 # .. .. .. ..@ Dim : int [1:2] 4 4 # .. .. .. ..@ Dimnames:List of 2 # .. .. .. .. ..$ : NULL # .. .. .. .. ..$ : chr [1:4] "c1" "c2" "c3" "c4" # .. .. .. ..@ x : num [1:7] 6 4 -1 3 5 5 2 # .. .. .. ..@ uplo : chr "U" # .. .. .. ..@ diag : chr "N" # .. ..@ p : int [1:4] 1 2 3 0 # .. ..@ q : int [1:4] 0 1 2 3 # .. ..@ Dim: int [1:2] 4 4

Here is an example which shows that the decomposition is only performed once:

set.seed(1) M <- runif(9e6) M[sample.int(9e6, size = 8e6)] <- 0 M <- Matrix(M, nrow = 3e3, sparse = TRUE) system.time(lu(M)) # user system elapsed # 13.527 0.161 13.701 system.time(lu(M)) # user system elapsed # 0 0 0]]>

`glmnet`

function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.
In this post, instead of looking at one of the function options of `glmnet`

, we’ll look at the `predict`

method for a `glmnet`

object instead. The object returned by `glmnet`

(call it `fit`

) has class `"glmnet"`

; when we run `predict(fit)`

, it runs the `predict`

method for class `"glmnet"`

objects, i.e. `predict.glmnet(fit)`

.

For reference, here is the full signature of the `predict.glmnet`

function/method (v3.0-2):

predict(object, newx, s = NULL, type = c("link", "response", "coefficients", "nonzero", "class"), exact = FALSE, newoffset, ...)

In the above, `object`

is a fitted `"glmnet"`

object (call it `fit`

). Recall that every glmnet `fit`

has a lambda sequence associated with it: this will be important in understanding what follows. (This sequence can be accessed via `fit$lambda`

.)

For the rest of this post, we will use the following data example:

set.seed(1) n <- 100; p <- 20 x <- matrix(rnorm(n * p), nrow = n) beta <- matrix(c(rep(1, 5), rep(0, 15)), ncol = 1) y <- x %*% beta + rnorm(n) fit <- glmnet(x, y)

**Function option: newx**

`newx`

is simply the new `x`

matrix at which we want predictions for. So for example, if we want predictions for the training `x`

matrix, we would do

predict(fit, x)

If no other arguments are passed, we will get a matrix of predictions, each column corresponding to predictions for each value of in `fit$lambda`

. For our example, `fit$lambda`

has length 68 and `x`

consists of 100 rows/observations, so `predict(fit, x)`

returns a matrix.

length(fit$lambda) # [1] 68 dim(predict(fit, x)) # [1] 100 68

`newx`

must be provided except when `type="coefficients"`

or `type="nonzero"`

(more on these types later).

**Function option: newoffset**

If the original `glmnet`

call was fit with an offset, then an offset must be included in the `predict`

call under the `newoffset`

option. If not included, an error will be thrown.

set.seed(2) offset <- rnorm(n) fit2 <- glmnet(x, y, offset = offset) predict(fit2, x) # Error: No newoffset provided for prediction, yet offset used in fit of glmnet

The reverse is true, in that if the original `glmnet`

call was NOT fit with an offset, then `predict`

will not allow you to include an offset in the prediction, EVEN if you pass it the `newoffset`

option. It does not throw a warning or error, but simply ignore the `newoffset`

option. You have been warned! This is demonstrated in the code snippet below.

pred_no_offset <- predict(fit, x) pred_w_offset <- predict(fit, x, offset = offset) max(abs(pred_no_offset - pred_w_offset)) # [1] 0

**Function option: s and exact**

`s`

indicates the values for which we want predictions at. If the user does not specify `s`

, `predict`

will give predictions for each of the values in `fit$lambda`

.

(** Why is this option named s and not the more intuitive lambda?** In page 5 of this vignette, the authors say they made this choice “in case later we want to allow one to specify the model size in other ways”.

`lambda`

controls the model size in the sense that the larger it is, the more coefficients will be forced to zero. There are other ways to specify model size. For example, one could imagine a function option where we specify the number of non-zero coefficients we want in the model, or where we specify the maximum norm the coefficient vector can have. None of these other options have been implemented at the moment.)* If the user-specified * then

`s`

values all belong to `fit$lambda`

,`predict`

pulls out the coefficients corresponding to those values and returns predictions. In this case, the `exact`

option has no effect.* If the user-specified * things get interesting. If

`s`

value does NOT belong to `fit$lambda`

,`exact=FALSE`

(the default), `predict`

uses linear interpolation to make predictions. (More accurately, it does linear interpolation of the coefficients, which translates to linear interpolation of the predictions.) As stated in the documentation: “while this is often a good approximation, it can sometimes be a bit coarse”.As a demonstration: In the snippet below, we look at the predictions at a value of that lies between the two largest values in `fit$lambda`

. If the function does as the documentation says, the last line should give a value of 0 (to machine precision).

b1 <- as.numeric(predict(fit, x, s = fit$lambda[1])) b2 <- as.numeric(predict(fit, x, s = fit$lambda[2])) b3 <- as.numeric(predict(fit, x, s = 0.3*fit$lambda[1] + 0.7*fit$lambda[2])) max(abs(b3 - (0.3*b1 + 0.7*b2))) # [1] 3.885781e-16

** What happens if we have values in ** First, I would recommend using

`s`

that are not within the range of `fit$lambda`

?`exact=TRUE`

because extrapolation beyond the range of `fit$lambda`

is dangerous in general. In my little experiments, it looks like `predict`

simply returns the predictions for the value in `fit$lambda`

that is closest to `s`

.If `exact=TRUE`

, `predict`

merges `s`

with `fit$lambda`

to get a single (decreasing) sequence, refits the glmnet model, then returns predictions at the values in `s`

. If your training data is very large, this refitting could take a long time.

One note when using `exact=TRUE`

is that you have to pass in additional arguments in order for the refitting to happen. That’s because the fitted `glmnet`

object does not contain all the ingredients needed to do refitting. For our example, to predict for `fit`

we need to supply `x`

and `y`

as well. For more complicated glmnet calls, more options have to be provided.

predict(fit, x, s = fit$lambda[68] / 2, exact = TRUE) # Error: used coef.glmnet() or predict.glmnet() with `exact=TRUE` # so must in addition supply original argument(s) x and y in order to # safely rerun glmnet predict(fit, x, s = fit$lambda[68] / 2, exact = TRUE, x = x, y = y) # glmnet correctly returns predictions...

**Function option: type**

The `type`

option determines the type of prediction returned. `type="coefficients"`

returns the model coefficients for the values in `s`

as a sparse matrix. `type="nonzero"`

returns a list, with each element being a vector of the features which have non-zero features. For example, the code snippet below shows that for the second and third values in `fit$lambda`

, the features that have non-zero coefficients are feature 5 and features 3 and 5 respectively.

predict(fit, type = "nonzero", s = fit$lambda[2:3]) # $`1` # [1] 5 # # $`2` # [1] 3 5

For `type="coefficients"`

and `type="nonzero"`

, the user does not have to provide a `newx`

argument since the return value does not depend on where we want the predictions. For the rest of the possible values of `type`

, `newx`

is required.

For `type="link"`

(the default) and `type="response"`

it helps to know a little GLM theory. For a observation having values , `type="link"`

returns , where is the coefficient vector corresponding to a value in `s`

.

For `type="response"`

, is passed through the GLM’s inverse link function to return predictions on the `y`

scale. For “gaussian” family it is still . For “binomial” and “poisson” families it is and respectively. For “multinomial” it returns fitted probabilities and for “cox” it returns fitted relative risk.

The final possibility, `type="class"`

, applies only to “binomial” and “multinomial” families. For each observation, it simply returns the class with the highest predicted probability.

**Bonus: The coef method**

The `coef`

method for glmnet is actually just a special case of the `predict`

method. This can be seen from the source code:

coef.glmnet # function (object, s = NULL, exact = FALSE, ...) # predict(object, s = s, type = "coefficients", exact = exact, # ...) # <bytecode: 0x7ff3ae934f20> # <environment: namespace:glmnet>

**Bonus: predict.elnet, predict.lognet, …**

If you inspect the class of the object returned by a `glmnet`

call, you will realize that it has more than one class. In the code below, we see that “gaussian” family results in an “elnet” class object. (“binomial” family returns a “lognet” object, “poisson” family returns a “fishnet” object, etc.)

class(fit) # [1] "elnet" "glmnet"

These classes have their own `predict`

methods as well, but they draw on this base `predict.glmnet`

call. As an example, here is the code for `predict.fishnet`

:

glmnet:::predict.fishnet # function (object, newx, s = NULL, type = c("link", "response", # "coefficients", "nonzero"), exact = FALSE, newoffset, ...) # { # type = match.arg(type) # nfit = NextMethod("predict") # switch(type, response = exp(nfit), nfit) # } # <bytecode: 0x7ff3ab622040> # <environment: namespace:glmnet>

What happens here is that `predict.glmnet`

is first called. If `type`

is not `"response"`

, then we simply return whatever `predict.glmnet`

would have returned. However, if `type="response"`

, then (i) we call `predict.glmnet`

, and (ii) the predictions are passed through the function before being returned.

This is how `predict`

is able to give the correct return output across the different `family`

and `type`

options.

**Recap: What is a GLM?**

Assume we have data points for . We want to build a * generalized linear model (GLM)* of the response using the other features . To that end, assume that the values are all fixed. Assume that are samples of independent random variables which have the probability density (or mass) function of the form

In the above, the form of is known, but not the values of the ‘s and .

Let . We assume that

where for some *link function* , assumed to be known.

**Recap: The likelihood/score equations**

The goal of fitting a GLM is to find estimates for . More specifically, we want to find the values of which maximize the (log)-likelihood of the data:

To do that, we differentiate w.r.t. and set it to be 0. Using the chain rule as well as the form of , after some algebra we have

Hence, the * likelihood equations* (or

appears implicitly in the equations above through : . (See the original post for a matrix form of these equations.)

**Likelihood equations: A derivation**

All this involves is evaluating the 4 partial derivatives in and multiplying them together.

Using the form of the probability density function for :

The second partial derivative, $\partial \theta_i / \partial \mu_i$, is probably the trickiest of the lot: simplifying it requires some properties of exponential families. Under general regularity conditions, we have

Applying to our setting:

Applying to our setting:

Thus,

The third partial derivative, , actually appears in the likelihood equations so we don’t have to do any algebraic manipulation here. Finally, using the systematic component of the GLM , we have

Putting these 4 parts together:

as required.

References:

- Agresti, A. Categorical Data Analysis (3rd ed), Chapter 4.

I heard about the most recent community call (“Maintaining an R package”) via an announcement on R-bloggers. The topic was of personal interest to me and the panelists were experienced/interesting enough that I felt I could learn a lot by participating. For reference, here were the speakers/panelists for this call:

- Julia Silge, Data scientist & software engineer @ RStudio
- Elin Waring, Professor of Sociology and Interim Dean of health sciences, human services and nursing @ Lehman College, CUNY
- Erin Grand, Data scientist @ Uncommon Schools
- Leonardo Collado-Torres, Research scientist @ Lieber institute for brain development
- Scott Chamberlain, Co-founder and technical lead @ rOpenSci

Here are some of my quick observations of the event as a participant:

- The calls are publicly hosted on Zoom, which made it really easy to join. Overall the video and sound quality was good and clear enough that I wasn’t straining to hear the speakers.
- At the beginning of the call, Stefanie, the community manager hosting this call, suggested that those who were comfortable share their screen so that we could put faces to names. That was a small simple touch that made the call more personal!
- As the call is happening, attendees can collaboratively update a shared document capturing the key points of the discussion. It is then made publicly available soon after the call is over. (As an example, this is the collaborative document of the call I attended.)
- Through the collaborative document, not only could participants ask the speakers questions, but other participants could answer and comment on those questions as well!
- rOpenSci does a really good job of recording different aspects of the call and archiving them for future reference. Each call has its own webpage with all the resources associated with it. For the call I attended, all the resources are here. There are a list of resource links (including one for the collaborative notes), as well as a video recording of the call itself!

I enjoyed listening in on the call and am very much looking forward to the next one! I hope that you will considering joining in as well.

For the full list of rOpenSci community calls, click here.

]]>* As a Bayesian method, all we have to do is to specify the prior distribution for the parameters and the likelihood.* From there, we can turn the proverbial Bayesian crank to get the posterior distribution of parameters and with it, posterior inference of any quantity we are interested in (e.g. point and estimate estimates of ).

**Likelihood**

The likelihood is easy to specify once we get definitions out of the way. Let denote a binary decision tree. Assuming the tree has terminal nodes, let be the set of parameter values associated with each of the terminal nodes such that if the value ends up in the th terminal node, the tree would return the value . We can think of as representing the structure of the tree, and of as specifying the value we return to the user once the input hits one of the terminal nodes.

For a given and , let denote the function which assigns to if ends up in the th terminal node.

The likelihood for BART is

The response is the sum of binary decision trees with additive Gaussian noise.

**Prior specification: General comments**

We need to choose priors for , , and .

Chipman et al. actually decided not to put a prior on for computational reasons. Instead, they suggested beginning with a default of , then check if one or two other choices of makes any difference. According to them, as increases, predictive performance improves dramatically at first, then levels off, and finally degrades slowly. (I believe this is the observed behavior with boosting as well.) As such, for prediction it appears only important to avoid having too few trees.

As for the other parameters, we introduce independence assumptions to simplify the prior specification. If represents the terminal value of the th node in , then we assume that

and that

To complete the prior specification, we just need to specify the priors for , and . We do so in the following subsections. The paper notes that these priors were also used in an earlier paper by the same authors (Reference 2) where they considered a Bayesian model for a single decision tree.

*The prior*

For BART uses a standard conjugate prior, the inverse chi-square distribution

where and are the degrees of freedom and scale hyperparameters. Chipman et al. recommend choosing these two hyperparameters in a data-driven way:

- Get some estimate of (e.g. sample standard deviation of , or fit ordinary least squares (OLS) of on and take the residual standard deviation).
- Pick a value of between 3 and 10 to get an appropriate shape.
- Pick a value of so that the th quantile of the prior on is , i.e. . Chipman et al. recommend considering , or .

For users who don’t want to choose and , the authors recommend defaults of and .

*The prior*

The prior for a tree can be specified with 3 ingredients:

- The probability that a node at depth (the root node having depth 0) is non-terminal.
- If a node is non-terminal, the probability that the th covariate (out of covariates) is the splitting variable for this node.
- Once the splitting variable for a node has been chosen, a probability over the possible cut-points for this variable.

Chipman et al. suggest the following:

- , where and are hyperparameters. Authors suggest and to favor small trees. With this choice, trees with 1, 2, 3, 4 and terminal nodes have prior probability of 0.05, 0.55, 0.28, 0.09 and 0.03 respectively.
- Chipman et al. suggest the uniform distribution over the covariates.
- Given the splitting variable, Chipman et al. suggest the uniform prior on the discrete set of available splitting values for this variable.

*The prior*

For computational efficiency, Chipman et al. suggest using the conjugate normal distribution

where and are hyperparameters. As with the hyperparameters associated with the prior, the authors suggest setting them in a data-driven way. The way we do this is by ensuring that is in the right ballpark. With the prior above, has the prior . Letting and be the min and max observed values of , choose and such that

If we choose , then this choice of hyperparameters means that the prior probability that is 0.95.

**Other notes**

In theory, once we define a prior and likelihood we have a posterior. The practical question is whether we can derive this posterior or sample from it efficiently. Section 3 of the paper outlines a Bayesian backfitting MCMC algorithm that allows us to sample from the posterior distribution.

The set-up above applies for quantitative . For binary , the paper develops a probit model along similar lines in Section 4.

References:

- Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). BART: Bayesian additive regression trees.
- Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search.

`Rmpfr`

package. `Rmpfr`

is R’s wrapper around the C library MPFR, which stands for “The main function that users will interact with is the `mpfr`

function: it converts numeric values into (typically) high-precision numbers, which can then be used for computation. The function’s first argument is the numeric value(s) to be converted, and the second argument, `precBits`

, represents the maximal precision to be used in numbers of bits. For example, `precBits = 53`

corresponds to double precision.

In his blog post, Cook gives an example of computing to 100 decimal places by multiplying the arctangent of 1 by 4 (recall that , so ):

4 * atan(mpfr(1, 333)) # 1 'mpfr' number of precision 333 bits # [1] 3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706807

* Why does he set the precision to 333 bits?* This link suggests that with bits, we get decimal digits of precision. (Reality for floating point numbers is not quite as straightforward as that: see this for a discussion. But for our purposes, this approximation will do.) Hence, to get 100 decimal places, we need around bits, so he rounds it up to 333 bits.

The first argument to `mpfr`

can be a vector as well:

mpfr(1:10, 5) # 10 'mpfr' numbers of precision 5 bits # [1] 1 2 3 4 5 6 7 8 9 10

As the next code snippet shows, R does NOT consider the output of a call to `mpfr`

a numeric variable.

x <- sin(mpfr(1, 100)) x # 1 'mpfr' number of precision 100 bits # [1] 0.84147098480789650665250232163005 is.numeric(x) # [1] FALSE

We can use the `asNumeric`

function to convert it to a numeric:

y <- asNumeric(x) y # [1] 0.841471 is.numeric(y) # [1] TRUE

** Can we use the more familiar as.numeric instead?** According to the function’s documentation,

`as.numeric`

coerces to both “numeric” and to a vector, whereas `asNumeric()`

should keep dim (and other) attributes. We can see this through a small example:x <- mpfr(matrix(1:4, nrow = 2), 10) x # 'mpfrMatrix' of dim(.) = (2, 2) of precision 10 bits # [,1] [,2] # [1,] 1.0000 3.0000 # [2,] 2.0000 4.0000 asNumeric(x) # [,1] [,2] # [1,] 1 3 # [2,] 2 4 as.numeric(x) # [1] 1 2 3 4]]>

`glmnet`

function (from the package of the same name), hoping to give more detail and insight beyond R’s documentation.
In this post, we will look at the **type.gaussian** option.

For reference, here is the full signature of the `glmnet`

function (v3.0-2):

glmnet(x, y, family = c("gaussian", "binomial", "poisson", "multinomial", "cox", "mgaussian"), weights, offset = NULL, alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs < nvars, 0.01, 1e-04), lambda = NULL, standardize = TRUE, intercept = TRUE, thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2 + 20, nvars), exclude, penalty.factor = rep(1, nvars), lower.limits = -Inf, upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars < 500, "covariance", "naive"), type.logistic = c("Newton", "modified.Newton"), standardize.response = FALSE, type.multinomial = c("ungrouped", "grouped"), relax = FALSE, trace.it = 0, ...)

**type.gaussian**

According to the official R documentation,

Two algorithm types are supported for (only)

`family="gaussian"`

. The default when`nvar<500`

is`type.gaussian="covariance"`

, and saves all inner-products ever computed. This can be much faster than`type.gaussian="naive"`

, which loops through nobs every time an inner-product is computed. The latter can be far more efficient for`nvar >> nobs`

situations, or when`nvar > 500`

.

Generally speaking there is no need for you as the user to change this option.

**How do the fitting times compare?**

I ran a timing simulation to compare the function run times of `type.gaussian="naive"`

vs. `type.gaussian="covariance"`

for a range of number of observations (`nobs`

or ) and number of features (`nvar`

or ). The results are shown below. (For the R code that generated these plots, see here.)

This first panel of boxplots shows the time taken for `type.gaussian="naive"`

to complete as a fraction (or multiple) of that for `type.gaussian="covariance"`

(each boxplot represents 5 simulation runs). As advertised, naive runs more slowly for small values of but more quickly for large values of . The difference seems to be more stark when is larger.

This next plot shows the absolute fitting times: note the log scale on both the x and y axes.

* So, what algorithms do these two options represent?* What follows is based on

Let denote the response for observation , and let denote the value of feature for observation . Assume that the response and the features are standardized to have mean zero so that we don’t have to worry about fitting an intercept term. For each value of in lambda, `glmnet`

is minimizing the following objective function:

We minimize the expression above by * cyclic coordinate descent*. That is, we cycle through the features . For each , treat as a function of

where is the soft-thresholding operator and is the *partial residual*.

Both of the modes minimize in this way. Where they differ is in how they keep track of the quantities needed to do the update above. From here, assume that the data has been standardized. (What follows works for unstandardized data as well but just has more complicated expressions.)

`type.gaussian = "naive"`

As the features are standardized, we can write the argument in the soft-thresholding operator as

where is the full residual for observation . In this mode, we keep track of the full residuals , .

- At a coordinate descent step for feature , if the coefficient doesn’t change its value, no updating of is needed. However, to get the LHS of for the next feature (), we need to make operations to compute the sum on the RHS of .
- If changes value, then we have to update the ‘s, then recompute the LHS of for the next feature using the expression on the RHS. This also takes time.

All in all, a full cycle through all variables costs operations.

`type.gaussian = "covariance"`

Ignoring the factor of , note that the first term on the RHS of can be written as

In this mode, we compute all the inner products ( of them) which takes operations. For each such that , we store the current values of (there are of them for each ).

- At a coordinate descent step for feature , if and the beginning of the step and its value changes, we need to update the terms with operations. Then, to calculate for the next coordinate descent step, we only need operations, where is the number of non-zero coefficients at the moment.
- As such, if no new variables become non-zero in a full cycle through the features, one full cycle takes only operations.
- If a new feature enters the model for the first time (i.e. becomes non-zero), then we need to compute and store for , which takes .

This form of updating avoids the updating needed at every step at each feature in naive mode. While we sometimes occur operations when a new variable enters the model, such events don’t happen often. Also, we have , so if is small or if , the operations for one full cycle pale in comparison with operations for naive updating.

]]>In many applications we might face massive * class imbalance*, that is, many more examples of one class than the other. (Without loss of generality, assume that class 0 happens a lot more often than class 1.) For example, in fraud detection, we will have many more examples of non-fraudulent transactions than fraudulent ones. (Read more about the class imbalance problem here.)

One way to address class imbalance is to make the classes more balanced by

from the majority class (i.e. just choose a fraction of the class 0 instances and feed only those to the model), or**Undersampling**from the minority class (i.e. randomly duplicate class 1 instances until we reach balance), or**Oversampling**- doing both undersampling and oversampling.

**What could go wrong with naive oversampling?**

By naive oversampling, I mean replicating the minority class instance exactly as it is. So for example, if I am oversampling by 200%, that means that each minority class instance appears in my dataset 3 times.

Chawla et al. (2002) give an example of what might go wrong with naive oversampling. They had a mammography dataset (from Reference 2), where the goal was to build a binary classifier that would look at a mammogram image and determine if there was microcalcification (minority class) or not (majority class). A decision tree was fit to the original data. The picture below shows the majority class instances in green circles and minority class instances in red circles. The solid black rectangle is one of the terminal decision regions of the decision tree, and it predicts majority (no microcalcification) for this box. This might be problematic because we see 3 minority class samples inside this box: the decision tree will misclassify them.

* What happens if we oversample by replicating the minority class instances and refit the decision tree?* What Chawla et al. ended up with was a decision tree that overfit to the data. Because we replicated the minority class instances, each of the 3 minority instances became important enough that the decision tree could not afford to misclassify them. As a result, it created small terminal decision regions just around the minority examples, as we can see in the figure below.

**SMOTE (Synthetic Minority Over-sampling TEchnique)**

SMOTE (Chawla et al.) is a resampling scheme that creates synthetic minority class examples based on the original ones. There are two parameters for SMOTE: the amount of oversampling as a percentage, and the number of nearest neighbors . If the amount of oversampling is % and there are original minority class instances, then SMOTE will generate synthetic minority class samples. For simplicity, we will only describe SMOTE for being a multiple of 100. The role of will become clear shortly.

Here is the SMOTE algorithm: Assume that the original minority class instances are .

- For each , , find the nearest neighbors of in feature space and denote them by . (We ignore the majority class instances when looking for the nearest neighbors, i.e. only consider other minority class instances.)
- For :
- For :
- Choose one of elements from the set uniformly at random and denote it by .
- Return the new sample , where . That is, the new sample lies (in feature space) on a random point along the line joining the original minority class instance and one of its nearest neighbors.

- For :

The intuition here is that in feature space, the regions lying between minority class instances probably belong to the minority class as well, so we “fill up” that space with our synthetic minority class examples. The computational bottleneck in this algorithm is in computing the nearest neighbors for each minority class instance (Step 1), although it’s probably not too bad because (i) there aren’t many instances to begin with, and (ii) when computing nearest neighbors, we only care about minority class instances.

The authors recommend that SMOTE be used in combination with undersampling of the majority class for better results.

**SMOTE in R**

There are multiple implementations of the SMOTE algorithm in R. Two that I found were `DMwR::SMOTE()`

and `smotefamily::SMOTE()`

. Both of these functions have as a default, which is what Chawla et al. suggested. It appears that `DMwR::SMOTE()`

does both the oversampling and undersampling while `smotefamily::SMOTE()`

only does the undersampling.

References:

- Chawla et al. (2002). SMOTE: synthetic minority oversampling technique.
- Woods et al. (1993). Comparative evaluation of pattern recognition techniques for detection of microcalcifications.