Medians in high dimensions

The median is a common statistical measure for central tendency, i.e. a value that lies at the “center” of a sample of observations. In one dimension it is easy to describe. If the observations are x_1, \dots, x_n \in \mathbb{R}, we simply line them up along the real number line and report the one right in the middle (or the mean of the two right in the middle if n is even).

It turns out that in higher dimensions, there are several “natural” ways to define the median. These medians are defined for points in \mathbb{R}^d; when d = 1, they all coincide with the definition of the one-dimensional (univariate) median.

One way to define the median in higher dimensions is to take the univariate median along each dimension. More concretely, if x_{ij} is the jth coordinate of the ith observation, then the median is a d-dimensional vector such that its jth coordinate is the median of x_{1j}, x_{2j}, \dots, x_{nj}. This is known as the marginal median.

Another way to define the median in higher dimensions is to note that in one dimension, the median minimizes the sum of distances to the points, i.e.

\text{med}(x_1, \dots, x_n) \in \underset{x \in \mathbb{R}}{\text{argmin}} \; \sum_{i=1}^n |x - x_i|.

We can take this to be the definition of the median and extend it easily to higher dimensions:

\text{med}(x_1, \dots, x_n) \in \underset{x \in \mathbb{R}^d}{\text{argmin}} \; \sum_{i=1}^n \|x - x_i\|,

where \| \cdot \| is Euclidean distance in \mathbb{R}^d. This is also known as the geometric median or the L_1 median.

One is not limited to using Eucliean distance in the definition above; for any distance metric d over \mathbb{R}^d we could define a median as

\text{med}(x_1, \dots, x_n) \in \underset{x \in \mathbb{R}^d}{\text{argmin}} \; \sum_{i=1}^n d(x, x_i).

The medoid takes this approach, except that it limits the possible values that the median can take on to the set of observations itself:

\text{med}(x_1, \dots, x_n) \in \underset{x \in \{x_1, \dots, x_n\}}{\text{argmin}} \; \sum_{i=1}^n d(x, x_i).

Yet another way to generalize the univariate median is to realize that the median divides the one-dimensional space such that half the points lie on each side of it. The centerpoint generalizes this to d dimensions: it is a point such that any hyperplane passing through it divides the observations into two sets, each containing at least \dfrac{1}{d+1} of the observations.

The centerpoint is closely related to the notion of Tukey depth, introduced by John Tukey in 1974. Given a set of observations x_1, \dots, x_n, the Tukey depth of a point x is the size of the smallest subset of points on either side of any hyperplane passing through x. Mathematically,

\text{depth}(x; x_1, \dots, x_n) = \underset{\|u\|_2 = 1}{\min} \; \# \{i: u^T x_i \geq u^T x \}.

Donoho & Gasko (1992) noted that this could be the basis of a definition for a multivariate median which is now known as the Tukey median: a point that maximizes Tukey depth, i.e.

\begin{aligned} \text{med}(x_1, \dots, x_n) &\in \underset{x \in \mathbb{R}^d}{\text{argmax}} \; \text{depth}(x; x_1, \dots, x_n) \\ &= \underset{x \in \mathbb{R}^d}{\text{argmax}} \left[ \underset{\|u\|_2 = 1}{\min} \; \# \{i: u^T x_i \geq u^T x \}\right]. \end{aligned}

References:

  1. Median: Multivariate median. Wikipedia.
  2. Breakdown properties of location estimates based on halfspace depth and projected outlyingness. David L. Donoho & Miriam Gasko, The Annals of Statistics, 1992.

2 thoughts on “Medians in high dimensions

Leave a reply to ya wei Cancel reply