Harrell’s C-index (also known as the concordance index) introduced in Harrell et al. 1982, is a goodness of fit measure for models which produce risk scores. It is commonly used to evaluate risk models in survival analysis, where data may be censored.
For concreteness, let’s imagine that we are in the clinical setting. For a given patient, we are interested in predicting his/her “time-to-disease”, i.e. the length of time until he/she develops the disease of interest, given covariates (e.g. demographic information) that we have on hand today. To train a model to make these predictions, we have patients with their covariate information and a “time-to-event” response .
Survival analysis is different from typical regression/classification because of the nature of the response . For a given disease, not all patients are going to develop the disease. In clinical settings, it’s also possible that the patient never comes for a follow-up visit, so we won’t know if the patient developed the disease or not. Hence, for the th patient, the “time-to-event” response is either
- The actual time-to-disease if we get to observe it, or
- The last time at which we know that the patient did not have the disease.
To know which of the two cases happened, we have an auxiliary variable such that if we got to see the disease, otherwise. If , patient ‘s time-to-event is said to be “censored”.
The intuition behind Harrell’s C-index is as follows. For patient , our risk model assigns a risk score . If our risk model is any good, patients who had shorter times-to-disease should have higher risk scores. Boiling this intuition down to two patients: the patient with the higher risk score should have a shorter time-to-disease.
We can compute the C-index in the following way: For every pair of patients and (with ), look at their risk scores and times-to-event.
- If both and are not censored, then we can observe when both patients got the disease. We say that the pair is a concordant pair if and , and it is a discordant pair if and .
- If both and are censored, then we don’t know who got the disease first (if at all), so we don’t consider this pair in the computation.
- If one of and is censored, we only observe one disease. Let’s say we observe patient getting disease at time , and that is censored. (The same logic holds for the reverse situation.)
- If , then we don’t know for sure who got the disease first, so we don’t consider this pair in the computation.
- If , then we know for sure that patient got the disease first. Hence, is a concordant pair if , and is a discordant pair if .
Harrell’s C-index is simply
The logic above can be expressed succinctly in a formula (taken from Reference 2):
Values of c near 0.5 indicate that the risk score predictions are no better than a coin flip in determining which patient will live longer. Values near 1 indicate that the risk scores are good at determining which of two patients will have the disease first. Values near 0 means that the risk scores are worse than a coin flip: you might be better off concluding the opposite of what the risk scores tell you.
Harrell’s C-index for continuous data
Of course, one can compute the C-index if none of the data is censored. In that case, all pairs such that will be included in the computation.
Harrell’s C-index for binary data
As Reference 3 suggests, the concept of the C-index can be easily ported over to binary data. In this setting, a high risk score prediction means more likely to be 1 than a 0. We only consider pairs where subject ‘s response is a 1 and subject ‘s response is a one. The pair is concordant if , and discordant if .
- Harrell Jr, F. E. et al. (1982). Evaluating the yield of medical tests.
- Schmid, M. et al. (2016). On the use of Harrell’s C for clinical risk prediction via random survival forests.
- Statistics How To. What is a C-Statistic?