tl;dr: Raking (or iterative proportional fitting) is “a post-stratification procedure for adjusting sample weights in a survey so that the adjusted weights add up to known population population totals for the post-stratified classifications when only the marginal population totals are known.” (Reference 1)
While the idea behind raking is pretty simple, I had trouble finding an example to demonstrate the procedure. In what follows, I walk through a toy example of how to apply raking to an estimation problem. (It shouldn’t be too hard to transfer the ideas into mathematical notation.)
Imagine that we want to estimate the mean number of hours of TV people in Fantasyland watch each week. Fantasyland is pretty diverse: we can imagine that it is made up of a number of distinct subpopulations, each having very different TV watching habits. For this toy example, assume that there are 2 age groups (child, adult) and 3 household income groups (< $40k/yr, $40-200k/yr, >$200k/yr). Assume that people who fall within the same (age, income) group watch exactly the same amount of TV. The following two tables show the number of people falling in each (age, income) group and how much TV each group watches per week:
If we knew the values in these two tables, then we would know the true value of the mean we are looking for:
We go and get some data…
Of course, we don’t know the values in the tables above, and so we get a random sample of Fantasyland inhabitants and ask them how much TV they watch. We can then tabulate our data in two tables much like the earlier ones:
(In this example, because we assume that people who fall within the same (age, income) group watch exactly the same amount of TV, the data on the right table is always going to be the same. We will omit it in future figures.)
Problem with simply taking the mean
One might estimate the true mean TV hours watched per week by taking a simple mean of the data we collected in our sample:
This estimate is WAY off from the truth! What happened? Let’s look at the population and sample size tables again:
The relative proportions of the 6 subgroups in our sample are completely different from that in the population! For example, household income >$200k make less than 1% of our population, but more than 50% of our sample. As a result, the simple mean of our sample grossly overweights the values of this group, giving us an estimate that is severely biased upwards.
A possible workaround, but what if…?
If we knew the values in the population table, we could just use those values as sample weights instead of the relative frequencies in the sample we took. In this example, it would give us the true value of exactly.
But what if we didn’t know the values in the cells, just the row and column totals? Can we do better than just the simple mean?
While this scenario might look contrived, it does happen in practice. For example, maybe Fantasyland did a census recently and disclosed the row and column totals, but for privacy purposes they did not release the cell values. (This scenario happens more often when we have many demographic variables. For example, in the US we know how many females there are, how many native Americans there are, how many live in Texas, and how many are between 30-39 years, but we don’t know the number of female native American Texans who are 30-39 years old.)
Raking to the rescue!
Raking, also known as iterative proportional fitting (IPF), is a method for adjusting sample weights so that they more accurately reflect the true population weights. The goal of raking is to adjust the sample weights so that the row and column totals (also known as the marginals) mimic those of the population.
It achieves its goal in the following way: Let’s say we are raking an -way table for variables 1 to , and that we only know the population marginals.
- Multiply all the cells in the sample size table by the same constant so that the cells sum up to the true population total.
- Look at variable 1. For each slice of the table that has the same value for variable 1, multiply the cells in this slice by the same constant so that these cells sum up to the true population marginal. Do the same for variable 2, 3, …, .
- Repeat Step 2 until convergence.
Raking is best demonstrated by example. In our TV watching example, in the first step we inflate all cells in our sample size table so that the cells sum up to the true population total, 51,650. (Note: Tables from here on out round the numbers to the nearest integer, so you may see some inconsistencies due to rounding.)
Since everything was inflated by the same factor, computing our estimate with these weights will just give us the same value: 48.1.
Let’s look at the variable Age. We will multiply the cells in row 1 by one constant and the cells in row 2 by another constant so that the row totals match the true population row totals:
Next, let’s look at the variable Household Income. We will multiply the cells in each column by a constant such that the column totals match those of the true population:
This completes one round of Step 2 of the raking algorithm. Notice that the sample weights more closely resemble those of the population already. If we use these weights for our estimate, we get something much more reasonable than the simple mean:
After this second operation, the column marginals match the population ones, but the row marginals no longer match their population counterparts. We can do another round of adjustment: first matching the row totals:
and then matching the column totals:
We now have the estimate
After the last operation, both the row and column totals match their population counterparts, so the raking procedure has converged.
Further notes on raking
- If we only have one variable that we are adjusting for (say Age or Household Income in our example), then raking is the same as post-stratification. (I talk a little bit about post-stratification here.)
- While the marginals of a raked table will be very close to that of the population, there are no guarantees for any of the totals based on more than one variable. In our example, we cannot guarantee that the cell values of the raked table will be close to the cell values in the population table.
- Raking cannot work if any of the marginals are equal to zero, since it will involve dividing by zero. Reference 2 suggests adding a little amount to these cells (e.g. 0.001 if all other marginals are whole numbers) to circumvent this problem.
- Raking will not adjust cells in the table that are equal to zero. The workaround is the same as the previous point: adding a little amount to these cells.
Hunsinger has excel sheets that walk through raking for 2, 3 and 4 variables that I think are very helpful. They are available as links within the first 3 links of Reference 2 (e.g. link on slide 6 of “Iterative Proportional Fitting for a Two-Dimensional Table”).
Reference 3 contains a good list of tips and things to think about when using raking.
- Lavrakas, P. J. (ed). (2008). Entry on Raking in Encyclopedia of Survey Research Methods.
- Hunsinger, E.. Iterative Proportional Fitting Information and Code.
- Battaglia, M. P., Hoaglin, D. C., and Frankel, M. R. (2009). Practical Considerations in Raking Survey Data.