Friday, November 12, 2010

Adventues in Demographic Modelling

I recently developed a whim to try and create a demographic electoral model. That is, a model that would look at the distribution of eight demographic variables across the 51 US federal voting regions (50 states + DC), and use nationwide tendencies of those demographics to come up with a set of topline numbers. In other words, assume that, say, Catholics, or those with income between $75,00 and $150,000 per year, had uniform behavior across the nation (which is untrue), and see what results the elections would provide. To test my model, I used data from the 2008 Presidential election. The results were rather interesting.
The variables I used were age (18-29, 30-44, 45-64, 65+), education (no HS, HS grad, some college, college grad, postgrad), income (<$30k, $30-$75k, $75-$150k, >$150k), party (D, R, I), ideology (liberal, moderate, conservative), race (white, black, Hispanic, Asian, other), religion (Protestant, Catholic, Jewish, other, and none), and urban status (urban, suburban, rural). All of these categorizations were available from the 2008 exit polls, except that about half the states didn't give me religion data, so I got that data from the 2004 version.

The first thing to notice is which of the variables performed well in predicting the actual results. I put it that way because one of the interesting things about this kind of demographic modeling is that everything is interconnected. Every individual person has an identity in all eight of my demographic groups: that's eight different demographic influences "pulling" on them, or, to put it another way, eight different groups in which they are being counted. So if an additional 10% of Catholics decide to vote for Obama, all of those people shifted to Obama in their other demographic groups as well. So that's tricky. I don't have any idea how one would try to tackle that in a nuanced way, other than a hyper-nuanced way, so I didn't: I just calculated, for each state/DC, the combined numbers between Obama and McCain using each of the eight variables separately, and then took the simple average of the eight different results for an official "topline" number.

So, here are the correlations between the results given by each of the eight variables across the 51 state-like things with the actual results (I'm using the numbers for Obama, which are very close to those for McCain in any event and I set it up so that it would've been a hassle to calculate using the margin):

Age: -0.078
Education: 0.470
Income: -0.414
Party: 0.883
Ideology: 0.915
Race: 0.388
Religion: 0.754
Urban: 0.391
Average: 0.881

Note, first of all, that the 88% correlation for my overall average is pretty damn good. I was certainly very pleased with it. My overall impressions at this level are as follows. (Also, be sure to remember that these are how well the variable predicts how a state votes, not how predictive they are of how individuals vote). It does not surprise me that the two explicitly political variables, party and political ideology, are the best predictors of state voting patterns. When 90% of partisans vote for their party's candidate, this ought to be the case. Race, urbanization, and education all have medium-level positive impact; none of this surprises me, either. I find it interesting that religion is stronger than any of those categories, though: perhaps this is because religions are opinions themselves, like political viewpoints, while most of the others are simply "identity" factors? That theory makes sense, actually, because while education level is an identity, it has an impact on opinions, and education does a slightly better job than race or urbanization at predictions.

I had predicted that age would have essentially 0 contribution before I ran these numbers, for one simple reason: all states have roughly the same age profile. The differences are just tiny, and often sort of arbitrary. So this one didn't surprise me at all. The one that appears to be shocking is the fairly strong predictive value of income level, but in reverse. What on earth does this mean? After all, poor people voted strong for Obama, so why didn't he do better in states with more of them? Well, the "why" component of that question is somewhat mysterious, but it has been observed before that while poor people vote democratic and rich people vote Republican across all states, poorer states also voted holistically more Republican, at least in 2008. Thus, even though, say, giving each poor person an extra vote in any given state would make it tilt Democratic, the states with the most poor people lean Republican (also, can we do that?).

Here are my guesses as to why this might be. Income correlates strongly with education; indeed, education and income are largely proxies for each other. And while we expect education to correlate positively with voting Democratic, we also expect poverty to do so as well, but poverty and education are inversely correlated. Also, you have what one might call the "Andrew Jackson" effect, the poor Southern whites who vote, at this point, Republican en masse for what are largely non-economic reasons. Indeed, I think it's obvious that this has to be something about non-economic factors in voting, probably social factors. One theory would be that for some random reason, it is simply true that richer states have more left-wing social values, and vice-versa; I'm skeptical. I think it probably has to do largely with education. But if people just vote on social values, then why don't poor people vote Republican? I think I have a theory about that, too.

My theory is that states develop a state-wide social-values culture which is based, largely, on the aggregate education level of the state, possibly in combination with other factors. This values culture then has an impact on the voting preferences of everyone in the state. But it is influencing poor people in their capacity as poor people, that is, Democratic voters, and on rich people in their capacity as rich people, that is, Republican voters. So in a poor, socially conservative state, all voters will be somewhat more conservative, but the poor within the state will be more liberal still; and vice-versa. Alternately, since education funding is managed at a state-wide level, everyone in richer states will be more educated than people in poor states controlling for income.

Now for some of the interesting things about the model. I noted above that the 88% correlation between the overall average of the predicted results and the actual election results was pretty damn good. And it is. But if I showed you a table of the actual predicted state-by-state topline results, you'd be fairly surprised. That's because the overall range of the projected Obama vote in the 50 states (excluding DC) runs from about 47% of the vote to about 55% of the vote. The actual range of Obama vote in the 50 states ran from about 32% to 71%. That's a pretty big difference. The reason this doesn't show up in the correlation is that correlation corrects for the variance of the two distributions. It is true that by and large the states in which Obama did better in the model are the states where he did better in real life; the margins are just shrunk in general. Actually, the Obama margin isn't shrunk: the difference from the average Obama margin, around 7 points, shrunk. States were all moved closer to a 52-45 margin between Obama and McCain. I can correct for this, by turning everything into z-scores and then set everything to the mean and standard deviation of the actual election. Below are the maps how each of the eight demographic variables would've predicted the election to turn out, making this adjustment.

First, here's age. I'd say that this is pure randomness, more or less. Don't try too hard finding patterns here. 


Note that dark blue means a projected Obama win of more than 15 points, pure blue is a win of 5-15 points, light blue is a 0-5 point Obama projected win, pink is a 0-5 point projected McCain win, pure red is a 5-15 point McCain win, and dark red is a 15 point or more projected McCain win. If age had been the only factor in the 2008 elections, Barack Obama would probably have won 306 EVs, compared to 365 in real life.

Next up, education. I'd say education was a strong component in Obama's win in Virginia and Colorado, 'cause a lot of his other indicators were lousy there. The South looks too democratic; the Pacific and Midwestern states too Republican. If education had been the only factor in the 2008 elections, Obama would probably have won 280 EVs.


Now here's income. This one looks basically backwards. Very, very backwards. It only got 13 out of 50 states correct. That's pretty bad. Obama would have won only 242 EVs under this model, and would have lost.


The next two are kind of cheating, since partisanship and ideology are explicitly political terms. Using the partisan map we can see that Obama would've considerably outperformed on this model. He massively overperforms in the inland South, and doesn't underperform anywhere much to speak of. He would've won 400 EVs using this model, 54 of which are from Southern states he didn't actually win.


The ideological map is the best one we've seen so far. That's not shocking, since it had the highest correlation. This model only gets 7 states wrong. Basically, it's like the partisanship model, except it corrects for the oddity of lingering Dixiecrat voters. 305 EVs for Obama under this model.


The model using race has some issues. Specifically, the solid Democratic South. That's because, of course, there are lots of black people in the South. Only one problem, though: insanely pro-Republican whites. Like, 88% of whites for McCain in some states. On the other hand, we have 95% white Vermont, where 68% of the whites voted for Obama. Gotta be the most liberal white people in the nation. Obama would get 403 EVs here, 109 of which from Southern states that didn't actually vote for him.



Religion is the best of the six variables that aren't directly political. This map pretty closely resembles the actual map. This map, unlike some of the previous, gives Obama somewhat less credit in the South than he really got. Only 7 incorrect calls here. Obama 334 EVs.


Finally, urbanization. I was a little surprised at how poorly the urbanization metric did. If you take the correlation of population density of Congressional districts against their PVIs, it's pretty damn strong. I have my doubts about their characterization of urban settings; for instance, Idaho, really? Texas I know has some of the most conservative cities in the country, while northern New England and the Wisconsin-Minnesota area has a lot of liberal countryside. This model gives Obama 415 EVs, his best of the eight demographic variables.


And now, finally, we have the overall map, computed using the straight average of the eight variables. Note that I took the averages before I normalized each individual variable to fit the 2008 mean/standard deviation scheme. That's important; otherwise I would've massively overweighted age, which had a tiny variance.


I am very proud of this map. It only calls Louisiana, Arizona, Colorado, Iowa, Indiana, Georgia and Missouri incorrectly. Iowa and Indiana, along with the somewhat weak Obama performances in MN/WI, are due to a home region boost; likewise Arizona. Louisiana's a funky state, as is Colorado. Missouri was a nailbiter, both in the election (-0.3%) and in my model (+2.6%). And Obama had half a shot at Georgoa, losing by 5 in real life and winning by 3% in my model. Most of the shading corresponds pretty well with how the actual election turned out. In this map, Obama would have won 382 EVs, slightly more than he won in real life.

Finally, here's the chart of Obama real-life vote percentage against Obama's model-generated adjusted vote percentage:


That crazy point on the right is DC, which the adjusted model projected Obama to win 105% of the vote in. Obviously, that's impossible. I had been musing to myself earlier that I couldn't imagine how demographics could capture the 90% thing; indeed they couldn't, but they couldn't even properly describe a 60%-40% state, and it turns out that they did a better job coming close to DC than most of the states. Probably that's because there's just an outer limit to how far you can go, and DC comes pretty damn close to that limit, so while its demographics might sustain the 105% thing, it hits a wall. Anyway, the three most notable outlier points are Vermont and Hawaii on the top (Hawaii's the better one for Obama, being his best state), and Louisiana on the bottom, for reasons I don't quite understand.

Anyway, here's the bottom line: I think this is a pretty good modeling system. I think that using these eight variables, with the possible exception of age which I think is just never going to give a satisfying variance, produces a reasonably accurate view of the 2008 elections. Especially satisfying is the fact that I made absolutely zero effort to be nuanced; I just took the straight average.

Now I will use this model to make projections about reasonable percentages in elections that have not ever happened, possibly including the 2012 Presidential election.

No comments:

Post a Comment