Monday, April 12, 2010

Proxies

I just sat down to try and do a statistical analysis of how certain, um, stats contribute to overall scoring average on the PGA Tour. I did a fair bit of analysis, and then had to go eat (because I was hungry). While I was walking, eating, and walking again, I had the following musings:

The difficulty in conducting such an analysis is that the statistics available are not perfect proxies of the skills one wishes to represent, and therefore are not always independent of one another and therefore confound one's data. I might, for instance, represent the following fundamental golf skills: power, accuracy, precision, chipping, and putting. Accuracy and precision are here differentiated because precision involves hitting at a point target, i.e. a golf hole, while accuracy is hitting at a ribbon target, a fairway. There are, amongst conventional golf stats, reasonable proxies for all of these fundamental skills. Power is driving distance. Accuracy is, well, driving accuracy. Precision is greens in regulation. Chipping is scrambling. Putting is putts per green in regulation.

But there are problems with most of these proxies, that relate to a lack of independence. Driving distance for power is basically fine. But now, driving accuracy is correlated with driving distance, such that a longer hitter inherently hits fewer fairways. So accuracy as a skill is not sufficiently represented by driving accuracy. Driving accuracy of 65% might be great for a 300-yard hitter but dreadful for a 275-yard player. Then there's a bigger problem with using greens in regulation to represent what is essentially iron play, namely that hitting greens is dependent on driving the ball well. So to hit 65% of greens might be very good indeed if you hit the ball 280 and only hit 50% of fairways, but if you hit it 300 and have 65% driving accuracy it's probably pretty bad. Putts per green in regulation is almost inherently independent of GIRs themselves, though not independent of proximity to the hole, which might be expected to correlate with GIR percentages, messing things up slightly. Finally, the problem with scrambling is that each successful up-and-down comes in two parts: a pitch and a putt. A great chipper, who hits most of their chip shots inside of three feet, is a good scrambler, but a mediocre chipper who has 10 feet on average but is a good putter and makes most of their 10-footers might also be a good scrambler.

Recently there's been a bit of an effort to account for this by taking various stats dependent on various other things. GIR% from the rough. Scrambling from >30 yards out. Proximity to the hole from 100-125 yards out. Proximity to the hole from the sand. Putts made from 10'-15'. All of which is all very well and good, and probably does say something about the skills more directly, but I tend to feel with those numbers that the sample sizes are just too damn small to be really worth much, and that as you try to parse the numbers down to within an inch of their lives you end up losing the purpose of statistics in the first place, which is to effectively summarize the great mass of information produced in, for instance, a single round of golf.

That being said, the fact that the proxies are imperfect makes trying to figure out which skills are most important a tricky task.

No comments:

Post a Comment