## The Computer: A Phantom Figure Skating Judge?

### Lyon, France

 John Emerson Assistant Professor Department of Statistics Yale University. New results from the Olympic Pairs competition are available.
"... a scoring system designed to increase fairness."
-- NBC Commentator Bob Costas, February 11, 2006, 8:50 PM.

"A close figure skating competition will be decided by a computer
choosing an anonymous panel of nine judges. In fairness to the
skaters, all twelve scores should be used in awarding medals. Let's
leave the computer out of it."

-- John Emerson, Assistant Professor of Statistics, Yale University.

 Torino, Italy, February 11, 2006. During NBC's Prime Time broadcast of the 2006 Olympic competition, commentator Bob Costas discusses the new figure skating "scoring system, designed to increase fairness" - fallout from the judging scandal in Salt Lake City. Two-time gold medal winner Dick Button offers his support of the new system. The viewer is comforted; the integrity of the Olympic Games is intact. Does the new scoring system increase fairness? On some level it does, but the system has introduced the unsettling possibility of dumb luck influencing the medal standings. In a close competition with skaters separated by only a few points, the outcome will likely be determined by the random choices of panels of nine judges. This is neither desirable nor fair, and the system can be improved easily. The outcome should be determined solely by the skaters and the judges, using the scores of all twelve judges. For over 100 years, panels of judges have used the 6.0 standard of scores. Judging was not anonymous, and accusations of favoritism were common. The starting order often influenced the scores, with earlier skaters receiving lower scores to "leave room" for the possibility of superior performances later in the session. In place since the 2004 World Championships and in use at the 2006 Olympic Games, the new system awards points for technical elements as well as five program components: skating skills, transition/linking footwork, performance/execution, choreography/composition, and interpretation. The scores for the technical elements depend on a base value for the level of difficulty of the elements. The twelve judges add or deduct points from this base value, acknowledging the "grade of execution" of the performance of the elements. Program component scores range from 0 to 10, with increments of 0.25, reflecting the overall presentation of the program and quality of the figure skating. Judging is now anonymous. Nine of twelve judges are selected at random for the Short Program and again for the Free Skate. Scores for each executed element or program component are calculated using a trimmed mean, as in the old system, dropping the maximum and minimum of the nine scores. Random elimination of three judges results in 220 possible combinations of nine-judge panels. However, only one panel actually determines the outcome. An examination of the Ladies' 2006 European Figure Skating Championships illustrates the problem. The Short Program was a close competition between four of the top five skaters: Irina Slutskaya (66.43), Elena Sokolova (60.88), Sarah Meier (60.87), Elena Gedevanishvili (60.19), and Carolina Kostner (60.04). The scores were calculated after a computer randomly excluded judges 4, 6, and 11, whose identities and nationalities are not disclosed. Only 50 of the 220 possible panels would have resulted in the same ranking of the skaters following the Short Program. Scores calculated using all of the twelve judges would have resulted in the same ranking, but with slightly different numerical scores. Random elimination of a different set of judges could have radically changed these standings. Only Slutskaya's standing was secure; each of the other skaters could have placed as high as 2nd or as low as 5th in the Short Program. If the scores had been similarly close following the Free Skate (they were not, fortunately), the medal standings would have been determined by the random selection of the panels of judges. The following graphs show the distribution of Short Skate rankings for each of the top 5 finishers, based on 220 possible panels of judges. The bars indicate the proportion of the 220 panels that would have resulted in a particular ranking of the skaters. The red bar indicates the actual outcome of the competition. Each of these panels awarded the highest score to Slutskaya. Meier was particularly lucky: while she placed 3rd, more than half of the possible panels would have placed her in 4th or 5th position. Conversely, Gedevanishvili, who placed 4th, was particularly unlucky - more than half of the possible panels would scored her in 2nd or 3rd position. Even Kostner, in 5th place, would have been ranked 2nd or 3rd by about one-third of the panels. Imagine a similarly close competition for the Olympic medals in Torino, Italy. I hope I never have to hear a 4th or 5th place finisher give the following interview: "I did my best, and I would have won Bronze if all twelve judges' scores had been included. And if a different panel of 9 judges had been selected, I might have won Gold." We can only hope that the podium in Torino on February 23 will be determined by the judging of the skaters on the ice, not by a computer.

There was considerable uncertainty in the placement of Sokolova, Meier, Gedevanishvili,
and Kostner in the Short Program.

 "The ISU has claimed that random selection of judges will make it impossible to create blocs, because the deal-makers would not know whom to approach. However, picking a random number of judges from a panel won't eliminate the ability to set up blocs. They may have to be bigger than before to ensure results, but they'll have the advantage of being undetectable once they're formed. ... Selecting random judges from a completely honest panel creates a fairness issue that does not exist when you use all the judges' marks."           -- Katherine Godfrey, Ph.D., available here. March, 2003, in a discussion of the ISU proposals for a new system, prior to the new June 2004 rules, writing about the proposed interim scoring system. "Let me be clear: I'm not calling past or future results illegitimate. Rules are rules, and the rules have been and will be applied fairly to determine the winners. If the random selection of skaters reduced the impact of possible nationalism or block voting (and I haven't studied this question), there is still a clear cost associated with the judging system: uncertainty in close outcomes. There is no perfect system, but we should work to find a good one."           -- John Emerson, Assistant Professor of Statistics, Yale University.

Sources

The ISU web site: the complete results.
ISU: Special Regulations Single & Pair Skating 2004: the most recent version I could find at the ISU.
Scale of values: more recent than the 2004 ISU tables.
How stuff works: scoring basics, easier reading for the beginner.
Detailed Olympic Figure Skating Results

The data

 Unlike the ISU, I advocate disclosure of data in a user-friendly form suitable for statistical analysis. The original sources are documented above. The scraped data and complete set of possible rankings are provided here, with brief documentation. I was pleased to find data available on all 12 judges for the European Championships, although processing the data was extremely difficult; in the interest of openness, why couldn't they present the data in an Excel file? Interestingly, the U.S. Championships appear to use 9 judges, counting all scores. So no analysis is necessary! Thanks to Stephen Kawalko for pointing this out. I'm obviously not a figure skating expert -- I'm trying to objectively present the results of a data analysis and will try not to get involved in the politics. I provide my processed data (in an Excel CSV file) for the top 8 finishers, so you can study the problem yourself. This file contains the raw GOE scores of the judges for the executed elements; there is an explanation (not finished) of the variables available in this data set that you may find helpful. I also provide a different version of the same data, using the SOV scores of the executed elements. Finally, I provide the rankings that would have results from each of the 48,400 panels of judges in the Short Program and Free Skate. This is a large file, so beware, and its format is not documented, yet.

 Thanks Thanks to: Robert Lehman for clearing up some of my confusion about the old system; Stephen Kawalko, for pointing out that the US system does not seem to exclude judges at random (which is great!); and my students, for inspiring me to come up with neat examples for class.