Olympic Diving EDA Part 2: Judging Bias

Building off of what was hinted at during phase 1 of the EDA , it makes sense to put more thought into possible juding bias during the 2000 Summer Olympic diving competition.

Specifically, we need a way to measure if a particular judge is consistently giving higher scores to divers of a particular country. There are few things we need to account for during this measurement, simply measuring the JScore will not work.

We could easily consider two situations where a judge could seem to assign higher than average scores. First, divers from a particular country might simply be good divers. If we consider the scores assigned to this country by a specific judge without accounting for their general skill level our conclusions will be skewed.

It is for this reason that we measure each judge's “Judging Difference”, meaning how much higher/lower they rank a dive than the other judges. To accomplish this, I average the panel of socres for each dive and subtract this average from each judges' assigned score. A positive JDiff means that a judge ranked a dive higher than the panel average.

The second contingancy we need to account for is each judge's particular tendancies. They might consistantly rank dives higher or lower than other judges without regard for the diver country. While we accounted for the “Judging Difference” numerically, we can control for “Judging Tendancy” through our choice of analytical tools.

Recall the heatmap of differences created during the first phase:

plot of chunk unnamed-chunk-2

If we examine the rows, we can can clearly see where a judge has a high difference index (dark red). If we then investiage the column we can see if this dark red pattern is consistent.

Our 'biased' judges are ones who have only a few (ideally one) dark column cells that is one of only a couple in its row (although this is accounted for by the color of the cell itself).

Based on this heatmap it actually appears as if biased judging was fairly common during the competition. That said, a few judges stand out, most notably Steve McFarland, Jesus Mena, and Oleg Zatsev. Notice that all three have one or two dark column cells that completely stand alone in their row. Additionally, these singular cells coinside with Divers from their home countries. Let's investiagte further starting with McFarland.

First, I plot the raw scores he gives to dives against the dive difficulty, while calling out divers from the USA (that is divers with a country match to the judge, McFarland in this case).

plot of chunk unnamed-chunk-3

Note that in cases where the difficulty is 2.5 or greater, all divers from the USA are given a high score by McFarland. Also, when we consider the easier dives, Amaericans are, in general, given a high score by McFarland.

With the groundwork established, let's consider the difference scores:

plot of chunk unnamed-chunk-4

This is even more telling than the first plot. Essentially all the american dives have a positive score difference. On average, McFarland has a score difference of +.20 for americans verses +.10 for other divers.

The evidence is pretty strong in favor of McFarland's nationalisitc bias. Just to be sure, it make sense to finalize our conclusions with a hypothesis test. First I use a qqplot to check to see if the data is normally distributed.

plot of chunk unnamed-chunk-6

The shape oscilates and peels away from the QQ line on the bottom and top. To be safe we will conduct a permutation test instead of a t-test.

To conduct the permutation test we first record the observed difference in JDiff mean for our two groups. We also make note of proportion of divers (judged by the Judge in question) from the same country as the Judge verses those not. Next, drawing from the pooled group of obsevations over a number of perumations, we randomly split the data into two disjoint groups matching the proportions we observed. We finally record our simulated difference in Judging Difference mean for each permutation. The two tailed p-value for our test is calculated as propotion of simulations where the absolute difference was greater or equal to the absolute value of our observed difference.

Ideally the number of permutations would be equal to all possible combinations of the two groups. We observe that McFarland judged 657 dives, 42 of which were from the USA. Ideally we would simulate 657Perm42 times. As this is a very large number I just settled randomly on 30,000.

For the two tailed test, our null hypothesis is that there is no difference in “Judging Difference” between the US divers and divers of other countries.

The p-value of this two-tailed test is .0023, so we can be fairly sure that there is indeed a difference in the mean of these two groups.

In the one-tailed test we attempt to build on our earlier conlusions and show test the hypothesis that US divers are given a higher score (and thus have a higher Judging Difference.) The null hypothesis is, again, that there is no difference.

For the one tailed test the p value is .001 so we can conclude that McFarland has a positive bias towards American divers.