Joe Cauteruccio,
Yale University
Initial Investigation Data were read-in from the web http://www.stat.yale.edu/~jay/625/diving/Diving2000.csv. There are 10787 observations and 10 features. The feture names are shown below.
colnames(d)
## [1] "Event" "Round" "Diver" "Country" "Rank"
## [6] "DiveNo" "Difficulty" "JScore" "Judge" "JCountry"
After reading in the data it makes sense to explore the variables a little bit. We find that there were three rounds, Preliminary, Semi-Final, and Final with a break down of 6636, 2303, and 1848. This progression makes sense intuitivly as divers are being eliminated at every stage.
We also note that there are 156 divers, from 42 countires and 25 judges from 21 countries.
Going into the judges a little deeper, we see that New Zealanders judges the most dives but on a individual level the Mexican judge Jesus Mena judged the most. We can print a sample dive that he judged and it looks like he judged this dive a full point lower than other judges. This brings up the idea of checking for judging bias (which is probably a good idea regardless of this example).
Further Investigation
Difficulty
Going a little deeper into the data, it makes sense to look at both JScore and dive difficulty. Looking at the difficulty distribution it appears to be bimodal.
ggplot(d, aes(x = d$Difficulty)) + geom_histogram(aes(binwidth = 1, fill = d$Round)) +
geom_density() + xlab("Difficulty of Dive") + ylab("Freq")
We see that the Bimodality seems to be due to the Dive round. As for whay this is the case, the data isn't clear.
Judging Bias
The previous print of a dive hinted that we should look at a potential judging bias. To do this, I calculated the average JScore (across the 7 judges) for each dive. Then I created a JScore Difference for each judge by subtracting the mean from score the score they gave for that dive.
These differences could be cut in a few different ways. Initially I chose to average the jscore for each judge by diver country and vizualize it in a Heat Map.
Arguably, the heat map is not sufficent to prove judge bias but it can certainly point us in the right direction and guide further tests. If we look at Jesus Mena, we see that he has a very dark box for Mexico (meaning on average he gives Mexicans higher scores) and a somewhat lite box for China.
## [1] "Warning, RStudio may not be able to display the heatmap"
JBHeat <- heatmap(hmd_nona, Rowv = NA, Colv = NA, col = colors, scale = "row",
margins = c(12, 5))
Conclusions
Some steps I highlighted for myself as I continue the EDA:
Questions coming out of summary/ Things to look into or do next: