The overall situation of diving data:
Number of cases: 10,787 scores from 1,541 dives (7 judges score each dive) performed in four events at the 2000 Olympic Games in Sydney, Australia.
Number of variables: 10.
Variables:
Event: Four events, men's and women's 3M and 10m.
Round: Preliminary, semifinal, and final rounds.
Diver: The name of the diver.
Country: The country of the diver.
Rank: The final rank of the diver in the event.
DiveNo: The number of the dive in sequence within round.
Difficulty: The degree of difficulty of the dive.
JScore: The score provided for the judge on this dive.
Judge: The name of the judge.
*JCountry: The country of the judge.
Fistly, I get the directory and put in the data:
setwd("D:/Study/Yale/STAT625_Case Studies/Homework#2")
data <- read.csv("D:/Study/Yale/STAT625_Case Studies/Homework#2/Diving2000.csv",
header = T, as.is = T)
head(data)
## Event Round Diver Country Rank DiveNo Difficulty JScore
## 1 M3mSB Final XIONG Ni CHN 1 1 3.1 8.0
## 2 M3mSB Final XIONG Ni CHN 1 1 3.1 9.0
## 3 M3mSB Final XIONG Ni CHN 1 1 3.1 8.5
## 4 M3mSB Final XIONG Ni CHN 1 1 3.1 8.5
## 5 M3mSB Final XIONG Ni CHN 1 1 3.1 8.5
## 6 M3mSB Final XIONG Ni CHN 1 1 3.1 8.5
## Judge JCountry
## 1 RUIZ-PEDREGUERA Rolando CUB
## 2 GEAR Dennis NZL
## 3 BOYS Beverley CAN
## 4 JOHNSON Bente NOR
## 5 BOUSSARD Michel FRA
## 6 CALDERON Felix PUR
attach(data)
Use of “YaleToolkit” and check the missing value:
library(YaleToolkit)
## Loading required package: grid Loading required package: lattice Loading
## required package: vcd Loading required package: MASS Loading required
## package: colorspace Loading required package: barcode Loading required
## package: gpairs
whatis(data)
## variable.name type missing distinct.values precision
## 1 Event character 0 4 NA
## 2 Round character 0 3 NA
## 3 Diver character 0 156 NA
## 4 Country character 0 42 NA
## 5 Rank numeric 0 49 1.0
## 6 DiveNo numeric 0 6 1.0
## 7 Difficulty numeric 0 20 0.1
## 8 JScore numeric 0 21 0.1
## 9 Judge character 0 25 NA
## 10 JCountry character 0 21 NA
## min max
## 1 M10mPF W3mSB
## 2 Final Semi
## 3 ABALLI Jesus-Iory ZHUPINA Olena
## 4 ARG ZIM
## 5 1 49
## 6 1 6
## 7 1.5 3.8
## 8 0 10
## 9 ALT Walter ZAITSEV Oleg
## 10 AUS ZIM
data[is.na(data) == TRUE]
## character(0)
[1] To check some categorical variables:
table(Event)
## Event
## M10mPF M3mSB W10mPF W3mSB
## 2709 3192 2317 2569
table(Round)
## Round
## Final Prelim Semi
## 1848 6636 2303
table(Country)
## Country
## ARG ARM AUS AUT AZE BLR BRA CAN CHN COL CUB CZE ESP FIN FRA GBR GEO GER
## 35 42 728 175 42 112 189 560 868 119 301 42 259 84 224 448 35 672
## GRE HKG HUN INA ITA JPN KAZ KOR MAS MEX PER PHI PRK PUR ROM RUS SUI SWE
## 189 42 231 112 294 231 399 154 196 420 84 77 427 98 154 791 77 105
## THA TPE UKR USA VEN ZIM
## 84 175 476 833 161 42
[2] Overall summary of scores:
summary(JScore)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 6.00 7.00 6.83 8.00 10.00
[3] Based on Event, explore different JScore in each event group. It indicates that Distribution of JScore in each group are similar, which are left-skewed.
barplot(table(Event, JScore), beside = T, main = "Scores in Each Group", xlab = "Scores",
legend = T)
[4] Based on Country, explore the distribution of JScore, Rank and Difficulty.
sort(tapply(JScore, Country, mean))
## INA ARG HKG THA TPE ARM SUI CZE GRE ZIM PHI ROM
## 4.473 4.614 4.667 5.107 5.186 5.238 5.240 5.488 5.545 5.583 5.604 5.662
## PUR KOR COL VEN GEO MAS PER FRA AZE ESP GBR BRA
## 5.832 5.844 5.903 5.935 6.000 6.010 6.018 6.109 6.226 6.243 6.364 6.392
## AUT FIN CUB HUN KAZ BLR PRK ITA UKR MEX GER AUS
## 6.446 6.458 6.487 6.511 6.607 6.652 6.672 6.811 6.825 6.913 7.213 7.303
## CAN USA JPN RUS SWE CHN
## 7.440 7.477 7.591 7.624 7.648 8.159
sort(tapply(Difficulty, Country, mean))
## SWE ITA PUR TPE BLR KAZ AUS USA RUS ARG MEX UKR
## 2.447 2.581 2.629 2.640 2.669 2.670 2.683 2.697 2.697 2.700 2.707 2.725
## GER CAN HKG ESP BRA CHN PRK GEO AUT FRA CUB GBR
## 2.726 2.726 2.733 2.759 2.767 2.771 2.775 2.780 2.784 2.784 2.791 2.795
## PHI HUN INA JPN COL SUI MAS AZE GRE KOR ROM ARM
## 2.818 2.833 2.844 2.852 2.865 2.873 2.932 2.967 2.974 2.995 3.009 3.017
## PER VEN FIN THA CZE ZIM
## 3.017 3.017 3.025 3.125 3.133 3.150
barplot(table(Country, JScore), beside = T, main = "Score in Each Country",
xlab = "Scores")
[5] JScore on both JCountry and Country. To check if there is any bias. And I found the biases indeed exist. The scores are higher if the judges and competitors come from the same country.
mean(JScore[JCountry == Country])
## [1] 7.462
mean(JScore)
## [1] 6.833
The scores are much higher if both the judges and divers come from China. There may be some correlations.
mean(JScore[JCountry == "CHN"])
## [1] 6.886
mean(JScore[JCountry == "CHN" & Country == "CHN"])
## [1] 8.475
[6] Explore bimodality in the degree of Difficulty. The correlation between JScore and Difficulty is negative.
hist(Difficulty)
cor(JScore, Difficulty) # Negative
## [1] -0.2724
And then I plot JScore and Difficulty for each gender. First define “male”.
male <- x$Event %in% c("M3mSB", "M10mPF")
## Error: 找不到对象'x'
Now back to the plot:
plot(jitter(x$Difficulty), jitter(x$JScore), xlab = "Degree of Difficulty",
ylab = "Judges' Scores", col = 1 + x$male)
## Error: 在为'plot'函数选择方法时评估'x'参数出了错:
## 错误于jitter(x$Difficulty) : 找不到对象'x'
Then I create a bar graph on “Round” and “Difficulty” to see if “Round” variable impacts “Difficulty”. The result is that divers usually choose more difficult tasks in preliminary and final rounds, while prefering relatively easier tasks in semifinal rounds. I think this is why the bimodality appears in “Difficulty”, and I was wondering if there are any rules to request them to pick harder tasks in preliminary and final rounds such as some specific required hard dives. Or the divers just want to keep it safe to get to final rounds.
levels(as.factor(Round))
## [1] "Final" "Prelim" "Semi"
levels(as.factor(Difficulty))
## [1] "1.5" "1.6" "1.8" "1.9" "2" "2.1" "2.4" "2.5" "2.6" "2.7" "2.8"
## [12] "2.9" "3" "3.1" "3.2" "3.3" "3.4" "3.5" "3.6" "3.8"
tapply(Difficulty, Round, mean)
## Final Prelim Semi
## 3.061 2.981 1.896
barplot(table(Round, Difficulty), beside = T, legend = T, main = "Difficulty Distribution for Every Round")
Then I want to see if “Difficulty” in each “Event” group are the same. It shows that Men's difficulty is a little higher than women's. And another interesting phenomenon is that each group shows the similar distribution as the overall.
tapply(Difficulty, Event, mean)
## M10mPF M3mSB W10mPF W3mSB
## 2.926 2.849 2.695 2.546
barplot(table(Event, Difficulty), beside = T, legend = T, main = "Difficulty Distribution for Every Group")
Also, distribution of scores are similar in each round, which is left-skewed.
barplot(table(Round, JScore), beside = T, legend = T, main = "Scores Distribution for Every Round")