Homework#2: Olympic Diving Data Analysis

The overall situation of diving data:

Number of cases: 10,787 scores from 1,541 dives (7 judges score each dive) performed in four events at the 2000 Olympic Games in Sydney, Australia.

Number of variables: 10.

Variables:

Event: Four events, men's and women's 3M and 10m.

Round: Preliminary, semifinal, and final rounds.

Diver: The name of the diver.

Country: The country of the diver.

Rank: The final rank of the diver in the event.

DiveNo: The number of the dive in sequence within round.

Difficulty: The degree of difficulty of the dive.

JScore: The score provided for the judge on this dive.

Judge: The name of the judge.

*JCountry: The country of the judge.

Fistly, I get the directory and put in the data:

setwd("D:/Study/Yale/STAT625_Case Studies/Homework#2")
data <- read.csv("D:/Study/Yale/STAT625_Case Studies/Homework#2/Diving2000.csv", 
    header = T, as.is = T)
head(data)
##   Event Round    Diver Country Rank DiveNo Difficulty JScore
## 1 M3mSB Final XIONG Ni     CHN    1      1        3.1    8.0
## 2 M3mSB Final XIONG Ni     CHN    1      1        3.1    9.0
## 3 M3mSB Final XIONG Ni     CHN    1      1        3.1    8.5
## 4 M3mSB Final XIONG Ni     CHN    1      1        3.1    8.5
## 5 M3mSB Final XIONG Ni     CHN    1      1        3.1    8.5
## 6 M3mSB Final XIONG Ni     CHN    1      1        3.1    8.5
##                     Judge JCountry
## 1 RUIZ-PEDREGUERA Rolando      CUB
## 2             GEAR Dennis      NZL
## 3           BOYS Beverley      CAN
## 4           JOHNSON Bente      NOR
## 5         BOUSSARD Michel      FRA
## 6          CALDERON Felix      PUR
attach(data)

Use of “YaleToolkit” and check the missing value:

library(YaleToolkit)
## Loading required package: grid Loading required package: lattice Loading
## required package: vcd Loading required package: MASS Loading required
## package: colorspace Loading required package: barcode Loading required
## package: gpairs
whatis(data)
##    variable.name      type missing distinct.values precision
## 1          Event character       0               4        NA
## 2          Round character       0               3        NA
## 3          Diver character       0             156        NA
## 4        Country character       0              42        NA
## 5           Rank   numeric       0              49       1.0
## 6         DiveNo   numeric       0               6       1.0
## 7     Difficulty   numeric       0              20       0.1
## 8         JScore   numeric       0              21       0.1
## 9          Judge character       0              25        NA
## 10      JCountry character       0              21        NA
##                  min           max
## 1             M10mPF         W3mSB
## 2              Final          Semi
## 3  ABALLI Jesus-Iory ZHUPINA Olena
## 4                ARG           ZIM
## 5                  1            49
## 6                  1             6
## 7                1.5           3.8
## 8                  0            10
## 9         ALT Walter  ZAITSEV Oleg
## 10               AUS           ZIM
data[is.na(data) == TRUE]
## character(0)

[1] To check some categorical variables:

table(Event)
## Event
## M10mPF  M3mSB W10mPF  W3mSB 
##   2709   3192   2317   2569
table(Round)
## Round
##  Final Prelim   Semi 
##   1848   6636   2303
table(Country)
## Country
## ARG ARM AUS AUT AZE BLR BRA CAN CHN COL CUB CZE ESP FIN FRA GBR GEO GER 
##  35  42 728 175  42 112 189 560 868 119 301  42 259  84 224 448  35 672 
## GRE HKG HUN INA ITA JPN KAZ KOR MAS MEX PER PHI PRK PUR ROM RUS SUI SWE 
## 189  42 231 112 294 231 399 154 196 420  84  77 427  98 154 791  77 105 
## THA TPE UKR USA VEN ZIM 
##  84 175 476 833 161  42

[2] Overall summary of scores:

summary(JScore)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    6.00    7.00    6.83    8.00   10.00

[3] Based on Event, explore different JScore in each event group. It indicates that Distribution of JScore in each group are similar, which are left-skewed.

barplot(table(Event, JScore), beside = T, main = "Scores in Each Group", xlab = "Scores", 
    legend = T)

plot of chunk unnamed-chunk-5

[4] Based on Country, explore the distribution of JScore, Rank and Difficulty.

sort(tapply(JScore, Country, mean))
##   INA   ARG   HKG   THA   TPE   ARM   SUI   CZE   GRE   ZIM   PHI   ROM 
## 4.473 4.614 4.667 5.107 5.186 5.238 5.240 5.488 5.545 5.583 5.604 5.662 
##   PUR   KOR   COL   VEN   GEO   MAS   PER   FRA   AZE   ESP   GBR   BRA 
## 5.832 5.844 5.903 5.935 6.000 6.010 6.018 6.109 6.226 6.243 6.364 6.392 
##   AUT   FIN   CUB   HUN   KAZ   BLR   PRK   ITA   UKR   MEX   GER   AUS 
## 6.446 6.458 6.487 6.511 6.607 6.652 6.672 6.811 6.825 6.913 7.213 7.303 
##   CAN   USA   JPN   RUS   SWE   CHN 
## 7.440 7.477 7.591 7.624 7.648 8.159
sort(tapply(Difficulty, Country, mean))
##   SWE   ITA   PUR   TPE   BLR   KAZ   AUS   USA   RUS   ARG   MEX   UKR 
## 2.447 2.581 2.629 2.640 2.669 2.670 2.683 2.697 2.697 2.700 2.707 2.725 
##   GER   CAN   HKG   ESP   BRA   CHN   PRK   GEO   AUT   FRA   CUB   GBR 
## 2.726 2.726 2.733 2.759 2.767 2.771 2.775 2.780 2.784 2.784 2.791 2.795 
##   PHI   HUN   INA   JPN   COL   SUI   MAS   AZE   GRE   KOR   ROM   ARM 
## 2.818 2.833 2.844 2.852 2.865 2.873 2.932 2.967 2.974 2.995 3.009 3.017 
##   PER   VEN   FIN   THA   CZE   ZIM 
## 3.017 3.017 3.025 3.125 3.133 3.150
barplot(table(Country, JScore), beside = T, main = "Score in Each Country", 
    xlab = "Scores")

plot of chunk unnamed-chunk-6

[5] JScore on both JCountry and Country. To check if there is any bias. And I found the biases indeed exist. The scores are higher if the judges and competitors come from the same country.

mean(JScore[JCountry == Country])
## [1] 7.462
mean(JScore)
## [1] 6.833

The scores are much higher if both the judges and divers come from China. There may be some correlations.

mean(JScore[JCountry == "CHN"])
## [1] 6.886
mean(JScore[JCountry == "CHN" & Country == "CHN"])
## [1] 8.475

[6] Explore bimodality in the degree of Difficulty. The correlation between JScore and Difficulty is negative.

hist(Difficulty)

plot of chunk unnamed-chunk-9

cor(JScore, Difficulty)  # Negative
## [1] -0.2724

And then I plot JScore and Difficulty for each gender. First define “male”.

male <- x$Event %in% c("M3mSB", "M10mPF")
## Error: 找不到对象'x'

Now back to the plot:

plot(jitter(x$Difficulty), jitter(x$JScore), xlab = "Degree of Difficulty", 
    ylab = "Judges' Scores", col = 1 + x$male)
## Error: 在为'plot'函数选择方法时评估'x'参数出了错:
## 错误于jitter(x$Difficulty) : 找不到对象'x'

Then I create a bar graph on “Round” and “Difficulty” to see if “Round” variable impacts “Difficulty”. The result is that divers usually choose more difficult tasks in preliminary and final rounds, while prefering relatively easier tasks in semifinal rounds. I think this is why the bimodality appears in “Difficulty”, and I was wondering if there are any rules to request them to pick harder tasks in preliminary and final rounds such as some specific required hard dives. Or the divers just want to keep it safe to get to final rounds.

levels(as.factor(Round))
## [1] "Final"  "Prelim" "Semi"
levels(as.factor(Difficulty))
##  [1] "1.5" "1.6" "1.8" "1.9" "2"   "2.1" "2.4" "2.5" "2.6" "2.7" "2.8"
## [12] "2.9" "3"   "3.1" "3.2" "3.3" "3.4" "3.5" "3.6" "3.8"
tapply(Difficulty, Round, mean)
##  Final Prelim   Semi 
##  3.061  2.981  1.896
barplot(table(Round, Difficulty), beside = T, legend = T, main = "Difficulty Distribution for Every Round")

plot of chunk unnamed-chunk-12

Then I want to see if “Difficulty” in each “Event” group are the same. It shows that Men's difficulty is a little higher than women's. And another interesting phenomenon is that each group shows the similar distribution as the overall.

tapply(Difficulty, Event, mean)
## M10mPF  M3mSB W10mPF  W3mSB 
##  2.926  2.849  2.695  2.546
barplot(table(Event, Difficulty), beside = T, legend = T, main = "Difficulty Distribution for Every Group")

plot of chunk unnamed-chunk-13

Also, distribution of scores are similar in each round, which is left-skewed.

barplot(table(Round, JScore), beside = T, legend = T, main = "Scores Distribution for Every Round")

plot of chunk unnamed-chunk-14