Tutorials, Workshops, and Short Courses in R and Statistics
2015
All lengths and levels, customizable, on a wide range of topics!

john.emerson AT yale.edu

John W. Emerson (Jay)
Associate Professor Adjunct and Director of Graduate Studies
Department of Statistics, Yale University

An Introduction to R for Non-Programmers (1- or 2-day)

This workshop introduces the R language for statistical computing and graphics in a manner accessible to professionals without prior programming experience. No prior experience programming or in statistics is required. The workshop focuses on the core of the language: working with data, data structures, base graphics, loops, and functions. Examples will demonstrate statistical methods for data analysis including tables, t-tests, linear regression, and analysis of variance. Participants will work through examples both individually and as a group.
The day will be organized loosely around three modules although the pace will adapt to the level of the participants:

  • The core language syntax and data structures for working with and exploring data. Accessing and organizing data; arithmetic and logical operators; conditionals arguments; loops; subsetting; common functions; getting help and using extension packages.

  • Graphics. An emphasis on base graphics; graphical output formats; customization; and time permitting, an introduction to lattice and ggplot2.

  • From data exploration to statistical inference. Primary case study: the 2000 Olympic diving competition. Other case studies help re-inforce the earlier material.

Different people approach statistical computing with R in different ways. It can be helpful to start with a real-data problem and learn something about R “on the fly” while trying to solve a problem. But it is also useful to have a more organized, formal introduction to the core of the language without the distraction of a complicated applied problem. This course offers overlapping modules which help reinforce the key concepts.

Using data from the 2000 Olympic diving competition, you will learn or review a small subset of the R language and syntax that supports an impressively large portion of everyday statistical visualization and analysis. Particular methods for review in this example include a comparison of t- and permutation tests. We'll start with displays from R's base graphics and will conclude with an introduction to grid graphics programming. Other smaller data examples will be used throughout the workshop.

All participants will receive electronic copies of all slides, data sets, exercises, and R scripts used in the course.


An Intensive Introduction to R (1- or 2-day)

This daylong workshop introduces the R language for statistical computing and graphics in a manner that assume some prior programming experience in another language (such as Python, Perl, Matlab, C/C++, etc…). No prior experience in statistics is required. The workshop focuses on the core of the language: working with data, data structures, base graphics, loops, and functions, and will emphasize the use of R as a programming language to address challenges beyond standard tools for data analysis and exploration. Advanced topics include regular expressions, data cleaning/munging, and an array of statistical methods for data analysis. Participants will work through examples both individually and as a group. The day will be organized around four modules:

  • The core language syntax and data structures for working with and exploring data. Accessing and organizing data; arithmetic and logical operators; conditionals arguments; loops; subsetting; common functions; getting help and using extension packages.

  • Graphics and writing customized functions. An emphasis on base graphics; graphical output formats; customization; an introduction to lattice and ggplot2.

  • From data exploration to statistical inference, including cleaning/munging data from unusual sources. Case studies: the 2000 Olympic diving competition; studying bookie pointspreads on college basketball.

  • Open topics to be determined, with data examples provided by the participants.

Different people approach statistical computing with R in different ways. It can be helpful to start with a real-data problem and learn something about R “on the fly” while trying to solve a problem. But it is also useful to have a more organized, formal introduction to the core of the language without the distraction of a complicated applied problem. This course offers four distinct modules which offer some overlap, reinforcing the key concepts. This is a hands-on class where attendees will benefit from working along with the instructor.

Using data from the 2000 Olympic diving competition, you will learn or review a small subset of the R language and syntax that supports an impressively large portion of everyday statistical visualization and analysis. Particular methods for review in this example include a comparison of t- and permutation tests. We'll also study gambling pointspreads for sports available online and processed automatically in R. We'll start with displays from R's base graphics and will conclude with an introduction to grid graphics programming. Other smaller data examples will be used throughout the workshop.

The final module will be shaped around participant interests and data contributions. All participants will receive electronic copies of all slides, data sets, exercises, and R scripts used in the course.


High-Performance Computing in R (1-day)

Overview

This intermediate-level masterclass will introduce you to topics in high-performance computing with R. We will begin by examining a range of related topics including memory management and algorithmic efficiency. Next, we will quickly explore the new parallel package (containing snow and multicore). We will then concentrate on the elegant framework for parallel programming offered by packages foreach and the associated parallel backends. The R package management system including the C/C++ interface and use of package Rcpp will be covered. We will conclude with basic examples of handling larger-than-RAM numeric matrices and use of shared memory. Hands-on exercises will be used throughout.

What will I learn?

Different people approach statistical computing with R in different ways. It can be helpful to work on real data problems and learn something about R “on the fly” while trying to solve a problem. But it is also useful to have a more organized, formal presentation without the distraction of a complicated applied problem. This course offers four distinct modules which adopt both approaches and offer some overlap across the modules, helping to reinforce the key concepts. This is an active-learning class where attendees will benefit from working along with the instructor. Roughly, the modules include:

  • An intensive review of the core language syntax and data structures for working with and exploring data. Functions; conditionals arguments; loops; subsetting; manipulating and cleaning data; efficiency considerations and best practices, including loops and vector operations, memory overhead and optimizing performance.

  • Motivating parallel programming with an eye on programming efficiency: a case study. Processing, manipulating, and conducting a basic analysis of 100-200 MB of raw microarray data provides an excellent challenge on standard laptops. It is large enough to be mildly annoying, yet small enough that we can make progress and see the benefits of programming effiency and parallel programming.

  • Topics in high-performance computing with R, including packages parallel and foreach. Hands-on examples will help reinforce key concepts and techniques.

  • Authoring R packages, including an introduction to the C/C++ interface and the use of Rcpp for high-performance computing. Participants will build a toy package including calls to C/C++ functions.

Is this class right for me?

This class will be a good fit for you if you are comfortable working in R and are familiar with R's core data structures (vectors, matrices, lists, and data frames). You are comfortable with for loops and preferably aware of R's apply-family of functions. Ideally you will have written a few functions on your own. You have some experience working with R, but are ready to take it to the next level. Or, you may have considerable experience with other programming languages but are interested in quickly getting up to speed in the areas covered by this masterclass.

After this workshop, what will I be able to do?

You will be in a better position to code efficiently with R, perhaps avoiding the need, in some cases, to resort to C/C++ or parallel programming. But you will be able to implement so-called embarassingly parallel algorithms in R when the need arises, and you'll be ready to exploit R's C/C++ interface in several ways. You'll be in a position to author your own R package can include C/C++ code.

Other informaion

All participants will receive electronic copies of all slides, data sets, exercises, and R scripts used in the course.

You will need your laptop with the latest version of R. I recommend use of the R Studio IDE, but it is not necessary. A few add-on packages will be used in the workshop. Packages Rcpp and foreach will be used. As a complement to foreach you should also install doMC (Linux or MacOS only) and doSNOW (all platforms). If you want to work along with the C/C++ interface segment, some extra preparation will be required. Rcpp and use of the C/C++ interface requires compilers and extra tools; the folks at RStudio have a nice page that summarizes the requirements. Please note that these requirements may not be trivial (particularly in Windows) and need to be completed prior to the workshop if you intend to compile C/C++ code and use Rcpp during the workshop.