July 23, 2015

It's great to be here!

Backstories

Early Years: Basic Legos

A friend's specialized Legos

Teenage Years: The Bigger the Better?

The National Security Agency

NSA: Super-Secret Big Data

Investing: Picking single stocks is like… gambling!

I'll return to these stories later

New Pursuits: Leveling the playing field in HPC

What is leveling the playing field?

Example: "Big Data" problems – everyone should have a chance to participate!


Source: washingtonpost.com

An Unusual Opening

An Unusual Opening

  • Please forgive me for starting this talk with too much text.
  • And forgive me from reading some of it to you.
  • Together, I'd say these are usually warning signs of a potentially "bad" talk.
  • But in this case I have a "good" reason… so thank you for your patience, understanding, and trust!

Abstract

The evolution of computing is currently in a period of rapid change. These changes are certain to provide major opportunities and challenges in statistics. We look at history and current trends, both in general computing and in statistical computing, with the goal of identifying key features and requirements for the near future.


Source: Wikipedia

Landscape 1

Computation has always been a major requirement for the statistical analysis of data, for it limits the quantity of data to be analyzed and influences which analytical methods are practical. Thus the development of computer systems and programs is central to statistics, and has never been more so than now.


Source: dilbert.com

Landscape 2

We present some ideas on where we have been and where we may be going. Past efforts in statistical computing have been valuable to users and have greatly increased the quantity of statistics analysis.

Aside: what about the quality of statistical analysis?

Landscape 3

We will emphasize human needs, both for the users of an interactive statistical system and for the developers of new or modified statistical software.

Source: Ekonometrics.blogspot.com

Landscape 4

A good algorithm should be portable, reliable, and should adapt well to a variety of problems. It should not limit the size of problems handled or make restrictive assumptions about available features in the programming environment.

Landscape 5

The writing and testing of programs have become an increasingly large part of the cost of using computers.


Source: overcomingbias.com

Landscape 6

There have been important developments in designing and writing computer software. The general thrust of these advances has been to produce programs that are well defined, correct, and understandable, and that also reduce the effort (particularly the drudgery) of programming, by providing the programmer with better tools and a richer software environment.

Landscape 7

The important measurement here is against the value of the scarcest resource in most environments: skilled human labor. The most important single implication is the need to consider the wise use of human effort more carefully than has often been the case.

Source: brookings.edu

End of Opening Remarks

  • The previous slides contained approximately 300 words.
  • I wrote (or mildly edited) about 5 of them. I stole the rest!
  • I deleted a few phrases for impact but without changing the substance.
  • The original text: by John Chambers (creator of the S programming language and R core member) in The American Statistician
  • published in 1980!

A Tip of the Hat to John Chambers: The S Language and R Core Member

So what is the "changing landscape of statistical computing"?

  • Big Data, High-Performance Computing, Data Science, Data Analytics… ?
  • New jargon?
  • New applied problems (choose your domain or area of work/study)? Yes!
  • Perhaps little is really new, though much has improved (bigger, better, faster, easier)!
  • One exception: distributed computing (for example, Hadoop/MapReduce), but here the playing field is not level.
  • Another exception: GPU programming. Extremely specialized; not sure about the level playing field!

This talk will not dwell on:

Source: Michael Kane (my 2nd PhD student and co-author of bigmemory)

So…

  • While computing resources become faster/bigger/cheaper, these changes aren't really "game changers".
  • Statistical analysis supported by computers is becoming easier and easier, accessible to virtually everyone. This is amazing, and amazingly dangerous. Quantity? Yes. Quality? There are no guarantees.
  • Thus, I want to:
    • give a few minimalist computational examples
    • advocate for a change in culture
    • provide a few other observations for you to debate among yourselves over cerveja

A Few Toy Examples in R (with general lessons that extend beyond the choice of language)

The following benchmarks are approximate and minimalist to make it easier to read to code.

Toy Examples: General Lessons on MEMORY, SPEED, and PARALLEL PROGRAMMING

The following benchmarks are approximate and minimalist to make it easier to read to code.

Toy Examples: Setup

# A function to cleanly report peak memory consumption:
mygc <- function(reset=FALSE) {
  paste(gc(reset=reset)[2,6], "MB")
}
mygc(reset=TRUE) # Current session memory consumption
## [1] "3.7 MB"
# And to size objects for these toy examples:
N <- 4000000     # Four million

Toy Example 1: MEMORY

Toy Example 1: Memory

object.size( x <- rep(as.integer(0), N) )
## 16000040 bytes
x <- x + 1       # Serious (toy) computation here...
object.size(x)   # ... resulting in type coercion!
## 32000040 bytes
                 # And resulting memory overhead, up from the
mygc()           # baseline?  Yikes!
## [1] "49.6 MB"

Toy Example 1: Memory, continued

  • But what happens when x is "big" (occupying non-trivial amount of available RAM) and you are doing "real statistical analysis"?
  • And what if your data set is so large it can't fit into RAM? Use SAS? That's an expensive option that doesn't help level the playing field.
  • These sorts of problems were (and are) research with Michael Kane. But the work is too specific to R and beyond the scope of today's talk.

Toy Example 2: SPEED

Toy Example 2a: Speed

x <- matrix(1:N, nrow=4)    # 4 rows, lots of columns

# Let's call this the "baseline" statistical analysis.
# We'll consider and compare several different approaches.
system.time({ 
  
    ans <- apply(x, 2, sum)
  
})["elapsed"]
## elapsed 
##    2.68

About 2.5 seconds.

Toy Example 2b: Speed

# Let's agree this is really really bad.  But why?
system.time({ 
  
    ans <- NULL
    for (i in 1:ncol(x)) {
      ans <- c(ans, sum(x[,i]))
    }
  
})["elapsed"]
## elapsed 
## 1243.36

About 20 minutes!

Toy Example 2c: Speed

# ... and this is much better.  But why?
system.time({
  
    ans <- rep(NA, ncol(x))
    for (i in 1:ncol(x)) {
      ans[i] <- sum(x[,i])
    }
  
})["elapsed"]
## elapsed 
##   1.773

Slightly less overhead than apply().

Toy Example 2d: Speed

library(Rcpp)
sourceCpp(code='
  #include <Rcpp.h>
  using namespace Rcpp;
  // [[Rcpp::export]]
  NumericVector mycolsum(NumericMatrix x) {
    NumericVector ans(x.ncol());
    int i, j;
    for (j=0; j<x.ncol(); j++) {
      ans[j] = 0;
      for (i=0; i<x.nrow(); i++) {
        ans[j] += x(i,j);
      }
    }
    return ans;
  }')

Toy Example 2d, continued

# Blazingly fast via compiled C++ code:
system.time({
  
    ans <- mycolsum(x)   # mycolsum() created by Rcpp, above
  
})
##    user  system elapsed 
##   0.031   0.001   0.032

Wow. But C/C++ coding isn't "free" – it takes more human effort.

Some good questions:

  • Do I care?
  • Is it worth knowing about this stuff?
  • Is this really about "high-performance computing"?
  • Can't we do better with parallel programming?
  • The answers? All the same:
  • It depends!

Toy Example 3: PARALLEL PROGRAMMING

Toy Example 3a: Parallel Programming

library(parallel)
# One of the easiest parallel code snippets:
system.time({

    ans <- mclapply(1:ncol(x),
                    function(i) sum(x[,i]), 
                    mc.cores=2)

})["elapsed"]
## elapsed 
##   1.815

Nice. Easy. But not portable and not for clusters.

Toy Example 3b: Parallel Programming

library(foreach)
library(itertools)
## Loading required package: iterators
library(doMC)
## Loading required package: parallel
registerDoMC(2) # Using 2 processor cores

Toy Example 3b, continued

# This is terrible!  Why?
system.time({
  
  ans <- foreach(i=1:ncol(x),
                 .combine=c) %dopar%
         {
           return(sum(x[,i]))
         }
  
})["elapsed"]
## elapsed 
## 595.448

Another example of what not to do!

Toy Example 3c: Parallel Programming

# This is much better and elegant.But ___iterators___ is beyond
# the scope of what should be discussed today.  Maybe tomorrow!
system.time({

  iter <- isplitIndices(ncol(x), chunks=2)
  ans <- foreach(i=iter,
                 .combine=c) %dopar%
         {
           return(apply(x[,i], 2, sum))
         }

})["elapsed"]
## elapsed 
##   1.423

Sometimes a little better than mclapply().

Or with SNOW and parallel backend doSNOW?

  • Not just a coding example.
  • A topic for my workshop tomorrow: SNOW is less memory-efficient than multicore in most interesting cases.
  • But unlike multicore, SNOW supports sophisticated parallel environments including clusters.
  • One of the best reasons to consider foreach: code portability
  • It's efficient, and doesn't force others to adopt your parallel transport mechanism (i.e. SNOW, multicore, Rmpi, etc…) – they can choose their own without substantive code modification!
library(doSNOW)

Toy Example 3d: Parallel Programming

# Only superficial cluster registration changes required:
machines <- rep("localhost", each=2)      # changed
cl <- makeCluster(machines, type="SOCK")  # changed
registerDoSNOW(cl)                        # changed
system.time({                              # NOT CHANGED
  iter <- isplitIndices(ncol(x), chunks=2) # NOT CHANGED
  ans <- foreach(i=iter,                   # NOT CHANGED
                 .combine=c) %dopar%       # NOT CHANGED
         {                                 # NOT CHANGED
           return(apply(x[,i], 2, sum))    # NOT CHANGED
         }                                 # NOT CHANGED
})["elapsed"]                              # NOT CHANGED
## elapsed 
##   1.853
stopCluster(cl)                   # strongly recommended

What Have We Learned From These Examples?

What Have We Learned?

  • A little basic "computer science" can go a long way.
  • Simple benchmarks can be useful in helping to understand and discover speed and memory efficiencies, even if the reasoning isn't always well-understood.
  • There can be huge differences between "good code" and "bad code".
  • Parallel programming isn't automatically better.
  • John Chambers was right in advocating for the importance of "skilled human effort": don't rely upon luck!
  • Skilled human effort? But how?

But How?

But How?

  • Educate: Not just our students – we need to get our hands dirty and set a good example!
  • Pedagogy: You don't learn to speak "português do Brasil" by reading a book, you learn by "em visita ao Brasil" (according to Google Translator). Apply the same lesson to statistical computing and computer programming.
  • Reward and recognize the importance of applied and computational work/research along side more traditional theory and methodology.
  • Realize that breadth of expertise may – in some cases – be more valuable to many than depth of expertise, as long as the breadth of expertise isn't superficial.

Returning to the Backstory and Wrapping Up

Proficiency in (real) programming lets you do… anything!

Super-specialized tools have a role, but are limiting.

Throwing around money to address Big Data limitations? Not me!

Don't be like the NSA!

Don't invest in single stocks… diversify! I recommend Python/Perl, R/Matlab, and some basic C/C++.

RBras/SEAGRO 2015

The general theme of the 2015 RBras/SEAGRO:

  • "New Challenges in Statistics: Handling and Modeling Information"
  • The 60th Meeting of RBras and the 16th SEAGRO aims to bring together researchers from all areas, employing statistics in decision-making, to discuss teaching and learning processes of statistical techniques and data analysis and ways to expand access to statistical expertise, promoting multidisciplinary exchanges.

Commentary

  • What are the "New Challenges in Statistics"? (I'm not sure, and I'm pretty sure we're getting better at "handling and modeling information"!)
  • What is the greatest contribution of our field to "the world" or "humankind" over the last 100 years? (I would argue "experimental design" – with roots in agricultural science!)
  • I would also argue that the "Big Data" explosion has lost sight of the importance of experimental design.

More Commentary

  • There have been visible calls in the last few weeks to abolish the use of p-values in scientific publications.
  • This is our own fault.
    • We need to do a better job of educating.
    • We need to recognize the dangers of "easy technology".
    • Understanding is more important than automation.
  • We can't afford to over-specialize, thinking only of statistics. Or only of Computer Science. We must diversify, add breadth of expertise and become fluent in a range of areas.
  • We need to engage and become better communicators and collaborators, because the world (industry and the business world in particular) won't wait for us. Otherwise, we'll be left behind.

Conclusion?

  • What is the "Changing Landscape of Statistical Computing"?
    • Trick question!
      • Of course things have changed since 1980. Or 1990. Or 2000. Or 2010.
      • But the most important thing has not changed.
        • What is it?
  • Answer: the importance human capital (or skilled human labor in Chambers' words)

Conclusion

The entire field of computing is in a period of rapid expansion and change. Many exciting chellenges face those who want to extend our capabilities in the statistical use of computers. The advances in hardware and software make possible more effective and widespread computing for data analysis. The most valuable resource is skilled human effort; computing should be organized to improve the ease and the quality of our work, whether we are users of statistical systems or designers and implementers of software.

Conclusion from Chambers (1980)

The entire field of computing is in a period of rapid expansion and change. Many exciting chellenges face those who want to extend our capabilities in the statistical use of computers. The advances in hardware and software make possible more effective and widespread computing for data analysis. The most valuable resource is skilled human effort; computing should be organized to improve the ease and the quality of our work, whether we are users of statistical systems or designers and implementers of software.

Thank you, RBras!

  • Thanks to Helio Migon and Aparecida Souza for arranging my visit and keeping an eye on my travel adventure.

  • No thanks to United or Gol for my being a full 2 days late.

  • This talk and materials from this morning's workshop will be available at: http://www.stat.yale.edu/~jay/RBras/. Give me a few days – things changed a lot in the last 24 hours.

  • John W. Emerson (Jay)