July 23, 2015
Early Years: Basic Legos
A friend's specialized Legos
Teenage Years: The Bigger the Better?
The National Security Agency
NSA: Super-Secret Big Data
Investing: Picking single stocks is like… gambling!
Example: "Big Data" problems – everyone should have a chance to participate!
The evolution of computing is currently in a period of rapid change. These changes are certain to provide major opportunities and challenges in statistics. We look at history and current trends, both in general computing and in statistical computing, with the goal of identifying key features and requirements for the near future.
Computation has always been a major requirement for the statistical analysis of data, for it limits the quantity of data to be analyzed and influences which analytical methods are practical. Thus the development of computer systems and programs is central to statistics, and has never been more so than now.
Source: dilbert.com
We present some ideas on where we have been and where we may be going. Past efforts in statistical computing have been valuable to users and have greatly increased the quantity of statistics analysis.
Aside: what about the quality of statistical analysis?
We will emphasize human needs, both for the users of an interactive statistical system and for the developers of new or modified statistical software.
A good algorithm should be portable, reliable, and should adapt well to a variety of problems. It should not limit the size of problems handled or make restrictive assumptions about available features in the programming environment.
The writing and testing of programs have become an increasingly large part of the cost of using computers.
There have been important developments in designing and writing computer software. The general thrust of these advances has been to produce programs that are well defined, correct, and understandable, and that also reduce the effort (particularly the drudgery) of programming, by providing the programmer with better tools and a richer software environment.
The important measurement here is against the value of the scarcest resource in most environments: skilled human labor. The most important single implication is the need to consider the wise use of human effort more carefully than has often been the case.
Source: brookings.edu
Source: Michael Kane (my 2nd PhD student and co-author of bigmemory)
The following benchmarks are approximate and minimalist to make it easier to read to code.
The following benchmarks are approximate and minimalist to make it easier to read to code.
# A function to cleanly report peak memory consumption: mygc <- function(reset=FALSE) { paste(gc(reset=reset)[2,6], "MB") } mygc(reset=TRUE) # Current session memory consumption
## [1] "3.7 MB"
# And to size objects for these toy examples: N <- 4000000 # Four million
object.size( x <- rep(as.integer(0), N) )
## 16000040 bytes
x <- x + 1 # Serious (toy) computation here... object.size(x) # ... resulting in type coercion!
## 32000040 bytes
# And resulting memory overhead, up from the mygc() # baseline? Yikes!
## [1] "49.6 MB"
x
is "big" (occupying non-trivial amount of available RAM) and you are doing "real statistical analysis"?x <- matrix(1:N, nrow=4) # 4 rows, lots of columns # Let's call this the "baseline" statistical analysis. # We'll consider and compare several different approaches. system.time({ ans <- apply(x, 2, sum) })["elapsed"]
## elapsed ## 2.68
About 2.5 seconds.
# Let's agree this is really really bad. But why? system.time({ ans <- NULL for (i in 1:ncol(x)) { ans <- c(ans, sum(x[,i])) } })["elapsed"]
## elapsed ## 1243.36
About 20 minutes!
# ... and this is much better. But why? system.time({ ans <- rep(NA, ncol(x)) for (i in 1:ncol(x)) { ans[i] <- sum(x[,i]) } })["elapsed"]
## elapsed ## 1.773
Slightly less overhead than apply().
library(Rcpp) sourceCpp(code=' #include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] NumericVector mycolsum(NumericMatrix x) { NumericVector ans(x.ncol()); int i, j; for (j=0; j<x.ncol(); j++) { ans[j] = 0; for (i=0; i<x.nrow(); i++) { ans[j] += x(i,j); } } return ans; }')
# Blazingly fast via compiled C++ code: system.time({ ans <- mycolsum(x) # mycolsum() created by Rcpp, above })
## user system elapsed ## 0.031 0.001 0.032
Wow. But C/C++ coding isn't "free" – it takes more human effort.
library(parallel) # One of the easiest parallel code snippets: system.time({ ans <- mclapply(1:ncol(x), function(i) sum(x[,i]), mc.cores=2) })["elapsed"]
## elapsed ## 1.815
Nice. Easy. But not portable and not for clusters.
library(foreach) library(itertools)
## Loading required package: iterators
library(doMC)
## Loading required package: parallel
registerDoMC(2) # Using 2 processor cores
# This is terrible! Why? system.time({ ans <- foreach(i=1:ncol(x), .combine=c) %dopar% { return(sum(x[,i])) } })["elapsed"]
## elapsed ## 595.448
Another example of what not to do!
# This is much better and elegant.But ___iterators___ is beyond # the scope of what should be discussed today. Maybe tomorrow! system.time({ iter <- isplitIndices(ncol(x), chunks=2) ans <- foreach(i=iter, .combine=c) %dopar% { return(apply(x[,i], 2, sum)) } })["elapsed"]
## elapsed ## 1.423
Sometimes a little better than mclapply()
.
library(doSNOW)
# Only superficial cluster registration changes required: machines <- rep("localhost", each=2) # changed cl <- makeCluster(machines, type="SOCK") # changed registerDoSNOW(cl) # changed system.time({ # NOT CHANGED iter <- isplitIndices(ncol(x), chunks=2) # NOT CHANGED ans <- foreach(i=iter, # NOT CHANGED .combine=c) %dopar% # NOT CHANGED { # NOT CHANGED return(apply(x[,i], 2, sum)) # NOT CHANGED } # NOT CHANGED })["elapsed"] # NOT CHANGED
## elapsed ## 1.853
stopCluster(cl) # strongly recommended
Proficiency in (real) programming lets you do… anything!
Super-specialized tools have a role, but are limiting.
Throwing around money to address Big Data limitations? Not me!
Don't be like the NSA!
Don't invest in single stocks… diversify! I recommend Python/Perl, R/Matlab, and some basic C/C++.
The general theme of the 2015 RBras/SEAGRO:
The entire field of computing is in a period of rapid expansion and change. Many exciting chellenges face those who want to extend our capabilities in the statistical use of computers. The advances in hardware and software make possible more effective and widespread computing for data analysis. The most valuable resource is skilled human effort; computing should be organized to improve the ease and the quality of our work, whether we are users of statistical systems or designers and implementers of software.
The entire field of computing is in a period of rapid expansion and change. Many exciting chellenges face those who want to extend our capabilities in the statistical use of computers. The advances in hardware and software make possible more effective and widespread computing for data analysis. The most valuable resource is skilled human effort; computing should be organized to improve the ease and the quality of our work, whether we are users of statistical systems or designers and implementers of software.
Thanks to Helio Migon and Aparecida Souza for arranging my visit and keeping an eye on my travel adventure.
No thanks to United or Gol for my being a full 2 days late.
This talk and materials from this morning's workshop will be available at: http://www.stat.yale.edu/~jay/RBras/. Give me a few days – things changed a lot in the last 24 hours.