1. Provide a navigable view of the Big Data landscape
2. Show off some of the work for the Round 11 Grand Challenge
Michael Kane with Peter Rabinowitz
1. Provide a navigable view of the Big Data landscape
2. Show off some of the work for the Round 11 Grand Challenge
Research Faculty in Biostatistics at Yale University
Interested in scalable machine learning and applied probability
Background in computing, machine learning, and statistics
Track record
AT&T Labs-Research
Genentech
Paradigm4
Sybase
Oracle
2013 I was asked to be an editor for The Handbook of Big Data (C&H)
We need to understand how methodologies and theories of Big Data fit together.
Indexing
Learning
We are at a point in human history where we can now collect far more data than we can analyze in a reasonable amount of time.
The goal of big data analytics is to provide new ways of making sense of enormous amounts of digital information and make meaningful (and positive) impacts in areas such as science and business
A big data set in genetics is order 10's of gigabytes.
A big data set in advertisement can be order petabytes or even more.
A data set is "big" if the computational, methodological, and theoretical approaches required to understand its structure have to be reimagined because of its vastness and new approaches need to be developed to extract information from it. -Casey King
"They extend their procedures without examining their principles" -Edgar Allen Poe, The Purloined Letter
1. The rapid accumulation of data
2. Many samples
3. Many measurements per sample
4. Many samples and many measurements per sample
5. Unstructured and complex samples
As of August 2012 Facebook stores more than 100 petabytes of disk space.
This is 900720000000000000 (9e17) bits.
If a bit were an inch wide: 14215900000000 (1.4e13) miles
The daily breakdown
"For increasingly large sets of data, access to individual samples decreases exponentially over time." -David Cohen
Videos and photographs shared over Facebook are almost never accessed after one day of being posted.
Amazon is betting on this phenomenon with AWS Glacier.
Many big data sets are "tall and skinny".
Tend to be very large in total volume
Example: The Airline On-time Performance
Learning (model fitting) algorithms do better with more data
We essentially have the population
Approach is generally to do things on individual blocks and aggregate the results
\[ \widehat{\beta} = \left( X^T X \right)^{-1} X^T Y \]
Algorithm for calculating the OLS slope estimate:
Number of features per sample is much larger than the number of samples
Example: Genome Studies
Tend to be smaller in total volume
require(foreach)
y <- rnorm(100)
p_values <- foreach(i = 1:1000, .combine = c) %do% {
x <- rnorm(100)
s <- summary(lm(y ~ x))
s$coefficients[2, 4]
}
sum(p_values < 0.05)
## [1] 56
Look for highly-significant features
False discovery rate
Dimension reduction
Data sets that are large both in the number of rows and number of columns
Generally show connections between things
Example: The Netflix Data Set
Each person only rated about 200 movies
This corresponds to 100,480,507 non-zero entries
The matrix is about 99% sparse
We only need about 375 MB to store this.
<Article PubModel="Print-Electronic">
<Journal>
<ISSN IssnType="Electronic">1551-4005</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>12</Volume>
<Issue>24</Issue>
<PubDate>
<Year>2013</Year>
<Month>Oct</Month>
<Day>21</Day>
</PubDate>
</JournalIssue>
<Title>Cell cycle (Georgetown, Tex.)</Title>
<ISOAbbreviation>Cell Cycle</ISOAbbreviation>
</Journal>
The whole of our intellectual understanding about human and animal medicine is contained in the literature.
Reading papers is hard.
Pubmed query results are data
Data can be clustered according to their content
How is this different than a keyword search?
Making the RVF Challenge Bigger
Provide custom document search, organization, and exploration