The Big Data Analytics Landscape with Applications in Augmented Knowledge Discovery

Michael Kane with Peter Rabinowitz

Goals for this talk

1. Provide a navigable view of the Big Data landscape

2. Show off some of the work for the Round 11 Grand Challenge

Who am I?

Research Faculty in Biostatistics at Yale University

Interested in scalable machine learning and applied probability

Background in computing, machine learning, and statistics

Track record

  • 2010 ASA Chamber's Prize
  • 2012 DARPA XDATA Initiative
  • 2013 Grand Challenge Exploration Round 11

Business Collaborators

AT&T Labs-Research

Genentech

Paradigm4

Sybase

Oracle

Inspiration for this talk

2013 I was asked to be an editor for The Handbook of Big Data (C&H)

We need to understand how methodologies and theories of Big Data fit together.

Indexing vs. Learning (Model Fitting)

Indexing vs. Learning

Indexing

  • Goal is efficient retrieval of data
  • Can used to make big data into small data

Learning

  • Goal is understanding the structure of data
  • Model relationships within data

The Social Impact of Big Data

We are at a point in human history where we can now collect far more data than we can analyze in a reasonable amount of time.

The goal of big data analytics is to provide new ways of making sense of enormous amounts of digital information and make meaningful (and positive) impacts in areas such as science and business

Should we characterize big data with the 4 V's?

Should we characterize by volume?

A big data set in genetics is order 10's of gigabytes.

A big data set in advertisement can be order petabytes or even more.

An alternative characterization

A data set is "big" if the computational, methodological, and theoretical approaches required to understand its structure have to be reimagined because of its vastness and new approaches need to be developed to extract information from it. -Casey King

"They extend their procedures without examining their principles" -Edgar Allen Poe, The Purloined Letter

What are the analytics challenges that have inspired "reimagining"?

1. The rapid accumulation of data

2. Many samples

3. Many measurements per sample

4. Many samples and many measurements per sample

5. Unstructured and complex samples

Rapid Accumulation of Data: Facebook

As of August 2012 Facebook stores more than 100 petabytes of disk space.

This is 900720000000000000 (9e17) bits.

If a bit were an inch wide: 14215900000000 (1.4e13) miles

  • The sun is only about 92000000 (9.2e7) miles away
  • 2.41828705 light years

The daily breakdown

  • 2.7 billion likes made daily on and off of the Facebook site
  • 300 million photos uploaded
  • 70,000 queries executed by people and automated systems
  • 500+ terabytes of new data "ingested"

The Cohen Conjecture

"For increasingly large sets of data, access to individual samples decreases exponentially over time." -David Cohen

Videos and photographs shared over Facebook are almost never accessed after one day of being posted.

Amazon is betting on this phenomenon with AWS Glacier.

Many Samples

Many big data sets are "tall and skinny".

  • Large number of measurements
  • Each measurement includes relatively few features

Tend to be very large in total volume

Example: The Airline On-time Performance

  • All commercial domestic flights from October 1987 to April 2008
  • About 120 million commercial domestic flights
  • 29 features (date, flight time, arrival delay, etc.)

The Tall-Skinny Representation

Tall-Skinny Challenges are Computational

Learning (model fitting) algorithms do better with more data

We essentially have the population

Approach is generally to do things on individual blocks and aggregate the results

"Chunk and Add"

\[ \widehat{\beta} = \left( X^T X \right)^{-1} X^T Y \]

Algorithm for calculating the OLS slope estimate:

  • Let \(X_i\) and \(Y_i\) be the $i$th block of \(X\) and \(Y\).
  • Let \(r\) be the number of blocks.
  • Compute \(X^T X\) as \[ \sum_{i=1}^{r} X_i^T X_i \]
  • Invert \(X^T X\)
  • Compute \(X^T Y\) similarly
  • Multiply the results to get the slope coefficients.

Many Features

Number of features per sample is much larger than the number of samples

Example: Genome Studies

  • Human genome is 3.2 billion base pairs
  • A "large" genome study has thousands of people

Tend to be smaller in total volume

Many-Feature Challenges are Statistical

require(foreach)
y <- rnorm(100)
p_values <- foreach(i = 1:1000, .combine = c) %do% {
    x <- rnorm(100)
    s <- summary(lm(y ~ x))
    s$coefficients[2, 4]
}
sum(p_values < 0.05)
## [1] 56

Approaches to Many-Feature Challenges

  1. Look for highly-significant features

    • Bonferonni/Sidak correction
    • Use an accepted (less stringent) p-value
  2. False discovery rate

    • Get the rate at which your "significant" features are not significant
    • Doesn't give you features of interest
  3. Dimension reduction

    • Project the data into a new feature space and perform analysis
    • Transformation can identify sets of features that are significant

Many-Sample-Many-Feature ("Squarish") Challenges

Data sets that are large both in the number of rows and number of columns

Generally show connections between things

Example: The Netflix Data Set

  • 480,189 users
  • 17,770 movies
  • An incidence matrix of ratings has 8,532,958,530 entries

Tall-Fat Data Sets are Sparse

Each person only rated about 200 movies

This corresponds to 100,480,507 non-zero entries

The matrix is about 99% sparse

We only need about 375 MB to store this.

Complex data is usually text

<Article PubModel="Print-Electronic">
    <Journal>
        <ISSN IssnType="Electronic">1551-4005</ISSN>
        <JournalIssue CitedMedium="Internet">
            <Volume>12</Volume>
            <Issue>24</Issue>
            <PubDate>
                <Year>2013</Year>
                <Month>Oct</Month>
                <Day>21</Day>
            </PubDate>
        </JournalIssue>
        <Title>Cell cycle (Georgetown, Tex.)</Title>
        <ISOAbbreviation>Cell Cycle</ISOAbbreviation>
    </Journal>

Peter's an My Gates Proposal

The whole of our intellectual understanding about human and animal medicine is contained in the literature.

  • Well-documented
  • Citation information is available

Reading papers is hard.

  • Time consuming
  • Hard to know what to read

Provide a Panoramic View of the Literature

Pubmed query results are data

Data can be clustered according to their content

How is this different than a keyword search?

  • We can visualize the "closeness" of any two documents
  • Query provides a context for understanding interrelationships

RVF Demo

Is 1,200 RVF Documents "Big"?

Project Vision

Making the RVF Challenge Bigger

  • Google Scholar has about 26,000 documents
  • Integrate genomics databases
  • Provide a interactive web application

Provide custom document search, organization, and exploration

  • Increase research awareness
  • Jumpstart interdisciplinary research

Project Relevance

Conclusions