Subspace ensembles for scalable statistical analyses

Michael Kane
Associate Research Scientist, Yale University

Research areas I'm interested in

Use of the term "Big Data" according to Google

## Loading required package: ggplot2

plot of chunk trendChart1

Are people still going "to the cloud?"

plot of chunk trendChart2

Data's value comes from its information

plot of chunk trendChart3

Overview for the rest of this talk

What is so challenging about "Big Data"?

What are the current approaches to large-scale statistical analyses?

Subspace ensembles - a unifying framework for large class of scalable analyses

Aggregation of "weak" estimators

Applications: Medical record linkage and network intrusion detection

My current work and open questions

Why is analyzing/processing "big data" a challenge?

	Capacity	Transfer Rate
L1 Cache	64 KB	950 GB/s
L2 Cache	256 KB	240 GB/s
L3 Cache	8 MB	120 GB/s
RAM	16 GB	6.4 GB/s
SSD	500 GB	0.55 GB/s
Disk	3 TB	0.15 GB/s
Gigabit Ethernet		0.01 GB/s
Internet		0.006 GB/s

Define "big" in terms of technology

A data set is "large" if it exceeds 20% of the available RAM for a single machine but can be analyzed on a single machine

"R is not well-suited for working with data structures larger than 10-20% of a computer's RAM." -The R Installation and Administration Manual
You can may be able to use standard tools but you need to be careful

A data set is "massive" if it requires more than one machine to process

Need an approach that will "scale-out"
Applications using "massive" data platforms is a subset of those who actually need it

Technological alternative approaches

Platform	Emb. Parallel	Non-emb. Parallel	Pipeline Parallel
GPGPU	s	s	s
MapReduce	c		c*
Complex Event Processing	s/c		s/c
Scalapack	c	c
SciDB	s/c	s/c
bigmemory	s/c	s

s - single machine solution
c - cluster solution

Approaches to managing complexity

Updatable/Downdatable
- Allows new data to be incorporated incrementally
- Does not reduce total complexity
Iterative approximation
- Successively approximates the analysis of interest
- Doesn't always provide the full analysis, only the essential ones
Subspace ensembles
- Perform analyses on subspaces of the data
- Predict by using an average of the individual estimators
- Aggregate (ensemble) is often equivalent to the reference estimator

The complexity of an analysis is a function of the data size.

Consider a computational procedure with a complexity of \(O(n^c)\) for \(c \geq 1\) for a data matrix of size \(n \times p\).

Matloff (2012) showed the computational complexity of running the same procedure on \(n/q\) "chunks," where \(q \geq 1\) is

\[ O(n^c q^{-(c-1)}). \]

Create a lot of these "weak estimators" and aggregate them to create an estimator that is competitive with the reference estimator.

Same approach can be used for columns instead of rows.

Row partitioning via subspace projection

Let \(\{P_r\}\) be a set of \(b\) be a set of orthogonal projection matrices in \(\mathcal{R}^{n \times p}\) such that:

Each element of each matrix in \(\{P_r\}\) is either zero or one.
The sum of the matrices is identity up to a permutation of the rows of \(I\). \[ \sum_{i=1}^b P_{r_i} = I\]
The sum of the projected \(X\) matrices is \(X\) up to a permutation of the rows of \(X\). \[ \sum_{i=1}^b P_{r_i} X = X\]

Then \(\{P_r\}\) partitions \(X\) by rows.

Ie. \(P_{r_i} X\) picks rows.

Bootstrapping via random subspace projection

Let \(B\) be an oblique projection matrix in \(\mathcal{R}^{n \times n}\) such that:

Each element of each matrix in \(\{P_r\}\) is either zero or one.
There is at most a single value of 1 in each row of \(X\)

This can easily be generalized to \(m\) out of \(n\) bootstrapping.

Is there a valid sampling scheme where \(P\) is not a projection?

Chunk and Average (Matloff 2012)

Basic approach

Partition the rows of a data set into (approximately) evenly divided ``chunks''
Train estimators on individual chunks
Prediction is performed by aggregating estimators

Assumptions

IID samples
Reference estimator has asymptotic normal distribution with variance \(\sigma\)
Ensemble estimates are normal and have average variance \(\sigma / c\) where \(c\) is the number of chunks.

Result: The ensemble estimator has the same asymptotic variance and mean as the ensemble estimator.

Bag of Little Bootstrap (Kleiner et al. 2012)

Basic approach

Create a random partition of the rows with all subsets having roughly the same size
Train bootstrap estimators on partitions
Prediction is performed on aggregate estimators

Assumptions

IID samples
Distribution of estimator is in the set of Hadamard differentiable functions
Subset size is asymptotically large

Result: The empirical CDF of the BLB estimator converges to the CDF of the bootstrap reference estimator. Both estimators converge to the CDF of the function at the same rate \(O_p(1/n)\).

Application: Record Linkage Comparison Patterns Data Set

Available from the UCI Machine Learning Repository

Element-wise comparison of medical records with personal data

5,749,132 record pairs

Phonetic equality of first name and family name, equality of date of birth.
Phonetic equality of first name, equality of day of birth.
Phonetic equality of first name, equality of month of birth.
Phonetic equality of first name, equality of year of birth.
Equality of complete date of birth.
Phonetic equality of family name, equality of sex.

Creating the ensemble

require(e1017)
require(foreach)
source("eutils.r")

# Given a data set z, make a training and testing subset.
train <- sample(nrow(z), floor(nrow(z)/2))
test <- setdiff(1:nrow(z), train)

# Create the ensemble.  library(doMC) registerDoMC(cores=6)
e <- foreach(it = irpart(z[train, 1], z[train, -1], 100)) %dopar% {
    svm(x = it$x, y = as.factor(it$y))
}

Predict on the training data

# Get the individual predictions.
preds <- foreach(fit = e, .combine = cbind) %dopar% {
    as.logical(predict(fit, z[test, -1]))
}

# Aggregate the predictions
agPred <- foreach(1:nrow(preds), .combine = c) %dopar% {
    as.logical(classVoteWinner(preds[i, ]))
}

Accuracy and timing

#partitions	Train time (sec)	True Pos	False Neg	True Neg
1	68.25	0.999	1	0.001
20	45.35	0.994	1	0.006
40	48.56	0.994	1	0.006
60	53.05	0.994	1	0.006
80	61.525	0.992	1	0.008
100	70.466	0.992	1	0.008

Computational complexity for training estimators for an \(n \times p\) matrix

Estimator	Complexity	Source
OLS	\(O(n p^2)\)	Aglib, 2013
Pen. Lin. Reg	\(O(n p^2 + p^3 )\)	Aglib, 2013
SVM	\(O( \max(n, p) \min^2(n, p) )\)	Chapelle, 2006
Kernel Methods	\(O(np^2 + p^3 )\)	Bach and Jordan, 2005
Radial Basis Fitting	\(O(n \log n)\)	Aglib, 2013

Sampling columns

From a linear algebra viewpoint, the row-sampling techniques can just as easily be performed on columns.

However, from a statistical standpoint this is very different.

Subsampling rows means each sample is complete
Subsampling columns you get more samples but you don't get all of the data associated with each sample.

Subspace ensembles

Sampling/partioning rows

Chunk and average (Matloff, 2012)
Bag of little bootstraps (Kleiner et al., 2012)

Partitioning columns

Weighted random subspaces (Li and Zhao, 2009)

For a fixed estimator, the procedure and characteristics are defined by the subspaces the data are projected into

The combination of estimator and subspaces determine the computational savings and convergence properties.

Weighted random subspace ensembles (Li and Zhao, 2009)

A "column-chunking" approach to creating ensembles.

Basic approach

Create a random partition of the columns with all subsets having roughly the same size.
Train estimators on subspaces.
Use cross validation to get an estimate of the error
Create weights based on error
Prediction is performed by the weighted aggregation of estimators

Has been found empirically to be better than equal weighting.

Convex estimator aggregation

During Training:

Estimate the out-of-sample error
Define what you want to minimize to create a set of weights on the simplex for each estimator in the ensemble
- Misclassification rate
- Squared-error loss

During Estimation:

Create a prediction for each estimator in the ensemble
Weight the individual estimates
- Weighted voting for classification
- Multiply estimate by weight for regression

Minimizing error rate (Li and Zhao, 2009)

Let \(\{\widehat{\theta}\}\) be a set of \(b\) independent estimators with classification error rates \(e_1, e_2, ... e_b\). The weight assignment \[ w_i = \frac{1}{N} \log \left(\frac{1-e_i}{e_i}\right) \] with \[ N = \sum_{i=1}^b w_i \] achieves the minimal error rate.

Minimizing MSE

Let \(\{\widehat{\theta}\}\) be a set of \(b\) estimators. The goal is to find \(\{w\}\) that minimize \[ \sum_{i=1}^b w_i \mathbb{P} \widehat{\theta}_i^2 + \sum_{1 \leq i < j \leq b} w_i w_j \mathbb{P} \widehat{\theta}_i \widehat{\theta}_j \] subject to \[ \sum_{i=1}^b w_i = 1 \] and \(w_i \geq 0\) \(\forall\) \(i \in \{1, ..., b\}\).

This can easily be reformulated to be a quadratic programming problem which minimizes \[ -a^{T} x + 1/2 x^T A x \] subject to \(C^T x \geq c\).

Application: Network intrusion detection

From a former Knowledge Discover and Data-mining competition

42 variables

Traffic was categorized as "good" and "bad" WRT intrusion

Task is to categorize traffic

Approach: apply random subspace ensembles

Variables

duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login count srv_count, serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate, dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate type

Timing for random subspace ensembles

#partitions	Train time (sec)
2	262.618
10	333.39
20	607.278
30	858.823
40	1105.608
42	1107.833

Accuracy for subspace ensembles with equal weighting

#partitions	True Pos	False Pos	False Neg	True Neg
1	0.9931	0.8125	0.1875	0.0069
2	1	0	1	0
10	0.995	0	1	0.005
20	0.9913	0	1	0.0087
30	0.9863	0	1	0.0137
40	0.9831	0	1	0.0169
42	0.9813	0	1	0.0187

Accuracy for subspace ensembles with independence

#partitions	True Pos	False Pos	False Neg	True Neg
1	0.9931	0.8125	0.1875	0.0069
2	1	0.9125	0.0875	0
10	0.9950	0	1	0.005
20	0.9931	0	1	0.0069
30	0.9881	0	1	0.0119
40	0.9831	0	1	0.0169
42	0.9831	0	1	0.0169

Accuracy for subspace ensembles with covariance

#partitions	True Pos	False Pos	False Neg	True Neg
1	0.9931	0.8125	0.1875	0.0069
2	0.9938	0.0025	0.997	0.0062
10	0.9975	0	1	0.0025
20	0.9969	0	1	0.0031
30	0.9988	0	1	0.0012
40	0.9831	0	1	0.0169
42	0.9831	0	1	0.0169

Properties of subspace ensembles

Theoretical

General approach for reducing computational complexity
Makes use of existing results
Work goes into proving properties of ensemble

Technological

Turns a large class of problems from non-embarrassingly parallel to EP
Makes use of existing implementations
Work goes into integrating with parallel computing platform

Current work

Is there a general approach for assessing column sampling?
- What about model stability?
- Can we get general convergence results?
What about unsupervised learning?
Can we expect ensemble estimator consistency if we use both row and column subspaces?
Are there better bases we could be project onto?
How do we handle samples that are not IID?

The ensemble R package

Forthcoming

Same model abstraction as the caret package

Will use foreach for parallelism

Support is planned for data frames, matrices, bigmemory, and SciDB

Supervised and unsupervised learning will be supported

Email me for an alpha version

Thanks!

Email: michael.kane@yale.edu

Github: http://github.com/kaneplusplus

This talk: http://euler.stat.yale.edu/~mjk56/SubspaceEnsembles

Download the ensemble package: email me

Email me for the slidify code for this talk

References

Aglib (2013) The Aglib User's Guide. http://www.alglib.net/#book

Algorithm AS274 (1992) Applied Statitsics. 41(2).

F. R. Bach and M. I. Jordan (2005) Predictive low-rank decomposition for kernel methods. Proceedings of the \(22^{nd}\) International Conference on Machine Learning.

J. Baglama and L. Reichel (2005) Augmented Implicitly Restarted Lanczos Bidiagonalization Methods. SIAM Journal of Scientific Computing.

O. Chapelle (2006) Training a Support Vector Machine in the Primal. Neural Computation. 19(5) 1155-1178.

X. Li and H. Zhao (2009) Weighted random subspace methods for high dimensional data classification. Statistics and its Interface. 2 153-159.

N. Matloff (2012) Efficient R Parallel Loops on Long-Latency Platforms. Talk given at the \(43^{rd}\) Symposium on the Interface of Computing Science and Statistics.

N. Matloff (2013) Software Alchemy: Turning Complex Statistical Computations into Embarrassingly Parallel Ones. Preprint.