Michael Kane
Associate Research Scientist, Yale University
## Loading required package: ggplot2
What is so challenging about "Big Data"?
What are the current approaches to large-scale statistical analyses?
Subspace ensembles - a unifying framework for large class of scalable analyses
Aggregation of "weak" estimators
Applications: Medical record linkage and network intrusion detection
My current work and open questions
Capacity | Transfer Rate | |
---|---|---|
L1 Cache | 64 KB | 950 GB/s |
L2 Cache | 256 KB | 240 GB/s |
L3 Cache | 8 MB | 120 GB/s |
RAM | 16 GB | 6.4 GB/s |
SSD | 500 GB | 0.55 GB/s |
Disk | 3 TB | 0.15 GB/s |
Gigabit Ethernet | 0.01 GB/s | |
Internet | 0.006 GB/s |
A data set is "large" if it exceeds 20% of the available RAM for a single machine but can be analyzed on a single machine
A data set is "massive" if it requires more than one machine to process
Platform | Emb. Parallel | Non-emb. Parallel | Pipeline Parallel |
---|---|---|---|
GPGPU | s | s | s |
MapReduce | c | c* | |
Complex Event Processing | s/c | s/c | |
Scalapack | c | c | |
SciDB | s/c | s/c | |
bigmemory | s/c | s |
s - single machine solution
c - cluster solution
Updatable/Downdatable
Iterative approximation
Subspace ensembles
Consider a computational procedure with a complexity of \(O(n^c)\) for \(c \geq 1\) for a data matrix of size \(n \times p\).
Matloff (2012) showed the computational complexity of running the same procedure on \(n/q\) "chunks," where \(q \geq 1\) is
\[ O(n^c q^{-(c-1)}). \]
Create a lot of these "weak estimators" and aggregate them to create an estimator that is competitive with the reference estimator.
Same approach can be used for columns instead of rows.
Let \(\{P_r\}\) be a set of \(b\) be a set of orthogonal projection matrices in \(\mathcal{R}^{n \times p}\) such that:
Then \(\{P_r\}\) partitions \(X\) by rows.
Ie. \(P_{r_i} X\) picks rows.
Let \(B\) be an oblique projection matrix in \(\mathcal{R}^{n \times n}\) such that:
This can easily be generalized to \(m\) out of \(n\) bootstrapping.
Is there a valid sampling scheme where \(P\) is not a projection?
Basic approach
Assumptions
Result: The ensemble estimator has the same asymptotic variance and mean as the ensemble estimator.
Basic approach
Assumptions
Result: The empirical CDF of the BLB estimator converges to the CDF of the bootstrap reference estimator. Both estimators converge to the CDF of the function at the same rate \(O_p(1/n)\).
Available from the UCI Machine Learning Repository
Element-wise comparison of medical records with personal data
5,749,132 record pairs
require(e1017)
require(foreach)
source("eutils.r")
# Given a data set z, make a training and testing subset.
train <- sample(nrow(z), floor(nrow(z)/2))
test <- setdiff(1:nrow(z), train)
# Create the ensemble. library(doMC) registerDoMC(cores=6)
e <- foreach(it = irpart(z[train, 1], z[train, -1], 100)) %dopar% {
svm(x = it$x, y = as.factor(it$y))
}
# Get the individual predictions.
preds <- foreach(fit = e, .combine = cbind) %dopar% {
as.logical(predict(fit, z[test, -1]))
}
# Aggregate the predictions
agPred <- foreach(1:nrow(preds), .combine = c) %dopar% {
as.logical(classVoteWinner(preds[i, ]))
}
#partitions | Train time (sec) | True Pos | False Pos | False Neg | True Neg |
---|---|---|---|---|---|
1 | 68.25 | 0.999 | 0 | 1 | 0.001 |
20 | 45.35 | 0.994 | 0 | 1 | 0.006 |
40 | 48.56 | 0.994 | 0 | 1 | 0.006 |
60 | 53.05 | 0.994 | 0 | 1 | 0.006 |
80 | 61.525 | 0.992 | 0 | 1 | 0.008 |
100 | 70.466 | 0.992 | 0 | 1 | 0.008 |
Estimator | Complexity | Source |
---|---|---|
OLS | \(O(n p^2)\) | Aglib, 2013 |
Pen. Lin. Reg | \(O(n p^2 + p^3 )\) | Aglib, 2013 |
SVM | \(O( \max(n, p) \min^2(n, p) )\) | Chapelle, 2006 |
Kernel Methods | \(O(np^2 + p^3 )\) | Bach and Jordan, 2005 |
Radial Basis Fitting | \(O(n \log n)\) | Aglib, 2013 |
From a linear algebra viewpoint, the row-sampling techniques can just as easily be performed on columns.
However, from a statistical standpoint this is very different.
Sampling/partioning rows
Partitioning columns
For a fixed estimator, the procedure and characteristics are defined by the subspaces the data are projected into
The combination of estimator and subspaces determine the computational savings and convergence properties.
A "column-chunking" approach to creating ensembles.
Basic approach
Has been found empirically to be better than equal weighting.
During Training:
During Estimation:
Let \(\{\widehat{\theta}\}\) be a set of \(b\) independent estimators with classification error rates \(e_1, e_2, ... e_b\). The weight assignment \[ w_i = \frac{1}{N} \log \left(\frac{1-e_i}{e_i}\right) \] with \[ N = \sum_{i=1}^b w_i \] achieves the minimal error rate.
Let \(\{\widehat{\theta}\}\) be a set of \(b\) estimators. The goal is to find \(\{w\}\) that minimize \[ \sum_{i=1}^b w_i \mathbb{P} \widehat{\theta}_i^2 + \sum_{1 \leq i < j \leq b} w_i w_j \mathbb{P} \widehat{\theta}_i \widehat{\theta}_j \] subject to \[ \sum_{i=1}^b w_i = 1 \] and \(w_i \geq 0\) \(\forall\) \(i \in \{1, ..., b\}\).
This can easily be reformulated to be a quadratic programming problem which minimizes \[ -a^{T} x + 1/2 x^T A x \] subject to \(C^T x \geq c\).
From a former Knowledge Discover and Data-mining competition
42 variables
Traffic was categorized as "good" and "bad" WRT intrusion
Task is to categorize traffic
Approach: apply random subspace ensembles
duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login count srv_count, serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate, dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate type
#partitions | Train time (sec) |
---|---|
2 | 262.618 |
10 | 333.39 |
20 | 607.278 |
30 | 858.823 |
40 | 1105.608 |
42 | 1107.833 |
#partitions | True Pos | False Pos | False Neg | True Neg |
---|---|---|---|---|
1 | 0.9931 | 0.8125 | 0.1875 | 0.0069 |
2 | 1 | 0 | 1 | 0 |
10 | 0.995 | 0 | 1 | 0.005 |
20 | 0.9913 | 0 | 1 | 0.0087 |
30 | 0.9863 | 0 | 1 | 0.0137 |
40 | 0.9831 | 0 | 1 | 0.0169 |
42 | 0.9813 | 0 | 1 | 0.0187 |
#partitions | True Pos | False Pos | False Neg | True Neg |
---|---|---|---|---|
1 | 0.9931 | 0.8125 | 0.1875 | 0.0069 |
2 | 1 | 0.9125 | 0.0875 | 0 |
10 | 0.9950 | 0 | 1 | 0.005 |
20 | 0.9931 | 0 | 1 | 0.0069 |
30 | 0.9881 | 0 | 1 | 0.0119 |
40 | 0.9831 | 0 | 1 | 0.0169 |
42 | 0.9831 | 0 | 1 | 0.0169 |
#partitions | True Pos | False Pos | False Neg | True Neg |
---|---|---|---|---|
1 | 0.9931 | 0.8125 | 0.1875 | 0.0069 |
2 | 0.9938 | 0.0025 | 0.997 | 0.0062 |
10 | 0.9975 | 0 | 1 | 0.0025 |
20 | 0.9969 | 0 | 1 | 0.0031 |
30 | 0.9988 | 0 | 1 | 0.0012 |
40 | 0.9831 | 0 | 1 | 0.0169 |
42 | 0.9831 | 0 | 1 | 0.0169 |
Theoretical
Technological
Forthcoming
Same model abstraction as the caret
package
Will use foreach
for parallelism
Support is planned for data frames, matrices, bigmemory, and SciDB
Supervised and unsupervised learning will be supported
Email me for an alpha version
Email: michael.kane@yale.edu
Github: http://github.com/kaneplusplus
This talk: http://euler.stat.yale.edu/~mjk56/SubspaceEnsembles
Download the ensemble
package: email me
Email me for the slidify
code for this talk
Aglib (2013) The Aglib User's Guide. http://www.alglib.net/#book
Algorithm AS274 (1992) Applied Statitsics. 41(2).
F. R. Bach and M. I. Jordan (2005) Predictive low-rank decomposition for kernel methods. Proceedings of the \(22^{nd}\) International Conference on Machine Learning.
J. Baglama and L. Reichel (2005) Augmented Implicitly Restarted Lanczos Bidiagonalization Methods. SIAM Journal of Scientific Computing.
O. Chapelle (2006) Training a Support Vector Machine in the Primal. Neural Computation. 19(5) 1155-1178.
X. Li and H. Zhao (2009) Weighted random subspace methods for high dimensional data classification. Statistics and its Interface. 2 153-159.
N. Matloff (2012) Efficient R Parallel Loops on Long-Latency Platforms. Talk given at the \(43^{rd}\) Symposium on the Interface of Computing Science and Statistics.
N. Matloff (2013) Software Alchemy: Turning Complex Statistical Computations into Embarrassingly Parallel Ones. Preprint.