Scatter Matrix Sharing for Distributed, Supervised Learning

Michael Kane, Yale Center for Analytical Sciences

Goals for this talk

1. Provide a quick background of Large Complex Data: Divide and Recombine (D&R) with RHIPE (Guha, Hafen, et. al).

2. Propose a method for assessing the "stability" of this approach based on the scatter matrix

3. Show the statistical properties of the scatter matrix under various model assumptions

4. Show how "Scatter Matrix Sharing" can be used to get exact results in supervised learning challenges

Who am I?

Research Faculty in Biostatistics at Yale University

Background in computing, machine learning, and statistics

Track record

2010 ASA Chamber's Prize
2012 DARPA XDATA Initiative
2013 Grand Challenge Exploration Round 11
Editor for C&H's upcoming Handbook of Big Data

Interested in scalable machine learning and applied probability

D&R approach

Significance

You don't have train a learner (fit a model) on an entire data set
- Many "small" learners are better than a single "big" one
- Applies to a large class of supervised learning problems
This can be done in a single MapReduce step
- Fits on modern "embarrassingly parallel" computing frameworks
- Minimum number of MR "units"

What makes this work?

If we represent the problem as: \[ \left[ \begin{array}{c} Y_1 \\ Y_2 \\ \vdots \\ Y_r \end{array} \right] = \left[ \begin{array}{c} \mathbf{X}_1 \\ \mathbf{X}_2 \\ \vdots \\ \mathbf{X}_r \end{array} \right] \beta. \]

then D&R works when

\[ (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T Y = \sum_{i=1}^r (\mathbf{X}_i^T \mathbf{X}_i)^{-1} \mathbf{X}_i^T Y_i \]

Which conditions do we need for this to be true?

If \(\mathbf{S}^{-1} = (\mathbf{X}^T \mathbf{X})^{-1}\) exists, \(\mathbf{S}_i^{-1} = (\mathbf{X}_i^T \mathbf{X}_i)^{-1}\) exist, and \(\frac{c}{n} \mathbf{S}_i = \mathbf{S}\) then

\[ \beta = \sum_{i=1}^r \mathbf{S}_i^{-1} \mathbf{X}_i^T Y_i \]

Proof: WLOG assume \(r=2\)

\[ \begin{align*} \beta &= \left( \left[ \begin{array}{cc} \mathbf{X}_1^T & \mathbf{X}_2^T \end{array} \right] \left[ \begin{array}{c} \mathbf{X}_1 \\ \mathbf{X}_2 \end{array} \right] \right)^{-1} \left[ \begin{array}{cc} \mathbf{X}_1^T & \mathbf{X}_2^T \end{array} \right] \left[ \begin{array}{c} Y_1 \\ Y_2 \end{array} \right] \\ &= \left( \mathbf{S}_1 + \mathbf{S}_2 \right)^{-1} (\mathbf{X}_1^T Y_1 + \mathbf{X}_2^T Y_2) \\ &= \frac{1}{2} \mathbf{S}^{-1} (\mathbf{X}_1^T Y_1 + \mathbf{X}_2^T Y_2) \end{align*} \]

How do we assess convergence?

In OLS we fix (condition on) the data.

Linear algebra doesn't give us a way to do this short of calculating \(\mathbf{S}\).

Assume a probability model on the data before learning

Recall: The Frobenius Norm

For an \(n \times p\) positive matrix \(\mathbf{A}\) the Frobenius norm is defined as:

\[ ||\mathbf{A}||_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^p |a_{ij}|^2} = \sqrt{\textrm{trace}(\mathbf{A}^T \mathbf{A} )} \]

Scatter matrix stability

Let \(\Upsilon = \textrm{diag}(S)\)

Scatter matrix stability

\[ R(\mathbf{X}_i, \mathbf{X}) = \frac{n}{c} \frac{||\mathbf{X}_i \Upsilon^{-1}||^2_F}{ ||\mathbf{X} \Upsilon^{-1}||^2_F} \] Independent scatter matrix stability \[ R(\mathbf{X}_i, \mathbf{X}_{n \setminus i}) = \frac{n}{c} \frac{||\mathbf{X}_i \Upsilon^{-1}||^2_F}{ ||\mathbf{X}_{n \setminus i} \Upsilon^{-1}||^2_F} \]

Probabilistic Behavior: Normal Data

If \(\mathbf{X}\) is distributed as \(\mathcal{N}(0, \Sigma)\) then \(R(\mathbf{X}_i, \mathbf{X})\) is distributed as:

\[ \frac{n}{c} \textrm{Beta} \left(\frac{c}{2}, \frac{n-c}{2} \right) \]

Scatter Matrix Stability: Normal Data

Expected value is 1

Variance is

\[ \frac{\frac{n(n-c)}{2}}{\left(\frac{n}{2}\right)^2\left(\frac{n}{2}+1\right)} = \frac{n-c}{cn/2+1} \]

Independent Scatter Matrix Stability: Normal Data

\(F(c, n-c)\) distribution

Independent Scatter Matrix Stability: Normal Data

Expected value is

\[ (n-c)/(n-c-2) \]

Variance is

\[ \frac{2(n-c)^2(n-2)}{c(n-c-2)^2(n-c-4)} \]

Independent Scatter Matrix Stability

Assume some distribution then by the CLT the ratio distribution approaches a Cauchy

\[ f(z) = \frac{1}{\pi} \frac{\sqrt{n/c}}{(z-1)^2 + n/c} \]

What does this mean for D&R?

If you believe the Cauchy model, then there may be a problem

What can we do to fix this?

Combine more carefully

Return the scatter matrices along with slope coefficients

Weight based on some measure of consensus

Scatter Matrix Sharing

Spend 1 MR step calculating the scatter matrix, \(X^T X\).

In subsequent steps:

"Share" the scatter matrix among all distributed workers
Aggregate the results to create a single learner

Procedure	MR Steps
OLS	2
RLS	2
GLM	k
LARS	?
SVM	?

Where are the empirical results?

Currently have an R package on Github

Benchmarks are slightly boring

Challenge is generalizing code for different data stores