Abstracts for 2007-08 seminar talks

April 28, 2008: On the Critical Density for Percolation in Random Geometric Graphs
Edmund Yeh
Department of Electrical Engineering, Yale University

Percolation theory has become a useful tool for the analysis of large-scale wireless networks. We investigate the fundamental problem of characterizing the critical density for d-dimensional Poisson random geometric graphs in continuum percolation theory. By using a probabilistic analysis which incorporates the clustering effect in random geometric graphs, we develop a new class of analytical lower bounds for the critical density. These analytical lower bounds are the tightest known to date, and reveal a deep underlying relationship between clustering effects and percolation phenomena.

Joint work with Zhenning Kong, Yale University.

April 24, 2008: Locally Linear Embedding is a Laplacian Eigenmap
Michael Trosset
Department of Statistics, Indiana University

An approach to nonlinear dimension reduction called manifold learning posits that high-dimensional data lie (approximately) on low-dimensional manifolds. In a seminal paper, Roweis and Saul (2000) proposed a widely used manifold learning technique that attempts to characterize local geometry by linear coefficients that reconstruct each point from its neighbors. We present a simple example that demonstrates that Locally Linear Embedding (LLE) can profoundly distort data. We relate LLE to another manifold learning technique, Laplacian eigenmaps, demonstrating that the former can be understood as a somewhat problematic special case of the latter.

This research is supported by the Office of Naval Research. It is joint work with Brent Castle.

April 14, 2008: Testing hypotheses about the number of factors in large factor models.
Alexei Onatski
Economics Department, Columbia University

In this paper we study high-dimensional time series that have the generalized dynamic factor structure. We develop a test of the null of k0 factors against the alternative that the number of factors is larger than k0 but no larger than k1 > k0. Our test statistic equals max k0 \lt k \le k1 (\gamma_k - \gamma_{k+1}) / (\gamma_{k+1} - \gamma_{k+2}), where \gamma_i is the i-th largest eigenvalue of the smoothed periodogram estimate of the spectral density matrix of data at a pre-specified frequency. We describe the asymptotic distribution of the statistic, as the dimensionality and the number of observations rise, as a function of the Tracy-Widom distribution and tabulate the critical values of the test. As an application, we test different hypotheses about the number of dynamic factors in macroeconomic time series and about the number of dynamic factors driving excess stock returns.

Keywords: Generalized dynamic factor model, approximate factor model, number of factors, hypothesis test, Tracy-Widom distribution.

April 7, 2008: Compressed Counting and Stable Random Projections
Ping Li
Cornell University

The method of (symmetric) random projections has become a popular tool for dimension reduction, in particular, for efficiently computing pairwise distances in massive high-dimensional data (including streaming data), with many applications in data mining and machine learning such as clustering, nearest neighbors, kernel methods etc. This talk will start with the recent work on (symmetric) stable random projections; then I will focus on the most recent work called Compressed Counting (CC), whose underlying technique is maximally-skewed stable random projections. Lp The computation task involving massive dynamic data streams was recently identified as one of the "10 challenges" in data mining research. Compressed Counting (CC) is a tool for counting the pth frequency moment of data streams. Due to the dynamic nature, while counting the first () moment (the sum) is trivial using a simple counter, the task is challenging when because the massive data are subject to frequent updating. Compressed Counting (CC) works almost like a simple counter when, and its complexity (number of counters) is a continuous function of how close 1=p1?pk1?ppis to 1. More precisely, the number of required counters )/1(?Ok= )when p is close to 1, a significant improvement compared with , the previous best result. Counting the /1(2epsilon;O=pth moment has direct applications in parameter estimation using the method of moments. CC also provides a very efficient solution for counting the entropy moment which had been extensively studied in theoretical computer science.

Bio: Ping Li is an assistant professor at Cornell University in the Department of Statistical Science and Faculty of Computing and Information Science. He graduated from Stanford in 2007, with a Ph.D. in Statistics, an MS in Computer Science, and an MS in Electrical Engineering. Ping Li's research interests include applied and computational statistics, ranking and web search, machine learning and data mining, randomized algorithms, etc.

March 31, 2008: The Geometry of Phylogenetics
Erick A. Matsen

Phylogenetics is the inference of the evolutionary history of a set of species using present-day sequence information. This talk will describe two recent appliations of geometry to phylogenetics: first, we describe in terms of convex geometry why phylogenetic inference using concatenation of several gene sequences can produce anomalous results. Second, we present a complete implicit characterization of the possible data sets corresponding to phylogenetic trees under several popular models of sequence evolution. This characterization shows that the relevant model spaces live on the boundary of a non-convex higher dimensional contractible space. (Although these sound like lots of big words, I'll be showing more diagrams than formulas, and the talk is certainly for non-experts!)

March 24, 2008: Statistical Inverse Problems in Network Tomography
Vijay Nair
Department of Statistics, University of Michiagan

The term network tomography characterizes two classes of large-scale inverse problems that arise in the modeling and analysis of computer and communications networks. In passive tomography, network traffic data are collected at the nodes, and the goal is to reconstruct origin-destination traffic patterns. The second is active network tomography where the goal is to recover link-level quality of service parameters, such as packet loss rates and delay distributions, from end-to-end path-level measurements. Internet service providers use this to characterize network performance and to monitor service quality. This talk will provide an overview of the network application and the statistical inverse problems with an emphasis on active tomography. Issues on design of probing experiments, inference, and applications to network monitoring will be discussed.

This is joint work with George Michailidis, Earl Lawrence, Bowei Xi, and Xiaodong Yang.

March 3, 2008: Pseudo-Maximization and self-normalized processes
Victor de la Pena
Department of Statistics, Columbia University

In this talk we provide a survey of results for self-normalized processes in dependent variables. An important example of self-normalized processes is Student's t-statistic (Student (1908)). We will provide examples of the LIL for this type of processes as well as extensions of Student's type statistics to the case of martingales.

February 25, 2008: Functional CLT for spatial birth and death processes and application in spatial statistics
Xin Qi
Department of Biostatistics, Yale University

We give a functional central limit theorem for spatial birth and death processes based on the representation of such processes as solutions of stochastic equations. For any bounded and integrable function in Euclidean space, we define a family of processes which is obtained by integrals of this function with respect to the centered and scaled spatial birth and death process with constant death rate. We prove that this family converges weakly to some Gaussian process. We do not need that the birth rate have a finite range of interaction. Instead we require that the the birth rate have a range of interaction that decays polynomially. In order to show the convergence of the finite-dimensional distributions of the above processes, we extend the Penrose's multivariate spatial central limit theorem. As an application, we apply the idea to prove the asymptotic normality of time-invariance estimators for the nearest neighbor interaction model in spatial point processes.

February 18, 2008: Free Energy Estimation, Rare Events and Efficient Importance Sampling.
Jose Blanchet
Department of Statistics, Harvard University

Estimating the free energy of a Gibbs distribution can sometimes be related to estimating an associated large deviations probability. In this talk, I'll discuss some examples in which one can develop efficient importance sampling estimators for this task. Our algorithms can be rigorously proved to achieve asymptotic optimality and, in some cases related to counting combinatorial objects, lead to the fastest known randomized algorithms for free energy estimation. As an example, we shall discuss an asymptotically optimal algorithm for estimating the free energy of a Gibbs distribution, on the space of random walks, that induces a self-attracting path interaction; a problem that is related to estimating the probability that a random walker avoids a set of Poissonian obstacles.

February 11, 2008: Concentration of distributions of sums and randomized CLT's for dependent observations.
Sergey Bobkov
Department of Mathematics, University of Minnesota

We give review of some results about randomized forms of the central limit theorem for dependent observations and will discuss the concentration methods in the study of randomized sums.

February 4, 2008: On Correlation Decay in Matchings and Multi-spin Systems
Prasad Tetali
School of Mathematics & School of Computer Science, Georgia Institute of Technology

In the first part of the talk, we will describe a deterministic fully polynomial time approximation scheme (FPTAS) for computing the total number of matchings in a bounded degree graph, which is a #P-complete problem. The approach is based on the so-called correlation decay technique originating in statistical physics. The second part of the talk deals with the construction of a computation tree for interacting systems modeled using graphs that preserve the marginal probability of any vertex of interest. In particular we construct, for any finite graph and a fixed vertex, a finite computation tree with appropriately defined boundary conditions so that the marginal probability on the tree at the vertex matches that on the graph. For several interacting systems, we show using our approach that if there is strong spatial mixing on an infinite regular tree, then one has strong spatial mixing for any given graph with maximum degree bounded by that of the regular tree.

The talk is based on joint works with M. Bayati (Microsoft), C. Nair (Hong Kong), D. Gamarnik (MIT) and D. Katz (MIT).

January 28, 2008: Geometry of random minimum spanning trees
Nick Reed
Department of Physics, Yale University

Given a connected finite graph with costs (real numbers) assigned to the edges, the minimum spanning tree is defined to be an acyclic subgraph (a tree) that is spanning (meets all the vertices) such that the sum of the costs of the edges occupied by the tree is minimized. We consider the random version in which the costs on the edges are independent random variables, uniformly distributed between 0 and 1. The model arises physically as the strong-disorder limit of a spin glass. Using Kruskal's greedy algorithm to solve a random instance leads to a relation with bond (Bernoulli) percolation on the same graph, which we exploit throughout. We address the geometry of the random trees when the graph is a portion of the hypercubic lattice in d dimensions. As a first step, we consider instead the case in which the graph is the "Bethe lattice" (infinite Cayley tree, of fixed degree at all vertices). This requires use of "wired" boundary conditions to obtain a non-trivial minimum spanning forest. We show that the connected component of the forest containing any given vertex has of order M**3 vertices within M steps of the given vertex. This is interpeted (heuristically) as implying that in high dimensions (d>6) on the lattice in Euclidean space the connected components contain R**6 vertices within Euclidean distance R. Second, we address the fractal dimension of the path connecting any two vertices on the MST in Euclidean space. (This has been addressed numerically in d=2 dimensions by several other authors.) We set up an expansion for the related problem in which the Kruskal process is halted at the percolation threshold, producing a minimum spanning forest. (It has been argued that the use of this model produces the same result as the MST itself.) Taking a heuristic continuum limit and using renormalization group methods, we calculate this dimension to first order in 6-d for d<6 (for d>6, we argue that it is 2, as for a Brownian random walk).

Work done in collaboration with Tom Jackson (Yale).

January 14, 2008: An upper bound on the time for quadratic sieving
Robin Pemantle
Department of Mathematics, University of Pennsylvania

Central to many factoring algorithms in use today is the following random process: generate random numbers in the interval [1,N] until some subset has a product which is a square (and there is a computable witness to this). Naive probabilistic models for the distribution of prime factors suggest that this stopping time has a very sharp threshold. Based on more sophisticated probabilistic models, we find a rigorous upper bound that is within a factor of 4/3 of a proven lower bound, and conjecture that our upper bound is in fact asymptotically sharp.

This is joint work with Andrew Granville, Ernie Croot and Prasad Tetali.

December 3, 2007: The Chernoff lower bound for symmetric quantum hypothesis testing
Michael Nussbaum
Department of Mathematics, Cornell University

Symmetric hypothesis testing in quantum statistics is considered, where the hypotheses are density operators on a finite dimensional complex Hilbert space, representing states of a finite quantum system. A lower bound on the asymptotic rate exponents of Bayesian error probabilities is proved. The bound represents a quantum extension of the Chernoff bound, which gives the best asymptotically achievable error exponent in classical discrimination between two probability measures on a finite set. In the present framework the classical result is reproduced if the two hypothetic density operators commute. Recently it has been shown that the lower bound is achievable also in the generic quantum (noncommutative) case. This implies that the result is one part of the definitive quantum Chernoff bound.

November 26, 2007: Group Variable Selection via Hierarchical Lasso and Its Oracle Property
Ji Zhu
Department of Statistics, University of Michigan

In many engineering and scientific applications, predictor variables are grouped, for example, in biological applications where assayed genes or proteins can be grouped by biological role. Common statistical analysis methods such as ANOVA, factor analysis, and functional modeling with partially ordered basis sets also exhibit natural variable groupings. We develop a new variable selection method that respects the group constraint. Our new method enjoys benefits that existing successful methods do not have, while offering the potential for achieving asymptotic "oracle" properties.

November 12, 2007: Some Statistical Methods for Analyzing Neural Spike Trains
Robert E. Kass
Department of Statistics and Center for the Neural Basis of Cognition, Carnegie Mellon University

One of the most important techniques in learning about the functioning of the brain has involved examining neuronal activity in laboratory animals under varying experimental conditions. Neural information is represented and communicated through series of action potentials, or spike trains, and the central scientific issue in many studies concerns the physiological significance that should be attached to a particular neuron firing pattern in a particular part of the brain. In addition, a major relatively new effort in neurophysiology involves the use of multielectrode recording, in which responses from dozens of neurons are recorded simultaneously. Among other things, this has made possible the construction of brain-controlled robotic devices, which could benefit people whose movement has been severely impaired.

My colleagues and I have formalized specific scientific questions in terms of point process intensity functions, and have used Bayesian methods to fit the point process models to neural spike-train data. In my talk I will very briefly outline some of the substantive problems we have examined, and the progress we've made. I will also describe our work on approximate methods for state-space models and its potential applicability to neural prostheses.

October 29, 2007: Five Great Errors by Great Statisticians.
Steve Stigler
Department of Statistics, University of Chicago

What sort of role models should aspiring statistical researchers follow? A new approach is suggested by a study of great errors in statistical theory by Joseph-Louis Lagrange, Karl Pearson, Ronald Fisher, and Harold Hotelling. It is arguable that they succeeded because of their errors, not in spite of them. The lessons from this are obvious.

October 15, 2007: On Margin Based Semisupervised Learning
Junhui Wang
Department of Statistics, Columbia University

In classification, semi-supervised learning occurs when a large amount of unlabeled data is available with only a small number of labeled data. This imposes a great challenge in that it is difficult to achieve good classification performance through labeled data alone. To leverage unlabeled data for enhancing classification, we introduces a margin based semisupervised learning method within the framework of regularization, based on an efficient margin loss for unlabeled data, which seeks efficient extraction of the information from unlabeled data for estimating the Bayes rule for classification. In particular, I will discuss three aspects: (1) the idea and methodology development; (2) computational tools; (3) a statistical learning theory. Numerical examples will be provided to demonstrate the advantage of our proposed methodology against other existing competitors. An application to gene function prediction will be discussed.

September 24, 2007: Good Estimators for Gaussian Models Having a Block-wise Structure
Linda H. Zhao
Statistics Department, University of Pennsylvania

Many multivariate Gaussian models can conveniently be split into disjoint, block-wise problems. Common settings where this situation arises are balanced ANOVA models, balanced longitudinal models and certain block-wise shrinkage estimators in nonparametric regression estimation involving orthogonal bases such as Fourier or wavelet bases.

It is well known that the standard, least squares estimate in multidimensional Gaussian models can often be improved through the use of minimax shrinkage estimators or related Bayes estimators that lead to well-motivated shrinkage estimation. In a situation involving statistically independent blocks it is natural to apply this shrinkage separately within each block. From the Bayesian perspective this results from placing independent priors on the statistically independent blocks.

In the talk we show that standard estimators constructed via independent shrinkage are inadmissible in the original problem in terms of their squared-error risk; and we provide improved minimax estimators. An alternate class of block-wise shrinkage estimators is also considered, and fairly precise conditions are given that characterize when these estimators are admissible or quasi-admissible.

This is a joint work with L. Brown