Yale Statistics Department Seminars: 2012-13


Revised: 23 Oct 2012


Date Speaker Seminar Title
Sept. 10, 2012 Yi Jiang
Georgia State University
Can Cell Morphology tell a story?
[abstract]
The question can be rephrased as: What can we learn from analyzing and modeling the morphology of cells? I will discuss our recent work on statistical analysis and mathematical modeling of cell morphology in retinal pigment epithelium. The story begins as we age... Age-related macular degeneration (AMD) is the main cause of vision loss in the elderly and is a looming epidemic in our aging society. Presently there is no way to determine how a patient's eye will progress, and no effective treatment for AMD. To tackle this problem, our eyes rested on retinal pigment epithelium, because it is a crucial site of AMD pathology and it undergoes morphological changes as the eye ages and AMD progresses. We collected retinal pigment epithelium morphological data from mouse eyes. Statistical analysis on the morphometric data established that we can discriminate the genotypes of the eyes despite aging as a confounding factor. This work is the first step toward developing the relationship between the cell morphology in the epithelium and the age and disease status of the eye. We also developed a mathematical model of two-dimensional epithelium morphological dynamics. Simulations suggested clustered cell death could cause normal retinal pigment epithelium to develop into those seen in AMD patients. This work provides a foundation for a potential diagnostic and prognostic tool for AMD.
Sept. 17, 2012 Nicolai Meinshausen
University of Oxford
Regularization for large-scale regression
[abstract]
Many recent applications in the physical sciences generate large-scale datasets and modern regression techniques are routinely applied to these data for a wide range of purposes. Many of these approaches require a careful choice of a tuning parameter. Often though, simple qualitative and sign-constraints can be imposed on grounds of physical considerations. These constraints simplify the estimation problem as the tuning parameter becomes superfluous. We show the perhaps unexpected effectiveness of this approach for examples in climate science and beyond. Predictive accuracy is not compromised in general and we examine under which assumptions optimal convergence rates can be achieved.
Sept. 24, 2012 Regina Liu
Rutgers University
Combining nonparametric inferences using data depth, bootstrap and confidence distribution
[abstract]
We apply the concepts of confidence distribution and data depth together with bootstrap to develop a new methodology for a combined inference from several nonparametric studies for a common hypothesis. A confidence distribution (CD) is a sample-dependent distribution function that can be used to estimate parameters of interest. It can be viewed as a "distribution estimator" of the parameter of interest. Examples of CDs include Efron's bootstrap distribution and Fraser's significance function (also referred to as p-value function). Although the concept of CD has natural links to concepts of Bayesian inference and the fiducial arguments of R. A. Fisher, it is a purely frequentist concept and has attracted renewed interest in recent years. CDs have shown high potential to be effective tools in statistical inference. We discuss a new approach to combining the test results from several independent studies for a common multivariate nonparametric hypothesis. Specifically, in each study we apply data depth and bootstraps to obtain a p-value function for the common hypothesis. The p-value functions are then combined under the framework of combining confidence distributions. This approach has several advantages. First, it allows us to resample directly from the empirical distribution, rather than from the estimated population distribution satisfying the null constraints. Second, it enables us to obtain test results directly without having to construct an explicit test statistic and then establish or approximate its sampling distribution. The proposed method provides a valid inference approach for a broad class of testing problems involving multiple studies where the parameters of interest can be either finite or infinite dimensional. The method will be illustrated using simulations and flight data from the Federal Aviation Administration (FAA).

This is joint work with Dungang Liu (School of Public Health, Yale University) and Minge Xie (Department of Statistics, Rutgers University).
Oct. 1, 2012 David Madigan
Columbia University
Observational studies in healthcare: are they any good?
[abstract]
Observational healthcare data, such as administrative claims and electronic health records, play an increasingly prominent role in healthcare. Pharmacoepidemiologic studies in particular routinely estimate temporal associations between medical product exposure and subsequent health outcomes of interest, and such studies influence prescribing patterns and healthcare policy more generally. Some authors have questioned the reliability and accuracy of such studies, but few previous efforts have attempted to measure their performance. The Observational Medical Outcomes Partnership (OMOP,http:

omop.fnih.org) has conducted a series of experiments to empirically measure the performance of various observational study designs with regard to predictive accuracy for discriminating between true drug effects and negative controls. In this talk, I describe the past work of the Partnership, explore opportunities to expand the use of observational data to further our understanding of medical products, and highlight areas for future research and development.

(on behalf of the OMOP investigators)
Oct. 8, 2012 Yixin Fang
New York University School of Medicine
Stability Selection in Cluster Analysis
[abstract]
Recently, the concept of clustering stability has become popular for selecting the number of clusters in cluster analysis. We develop a method for estimating the clustering instability based on the bootstrap, and propose to choose the number of clusters as the one minimizing the clustering instability. The idea can also be applied to select tuning parameters in some regularized cluster analysis procedures.
Oct. 15, 2012 Tiefeng Jiang
University of Minnesota
Distributions of Angles in Random Packing on Spheres.
[abstract]
We study the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in p-dimensional spaces as the number of points n goes to infinity, while the dimension p is either fixed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that ``all high-dimensional random vectors are almost always nearly orthogonal to each other". Applications to statistics and connections with some open problems in physics and mathematics are also discussed. This is a joint work with Tony Cai and Jianqing Fan.
Oct. 22, 2012 Hongzhe Li
University of Pennsylvania
Robust Detection and Identification of Sparse Segments in Ultra-High Dimensional Data Analysis
[abstract]
Copy number variants (CNVs) are alternations of DNA of a genome that results in the cell having a less or more than two copies of segments of the DNA. CNVs correspond to relatively large regions of the genome, ranging from about one kilobase to several megabases, that are deleted or duplicated. Motivated by CNV analysis based on next generation sequencing data, we consider the problem of detecting and identifying sparse short segments hidden in a long linear sequence of data with an unspecified noise distribution. We propose a computationally efficient method that provides a robust and near-optimal solution for segment identification over a wide range of noise distributions. We theoretically quantify the conditions for detecting the segment signals and show that the method near-optimally estimates the signal segments whenever it is possible to detect their existence. Simulation studies are carried out to demonstrate the efficiency of the method under different noise distributions. We present results from a CNV analysis of a HapMap Yoruban sample to further illustrate the theory and the methods.
Oct. 29, 2012 Dan Nettleton
Iowa State University
Testing Union-of-Cones Hypotheses for the Identification of Traits that Exhibit Heterosis
[abstract]
Heterosis, also known as hybrid vigor, occurs when the mean trait value of offspring is more extreme than that of either parent. Well before heterosis was first scientifically described by Darwin in 1876, humans had been using heterosis for various practical purposes. Within the last century, heterosis has been used to improve many crop species for food, feed, and fuel industries. Despite intensive study and successful utilization of heterosis, the basic molecular genetic mechanisms responsible for heterosis remain unclear. In an effort to better understand the underlying mechanisms, researchers have begun to measure the expression levels of thousands of genes in parental maize lines and their hybrid offspring. The expression level of each gene can be viewed as a trait alongside more traditional traits like plant height, grain yield, or drought tolerance. This talk will describe statistical methods that can be used to identify traits that exhibit heterosis. The testing problem is nonstandard because the null hypothesis of no heterosis constrains a parameter vector to a union of two essentially disjoint closed convex cones that is neither a cone nor convex. We will present the likelihood ratio test for heterosis and discuss challenges that arise when attempting to apply it simultaneously to data from thousands of traits. We will also propose an alternative strategy that involves hierarchical modeling and empirical Bayesian inference for simultaneous estimation and identification of heterosis for multiple traits. This talk covers joint work with Tieming Ji, Peng Liu, and Heng Wang.
Nov. 5, 2012 Mya Bar-Hillel
The Bible Code -- Riddle and Solution
[abstract]
In 1995, Statistical Science published a paper purporting to prove the existence of a code in the book of Genesis that predicts future events. Some of the world's leading statisticians and mathematicians had not managed to find the flaw in this work. In 1999, Statistical Science published a refutation of the so-called Bible Code proof, by a team that included the present speaker. This lecture will relate the story of the rise and fall of the Bible Code -- a statistical riddle and its solution.
Nov. 12, 2012 Wenbo Li
University of Delaware
Gaussian inequalities and conjectures
[abstract]
Gaussian inequalities play a fundamental role in the study of high dimensional probability. We first provide an overview of various Gaussian inequalities and then present several recent results and conjectures for Gaussian measure/vectors, together with various applications.
Nov. 19, 2012 No Speaker
Thanksgiving Break
Nov. 26, 2012 Joel Zinn
Texas AM University
Functional Depth
[abstract]
In the last several years there has been interest in extending the various notions of statistical depth and quantiles to the functional and infinite dimensional setting. We will present some of these notions and indicate both positive and negative aspects. We will also discuss one approach which bypasses depth and goes directly to quantile functions. This is joint work with J. Kuelbs.
Dec. 3, 2012 Wei Biao Wu
University of Chicago
Convariance and Precision Matrix Estimation for High-Dimentional Time Series.
[abstract]
I will consider estimation of covariance matrices and their inverses (a.k.a. precision matrices) for high-dimensional stationary and locally stationary time series. In the latter case the covariance matrices evolve smoothly in time, thus forming a covariance matrix function. Using the functional dependence measure of Wu (2005), we obtain the rate of convergence for the thresholded estimate and illustrate how the dependence affects the rate of convergence. Asymptotic properties are also obtained for the precision matrix estimate which is based on the graphical Lasso principle. Our theory substantially generalizes earlier ones by allowing dependence, by allowing non-stationarity and by relaxing the associated moment conditions.