Abstracts for 2010-11 seminar talks

May 2, 2011: Bayesian selective inference
Daniel Yekutieli
Tel Aviv University, Department of Statistics and Operations Research

The term selective inference refers to marginal statistical inferences that are provided for parameters selected after viewing the data, where the selected parameters are typically the "significant" findings of a multiple testing procedure. I will discuss selective inference from a Bayesian perspective. I will show that if the parameter is elicited a non-informative prior, or if it is a "fixed" unknown constant, then it is necessary to adjust the Bayesian inference for selection. I will present a Bayesian framework for providing inference for selected parameters and Bayesian False Discovery Rate controlling methodology, that is a generalization of existing Bayesian FDR methods that are only defined in the two-group mixture model. I will illustrate the results by applying them to simulated data and data from a microarray experiment.

April 25, 2011: Nonparametric Methods for Multivariate Data and Repeated Measures Designs
Arne Bathke
University of Kentucky, Statistics

Data obtained through observational or experimental studies, for example in the life sciences or social sciences, are often intrinsically multivariate because several response variables are measured on the same experimental unit (multiple endpoints). We present new nonparametric methods for statistical inference based on such data. The nonparametric approach does not need the assumption of normality, and it has the advantage that it can handle quantitative, as well as ordinal response variables, or a mixture of both. Furthermore, the proposed tests are invariant under monotone transformations of the original variables. We will present asymptotic results for different situations, supplemented by results from simulation studies, as well as the analysis of a data example.

April 11, 2011: Correcting for population stratification in case-control studies of rare genetic variation
Michael Epstein
Emory, Department of Human Genetics and Biostat

Recent advances in next-generation sequencing technology have enabled investigators to assess the role of rare genetic variation in the origins of complex human diseases. Within case-control resequencing studies, investigators typically test for association between rare variants and disease using burden tests that collapse sets of rare variants within a gene or region into a composite variable prior to association testing with disease. An open issue with resequencing studies, and burden association tests in particular, is their validity in the presence of confounding due to population stratification. Such confounding will arise when genetic variation is correlated with variation in disease risk across latent subpopulations or geographic gradients. In this talk, I describe the use of a measure called the stratification score (defined as the odds of disease given confounders) to resolve confounding due to population stratification in case-control resequencing studies. I first show how one can use the stratification score to choose a subset of subjects for resequencing from a larger GWAS sample who are well matched on genetic ancestry. Next, I describe how one can use the stratification score to adjust existing burden tests (many of which rely on statistical frameworks that do not allow for covariates) for population stratification. We illustrate our approaches using both simulated and real data from an existing study of schizophrenia. This is joint work with Drs. Glen Satten and Andrew Allen.

March 28, 2011: Likelihood Approach for Detecting Imprinting and Maternal Effects
Shili Lin
The Ohio State University

Genomic imprinting and maternal effects are two epigenetic factors that have been increasingly explored for their roles in the etiology of complex diseases. This is part of a concerted effort to find the "missing heritability". Accordingly, statistical methods have been proposed to detect imprinting and maternal effects simultaneously based on either a case-parents triads design or a case-mother/control-mother pairs design. However, these methods are not amenable to extended families, which are commonly recruited in family-based studies. Further, existing methods are full-likelihood based and have to make strong assumptions concerning mating type probabilities (nuisance parameters) to avoid overparametrization. In this talk, I will focus on Likelihood approaches for detecting Imprinting and Maternal Effects (LIME) using family data. In particular, I will discuss LIME-ped, which uses extended pedigrees from prospective family-based association studies without the Hardy-Weinberg equilibrium assumption by introducing a novel concept called "conditional mating type" between marry-in founders and their non-founder spouses. I will also discuss LIME-mix, which augments the two popular study designs noted above by combining them and including control-parents triads, so that our sample may contain a mixture of case-parents/control-parents triads and case-mother/control-mother pairs. By matching the case families with control families of the same structure and stratifying according to the familial genotypes, we are able to derive a partial likelihood that is free of the nuisance parameters. This renders unnecessary of strong assumptions and leads to a robust procedure without sacrificing power. I will show simulation results to illustrate power gain with LIME-ped by using extended pedigrees and demonstrates robustness of LIME-mix under a variety of settings.

February 28, 2011: Fast and Accurate False Positive Control in Genome-wide Association Studies
Yu Zhang
Penn State University

Genome-wide association studies routinely test hundreds of thousands or millions of genetic markers simultaneously. Adjustment on the p-values of individual tests is necessary to reduce false positive findings, known as the multiple-comparison problem. Current practices rely on either Bonferroni corrections or permutations to evaluate the genome-wide significance of associations. The Bonferroni method is overly conservative due to the strong dependence between genetic markers, which is particularly problematic for testing high-density markers and markers in overlapping windows. Bonferroni correction also has significant impact on false discovery rate (FDR) procedures. Permutation test, on the other hand, is computationally too expensive for large studies involving millions of comparisons or many thousands of individuals. We propose a new method for adjusting multiple correlated comparisons that is accurate and extremely fast. The method produces accurate p-value adjustments in almost a constant time irrespective to the number of tests, the sample size, and the scale of p-values. The method can also be easily adopted into FDR control procedures. We introduce a new FDR control method that produces much more reasonable results than conventional methods in GWAS. We further generalize the method to conditional tests, such that biological prior knowledge of the distribution of disease genes can be incorporated to improve the sensitivity and the specificity of disease association mapping.

February 21, 2011: Does aneuploidy cause cancer: can genomics data modeling help explain?
Cheng Li
Harvard University

A common type of aneuploidy is Down's syndrome, where one-copy gain of chromosome 21 can lead to many symptoms and higher risk of cancer. Cancer cells frequently harbor an aneuploid genome with gains or losses of large chromosome regions or entire chromosomes that affect the expression of hundreds of genes. The aneuploid patterns are recurrent in a cancer type and correlate with patient response and prognosis. I will introduce various hypotheses about the relationship between aneuploidy and cancer, recent biological experiments generating new hypotheses, and how genomics data such as expression and copy number profiling, combined with statistical and bioinformatic models, may help shed light on the debate.

Cheng Li's biography: Dr. Cheng Li received his B.S. degree in computer science in 1995 at Beijing Normal University, and Ph.D. degree in statistics in 2001 at University of California at Los Angeles. He joined the Department of Biostatistics of Harvard School of Public Health and Dana-Farber Cancer Institute as an assistant professor in 2002 and associate professor in 2008. He has developed many novel gene expression and SNP microarray analysis and visualization methods, and implemented and maintained the widely used genomics analysis software dChip, which has been cited 1800 times. His current interests are how genomic changes in the cell promote the initiation and progression of cancer and neurological disorders, and classify the diseases for prognosis. See www.ChengLiLab.org for more information.

February 14, 2011: Transforming Public Gene Expression Repositories into Disease Diagnosis Databases
Haiyan Huang
University of California, Berkeley

The rapid accumulation of gene expression data has offered unprcedented opportunities to study human diseases. The NCBI Gene Expression Omnibus is currently the largest database that systematically documents the genome wide molecular basis of diseases. In this talk, I will introduce our study on transforming a public gene expression erpository,particularly NCBI GEO, into an automated disease diagnosis database. Relevant computational and statistical issues and challenges e.g.standardizing cross platform gene expression data and heterogeneous disease annotations, developing a two stage Bayesian learning approach to achieve the automated disease diagnosis under the formulation of hierarchical multiple lable classification will be discussed.

February 7, 2011 : A unified framework for testing multiple phenotypes for association with genetic variants
Matthew Stephens
University of Chicago

In many ongoing genome-wide association studies, multiple related phenotypes are available for testing for association with genetic variants. In most cases, however, these related phenotypes are analysed independently from one another. For example, several studies have measured multiple lipid-related phenotypes, such as LDL-cholestrol, HDL-cholestrol, and Triglycerides, but in most cases the primary analysis has been a simple univariate scan for each phenotype. This type of univariate analysis fails to make full use of potentially rich phenotypic data. While this observation is in some sense obvious, much less obvious is the right way to go about examining associations with multiple phenotypes. Common existing approaches include the use of methods such as MANOVA, canonical correlations, or Principal Components Analysis, to identify linear combinations of outcome that are associated with genetic variants. However, if such methods give a significant result, these associations are not always easy to interpret. Indeed the usual approach to explaining observed multivariate associations is to revert to univariate tests, which seems far from ideal. In this work we outline an approach to dealing with multiple phenotypes based on Bayesian multivariate regression. The method attempts to identify which subset of phenotypes is associated with a given genotype. In this way it incorporates the null model (no phenotypes associated with genotype); the simple univariate alternative (only one phenotype associated with genotype) and the general alternative (all phenotypes associated with genotype) into a single unified framework. In particular our approach both tests for and explains multivariate associations within a single model, avoiding the need to resort to univariate tests when explaining and interpreting significant multivariate findings. We illustrate the approach on examples, and show how, when combined with multiple phenotype data, the method can improve both power and interpretation of association analyses.

January 31, 2011: Computational Genomics of Gene Regulation
Xiaole Liu
Harvard University

High throughput genomics technologies such as gene expression microarrays, tiling microarrays, massively parallel sequencing have drastically accelerated the pace of biomedical research and discovery. However, they also created challenges for bioinformatic data analysis. I will introduce our work in trying to understand gene regulation through transcription factor motif discovery, ChIP-chip and ChIP-seq data analysis, and epigenomic studies. I will also discuss our recent work where we use nucleosome-resolution histone mark ChIP-seq data to infer the transcription factors driving a biological process and their in vivo binding sites, and show how the method is applied to understand prostate cancer and gut development.

January 24, 2011: Why to ignore correlations:applications to genetic association analysis
Wei Pan
University of Minnesota

An important problem in genetic analysis is to test disease association with multiple genetic markers in a candidate region, for which the corresponding statistical formulation is familiar to everyone: we test on multiple regression coefficients in a logistic regression model. However, the most popular Wald (or score or likelihood ratio) test may not be powerful, even for relatively "low-dimensional, high-sample sized" SNP data. In contrast to the Wald (or score) test, if we ignore correlations among the parameter estimates (or score components) and do not use their covariance matrix, the resulting test (called SSB or SSU test) may have higher power. Interestingly, the SSB or SSU test is closely related to two other non-parametric methods recently proposed for genomic data: genomic distance-based regression and kernel machine regression. Numerical examples will be provided to illustrate their applications to genetic association analysis of common variants and rare variants.

January 10, 2011: Geometry and Topology in Statistical Inference
Sayan Mukherjee
Duke University

We use two problems to illustrate the utility of geometry and topology in statistical inference: supervised dimension reduction (SDR), and inference of (hyper) graph models. I will also show two slides, containing only pictures, illustrating the problem of inference of stratified spaces. We start with a "tale of two manifolds." The focus is on the problem of supervised dimension reduction (SDR). We first formulate the problem with respect to the inference of a geometric property of the data, the gradient of the regression function with respect to the manifold that supports the marginal distribution. We provide an estimation algorithm, prove consistency, and explain why the gradient is salient for dimension reduction. We then reformulate SDR in a probabilistic framework and propose a Bayesian model, a mixture of inverse regressions. In this modeling framework the Grassman manifold plays a prominent role. The second part of the talk develops a parameterization of hypergraphs based on the geometry of points in ddimensions. Informative prior distributions on hypergraphs are induced through this parameterization by priors on point configurations via spatial processes. The approach combines tools from computational geometry and topology with spatial processes and offers greater control on the distribution of graph features than Erdos-Renyi random graphs. I will close with two slides that pictorally describe the problem of inferring Whitney stratified spaces. Consider two intersecting planes in 3 dimensions and draw n-points iid from this object. Can we infer which points belong to which plane and which points belong to the line defined by the intersection?

December 6, 2010: Phase Transitions for the Multi-State Hard Core Model on a Tree
Kavita Ramanan
Brown University

The hard core model is a well studied stochastic model with "hard constraints" that arises in statistical physics, combinatorics and stochastic networks. We consider generalizations of the hard core model on a tree, in which each vertex lies in any of C+1 states, subject to the constraint that the sum of the states of any two neighboring vertices does not exceed C. We characterize the phase transition region for this model, and identify an interesting dependence on the parity of C. We also discuss extensions of this model and implications of this analysis for certain loss network models arising in telecommunications.

November 29, 2010: Efficiently Learning Mixtures of Gaussians
Ankur Moitra
MIT

Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We provide a polynomial-time algorithm for this problem for any fixed number ($k$) of Gaussians in $n$ dimensions (even if they overlap), with provably minimal assumptions on the Gaussians and polynomial data requirements. In statistical terms, our estimator converges at an inverse polynomial rate, and no such estimator (even exponential time) was known for this problem (even in one dimension, restricted to two Gaussians). Our algorithm reduces the $n$-dimensional problem to the one dimensional problem, where the method of moments is applied. As a corollary, we are able to give to give the first polynomial time algorithm for density estimation for mixtures of $k$ Gaussians without any assumptions.

This talk will be based on two papers (Kalai, Moitra, Valiant, STOC 2010) and (Moitra, Valiant, FOCS 2010), the first of which handles the case of mixtures of two Gaussians, and the later generalizes the approach to mixtures of any fixed number of Gaussians. A major technical hurdle in the first paper is proving that noisy estimates of the first $4k-2$ moments of a univariate mixture of $k$ Guassians suffice to recover accurate estimates of the mixture parameters, as conjectured by Pearson (1894), and in fact these estimates converge at an inverse polynomial rate. For mixtures of more than two Gaussians, pathological scenarios can arise when projecting down to a single dimension. Consequently, the major challenge in the second paper concerns how to leverage a univariate algorithm with weaker guarantees to still yield an efficient learning algorithm in higher dimensions.

Lastly, while the running time and data requirements of our algorithm depend exponentially on the number of Gaussians in the mixture, we prove that such a dependence is necessary.

This is joint work with Adam Tauman Kalai and Gregory Valiant. This work appears as "Efficiently Learning Mixtures of Two Gaussians" (STOC 2010) and "Settling The Polynomial Learnability of Mixtures of Gaussians" (FOCS 2010).

November 15, 2010: Imaging the Earth's Deep Interior: a statistical perspective
Ping Ma
University of Illinois at Urbana-Champaign

At a depth of 2890 km, the core-mantle boundary (CMB) separates turbulent flow of liquid metals in the outer core from slowly convecting, highly viscous mantle silicates. The CMB marks the most dramatic change in dynamic processes and material properties in our planet, and accurate images of the structure at or near the CMB--over large areas--are crucially important for our understanding of present day geodynamical processes and the thermo-chemical structure and history of the mantle and mantle-core system. In addition to mapping the CMB we need to know if other structures exist directly above or below it, what they look like, and what they mean in terms of physical and chemical material properties and geodynamical processes. Detection, imaging, characterization, and understanding of structure in this remote region have been--and are likely to remain--a frontier in cross-disciplinary geophysics research. I will discuss the statistical problems, challenges and methods in imaging the CMB.

November 8, 2010: Adversarial Risk Analysis: Bayesian Methods in Game Theory
David Banks
http://www.stat.duke.edu/~banks/

Classical game theory has been an unreasonable description for human behavior, and traditional analyses make strong assumptions about common knowledge and fixed payoffs. Classical risk analysis has assumed that the opponent is non-adversarial (i.e.,"Nature") and thus is inapplicable to many situations. This work explores Bayesian approaches to adversarial risk analysis, in which each opponent must model the decision process of the other, but there is the opportunity to use human judgment and subjective distributions. The approach is illustrated in the analysis of two important applications: sealed bid auctions and simple poker; some related work on counter bioterrorism is also covered. The results in these three applications are interestingly different from those found from a minimax perspective.

November 1, 2010: A polarization approach to compressed sensing
Emmanuel Abbe
Federal Polytechnic School of Lausanne, Switzerland

In 2008, a technique called 'polarization' allowed to solve a problem open since 1948 by Shannon: the construction of low complexity codes that are provably capacity achieving. The polarization idea can be explained on the basis of a rather general probabilistic phenomenon: using the so-called polar transform, one can separate an ergodic process into two sub-processes of maximal and minimal entropy (fair coins and constants), and this procedure can be done at low computational cost. In this talk, we will use the idea behind polarization not for channel coding, but to propose a new approach to compressed sensing. With this approach, the measurement matrix has the attribute of being deterministic, whereas the signal is assumed to be statistically sparse. The overall scheme is shown to have a low complexity and the reconstruction algorithm is based on algebraic arguments rather than l1-minimization.

October 25, 2010: Robust High-dimensional Principal Component Analysis
Constantine Caramanis
The University of Texas at Austin

The analysis of very high dimensional data - data sets where the dimensionality of each observation is comparable to or even larger than the number of observations - has drawn increasing attention in the last few decades due to a broad array of applications, from DNA microarrays to video processing, to consumer preference modeling and collaborative filtering, and beyond. As we discuss, many of our tried-and-true statistical techniques fail in this regime. We revisit one of the perhaps most widely used statistical techniques for dimensionality reduction: Principal Component Analysis (PCA). In the standard setting, PCA is computationally efficient, and statistically consistent, i.e., as the number of samples goes to infinity, we are guaranteed to recover the optimal low-dimensional subspace. On the other hand, PCA is well-known to be exceptionally brittle -- even a single corrupted point can lead to arbitrarily bad PCA output. We consider PCA in the high-dimensional regime, where a constant fraction of the observations in the data set are arbitrarily corrupted. We show that standard techniques fail in this setting, and discuss some of the unique challenges (and also opportunities) that the high-dimensional regime poses. For example, one of the (many) confounding features of the high-dimensional regime, is that the noise magnitude dwarfs the signal magnitude. While in the classical regime, dimensionality recovery would fail under these conditions, sharp concentration-of-measure phenomena in high dimensions provide a way forward. Then, for the main part of the talk, we propose a High-dimensional Robust Principal Component Analysis (HR-PCA) algorithm that is computationally tractable, robust to contaminated points, and easily kernelizable. The resulting subspace has a bounded deviation from the desired one, for up to 50% corrupted points. No algorithm can possibly do better than that, and there is currently no known polynomial-time algorithm that can handle anything above 0%. Finally, unlike ordinary PCA algorithms, HR-PCA has perfect recovery in the limiting case where the proportion of corrupted points goes to zero.

October 18, 2010: Random graphs with a given degree sequence
Sourav Chatterjee
University of California at Berkeley on leave 2010-2011 visiting New York University

Large graphs are sometimes studied through their degree sequences. We study graphs that are uniformly chosen with a given degree sequence. Under mild conditions, it is shown that sequences of such graphs have graph limits in the sense of Lovasz and Szegedy with identifiable limits. This allows simple determination of other features such as the number of triangles. The argument proceeds by studying a natural exponential model having the degree sequence as a sufficient statistic. The maximum likelihood estimate (MLE) of the parameters is shown to be unique and consistent with high probability. Thus n parameters can be consistently estimated based on a sample of size one. A fast, provably convergent, algorithm for the MLE is derived. These ingredients combine to prove the graph limit theorem. Along the way, a continuous version of the Erdos-Gallai characterization of degree sequences is derived.

October 11, 2010: On proving Consistency of Non-Standard Kernel Estimators
David Mason
University of Delaware

I shall discuss general methods based on empirical process techniques to prove uniform in bandwidth consistency of a class of non-standard kernel-type function estimators. Examples include biased corrected kernel density and Nadaraya-Watson function estimators, projection pursuit regression and conditional distribution estimation and kernel estimation of the density of linear regression residuals. Our results are useful to establish uniform consistency of data-driven bandwidth kernel-type function estimators. My talk will be based upon joint work completed and in progress with Julia Dony, Uwe Einmahl and Jan Swanepoel.

October 4, 2010: Maximum likelihood estimation of a multidimensional log-concave density
Richard Samworth
University of Cambridge

If $X_1,...,X_n$ are a random sample from a density $f$ in $\mathbb{R}^d$, then with probability one there exists a unique log-concave maximum likelihood estimator $\hat{f}_n$ of $f$. The use of this estimator is attractive because, unlike kernel density estimation, the estimator is fully automatic, with no smoothing parameters to choose. We exhibit an iterative algorithm for computing the estimator and show how the method can be combined with the EM algorithm to fit finite mixtures of log-concave densities. Applications to classification, clustering and functional estimation problems will be discussed, as well as recent theoretical results on the performance of the estimator. The talk will be illustrated with pictures from the R package LogConcDEAD.

Co-authors: Yining Chen, Madeleine Cule, Robert Gramacy (University of Cambridge) and Michael Stewart (University of Sydney).

September 20, 2010: Clinical Trials For Personalized Medicine:Some Statistical Challenges
Feifang Hu
University of Virginia

In decades, scientists have identified genes (biomarkers) that seem to be linked with diseases. To translate these great scientific findings into real-world products for those who need them (Personalized Medicine), clinical trials play an essential and important role. New approaches to the drug-development paradigm are needed, especially new designs for clinical trials so that genetics and other biomarkers can be incorporated to assist in patient and treatment selection. Also the data from these studies are usually very complex and sequentially dependent. In this talk, I will be focusing on these following statistical issues: (i) the complexness of data structure; (ii) clinical trial designs that use genetics or other biomarkers; and (iii) statistical inference. Some further research problems will also be discussed.