Date  Speaker  Seminar Title 

September 6, 2010 
No Seminar


September 13, 2010 
NO Seminar


September 20, 2010 
Feifang Hu
University of Virginia 
Clinical Trials For Personalized Medicine:Some Statistical Challenges
[abstract] In decades, scientists have identified genes (biomarkers) that seem to be linked with diseases. To translate these great scientific findings into realworld products for those who need them (Personalized Medicine), clinical trials play an essential and important role. New approaches to the drugdevelopment paradigm are needed, especially new designs for clinical trials so that genetics and other biomarkers can be incorporated to assist in patient and treatment selection. Also the data from these studies are usually very complex and sequentially dependent. In this talk, I will be focusing on these following statistical issues: (i) the complexness of data structure; (ii) clinical trial designs that use genetics or other biomarkers; and (iii) statistical inference. Some further research problems will also be discussed.

September 27, 2010 
Vladimir Koltchinski
Georgia Tech, School of Mathematics 
Matrix Estimation Problems and vonNeumann Entropy Penalization 
October 4, 2010 
Richard Samworth
University of Cambridge 
Maximum likelihood estimation of a multidimensional logconcave density
[abstract] If $X_1,...,X_n$ are a random sample from a density $f$ in $\mathbb{R}^d$, then with probability one there exists a unique logconcave maximum likelihood estimator $\hat{f}_n$ of $f$. The use of this estimator is attractive because, unlike kernel density estimation, the estimator is fully automatic, with no smoothing parameters to choose. We exhibit an iterative algorithm for computing the estimator and show how the method can be combined with the EM algorithm to fit finite mixtures of logconcave densities. Applications to classification, clustering and functional estimation problems will be discussed, as well as recent theoretical results on the performance of the estimator. The talk will be illustrated with pictures from the R package LogConcDEAD. Coauthors: Yining Chen, Madeleine Cule, Robert Gramacy (University of Cambridge) and Michael Stewart (University of Sydney). 
October 11, 2010 
David Mason
University of Delaware 
On proving Consistency of NonStandard Kernel Estimators
[abstract] I shall discuss general methods based on empirical process techniques to prove uniform in bandwidth consistency of a class of nonstandard kerneltype function estimators. Examples include biased corrected kernel density and NadarayaWatson function estimators, projection pursuit regression and conditional distribution estimation and kernel estimation of the density of linear regression residuals. Our results are useful to establish uniform consistency of datadriven bandwidth kerneltype function estimators. My talk will be based upon joint work completed and in progress with Julia Dony, Uwe Einmahl and Jan Swanepoel.

October 18, 2010 
Sourav Chatterjee
University of California at Berkeley on leave 20102011 visiting New York University 
Random graphs with a given degree sequence
[abstract] Large graphs are sometimes studied through their degree sequences. We study graphs that are uniformly chosen with a given degree sequence. Under mild conditions, it is shown that sequences of such graphs have graph limits in the sense of Lovasz and Szegedy with identifiable limits. This allows simple determination of other features such as the number of triangles. The argument proceeds by studying a natural exponential model having the degree sequence as a sufficient statistic. The maximum likelihood estimate (MLE) of the parameters is shown to be unique and consistent with high probability. Thus n parameters can be consistently estimated based on a sample of size one. A fast, provably convergent, algorithm for the MLE is derived. These ingredients combine to prove the graph limit theorem. Along the way, a continuous version of the ErdosGallai characterization of degree sequences is derived.

October 25, 2010 
Constantine Caramanis
The University of Texas at Austin 
Robust Highdimensional Principal Component Analysis
[abstract] The analysis of very high dimensional data  data sets where the dimensionality of each observation is comparable to or even larger than the number of observations  has drawn increasing attention in the last few decades due to a broad array of applications, from DNA microarrays to video processing, to consumer preference modeling and collaborative filtering, and beyond. As we discuss, many of our triedandtrue statistical techniques fail in this regime. We revisit one of the perhaps most widely used statistical techniques for dimensionality reduction: Principal Component Analysis (PCA). In the standard setting, PCA is computationally efficient, and statistically consistent, i.e., as the number of samples goes to infinity, we are guaranteed to recover the optimal lowdimensional subspace. On the other hand, PCA is wellknown to be exceptionally brittle  even a single corrupted point can lead to arbitrarily bad PCA output. We consider PCA in the highdimensional regime, where a constant fraction of the observations in the data set are arbitrarily corrupted. We show that standard techniques fail in this setting, and discuss some of the unique challenges (and also opportunities) that the highdimensional regime poses. For example, one of the (many) confounding features of the highdimensional regime, is that the noise magnitude dwarfs the signal magnitude. While in the classical regime, dimensionality recovery would fail under these conditions, sharp concentrationofmeasure phenomena in high dimensions provide a way forward. Then, for the main part of the talk, we propose a Highdimensional Robust Principal Component Analysis (HRPCA) algorithm that is computationally tractable, robust to contaminated points, and easily kernelizable. The resulting subspace has a bounded deviation from the desired one, for up to 50% corrupted points. No algorithm can possibly do better than that, and there is currently no known polynomialtime algorithm that can handle anything above 0%. Finally, unlike ordinary PCA algorithms, HRPCA has perfect recovery in the limiting case where the proportion of corrupted points goes to zero.

November 1, 2010 
Emmanuel Abbe
Federal Polytechnic School of Lausanne, Switzerland 
A polarization approach to compressed sensing
[abstract] In 2008, a technique called 'polarization' allowed to solve a problem open since 1948 by Shannon: the construction of low complexity codes that are provably capacity achieving. The polarization idea can be explained on the basis of a rather general probabilistic phenomenon: using the socalled polar transform, one can separate an ergodic process into two subprocesses of maximal and minimal entropy (fair coins and constants), and this procedure can be done at low computational cost. In this talk, we will use the idea behind polarization not for channel coding, but to propose a new approach to compressed sensing. With this approach, the measurement matrix has the attribute of being deterministic, whereas the signal is assumed to be statistically sparse. The overall scheme is shown to have a low complexity and the reconstruction algorithm is based on algebraic arguments rather than l1minimization.

November 8, 2010 
David Banks
http://www.stat.duke.edu/~banks/ 
Adversarial Risk Analysis: Bayesian Methods in Game Theory
[abstract] Classical game theory has been an unreasonable description for human behavior, and traditional analyses make strong assumptions about common knowledge and fixed payoffs. Classical risk analysis has assumed that the opponent is nonadversarial (i.e.,"Nature") and thus is inapplicable to many situations. This work explores Bayesian approaches to adversarial risk analysis, in which each opponent must model the decision process of the other, but there is the opportunity to use human judgment and subjective distributions. The approach is illustrated in the analysis of two important applications: sealed bid auctions and simple poker; some related work on counter bioterrorism is also covered. The results in these three applications are interestingly different from those found from a minimax perspective.

November 15, 2010 
Ping Ma
University of Illinois at UrbanaChampaign 
Imaging the Earth's Deep Interior: a statistical perspective
[abstract] At a depth of 2890 km, the coremantle boundary (CMB) separates turbulent flow of liquid metals in the outer core from slowly convecting, highly viscous mantle silicates. The CMB marks the most dramatic change in dynamic processes and material properties in our planet, and accurate images of the structure at or near the CMBover large areasare crucially important for our understanding of present day geodynamical processes and the thermochemical structure and history of the mantle and mantlecore system. In addition to mapping the CMB we need to know if other structures exist directly above or below it, what they look like, and what they mean in terms of physical and chemical material properties and geodynamical processes. Detection, imaging, characterization, and understanding of structure in this remote region have beenand are likely to remaina frontier in crossdisciplinary geophysics research. I will discuss the statistical problems, challenges and methods in imaging the CMB.

November 22, 2010 
No Seminar Fall Recess


November 29, 2010 
Ankur Moitra
MIT 
Efficiently Learning Mixtures of Gaussians
[abstract] Given data drawn from a mixture of multivariate Gaussians, a basic problem is to accurately estimate the mixture parameters. We provide a polynomialtime algorithm for this problem for any fixed number ($k$) of Gaussians in $n$ dimensions (even if they overlap), with provably minimal assumptions on the Gaussians and polynomial data requirements. In statistical terms, our estimator converges at an inverse polynomial rate, and no such estimator (even exponential time) was known for this problem (even in one dimension, restricted to two Gaussians). Our algorithm reduces the $n$dimensional problem to the one dimensional problem, where the method of moments is applied. As a corollary, we are able to give to give the first polynomial time algorithm for density estimation for mixtures of $k$ Gaussians without any assumptions. This talk will be based on two papers (Kalai, Moitra, Valiant, STOC 2010) and (Moitra, Valiant, FOCS 2010), the first of which handles the case of mixtures of two Gaussians, and the later generalizes the approach to mixtures of any fixed number of Gaussians. A major technical hurdle in the first paper is proving that noisy estimates of the first $4k2$ moments of a univariate mixture of $k$ Guassians suffice to recover accurate estimates of the mixture parameters, as conjectured by Pearson (1894), and in fact these estimates converge at an inverse polynomial rate. For mixtures of more than two Gaussians, pathological scenarios can arise when projecting down to a single dimension. Consequently, the major challenge in the second paper concerns how to leverage a univariate algorithm with weaker guarantees to still yield an efficient learning algorithm in higher dimensions. Lastly, while the running time and data requirements of our algorithm depend exponentially on the number of Gaussians in the mixture, we prove that such a dependence is necessary. This is joint work with Adam Tauman Kalai and Gregory Valiant. This work appears as "Efficiently Learning Mixtures of Two Gaussians" (STOC 2010) and "Settling The Polynomial Learnability of Mixtures of Gaussians" (FOCS 2010). 
December 6, 2010 
Kavita Ramanan
Brown University 
Phase Transitions for the MultiState Hard Core Model on a Tree
[abstract] The hard core model is a well studied stochastic model with "hard constraints" that arises in statistical physics, combinatorics and stochastic networks. We consider generalizations of the hard core model on a tree, in which each vertex lies in any of C+1 states, subject to the constraint that the sum of the states of any two neighboring vertices does not exceed C. We characterize the phase transition region for this model, and identify an interesting dependence on the parity of C. We also discuss extensions of this model and implications of this analysis for certain loss network models arising in telecommunications.

December 13, 2010 
No Seminar Winter Recess


December 20, 2010 
No Seminar Winter Recess


December 27, 2010 
No Seminar Winter Recess


January 3, 2011 


January 10, 2011 
Sayan Mukherjee
Duke University 
Geometry and Topology in Statistical Inference
[abstract] We use two problems to illustrate the utility of geometry and topology in statistical inference: supervised dimension reduction (SDR), and inference of (hyper) graph models. I will also show two slides, containing only pictures, illustrating the problem of inference of stratified spaces. We start with a "tale of two manifolds." The focus is on the problem of supervised dimension reduction (SDR). We first formulate the problem with respect to the inference of a geometric property of the data, the gradient of the regression function with respect to the manifold that supports the marginal distribution. We provide an estimation algorithm, prove consistency, and explain why the gradient is salient for dimension reduction. We then reformulate SDR in a probabilistic framework and propose a Bayesian model, a mixture of inverse regressions. In this modeling framework the Grassman manifold plays a prominent role. The second part of the talk develops a parameterization of hypergraphs based on the geometry of points in ddimensions. Informative prior distributions on hypergraphs are induced through this parameterization by priors on point configurations via spatial processes. The approach combines tools from computational geometry and topology with spatial processes and offers greater control on the distribution of graph features than ErdosRenyi random graphs. I will close with two slides that pictorally describe the problem of inferring Whitney stratified spaces. Consider two intersecting planes in 3 dimensions and draw npoints iid from this object. Can we infer which points belong to which plane and which points belong to the line defined by the intersection?

January 17, 2011 
No Seminar


January 24, 2011 
Wei Pan
University of Minnesota 
Why to ignore correlations:applications to genetic association analysis
[abstract] An important problem in genetic analysis is to test disease association with multiple genetic markers in a candidate region, for which the corresponding statistical formulation is familiar to everyone: we test on multiple regression coefficients in a logistic regression model. However, the most popular Wald (or score or likelihood ratio) test may not be powerful, even for relatively "lowdimensional, highsample sized" SNP data. In contrast to the Wald (or score) test, if we ignore correlations among the parameter estimates (or score components) and do not use their covariance matrix, the resulting test (called SSB or SSU test) may have higher power. Interestingly, the SSB or SSU test is closely related to two other nonparametric methods recently proposed for genomic data: genomic distancebased regression and kernel machine regression. Numerical examples will be provided to illustrate their applications to genetic association analysis of common variants and rare variants.

January 31, 2011 
Xiaole Liu
Harvard University 
Computational Genomics of Gene Regulation
[abstract] High throughput genomics technologies such as gene expression microarrays, tiling microarrays, massively parallel sequencing have drastically accelerated the pace of biomedical research and discovery. However, they also created challenges for bioinformatic data analysis. I will introduce our work in trying to understand gene regulation through transcription factor motif discovery, ChIPchip and ChIPseq data analysis, and epigenomic studies. I will also discuss our recent work where we use nucleosomeresolution histone mark ChIPseq data to infer the transcription factors driving a biological process and their in vivo binding sites, and show how the method is applied to understand prostate cancer and gut development.

February 7, 2011 
Matthew Stephens
University of Chicago 
A unified framework for testing multiple phenotypes for association with genetic variants
[abstract] In many ongoing genomewide association studies, multiple related phenotypes are available for testing for association with genetic variants. In most cases, however, these related phenotypes are analysed independently from one another. For example, several studies have measured multiple lipidrelated phenotypes, such as LDLcholestrol, HDLcholestrol, and Triglycerides, but in most cases the primary analysis has been a simple univariate scan for each phenotype. This type of univariate analysis fails to make full use of potentially rich phenotypic data. While this observation is in some sense obvious, much less obvious is the right way to go about examining associations with multiple phenotypes. Common existing approaches include the use of methods such as MANOVA, canonical correlations, or Principal Components Analysis, to identify linear combinations of outcome that are associated with genetic variants. However, if such methods give a significant result, these associations are not always easy to interpret. Indeed the usual approach to explaining observed multivariate associations is to revert to univariate tests, which seems far from ideal. In this work we outline an approach to dealing with multiple phenotypes based on Bayesian multivariate regression. The method attempts to identify which subset of phenotypes is associated with a given genotype. In this way it incorporates the null model (no phenotypes associated with genotype); the simple univariate alternative (only one phenotype associated with genotype) and the general alternative (all phenotypes associated with genotype) into a single unified framework. In particular our approach both tests for and explains multivariate associations within a single model, avoiding the need to resort to univariate tests when explaining and interpreting significant multivariate findings. We illustrate the approach on examples, and show how, when combined with multiple phenotype data, the method can improve both power and interpretation of association analyses.

February 14, 2011 
Haiyan Huang
University of California, Berkeley 
Transforming Public Gene Expression Repositories into Disease Diagnosis Databases
[abstract] The rapid accumulation of gene expression data has offered unprcedented opportunities to study human diseases. The NCBI Gene Expression Omnibus is currently the largest database that systematically documents the genome wide molecular basis of diseases. In this talk, I will introduce our study on transforming a public gene expression erpository,particularly NCBI GEO, into an automated disease diagnosis database. Relevant computational and statistical issues and challenges e.g.standardizing cross platform gene expression data and heterogeneous disease annotations, developing a two stage Bayesian learning approach to achieve the automated disease diagnosis under the formulation of hierarchical multiple lable classification will be discussed.

February 21, 2011 
Cheng Li
Harvard University 
Does aneuploidy cause cancer: can genomics data modeling help explain?
[abstract] A common type of aneuploidy is Down's syndrome, where onecopy gain of chromosome 21 can lead to many symptoms and higher risk of cancer. Cancer cells frequently harbor an aneuploid genome with gains or losses of large chromosome regions or entire chromosomes that affect the expression of hundreds of genes. The aneuploid patterns are recurrent in a cancer type and correlate with patient response and prognosis. I will introduce various hypotheses about the relationship between aneuploidy and cancer, recent biological experiments generating new hypotheses, and how genomics data such as expression and copy number profiling, combined with statistical and bioinformatic models, may help shed light on the debate. Cheng Li's biography: Dr. Cheng Li received his B.S. degree in computer science in 1995 at Beijing Normal University, and Ph.D. degree in statistics in 2001 at University of California at Los Angeles. He joined the Department of Biostatistics of Harvard School of Public Health and DanaFarber Cancer Institute as an assistant professor in 2002 and associate professor in 2008. He has developed many novel gene expression and SNP microarray analysis and visualization methods, and implemented and maintained the widely used genomics analysis software dChip, which has been cited 1800 times. His current interests are how genomic changes in the cell promote the initiation and progression of cancer and neurological disorders, and classify the diseases for prognosis. See www.ChengLiLab.org for more information. 
February 28, 2011 
Yu Zhang
Penn State University 
Fast and Accurate False Positive Control in Genomewide Association Studies
[abstract] Genomewide association studies routinely test hundreds of thousands or millions of genetic markers simultaneously. Adjustment on the pvalues of individual tests is necessary to reduce false positive findings, known as the multiplecomparison problem. Current practices rely on either Bonferroni corrections or permutations to evaluate the genomewide significance of associations. The Bonferroni method is overly conservative due to the strong dependence between genetic markers, which is particularly problematic for testing highdensity markers and markers in overlapping windows. Bonferroni correction also has significant impact on false discovery rate (FDR) procedures. Permutation test, on the other hand, is computationally too expensive for large studies involving millions of comparisons or many thousands of individuals. We propose a new method for adjusting multiple correlated comparisons that is accurate and extremely fast. The method produces accurate pvalue adjustments in almost a constant time irrespective to the number of tests, the sample size, and the scale of pvalues. The method can also be easily adopted into FDR control procedures. We introduce a new FDR control method that produces much more reasonable results than conventional methods in GWAS. We further generalize the method to conditional tests, such that biological prior knowledge of the distribution of disease genes can be incorporated to improve the sensitivity and the specificity of disease association mapping.

March 7, 2011 
Spring Recess


March 14, 2011 
Spring Recess


March 21, 2011  Adalbert Wilhelm
Jacobs University, School of Humanities and Social Sciences 
A dual layered linkage system for combining visual and textbased queries for image analysis 
March 28, 2011 
Shili Lin
The Ohio State University 
Likelihood Approach for Detecting Imprinting and Maternal Effects
[abstract] Genomic imprinting and maternal effects are two epigenetic factors that have been increasingly explored for their roles in the etiology of complex diseases. This is part of a concerted effort to find the "missing heritability". Accordingly, statistical methods have been proposed to detect imprinting and maternal effects simultaneously based on either a caseparents triads design or a casemother/controlmother pairs design. However, these methods are not amenable to extended families, which are commonly recruited in familybased studies. Further, existing methods are fulllikelihood based and have to make strong assumptions concerning mating type probabilities (nuisance parameters) to avoid overparametrization. In this talk, I will focus on Likelihood approaches for detecting Imprinting and Maternal Effects (LIME) using family data. In particular, I will discuss LIMEped, which uses extended pedigrees from prospective familybased association studies without the HardyWeinberg equilibrium assumption by introducing a novel concept called "conditional mating type" between marryin founders and their nonfounder spouses. I will also discuss LIMEmix, which augments the two popular study designs noted above by combining them and including controlparents triads, so that our sample may contain a mixture of caseparents/controlparents triads and casemother/controlmother pairs. By matching the case families with control families of the same structure and stratifying according to the familial genotypes, we are able to derive a partial likelihood that is free of the nuisance parameters. This renders unnecessary of strong assumptions and leads to a robust procedure without sacrificing power. I will show simulation results to illustrate power gain with LIMEped by using extended pedigrees and demonstrates robustness of LIMEmix under a variety of settings.

April 4, 2011 
CANCELLED


April 11, 2011 
Michael Epstein
Emory, Department of Human Genetics and Biostat 
Correcting for population stratification in casecontrol studies of rare genetic variation
[abstract] Recent advances in nextgeneration sequencing technology have enabled investigators to assess the role of rare genetic variation in the origins of complex human diseases. Within casecontrol resequencing studies, investigators typically test for association between rare variants and disease using burden tests that collapse sets of rare variants within a gene or region into a composite variable prior to association testing with disease. An open issue with resequencing studies, and burden association tests in particular, is their validity in the presence of confounding due to population stratification. Such confounding will arise when genetic variation is correlated with variation in disease risk across latent subpopulations or geographic gradients. In this talk, I describe the use of a measure called the stratification score (defined as the odds of disease given confounders) to resolve confounding due to population stratification in casecontrol resequencing studies. I first show how one can use the stratification score to choose a subset of subjects for resequencing from a larger GWAS sample who are well matched on genetic ancestry. Next, I describe how one can use the stratification score to adjust existing burden tests (many of which rely on statistical frameworks that do not allow for covariates) for population stratification. We illustrate our approaches using both simulated and real data from an existing study of schizophrenia. This is joint work with Drs. Glen Satten and Andrew Allen.

April 18, 2011 
CANCELLED


April 25, 2011 
Arne Bathke
University of Kentucky, Statistics 
Nonparametric Methods for Multivariate Data and Repeated Measures Designs
[abstract] Data obtained through observational or experimental studies, for example in the life sciences or social sciences, are often intrinsically multivariate because several response variables are measured on the same experimental unit (multiple endpoints). We present new nonparametric methods for statistical inference based on such data. The nonparametric approach does not need the assumption of normality, and it has the advantage that it can handle quantitative, as well as ordinal response variables, or a mixture of both. Furthermore, the proposed tests are invariant under monotone transformations of the original variables. We will present asymptotic results for different situations, supplemented by results from simulation studies, as well as the analysis of a data example.

May 2, 2011 
Daniel Yekutieli
Tel Aviv University, Department of Statistics and Operations Research 
Bayesian selective inference
[abstract] The term selective inference refers to marginal statistical inferences that are provided for parameters selected after viewing the data, where the selected parameters are typically the "significant" findings of a multiple testing procedure. I will discuss selective inference from a Bayesian perspective. I will show that if the parameter is elicited a noninformative prior, or if it is a "fixed" unknown constant, then it is necessary to adjust the Bayesian inference for selection. I will present a Bayesian framework for providing inference for selected parameters and Bayesian False Discovery Rate controlling methodology, that is a generalization of existing Bayesian FDR methods that are only defined in the twogroup mixture model. I will illustrate the results by applying them to simulated data and data from a microarray experiment.
