Date  Speaker  Seminar Title 

Aug. 19, 2011 
Mark Hansen
UCLA 
Repetition and surprise, rehearsal and reinvention
[abstract]
For the last decade, I have had the privilege of collaborating on a number of public artworks that draw on dynamic data sources. In my talk I will describe two of these pieces. The first is called Moveable Type and is a permanent installation in the lobby of the New York Times building in midtown Manhattan. This work is designed to represent the activity taking place around Times' content and draws on a feed of the Times' news stories, an hourly dump of their web access and search logs (a sample, suitably anonymized), and the complete archive back to 1851. The second piece I will present is Shuffle, a performance by the Elevator Repair Service that was created for the New York Public Library's centennial celebration in June of this year. This work was designed to be a mixing or reinterpretation of the material from the last three ERS tours, classic works by Faulkner, Fitzgerald and Hemingway. If there is time, I will also briefly present designs for new artwork for the 9/11 Memorial Museum and the NYU Law School.
In terms of a statistical practice, I see these collaborations as waypoints in an expanded field of data analysis. They present complex data for the public but in nonstandard venues and employing novel presentation techniques. These artworks have, in turn, shaped my views on the role of data, its collection and analysis, by the general public. If there is time, I will also present some of the curricular work I have helped develop for the Los Angeles Unified School District. Specifically, I will discuss a new, NSFfunded program for high school students that introduces data analysis in the context of a yearlong course in computer science. Related links: Moveable Type http: www.nytimes.com/2007/10/25/arts/design/25vide.html Shuffle http: www.nytimes.com/2011/05/24/theater/elevatorrepairserviceperformsatnewyorkpubliclibrary.html 
Sept. 12, 2011 
Ramon van Handel
Princeton University 
The universal GlivenkoCantelli property
[abstract]
Uniform laws of large numbers (ULLN) are basic tools in probability and statistics. Classes of functions for which the ULLN holds for a given probability measure (GlivenkoCantelli classes) or uniformly with respect to all probability measures (uniform GlivenkoCantelli classes) were characterized by Vapnik and Chervonenkis and by Talagrand. However, classes for which the ULLN holds for every probability measurethe universal GlivenkoCantelli classesare much more poorly understood. In this talk I will show how, under some regularity assumptions, universal GlivenkoCantelli classes can be characterized in terms of certain geometric and combinatorial properties. A surprising consequence is that the ULLN holds universally in this setting if and only if the same is true for the uniform ergodic theorem or for uniform reverse martingale convergence, extending their applicability substantially beyond the i.i.d. setting inherent in the definition. I will also discuss several unusual counterexamples that highlight the limitations and difficulties of trying to characterize the universal GlivenkoCantelli property.

Sept. 19, 2011 
Guantao Chen
Georgia State University 
Finding Long Cycles in 3connected Graphs 
Sept. 26, 2011 
Liming Cai
University of Georgia 
Shannon's Entropy Measuring of RNA Secondary Structure Over Stochastic Grammar Ensembles
[abstract]
Shannon's entropy measures the fold certainty (i.e., structural variation) of any given RNA sequence over a defined secondary structure ensemble. However, since the thermodynamic scoring scheme built in Boltzmann ensemble is not normalized, derivations for the structural entropy have not been available. In this presentation, we derive Shannon's entropy of RNA secondary structure over stochastic contextfree grammar (SCFG) ensembles that have welldefined probability distributions. Being reconfigurable, SCFGs can incorporate constraints preferred by tertiary folding and makes it possible to effectively distinguish noncoding RNA sequences from random sequences by the Shannon's entropy. In addition, we derive Shannon's entropy of SCFG ensembles without the presence of RNA sequences and show the entropy actually measures the average length of RNA sequences within such an ensemble. Potential applications of this research including noncoding RNA gene finding and annotation on genome sequences.

Oct. 3, 2011 
Qing Zhou
UCLA 
The MultiDomain Sampler and Its Applications
[abstract]
When a posterior distribution has multiple modes, unconditional expectations, such as the posterior mean, may not offer informative summaries of the distribution. Motivated by this problem, I propose to decompose the sample space of a multimodal distribution into domains of attraction of local modes. Domainbased representations are defined to summarize the probability masses of and conditional expectations on domains of attraction, which are much more informative than the mean and other unconditional expectations. A computational method, the multidomain sampler, is developed to construct domainbased representations for an arbitrary multimodal distribution. The effectiveness of the multidomain sampler is demonstrated by applications to structural learning of proteinsignaling networks from singlecell data and construction of energy landscapes of the SherringtonKirkpatrick spin glasses.

Oct. 10, 2011 
Ming Yuan Cancelled (substitute speaker Pengsheng Ji)

Sharp Adaptive Nonparametric Hypothesis Testing for Sobolev Ellipsoids 
Oct. 17, 2011 
Zhaohui Qin
Emory University 
Modelbased methods for analyzing NGS data
[abstract]
The next generation sequencing (NGS) technologies have been rapidly adopted in an array of diverse applications. Although extremely promising, the massive amount of data generated from NGS, substantial biases and correlation pose daunting challenges for data analysis. By treating observed data as random samples from probability distributions, modelbased methods can accommodate uncertainties explicitly and also automatically leads to rigorous statistical inference. Inspired by the success of modelbased methods in the analysis of other high throughput genomics data such as microarray, we attempted to develop novel modelbased methods to analyze data generated from the new NGSbased experiments. RNA sequencing (RNAseq) is a powerful new technology for mapping and quantifying transcriptome. We propose a spatial modelbased method named POME to characterize baselevel read coverage within each exon. The underlying expression level is included as a key parameter in this model and large basespecific variations and betweenbase correlations are also taken into account. Simulated and real data analysis demonstrated significant improvement when comparing POME to existing approaches. I will also discuss how modelbased methods can help other applications of NGS. This is a joint work with Ming Hu, Michael Zhu and Jun Liu

Oct. 24, 2011 
Ying Xu
University of Georgia 
Genomic location is information: computational elucidation of bacterial genomic structures
[abstract]
We have recently discovered that genomic locations of genes in bacteria are highly constrained by the cellular processes that are involved in. So for the first time, we understand that the locations of genes follow both global and local rules. This realization has led to a new paradigm for tackling and solving some very challenging genomic analysis problems. I will discuss about this new discovery and a number of applications that we are currently doing, including gene assignments of pathway holes and complete genome assembly.

Oct. 31, 2011 
GuoCheng Yuan
Harvard 
Prediction of epigenetic patterns from DNA sequences
[abstract]
In a multicellular organism, a single genome is shared by nearly all celltypes; yet each celltype expresses a different set of genes. A partial explanation is the fact that only a small portion of the genomic DNA is accessible in any cell type; this accessibility is highly controlled by epigenetic mechanisms. Recently large amount of epigenomic data has been generated, providing strong evidence that tissuespecific epigenetic patterns are responsible for controlling global gene expression required for maintenance of cell identity. However, a fundamental yet unresolved question is how epigenetic patterns are established and maintained. Previous studies have identified a large number of molecular interactions that play a role in regulating the epigenetic patterns. It is a daunting task to fully dissect the complexity of this complex interaction network. As a starting point, we have developed two computational methods to systematically investigate the role of DNA sequences in guiding genomewide epigenetic patterns. The first method, which we call the Nscore model, extracts periodic sequence features by using a wavelet approach. The second method combines multiple sequence features by using Bayesian regression trees. We applied these methods to analyze the genomewide patterns of various epigenetic marks. We found that a significant proportion of the epigenetic landscape can be explained by the DNA sequence information alone. We suggest that the DNA sequence plays at least two distinct roles in mediating epigenetic patterns. 1) A small number of simple features may be recognized by general factors to orchestrate the overall epigenetic variability; and 2) a large number of highly specific features may be recognized by tissuespecific factors to refine the default epigenetic patterns at specific loci.

Nov. 7, 2011 
J. S. Marron
University of North Carolina 
OODA of TreeStructured Data Objects
[abstract]
The field of Object Oriented Data Analysis has made a lot of progress on the statistical analysis of the variation in populations of complex objects. A particularly challenging example of this type is populations of treestructured objects. Deep challenges arise, which involve a marriage of ideas from statistics, geometry, and numerical analysis, because the space of trees is strongly nonEuclidean in nature. These challenges, together with three completely different approaches to addressing them, are illustrated using a real data example, where each data point is the tree of blood arteries in one person's brain.

Nov. 14, 2011 
David Blei
Princeton 
Online variational inference for scalable approximate posterior inference (with applications to probabilistic topic models)
[abstract]
Probabilistic topic modeling provides a suite of tools for analyzing large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. We can use topic models to explore the thematic structure of a corpus and to solve a variety of prediction problems about documents. At the center of a topic model is a hierarchical mixedmembership model, where each document exhibits a shared set of mixture components with individual (perdocument) proportions. Our goal is to condition on the observed words of a collection and estimate the posterior distribution of the shared components and perdocument proportions. When analyzing modern corpora, this amounts to posterior inference with billions of latent variables. How can we cope with such data? In this talk, I will describe online variational inference for approximating posterior distributions in hierarchical models. In traditional variational inference, we posit a simple family of distributions over the latent variables and try to find the member of that family that is close to the posterior of interest. In online variational inference, we use stochastic optimization to find the closest member of the family, where we obtain noisy estimates of the appropriate gradient by repeatedly subsampling from the data. This approach (along with some information geometric considerations) leads to a scalable variational inference algorithm for massive data sets. I will demonstrate the algorithm with probabilistic topic models fitted to millions of articles. I will further describe two variants, one for mixedmembership community detection in massive social networks and one for Bayesian nonparametric mixedmembership models. I will show how online variational inference can be generalized to many kinds of hierarchical models. Finally, I will highlight several open questions and outstanding issues. (This is joint work with Francis Bach, Matt Hoffman, John Paisley, and Chong Wang.)

Nov. 21, 2011 
FALL RECESS

NO SEMINAR 
Nov. 28, 2011 
Haipeng Shen
University of North Carolina 
Color Independent Component Analysis with an Application to Functional Magnetic Resonance Imaging
[abstract]
Independent component analysis (ICA) is an effective datadriven method for blind source separation. It has been successfully applied to separate source signals of interest from their mixtures. Most existing ICA procedures are carried out by relying solely on the estimation of the marginal density functions. However, in many applications, correlation structures within each source also play an important role besides the marginal distributions. One important such example is functional magnetic resonance imaging (fMRI) analysis where the brainfunctionrelated signals are temporally correlated. We develop a novel color ICA approach that fully exploits the correlation structures within the sources. Specifically, we propose to estimate the spectral density functions of the source signals instead of their marginal density functions. Our methodology is described and implemented using spectral density functions from common time series models. The time series model parameters and the mixing matrix are estimated via maximizing the Whittle likelihood function. The proposed method is shown to outperform several popular existing methods through simulation studies and a real fMRI application.

Jan. 9, 2012 
Mark Tygert
New York University 
Chisquare and classical exact tests often wildly misreport significance;
the remedy lies in computers
[abstract]
If a discrete probability distribution in a model being tested for goodnessoffit is not close to uniform, then forming the Pearson chisquare statistic can involve division by nearly zero. This often leads to serious trouble in practice  even in the absence of roundoff errors  as the talk will illustrate via numerous examples. Fortunately, with the now widespread availability of computers, avoiding all the trouble is simple and easy: without the problematic division by nearly zero, the actual values taken by goodnessoffit statistics are not humanly interpretable, but blackbox computer programs can rapidly calculate their precise significance.

Jan. 23, 2012 
No Speaker


Jan. 30, 2012 
Tianming Liu
University of Georgia 
Connectomics Signatures for Characterization of Brain Conditions
[abstract]
Human connectomes constructed via neuroimaging data offer a complete description of macroscale structural/functional connectivity within the brain. Assessing connectomewide structural and functional connectivities not only can fundamentally advance our understanding of brain organization and function, but also have ultimate importance to systematically and comprehensively characterize many devastating brain conditions. Here, we constructed structural connectomes of 240 brains and assessed the connectomewide functional connectivity alterations in mild cognitive impairment, schizophrenia and posttraumatic stress disorder, in comparison with their healthy controls. By applying genomics signatures discovery approaches, we discovered informative and robust functional connectomics signatures that can distinctively characterize these brain conditions from their healthy controls. Our results suggest that connectomics signatures could be a general, powerful platform for characterization of many brain conditions in the future.

Feb. 6, 2012 
Erik Sudderth
Brown University 
Uncertainty in Natural Image Segmentation
[abstract]
We explore nonparametric Bayesian statistical models for image partitions which coherently model uncertainty in the size, shape, and structure of human image interpretations. Examining a large set of manually segmented scenes, we show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the PitmanYor (PY) process. This generalization of the Dirichlet process leads to segmentation algorithms which automatically adapt their resolution to each image. Generalizing previous applications of PY priors, we use nonMarkov Gaussian processes (GPs) to infer spatially contiguous segments which respect image boundaries. We show how GP covariance functions can be calibrated to accurately match the statistics of human segmentations, and that robust posterior inference is possible via a variational method, expectation propagation. The resulting method produces highly accurate segmentations of complex scenes, and hypothesizes multiple image partitions to capture the variability inherent in human scene interpretations.

Feb. 13, 2012 
James Robins
Harvard University 
A Bold Vision (Delusion) of Artificial Intelligence and Philosophy: Finding Causal Effects Without Background Knowledge or Statistical Independencies 
Feb. 20, 2012 
Subhashis Ghoshal
North Carolina State University 
ADAPTIVE BAYESIAN MULTIVARIATE DENSITY ESTIMATION WITH DIRICHLET MIXTURES
[abstract]
The kernel method has been an extremely important component in nonparametric estimation method and has undergone tremendous development since its introduction over fifty years ago. Bayesian methods for density estimation using kernelsmoothed priors were first introduced in the mideighties, where a random probability measure following typically a Dirichlet process is convoluted with a kernel to induce a prior on smooth densities. The resulting prior distribution is commonly known as a Dirichlet mixture process. Such priors became extremely popular in the Bayesian nonparametric literature after the development of Markov chain MonteCarlo methods for posterior computation in the nineties. Posterior consistency of a Dirichlet mixture prior with a normal kernel was established in Ghosal et al. (1999). Subsequent papers relaxed conditions for consistency, generalized to other kernels and studied rates of convergence, especially in the univariate case. More recently, it has been found that Bayesian kernel mixtures of finitely supported random distributions have some automatic rate adaptation property  something a classical kernel estimator lacks. We consider Bayesian multivariate density estimation using a Dirichlet mixture of normal kernel as the prior distribution. By representing a Dirichlet process as a stickbreaking process, we are able to extend convergence results beyond finitely supported mixtures priors to Dirichlet mixtures. Thus our results have new implications in the univariate situation as well. Assuming that the true density satisfies Holder smoothness and exponential tail conditions, we show the rates of posterior convergence are minimaxoptimal up to a logarithmic factor. This procedure is fully adaptive since the priors are constructed without using the knowledge of the smoothness level.

Feb. 27, 2012 
Philippe Rigollet
Princeton University 
Sparsity pattern aggregation
[abstract]
Sparse estimation has received an incredible amount of attention from the statistical community over the past decade. The celebrated Lasso estimator and its extensions have attracted most of the attention both from a theoretical and a computational perspective. The aim of this presentation is to develop an entirely new approach to sparse estimation using the principle of 'sparsity pattern aggregation' (SPA). This principle builds upon refined results for the problem of model selection using entropy penalization, which results in exponential weights. Consider a general, non necessarily linear, regression problem with Gaussian noise as an example. The main idea is to aggregate least squares estimators by carefully balancing a fitting term and a term that accounts for the sparsity of a given estimator. This principle yields surprisingly sharp finite sample performance guarantees known as 'Sparsity Oracle Inequalities' (SOI) that hold in expectation with respect to the sample at hand. In particular, it can be shown that it produces estimators that are optimal in a minimax sense over several interesting classes of problems arising in sparse estimation. A natural question is to extend these results to SOI that hold with high probability. It can be proved that SPA estimators based on exponential weights cannot satisfy such inequalities. Nevertheless, a nontrivial modification of the original SPA estimator does achieve the desired result. Moreover, the SPA principle turns out to be quite general and can be extended to other types of sparsity such as fusedsparsity or groupsparsity. On the computational side, an approximation of the resulting estimators can be implemented using stateoftheart Markov chain Monte Carlo simulations that result in a fairly intuitive stochastic greedy algorithm and that compares favorably to some of the most competitive estimators in this setting.

March 19, 2012 
Chunming Zhang
University of Wisconsin 
Multiple Testing Via FDR_L For LargeScale Imaging Data
[abstract]
The multiple testing procedure plays an important role in detecting the presence of spatial signals for largescale imaging data. Typically, the spatial signals are sparse but clustered. This paper provides empirical evidence that for a range of commonly used control levels, the conventional FDR procedure can lack the ability to detect statistical significance, even if the pvalues under the true null hypotheses are independent and uniformly distributed; more generally, ignoring the neighboring information of spatially structured data will tend to diminish the detection effectiveness of the FDR procedure. This paper first introduces a scalar quantity to characterize the extent to which the lack of identification phenomenon (LIP) of the FDR procedure occurs. Second, we propose a new multiple comparison procedure, called FDR_L, to accommodate the spatial information of neighboring pvalues, via a local aggregation of pvalues. Theoretical properties of the FDR_L procedure are investigated under weak dependence of pvalues. It is shown that the FDR_L procedure alleviates the LIP of the FDR procedure, thus substantially facilitating the selection of more stringent control levels. Simulation evaluations indicate that the FDR_L procedure improves the detection sensitivity of the FDR procedure with little loss in detection specificity. The computational simplicity and detection effectiveness of the FDR_L procedure are illustrated through a real brain fMRI dataset.

March 26, 2012 
Alexander Rakhlin
University of Pennsylvania 
From Statistical to GameTheoretic Learning
[abstract]
The study of prediction within the realm of Statistical Learning Theory is intertwined with the study of the supremum of an empirical process. The supremum can be analyzed with classical tools: VapnikChervonenkis and scalesensitive combinatorial dimensions, covering and packing numbers, and Rademacher averages. Consistency of empirical risk minimization is known to be closely related to the uniform Law of Large Numbers for function classes. In contrast to the i.i.d. scenario, in the sequential prediction framework we are faced with an individual sequence of data on which we place no probabilistic assumptions. The problem of universal prediction of such deterministic sequences has been studied within Statistics, Information Theory, Game Theory, and Computer Science. However, general tools for analysis have been lacking, and most results have been obtained on a casebycase basis. In this talk, we show that the study of sequential prediction is closely related to the study of the supremum of a certain dyadic martingale process on trees. We develop analogues of the Rademacher complexity, covering numbers and scalesensitive dimensions, which can be seen as temporal generalizations of the classical results. The complexities we define also ensure uniform convergence for noni.i.d. data, extending the GlivenkoCantelli type results. Analogues of local Rademacher complexities can be employed for obtaining fast rates and developing adaptive procedures. Our understanding of the inherent complexity of sequential prediction is complemented by a recipe that can be used for developing new algorithms. * Joint work with Karthik Sridharan and Ambuj Tewari.

April 2, 2012 
Sheng Zhong
University of Illinois 
Network based comparison of temporal gene expression patterns
[abstract]
In the pursuits of mechanistic understanding of cell differentiation, it is often necessary to compare multiple differentiation processes triggered by different external stimuli and internal perturbations. Available methods for comparing temporal gene expression patterns are limited to a genebygene approach, which ignores coexpression information and thus is sensitive to measurement noise. We present a method for coexpression network based comparison of temporal expression patterns (NACEP). NACEP compares the temporal patterns of a gene between two experimental conditions, taking into consideration all of the possible coexpression modules that this gene may participate in. NACEP first uses a Dirichlet Process to cluster genes, and then it uses probabilistically averaged spline curves to compare temporal patterns. We applied NACEP to analyze RAinduced differentiation of embryonic stem (ES) cells. The analysis suggests RA may facilitate neural differentiation by inducing the shh and insulin receptor pathways. NACEP was also applied to compare the temporal responses of seven RNA inhibition (RNAi) experiments. As proof of concept, we demonstrate that the difference in the temporal responses to RNAi treatments can be used to derive interaction relationships of transcription factors (TFs), and therefore infer regulatory modules within a transcription network.

April 9, 2012 
Nancy R. Zhang
Stanford University 
[CANCELLED]
Crosssample Profiling of Genomic Copy Number Changes
[abstract]
DNA copy number analysis involves the detection of chromosomal gains and losses using highdensity microarray or nextgeneration sequencing platforms. Changepoint methods have been applied successfully to detecting signals in single data sequences derived from one biological sample. It is now common to have data sets involving hundreds to thousands of biological samples, each assayed at millions of positions. How should data be pooled across samples to detect changes that are shared by an unknown subset of the samples? How can we obtain a sparse signature of variation across the cohort? I will discuss the statistical issues underlying these problems and formulate a class of simultaneous changepoint models for crosssample and crossplatform data integration. These models lead to interpretable scan statistics whose false positive rates can be approximated analytically. I hope to also discuss model selection approaches for these problems, where conventional methods fail due to their high dimension and unknown sparsity. The insights gained from this study can be applied to other types of simultaneous scanning procedures that arise frequently in genomics.

April 16, 2012 
Hongkai Ji
Johns Hopkins Bloomberg School of Public Health 
Differential Principal Component Analysis for ChIPseq
[abstract]
We propose Differential Principal Component Analysis (dPCA) for characterizing differences between two biological conditions with respect to multiple ChIPseq data sets. dPCA describes major differential patterns between two conditions using a small number of principal components. Each component corresponds to a multidataset covariation pattern shared by many genomic loci. The analysis prioritizes genomic loci based on each pattern, and for each pattern, it identifies loci with significant betweencondition changes after considering variability among replicate samples. This approach provides an integrated solution to dimension reduction, unsupervised pattern discovery, and statistical inference. We demonstrate dPCA through analyses of differential chromatin patterns at transcription factor binding sites and human promoters using ENCODE data

April 23, 2012 
Christina Kendziorski
University of Wisconsin 
Statistical methods for genomic based studies of disease
[abstract]
My research concerns the development and application of statistical methods for genomic based studies of disease. In this talk, I will overview two projects; one focused on methods for identifying differential expression (DE) in an RNAseq experiment and the other on methods for analysis and integration in genomic based studies of disease. In short, a number of methods have been developed for identifying DE genes in an RNAseq experiment, but they are sensitive to outliers and deficient for identifying DE isoforms. I will present EBSeq, an empirical Bayesian modeling approach for identifying differential expression in an RNAseq experiment (EBSeq). Evaluation via simulation and case studies suggest that EBSeq is a powerful and robust approach that outperforms existing methods; and an application of EBSeq to a study of human embryonic and induced pluripotent stem cells provides novel insights into genomic differences underlying these cell types.
I will also review work which extends latent Dirichlet allocation (LDA) models to address problems in personalized gemomic medicine. LDA models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a "document" with "text" constructed from clinical and highdimensional genomic measurements, and also by allowing for supervision by a timetoevent response. So called survivalsupervised LDA (survLDA) enables the efficient identification of collections of clinical and genomic features that cooccur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas (TCGA) ovarian project identifies informative patient subgroups and illustrates the potential for patientspecific inference. 
April 30, 2012 
Christopher Genovese
Carnegie Mellon University 
Estimating Filaments and Manifolds: Methods and Surrogates
[abstract]
Spatial data and highdimensional data, such as collections of images, often contain highdensity regions that concentrate around some lower dimensional structure. In many cases, these structures are wellmodeled by smooth manifolds, or collections of such manifolds. For example, the distribution of matter in the universe at large scales forms a web of intersecting clusters (0dimensional manifolds), filaments (1dimensional manifolds), and walls (2dimensional manifolds), and the shape and distribution of these structures have cosmological implications.
I will discuss theory and methods for the problem of estimating manifolds (and collections of manifolds) from noisy data in the embedding space. The noise distribution has a dramatic effect on the performance (e.g., minimax rates) of estimators that is related to but distinct from what happens in measurementerror problems. Some variants of the problem are ``hard'' in the sense that no estimator can achieve a practically useful level of performance. I will show that in the ``hard'' case, it is possible to achieve accurate estimators for a suitable surrogate of the unknown manifold that captures many of the key features of the object and will describe a method for doing this efficiently. 