Abstracts for 2011-12 seminar talks

Aug. 19, 2011: Repetition and surprise, rehearsal and reinvention
Mark Hansen
UCLA

For the last decade, I have had the privilege of collaborating on a number of public artworks that draw on dynamic data sources. In my talk I will describe two of these pieces. The first is called Moveable Type and is a permanent installation in the lobby of the New York Times building in midtown Manhattan. This work is designed to represent the activity taking place around Times' content and draws on a feed of the Times' news stories, an hourly dump of their web access and search logs (a sample, suitably anonymized), and the complete archive back to 1851. The second piece I will present is Shuffle, a performance by the Elevator Repair Service that was created for the New York Public Library's centennial celebration in June of this year. This work was designed to be a mixing or reinterpretation of the material from the last three ERS tours, classic works by Faulkner, Fitzgerald and Hemingway. If there is time, I will also briefly present designs for new artwork for the 9/11 Memorial Museum and the NYU Law School.

In terms of a statistical practice, I see these collaborations as waypoints in an expanded field of data analysis. They present complex data for the public but in nonstandard venues and employing novel presentation techniques. These artworks have, in turn, shaped my views on the role of data, its collection and analysis, by the general public. If there is time, I will also present some of the curricular work I have helped develop for the Los Angeles Unified School District. Specifically, I will discuss a new, NSF-funded program for high school students that introduces data analysis in the context of a year-long course in computer science.

Related links:

Moveable Type http:

www.nytimes.com/2007/10/25/arts/design/25vide.html

Shuffle http:

www.nytimes.com/2011/05/24/theater/elevator-repair-service-performs-at-new-york-public-library.html

Sept. 12, 2011: The universal Glivenko-Cantelli property
Ramon van Handel
Princeton University

Uniform laws of large numbers (ULLN) are basic tools in probability and statistics. Classes of functions for which the ULLN holds for a given probability measure (Glivenko-Cantelli classes) or uniformly with respect to all probability measures (uniform Glivenko-Cantelli classes) were characterized by Vapnik and Chervonenkis and by Talagrand. However, classes for which the ULLN holds for every probability measure---the universal Glivenko-Cantelli classes---are much more poorly understood. In this talk I will show how, under some regularity assumptions, universal Glivenko-Cantelli classes can be characterized in terms of certain geometric and combinatorial properties. A surprising consequence is that the ULLN holds universally in this setting if and only if the same is true for the uniform ergodic theorem or for uniform reverse martingale convergence, extending their applicability substantially beyond the i.i.d. setting inherent in the definition. I will also discuss several unusual counterexamples that highlight the limitations and difficulties of trying to characterize the universal Glivenko-Cantelli property.

Sept. 26, 2011: Shannon's Entropy Measuring of RNA Secondary Structure Over Stochastic Grammar Ensembles
Liming Cai
University of Georgia

Shannon's entropy measures the fold certainty (i.e., structural variation) of any given RNA sequence over a defined secondary structure ensemble. However, since the thermodynamic scoring scheme built in Boltzmann ensemble is not normalized, derivations for the structural entropy have not been available. In this presentation, we derive Shannon's entropy of RNA secondary structure over stochastic context-free grammar (SCFG) ensembles that have well-defined probability distributions. Being reconfigurable, SCFGs can incorporate constraints preferred by tertiary folding and makes it possible to effectively distinguish non-coding RNA sequences from random sequences by the Shannon's entropy. In addition, we derive Shannon's entropy of SCFG ensembles without the presence of RNA sequences and show the entropy actually measures the average length of RNA sequences within such an ensemble. Potential applications of this research including non-coding RNA gene finding and annotation on genome sequences.

Oct. 3, 2011: The Multi-Domain Sampler and Its Applications
Qing Zhou
UCLA

When a posterior distribution has multiple modes, unconditional expectations, such as the posterior mean, may not offer informative summaries of the distribution. Motivated by this problem, I propose to decompose the sample space of a multimodal distribution into domains of attraction of local modes. Domain-based representations are defined to summarize the probability masses of and conditional expectations on domains of attraction, which are much more informative than the mean and other unconditional expectations. A computational method, the multi-domain sampler, is developed to construct domain-based representations for an arbitrary multimodal distribution. The effectiveness of the multi-domain sampler is demonstrated by applications to structural learning of protein-signaling networks from single-cell data and construction of energy landscapes of the Sherrington-Kirkpatrick spin glasses.

Oct. 17, 2011: Model-based methods for analyzing NGS data
Zhaohui Qin
Emory University

The next generation sequencing (NGS) technologies have been rapidly adopted in an array of diverse applications. Although extremely promising, the massive amount of data generated from NGS, substantial biases and correlation pose daunting challenges for data analysis. By treating observed data as random samples from probability distributions, model-based methods can accommodate uncertainties explicitly and also automatically leads to rigorous statistical inference. Inspired by the success of model-based methods in the analysis of other high throughput genomics data such as microarray, we attempted to develop novel model-based methods to analyze data generated from the new NGS-based experiments. RNA sequencing (RNA-seq) is a powerful new technology for mapping and quantifying transcriptome. We propose a spatial model-based method named POME to characterize base-level read coverage within each exon. The underlying expression level is included as a key parameter in this model and large base-specific variations and between-base correlations are also taken into account. Simulated and real data analysis demonstrated significant improvement when comparing POME to existing approaches. I will also discuss how model-based methods can help other applications of NGS. This is a joint work with Ming Hu, Michael Zhu and Jun Liu

Oct. 24, 2011: Genomic location is information: computational elucidation of bacterial genomic structures
Ying Xu
University of Georgia

We have recently discovered that genomic locations of genes in bacteria are highly constrained by the cellular processes that are involved in. So for the first time, we understand that the locations of genes follow both global and local rules. This realization has led to a new paradigm for tackling and solving some very challenging genomic analysis problems. I will discuss about this new discovery and a number of applications that we are currently doing, including gene assignments of pathway holes and complete genome assembly.

Oct. 31, 2011: Prediction of epigenetic patterns from DNA sequences
Guo-Cheng Yuan
Harvard

In a multi-cellular organism, a single genome is shared by nearly all cell-types; yet each cell-type expresses a different set of genes. A partial explanation is the fact that only a small portion of the genomic DNA is accessible in any cell type; this accessibility is highly controlled by epigenetic mechanisms. Recently large amount of epigenomic data has been generated, providing strong evidence that tissue-specific epigenetic patterns are responsible for controlling global gene expression required for maintenance of cell identity. However, a fundamental yet unresolved question is how epigenetic patterns are established and maintained. Previous studies have identified a large number of molecular interactions that play a role in regulating the epigenetic patterns. It is a daunting task to fully dissect the complexity of this complex interaction network. As a starting point, we have developed two computational methods to systematically investigate the role of DNA sequences in guiding genome-wide epigenetic patterns. The first method, which we call the N-score model, extracts periodic sequence features by using a wavelet approach. The second method combines multiple sequence features by using Bayesian regression trees. We applied these methods to analyze the genome-wide patterns of various epigenetic marks. We found that a significant proportion of the epigenetic landscape can be explained by the DNA sequence information alone. We suggest that the DNA sequence plays at least two distinct roles in mediating epigenetic patterns. 1) A small number of simple features may be recognized by general factors to orchestrate the overall epigenetic variability; and 2) a large number of highly specific features may be recognized by tissue-specific factors to refine the default epigenetic patterns at specific loci.

Nov. 7, 2011: OODA of Tree-Structured Data Objects
J. S. Marron
University of North Carolina

The field of Object Oriented Data Analysis has made a lot of progress on the statistical analysis of the variation in populations of complex objects. A particularly challenging example of this type is populations of tree-structured objects. Deep challenges arise, which involve a marriage of ideas from statistics, geometry, and numerical analysis, because the space of trees is strongly non-Euclidean in nature. These challenges, together with three completely different approaches to addressing them, are illustrated using a real data example, where each data point is the tree of blood arteries in one person's brain.

Nov. 14, 2011: Online variational inference for scalable approximate posterior inference (with applications to probabilistic topic models)
David Blei
Princeton

Probabilistic topic modeling provides a suite of tools for analyzing large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. We can use topic models to explore the thematic structure of a corpus and to solve a variety of prediction problems about documents. At the center of a topic model is a hierarchical mixed-membership model, where each document exhibits a shared set of mixture components with individual (per-document) proportions. Our goal is to condition on the observed words of a collection and estimate the posterior distribution of the shared components and per-document proportions. When analyzing modern corpora, this amounts to posterior inference with billions of latent variables. How can we cope with such data? In this talk, I will describe online variational inference for approximating posterior distributions in hierarchical models. In traditional variational inference, we posit a simple family of distributions over the latent variables and try to find the member of that family that is close to the posterior of interest. In online variational inference, we use stochastic optimization to find the closest member of the family, where we obtain noisy estimates of the appropriate gradient by repeatedly subsampling from the data. This approach (along with some information geometric considerations) leads to a scalable variational inference algorithm for massive data sets. I will demonstrate the algorithm with probabilistic topic models fitted to millions of articles. I will further describe two variants, one for mixed-membership community detection in massive social networks and one for Bayesian nonparametric mixed-membership models. I will show how online variational inference can be generalized to many kinds of hierarchical models. Finally, I will highlight several open questions and outstanding issues. (This is joint work with Francis Bach, Matt Hoffman, John Paisley, and Chong Wang.)

Nov. 28, 2011: Color Independent Component Analysis with an Application to Functional Magnetic Resonance Imaging
Haipeng Shen
University of North Carolina

Independent component analysis (ICA) is an effective data-driven method for blind source separation. It has been successfully applied to separate source signals of interest from their mixtures. Most existing ICA procedures are carried out by relying solely on the estimation of the marginal density functions. However, in many applications, correlation structures within each source also play an important role besides the marginal distributions. One important such example is functional magnetic resonance imaging (fMRI) analysis where the brain-function-related signals are temporally correlated. We develop a novel color ICA approach that fully exploits the correlation structures within the sources. Specifically, we propose to estimate the spectral density functions of the source signals instead of their marginal density functions. Our methodology is described and implemented using spectral density functions from common time series models. The time series model parameters and the mixing matrix are estimated via maximizing the Whittle likelihood function. The proposed method is shown to outperform several popular existing methods through simulation studies and a real fMRI application.

Jan. 9, 2012: Chi-square and classical exact tests often wildly misreport significance; the remedy lies in computers
Mark Tygert
New York University

If a discrete probability distribution in a model being tested for goodness-of-fit is not close to uniform, then forming the Pearson chi-square statistic can involve division by nearly zero. This often leads to serious trouble in practice -- even in the absence of round-off errors -- as the talk will illustrate via numerous examples. Fortunately, with the now widespread availability of computers, avoiding all the trouble is simple and easy: without the problematic division by nearly zero, the actual values taken by goodness-of-fit statistics are not humanly interpretable, but black-box computer programs can rapidly calculate their precise significance.

Jan. 30, 2012: Connectomics Signatures for Characterization of Brain Conditions
Tianming Liu
University of Georgia

Human connectomes constructed via neuroimaging data offer a complete description of macro-scale structural/functional connectivity within the brain. Assessing connectome-wide structural and functional connectivities not only can fundamentally advance our understanding of brain organization and function, but also have ultimate importance to systematically and comprehensively characterize many devastating brain conditions. Here, we constructed structural connectomes of 240 brains and assessed the connectome-wide functional connectivity alterations in mild cognitive impairment, schizophrenia and post-traumatic stress disorder, in comparison with their healthy controls. By applying genomics signatures discovery approaches, we discovered informative and robust functional connectomics signatures that can distinctively characterize these brain conditions from their healthy controls. Our results suggest that connectomics signatures could be a general, powerful platform for characterization of many brain conditions in the future.

Feb. 6, 2012: Uncertainty in Natural Image Segmentation
Erik Sudderth
Brown University

We explore nonparametric Bayesian statistical models for image partitions which coherently model uncertainty in the size, shape, and structure of human image interpretations. Examining a large set of manually segmented scenes, we show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman-Yor (PY) process. This generalization of the Dirichlet process leads to segmentation algorithms which automatically adapt their resolution to each image. Generalizing previous applications of PY priors, we use non-Markov Gaussian processes (GPs) to infer spatially contiguous segments which respect image boundaries. We show how GP covariance functions can be calibrated to accurately match the statistics of human segmentations, and that robust posterior inference is possible via a variational method, expectation propagation. The resulting method produces highly accurate segmentations of complex scenes, and hypothesizes multiple image partitions to capture the variability inherent in human scene interpretations.

Feb. 20, 2012: ADAPTIVE BAYESIAN MULTIVARIATE DENSITY ESTIMATION WITH DIRICHLET MIXTURES
Subhashis Ghoshal
North Carolina State University

The kernel method has been an extremely important component in nonparametric estimation method and has undergone tremendous development since its introduction over fifty years ago. Bayesian methods for density estimation using kernel-smoothed priors were first introduced in the mid-eighties, where a random probability measure following typically a Dirichlet process is convoluted with a kernel to induce a prior on smooth densities. The resulting prior distribution is commonly known as a Dirichlet mixture process. Such priors became extremely popular in the Bayesian nonparametric literature after the development of Markov chain Monte-Carlo methods for posterior computation in the nineties. Posterior consistency of a Dirichlet mixture prior with a normal kernel was established in Ghosal et al. (1999). Subsequent papers relaxed conditions for consistency, generalized to other kernels and studied rates of convergence, especially in the univariate case. More recently, it has been found that Bayesian kernel mixtures of finitely supported random distributions have some automatic rate adaptation property --- something a classical kernel estimator lacks. We consider Bayesian multivariate density estimation using a Dirichlet mixture of normal kernel as the prior distribution. By representing a Dirichlet process as a stick-breaking process, we are able to extend convergence results beyond finitely supported mixtures priors to Dirichlet mixtures. Thus our results have new implications in the univariate situation as well. Assuming that the true density satisfies Holder smoothness and exponential tail conditions, we show the rates of posterior convergence are minimax-optimal up to a logarithmic factor. This procedure is fully adaptive since the priors are constructed without using the knowledge of the smoothness level.

Feb. 27, 2012: Sparsity pattern aggregation
Philippe Rigollet
Princeton University

Sparse estimation has received an incredible amount of attention from the statistical community over the past decade. The celebrated Lasso estimator and its extensions have attracted most of the attention both from a theoretical and a computational perspective. The aim of this presentation is to develop an entirely new approach to sparse estimation using the principle of 'sparsity pattern aggregation' (SPA). This principle builds upon refined results for the problem of model selection using entropy penalization, which results in exponential weights. Consider a general, non necessarily linear, regression problem with Gaussian noise as an example. The main idea is to aggregate least squares estimators by carefully balancing a fitting term and a term that accounts for the sparsity of a given estimator. This principle yields surprisingly sharp finite sample performance guarantees known as 'Sparsity Oracle Inequalities' (SOI) that hold in expectation with respect to the sample at hand. In particular, it can be shown that it produces estimators that are optimal in a minimax sense over several interesting classes of problems arising in sparse estimation. A natural question is to extend these results to SOI that hold with high probability. It can be proved that SPA estimators based on exponential weights cannot satisfy such inequalities. Nevertheless, a non-trivial modification of the original SPA estimator does achieve the desired result. Moreover, the SPA principle turns out to be quite general and can be extended to other types of sparsity such as fused-sparsity or group-sparsity. On the computational side, an approximation of the resulting estimators can be implemented using state-of-the-art Markov chain Monte Carlo simulations that result in a fairly intuitive stochastic greedy algorithm and that compares favorably to some of the most competitive estimators in this setting.

March 19, 2012: Multiple Testing Via FDR_L For Large-Scale Imaging Data
Chunming Zhang
University of Wisconsin

The multiple testing procedure plays an important role in detecting the presence of spatial signals for large-scale imaging data. Typically, the spatial signals are sparse but clustered. This paper provides empirical evidence that for a range of commonly used control levels, the conventional FDR procedure can lack the ability to detect statistical significance, even if the p-values under the true null hypotheses are independent and uniformly distributed; more generally, ignoring the neighboring information of spatially structured data will tend to diminish the detection effectiveness of the FDR procedure. This paper first introduces a scalar quantity to characterize the extent to which the lack of identification phenomenon (LIP) of the FDR procedure occurs. Second, we propose a new multiple comparison procedure, called FDR_L, to accommodate the spatial information of neighboring p-values, via a local aggregation of p-values. Theoretical properties of the FDR_L procedure are investigated under weak dependence of p-values. It is shown that the FDR_L procedure alleviates the LIP of the FDR procedure, thus substantially facilitating the selection of more stringent control levels. Simulation evaluations indicate that the FDR_L procedure improves the detection sensitivity of the FDR procedure with little loss in detection specificity. The computational simplicity and detection effectiveness of the FDR_L procedure are illustrated through a real brain fMRI dataset.

March 26, 2012: From Statistical to Game-Theoretic Learning
Alexander Rakhlin
University of Pennsylvania

The study of prediction within the realm of Statistical Learning Theory is intertwined with the study of the supremum of an empirical process. The supremum can be analyzed with classical tools: Vapnik-Chervonenkis and scale-sensitive combinatorial dimensions, covering and packing numbers, and Rademacher averages. Consistency of empirical risk minimization is known to be closely related to the uniform Law of Large Numbers for function classes. In contrast to the i.i.d. scenario, in the sequential prediction framework we are faced with an individual sequence of data on which we place no probabilistic assumptions. The problem of universal prediction of such deterministic sequences has been studied within Statistics, Information Theory, Game Theory, and Computer Science. However, general tools for analysis have been lacking, and most results have been obtained on a case-by-case basis. In this talk, we show that the study of sequential prediction is closely related to the study of the supremum of a certain dyadic martingale process on trees. We develop analogues of the Rademacher complexity, covering numbers and scale-sensitive dimensions, which can be seen as temporal generalizations of the classical results. The complexities we define also ensure uniform convergence for non-i.i.d. data, extending the Glivenko-Cantelli type results. Analogues of local Rademacher complexities can be employed for obtaining fast rates and developing adaptive procedures. Our understanding of the inherent complexity of sequential prediction is complemented by a recipe that can be used for developing new algorithms. * Joint work with Karthik Sridharan and Ambuj Tewari.

April 2, 2012: Network based comparison of temporal gene expression patterns
Sheng Zhong
University of Illinois

In the pursuits of mechanistic understanding of cell differentiation, it is often necessary to compare multiple differentiation processes triggered by different external stimuli and internal perturbations. Available methods for comparing temporal gene expression patterns are limited to a gene-by-gene approach, which ignores co-expression information and thus is sensitive to measurement noise. We present a method for co-expression network based comparison of temporal expression patterns (NACEP). NACEP compares the temporal patterns of a gene between two experimental conditions, taking into consideration all of the possible co-expression modules that this gene may participate in. NACEP first uses a Dirichlet Process to cluster genes, and then it uses probabilistically averaged spline curves to compare temporal patterns. We applied NACEP to analyze RA-induced differentiation of embryonic stem (ES) cells. The analysis suggests RA may facilitate neural differentiation by inducing the shh and insulin receptor pathways. NACEP was also applied to compare the temporal responses of seven RNA inhibition (RNAi) experiments. As proof of concept, we demonstrate that the difference in the temporal responses to RNAi treatments can be used to derive interaction relationships of transcription factors (TFs), and therefore infer regulatory modules within a transcription network.

April 9, 2012: [CANCELLED] Cross-sample Profiling of Genomic Copy Number Changes
Nancy R. Zhang
Stanford University

DNA copy number analysis involves the detection of chromosomal gains and losses using high-density microarray or next-generation sequencing platforms. Change-point methods have been applied successfully to detecting signals in single data sequences derived from one biological sample. It is now common to have data sets involving hundreds to thousands of biological samples, each assayed at millions of positions. How should data be pooled across samples to detect changes that are shared by an unknown subset of the samples? How can we obtain a sparse signature of variation across the cohort? I will discuss the statistical issues underlying these problems and formulate a class of simultaneous change-point models for cross-sample and cross-platform data integration. These models lead to interpretable scan statistics whose false positive rates can be approximated analytically. I hope to also discuss model selection approaches for these problems, where conventional methods fail due to their high dimension and unknown sparsity. The insights gained from this study can be applied to other types of simultaneous scanning procedures that arise frequently in genomics.

April 16, 2012: Differential Principal Component Analysis for ChIP-seq
Hongkai Ji
Johns Hopkins Bloomberg School of Public Health

We propose Differential Principal Component Analysis (dPCA) for characterizing differences between two biological conditions with respect to multiple ChIP-seq data sets. dPCA describes major differential patterns between two conditions using a small number of principal components. Each component corresponds to a multi-dataset covariation pattern shared by many genomic loci. The analysis prioritizes genomic loci based on each pattern, and for each pattern, it identifies loci with significant between-condition changes after considering variability among replicate samples. This approach provides an integrated solution to dimension reduction, unsupervised pattern discovery, and statistical inference. We demonstrate dPCA through analyses of differential chromatin patterns at transcription factor binding sites and human promoters using ENCODE data

April 23, 2012: Statistical methods for genomic based studies of disease
Christina Kendziorski
University of Wisconsin

My research concerns the development and application of statistical methods for genomic based studies of disease. In this talk, I will overview two projects; one focused on methods for identifying differential expression (DE) in an RNA-seq experiment and the other on methods for analysis and integration in genomic based studies of disease. In short, a number of methods have been developed for identifying DE genes in an RNA-seq experiment, but they are sensitive to outliers and deficient for identifying DE isoforms. I will present EBSeq, an empirical Bayesian modeling approach for identifying differential expression in an RNA-seq experiment (EBSeq). Evaluation via simulation and case studies suggest that EBSeq is a powerful and robust approach that outperforms existing methods; and an application of EBSeq to a study of human embryonic and induced pluripotent stem cells provides novel insights into genomic differences underlying these cell types.

I will also review work which extends latent Dirichlet allocation (LDA) models to address problems in personalized gemomic medicine. LDA models have proven extremely effective at identifying themes common across large collections of text, but applications to genomics have been limited. Our framework extends LDA to the genome by considering each patient as a "document" with "text" constructed from clinical and high-dimensional genomic measurements, and also by allowing for supervision by a time-to-event response. So called survival-supervised LDA (survLDA) enables the efficient identification of collections of clinical and genomic features that co-occur within patient subgroups, and then characterizes each patient by those features. An application of survLDA to The Cancer Genome Atlas (TCGA) ovarian project identifies informative patient subgroups and illustrates the potential for patient-specific inference.

April 30, 2012: Estimating Filaments and Manifolds: Methods and Surrogates
Christopher Genovese
Carnegie Mellon University

Spatial data and high-dimensional data, such as collections of images, often contain high-density regions that concentrate around some lower dimensional structure. In many cases, these structures are well-modeled by smooth manifolds, or collections of such manifolds. For example, the distribution of matter in the universe at large scales forms a web of intersecting clusters (0-dimensional manifolds), filaments (1-dimensional manifolds), and walls (2-dimensional manifolds), and the shape and distribution of these structures have cosmological implications.

I will discuss theory and methods for the problem of estimating manifolds (and collections of manifolds) from noisy data in the embedding space. The noise distribution has a dramatic effect on the performance (e.g., minimax rates) of estimators that is related to but distinct from what happens in measurement-error problems. Some variants of the problem are ``hard'' in the sense that no estimator can achieve a practically useful level of performance. I will show that in the ``hard'' case, it is possible to achieve accurate estimators for a suitable surrogate of the unknown manifold that captures many of the key features of the object and will describe a method for doing this efficiently.