Abstracts for 2008-09 seminar talks

Apr. 27, 2009 :  Incorporating Prior Biological Knowledge in Genome-Wide Association Studies
Hongyu Zhao
Epid/Public Health, Yale University
The last four years have seen great successes in many Genome-Wide Association Studies (GWAS) which have identified numerous genetic variants underlying complex traits. The analysis and interpretation of data from GWAS presents great statistical and computational challenges, especially after the initial discoveries of variants carrying relatively large effects. Although various statistical approaches have been or are being developed to better analyze GWAS data, it has become apparent that the incorporation of information from prior studies and other sources is indispensable. In this presentation, we discuss our recently developed statistical methods and bioinformatics tools that are designed to more effectively integrate diverse types of prior biological information in analyzing GWAS data. The usefulness of these methods will be illustrated through their applications to some recent large scale GWAS data.

Apr. 13, 2009 :  Control Variates for Reversible MCMC Samplers
Ioannis (or Yiannis) Kontoyiannis
Department of Informatics Athens University of Econ & Business
Control variates are a well-known and effective tool for variance reduction in Monte Carlo sampling. We will present a general methodology for the construction and effective use of control variates for reversible Markov chain Monte Carlo (MCMC) samplers. We will show that the values of the coefficients of the optimal linear combination of the control variates can be computed explicitly, and we will derive adaptive, consistent MCMC estimators for these optimal coefficients. Numerous MCMC simulation examples from Bayesian inference applications will be presented, demonstrating that the resulting variance reduction can be quite dramatic, often by a factor of the order of hundreds or thousands.

Joint work with Petros Dellaportas.

Apr. 6, 2009:  Logistic Regression with Brownian-like Predictors
Ian McKeague
Department of Biostatistics Columbia University
This talk introduces a new type of logistic regression model involving functional predictors of binary responses, along with an extension of the approach to generalized linear models. The predictors are trajectories that have certain sample-path properties in common with Brownian motion. Time points are treated as parameters of interest, and confidence intervals developed under prospective and retrospective (case-control) sampling designs. In an application to fMRI data, signals from individual subjects are used to find the portion of the time course that is most predictive of the response. This allows the identification of sensitive time points, specific to a brain region and associated with a certain task, that can be used to distinguish between responses. A second application concerns gene expression data in a case-control study involving breast cancer, where the aim is to identify genetic loci along a chromosome that best discriminate between cases and controls.

The talk is based on joint work with Martin Lindquist.

Mar. 30, 2009 :  A New Look at the Compound Poisson Distribution and Compound Poisson Approximation using Entropy
Mokshay Madiman
Department of Statistics Yale University
We develop an information-theoretic foundation for compound Poisson approximation and limit theorems (analogous to the corresponding developments for the central limit theorem and for simple Poisson approximation). First, sucient conditions are given under which the compound Poisson distribution has maximal entropy within a natural class of probability measures on the nonnegative integers. In particular, it is shown that a maximum entropy property is valid if the measures under consideration are log-concave, but that it fails in general. Second, approximation bounds in the (strong) relative entropy sense are given for distributional approximation of sums of independent nonnegative integer valued random variables by compound Poisson distributions. The proof techniques involve the use of a notion of local information quantities that generalize the classical Fisher information used for normal approximation, as well as the use of ingredients from Stein's method for compound Poisson approximation.

This work is joint with Andrew Barbour (Zurich), Oliver Johnson (Bristol) and Ioannis Kontoyiannis (AUEB).

Dec. 1, 2008:  Conditional inference for assessing the statistical significance of neural spiking patterns
Matt Harrison
Department of Statistics, Carnegie Mellon University
Conditional inference has proven useful for exploratory analysis of neurophysiological point process data. I will illustrate this approach and then focus on two sub-problems: (1) uniform generation of binary matrices with marginal constraints and (2) multiple hypothesis testing for random measures. (1) Sequential importance sampling (SIS) is an effective technique for approximate uniform sampling of binary matrices with specified marginals. I will describe how to simplify and improve existing SIS procedures using improved asymptotic enumeration and dynamic programming (DP). The DP approach is interesting because it facilitates generalizations. (2) For point process data or functional data collected in different experimental conditions, it is often important to determine if the data are distributed differently in different conditions and to further localize where (in time or space, for example) the differences occur. This can be framed as a multiple testing problem for random measures. When differences might exist at multiple and/or unknown (spatio-temporal) scales, I call this multi-scale multiple testing because for each location there are many hypothesis tests corresponding to many potential scales. I will describe a non-parametric permutation test approach to this problem.

This is joint work with Stuart Geman and Asohan Amarasingham.

Nov. 17, 2008:  The Large Margin Unified Machine: A Bridge between Hard and Soft Classification.
Yufeng Liu
Department of Statistics and Operations Research, University of North Carolina
Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers and some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits. In this talk, instead of making a choice between hard and soft classification, we propose a novel family of large-margin classifiers, namely large-margin unified machines (LUMs), which cover a broad range of margin-based classifiers including both hard and soft ones. The LUM family has close connections with some well-known large margin classifiers such as the Support Vector Machine and Boosting. By offering a natural bridge from soft to hard classification, the LUM provides a unified algorithm to fit various classifiers and hence a convenient platform to compare hard and soft classification.

Nov. 10, 2008 :  Dual Modality Network Tomography
George Michailidis
Dept of Statistics and EECS, University of Michigan
In this talk, we discuss joint modeling mechanisms for packet volumes and byte volumes to perform computer network tomography, whose goal is to estimate characteristics of source-destination flows based on link measurements. Network tomography is a prototypical example of a linear inverse problem on graphs. We examine two generative models for the relation between packet and byte volumes, establish identifiability of their parameters and discuss different estimating procedures. The proposed estimators of the flow characteristics are evaluated using both simulated and emulated data. Finally, the proposed models allow us to estimate parameters of the packet size distribution, thus providing additional insights into the composition of network traffic.

Nov. 3, 2008:  On Nonparametric Variable Selection
Kjell Doksum
University of Wisconsin
We consider regression experiments involving a response variable Y and a large numbered of predictor variables (X's) many of which may be of no value for the prediction of Y and thus need to be removed before predicting Y from the X's. This talk considers procedures that select variables by using importance scores that measure the strength of the relationship between predictor variables and a response. In the first of these procedures, scores are obtained by randomly drawing subregions (tubes) of the covariate space that constrain all but one predictor and in each subregion computing a signal to noise ratio (efficacy) based on a nonparametric univariate regression of Y on the unconstrained variable. The regions are adapted to boost weak variables iteratively by searching (hunting) for the regions where the efficacy is maximized. The efficacy can be viewed as an approximation to a one-to-one function of the probability of identifying features. By using importance scores based on averages of maximized efficacies, we develop a variable selection algorithm called EARTH (Efficacy adaptive egression tube hunting). The second importance score method (RFVS) is based on using Random Forest importance values to select variables. Computer simulations show that EARTH and RFVS are successful variable selection methods when compared to other procedures in nonparametric situations with a large number of irrelevant predictor variables. Moreover, when each is combined with the model selection and prediction procedure MARS, the tree-based prediction procedure GUIDE, or the Random Forest prediction method, the combinations lead to improved prediction accuracy for certain models with many irrelevant variables. We give conditions under which a version of the EARTH algorithm selects the correct model with probability tending to one as the sample size tends to infinity, even if d tends to infinity as n tends to infinity. We end with the analysis of a real data set.

(This is joint work with Shijie Tang and Kam Tsui.)

Oct. 27, 2008 :  Inference of Patterns and Associations Using Dictionary Models
Jun Liu
Department of Statistics, Harvard University
Pattern discovery is a ubiquitous problem in many disciplines. It is especially prominent in recent years due to our greatly improved data-generation capabilities in science and technologies. The method I present here is motivated by the "motif-finding" and "module-finding" problems in biology, i.e., to find sequence patterns (i.e., "words") that seem to appear more frequent than usual in a given set of text sequences (i.e., sentences) and to find which of these "words" tend to co-occur in a sentence. A challenge in the motif-finding problem is that there are no spacings and punctuations between the words and the dictionary of "words" is unknown to us. Existing methods are mostly "bottom-up" approaches, i.e., to build up the dictionary starting with single-letter words and then concatenate some existing words that appear to occur next to each other in sentences more frequently than chance. Our new approach is a top-down strategy, which uses a tree structure to represent the relationship among all possible existing words and uses the EM algorithm to estimate the usage frequency of each word. It automatically trims down most of the incorrect "words" by letting their usage frequencies converge to zero.

The module-finding problem is closely related to the well-known "market basket" problem, in which one attempts to mine association rules among the items in a supermarket based on customers' transaction records. It is also related to the two-way clustering problem. In this problem, we assume that the words are given, and our goal is to find subsets of words that tend to co-occur in a sentence. We call the set of co-occurring words (not necessarily orderly) a "theme" or a "module". We can generalize the dictionary model to the "theme"-model and use a similar EM-strategy to infer these themes. I will demonstrate its applications in a few examples including an analysis of Chinese medicine prescriptions and an analysis of a Chinese novel.

This is based on a joint work with Ke Deng and Zhi Geng.

Sept. 29, 2008 :  The Vine-Copula and Bayesian Belief Net Representation of High Dimensional Distributions
Roger Cooke
Resources for the Future and Department of Mathematics, Delft University of Technology
Regular vines are a graphical tool for representing complex high dimensional distributions as bivariate and conditional bivariate distributions. Assigning marginal distributions to each variable and (conditional) copulae to each edge of the vine uniquely species the joint distribution, and every joint density can be represented (non-uniquely) in this way. From a vine-copulae representation an expression for the density and a sampling routine can be immediately derived. Moreover the mutual information (which is the appropriate generalization of the determinant for non-linear dependencies) can be given an additive decomposition in terms of the conditional bivariate mutual informations at each edge of the vine. This means that minimal information completions of partially specied vine-copulae can be trivially constructed. The basic results on vines have recently been applied to derive similar representations for continuous, non-parametric Bayesian Belief Nets (BBNs). These are directed acyclic graphs in which influence (directed arcs) are interpreted in terms of conditional copulae. Interpreted in this way, BBNs inherit all the desirable properties of regular vines, and in addition have a more transparent graphical structure. New results concern 'optimal' vine-copulae representations; that is, loosely, representations which capture the most dependence in the smallest number of edges. This development uses the mutual information decomposition theorem, the theory of majorization and Schur convex functions. Keywords: correlation, graphs, positive definite matrix, Bayesian Belief Nets, majorization, determinant, mutual information, Schur convex functions, model inference.