Date  Speaker  Seminar Title 

Apr. 27, 2009 
Hongyu Zhao
Epid/Public Health, Yale University 
Incorporating Prior Biological Knowledge in GenomeWide Association Studies
[abstract] The last four years have seen great successes in many GenomeWide Association Studies (GWAS) which have identified numerous genetic variants underlying complex traits. The analysis and interpretation of data from GWAS presents great statistical and computational challenges, especially after the initial discoveries of variants carrying relatively large effects. Although various statistical approaches have been or are being developed to better analyze GWAS data, it has become apparent that the incorporation of information from prior studies and other sources is indispensable. In this presentation, we discuss our recently developed statistical methods and bioinformatics tools that are designed to more effectively integrate diverse types of prior biological information in analyzing GWAS data. The usefulness of these methods will be illustrated through their applications to some recent large scale GWAS data.

Apr. 20, 2009 

no seminar 
Apr. 13, 2009 
Ioannis (or Yiannis) Kontoyiannis
Department of Informatics Athens University of Econ & Business 
Control Variates for Reversible MCMC Samplers
[abstract] Control variates are a wellknown and effective tool for variance reduction in Monte Carlo sampling. We will present a general methodology for the construction and effective use of control variates for reversible Markov chain Monte Carlo (MCMC) samplers. We will show that the values of the coefficients of the optimal linear combination of the control variates can be computed explicitly, and we will derive adaptive, consistent MCMC estimators for these optimal coefficients. Numerous MCMC simulation examples from Bayesian inference applications will be presented, demonstrating that the resulting variance reduction can be quite dramatic, often by a factor of the order of hundreds or thousands. Joint work with Petros Dellaportas. 
Apr. 6, 2009 
Ian McKeague
Department of Biostatistics Columbia University 
Logistic Regression with Brownianlike Predictors
[abstract] This talk introduces a new type of logistic regression model involving functional predictors of binary responses, along with an extension of the approach to generalized linear models. The predictors are trajectories that have certain samplepath properties in common with Brownian motion. Time points are treated as parameters of interest, and confidence intervals developed under prospective and retrospective (casecontrol) sampling designs. In an application to fMRI data, signals from individual subjects are used to find the portion of the time course that is most predictive of the response. This allows the identification of sensitive time points, specific to a brain region and associated with a certain task, that can be used to distinguish between responses. A second application concerns gene expression data in a casecontrol study involving breast cancer, where the aim is to identify genetic loci along a chromosome that best discriminate between cases and controls. The talk is based on joint work with Martin Lindquist. 
Mar. 30, 2009 
Mokshay Madiman
Department of Statistics Yale University 
A New Look at the Compound Poisson Distribution and Compound Poisson
Approximation using Entropy
[abstract] We develop an informationtheoretic foundation for compound Poisson approximation and limit theorems (analogous to the corresponding developments for the central limit theorem and for simple Poisson approximation). First, sucient conditions are given under which the compound Poisson distribution has maximal entropy within a natural class of probability measures on the nonnegative integers. In particular, it is shown that a maximum entropy property is valid if the measures under consideration are logconcave, but that it fails in general. Second, approximation bounds in the (strong) relative entropy sense are given for distributional approximation of sums of independent nonnegative integer valued random variables by compound Poisson distributions. The proof techniques involve the use of a notion of local information quantities that generalize the classical Fisher information used for normal approximation, as well as the use of ingredients from Stein's method for compound Poisson approximation. This work is joint with Andrew Barbour (Zurich), Oliver Johnson (Bristol) and Ioannis Kontoyiannis (AUEB). 
Mar. 23, 2009 
Jack Silverstein
Department of Mathematics North Carolina State University 
Eigenvalues of Large Dimensional Random Matrices 
Mar. 9, 2009 

Spring Break: no seminar until March 23 
Mar. 2, 2009 
Andrew Barron
Yale University Department of Statistics 
Superposition Codes with Polynomial Size Dictionary are Reliable at Rates Up to Channel Capacity 
Feb. 23, 2009 
Charles R. Johnson
Department of Mathematics The College of William & Mary 
Determinantal Inequalities: Ancient History and Recent Advances 
Feb. 16, 2009 
Daniel Spielman
Computer Science Department Yale University 
Graph approxiamtion and local clustering, with applications to the solution of diagonallydominant systems of linear equations. 
Feb. 9, 2009 
Gideon Weiss
University of Haifa 
FCFS Infinite Bipartite Matching of Servers and Customers 
Feb. 2, 2009 
Hui Zou
School of Statistics University of Minnesota 
Local CQR Smoothing 
Jan. 26, 2009 
Joerg Stoye
Department of Economics New York University 
More on Confidence Intervals for Partially Identified Parameters 
Jan. 19, 2009 

Martin Luther King Jr. Day : no seminar 
Jan. 12, 2009 
Patrick Wolfe
Statistics and Information Sciences Laboratory Harvard University 
Prespectives on LargeScale Network Data: Blending Inference and Algorithms for Analysis 
Dec. 8, 2008 

Winter Break: no seminar until January 12 
Dec. 1, 2008 
Matt Harrison
Department of Statistics, Carnegie Mellon University 
Conditional inference for assessing the statistical significance of neural spiking patterns
[abstract] Conditional inference has proven useful for exploratory analysis of neurophysiological point process data. I will illustrate this approach and then focus on two subproblems: (1) uniform generation of binary matrices with marginal constraints and (2) multiple hypothesis testing for random measures. (1) Sequential importance sampling (SIS) is an effective technique for approximate uniform sampling of binary matrices with specified marginals. I will describe how to simplify and improve existing SIS procedures using improved asymptotic enumeration and dynamic programming (DP). The DP approach is interesting because it facilitates generalizations. (2) For point process data or functional data collected in different experimental conditions, it is often important to determine if the data are distributed differently in different conditions and to further localize where (in time or space, for example) the differences occur. This can be framed as a multiple testing problem for random measures. When differences might exist at multiple and/or unknown (spatiotemporal) scales, I call this multiscale multiple testing because for each location there are many hypothesis tests corresponding to many potential scales. I will describe a nonparametric permutation test approach to this problem. This is joint work with Stuart Geman and Asohan Amarasingham. 
Nov. 24, 2008 

Thanksgiving Break: no seminar 
Nov. 17, 2008 
Yufeng Liu
Department of Statistics and Operations Research, University of North Carolina 
The Large Margin Unified Machine: A Bridge between Hard and Soft Classification.
[abstract] Marginbased classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers and some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits. In this talk, instead of making a choice between hard and soft classification, we propose a novel family of largemargin classifiers, namely largemargin unified machines (LUMs), which cover a broad range of marginbased classifiers including both hard and soft ones. The LUM family has close connections with some wellknown large margin classifiers such as the Support Vector Machine and Boosting. By offering a natural bridge from soft to hard classification, the LUM provides a unified algorithm to fit various classifiers and hence a convenient platform to compare hard and soft classification.

Nov. 10, 2008 
George Michailidis
Dept of Statistics and EECS, University of Michigan 
Dual Modality Network Tomography
[abstract] In this talk, we discuss joint modeling mechanisms for packet volumes and byte volumes to perform computer network tomography, whose goal is to estimate characteristics of sourcedestination flows based on link measurements. Network tomography is a prototypical example of a linear inverse problem on graphs. We examine two generative models for the relation between packet and byte volumes, establish identifiability of their parameters and discuss different estimating procedures. The proposed estimators of the flow characteristics are evaluated using both simulated and emulated data. Finally, the proposed models allow us to estimate parameters of the packet size distribution, thus providing additional insights into the composition of network traffic.

Nov. 3, 2008 
Kjell Doksum
University of Wisconsin 
On Nonparametric Variable Selection
[abstract] We consider regression experiments involving a response variable Y and a large numbered of predictor variables (X's) many of which may be of no value for the prediction of Y and thus need to be removed before predicting Y from the X's. This talk considers procedures that select variables by using importance scores that measure the strength of the relationship between predictor variables and a response. In the first of these procedures, scores are obtained by randomly drawing subregions (tubes) of the covariate space that constrain all but one predictor and in each subregion computing a signal to noise ratio (efficacy) based on a nonparametric univariate regression of Y on the unconstrained variable. The regions are adapted to boost weak variables iteratively by searching (hunting) for the regions where the efficacy is maximized. The efficacy can be viewed as an approximation to a onetoone function of the probability of identifying features. By using importance scores based on averages of maximized efficacies, we develop a variable selection algorithm called EARTH (Efficacy adaptive egression tube hunting). The second importance score method (RFVS) is based on using Random Forest importance values to select variables. Computer simulations show that EARTH and RFVS are successful variable selection methods when compared to other procedures in nonparametric situations with a large number of irrelevant predictor variables. Moreover, when each is combined with the model selection and prediction procedure MARS, the treebased prediction procedure GUIDE, or the Random Forest prediction method, the combinations lead to improved prediction accuracy for certain models with many irrelevant variables. We give conditions under which a version of the EARTH algorithm selects the correct model with probability tending to one as the sample size tends to infinity, even if d tends to infinity as n tends to infinity. We end with the analysis of a real data set. (This is joint work with Shijie Tang and Kam Tsui.) 
Oct. 27, 2008 
Jun Liu
Department of Statistics, Harvard University 
Inference of Patterns and Associations Using Dictionary Models
[abstract] Pattern discovery is a ubiquitous problem in many disciplines. It is especially prominent in recent years due to our greatly improved datageneration capabilities in science and technologies. The method I present here is motivated by the "motiffinding" and "modulefinding" problems in biology, i.e., to find sequence patterns (i.e., "words") that seem to appear more frequent than usual in a given set of text sequences (i.e., sentences) and to find which of these "words" tend to cooccur in a sentence. A challenge in the motiffinding problem is that there are no spacings and punctuations between the words and the dictionary of "words" is unknown to us. Existing methods are mostly "bottomup" approaches, i.e., to build up the dictionary starting with singleletter words and then concatenate some existing words that appear to occur next to each other in sentences more frequently than chance. Our new approach is a topdown strategy, which uses a tree structure to represent the relationship among all possible existing words and uses the EM algorithm to estimate the usage frequency of each word. It automatically trims down most of the incorrect "words" by letting their usage frequencies converge to zero. The modulefinding problem is closely related to the wellknown "market basket" problem, in which one attempts to mine association rules among the items in a supermarket based on customers' transaction records. It is also related to the twoway clustering problem. In this problem, we assume that the words are given, and our goal is to find subsets of words that tend to cooccur in a sentence. We call the set of cooccurring words (not necessarily orderly) a "theme" or a "module". We can generalize the dictionary model to the "theme"model and use a similar EMstrategy to infer these themes. I will demonstrate its applications in a few examples including an analysis of Chinese medicine prescriptions and an analysis of a Chinese novel. This is based on a joint work with Ke Deng and Zhi Geng. 
Oct. 20, 2008 
Alexander Barvinok
Department of Mathematics University of Michigan 
What does a random contingency table look like? 
Oct. 13, 2008 
Sid Resnick
Cornell University 
Detection of the Conditional Extreme Value Model 
Oct. 8, 2008 
Hannes Leeb
Department of Statistics Yale University 
(informal seminar) 
Oct. 6, 2008 
Bing Li
Department of Statistics Penn State University 
Dimension Reduction for NonElliptically Distributed Predictors: SecondOrder Methods 
Sept. 29, 2008 
Roger Cooke
Resources for the Future and Department of Mathematics, Delft University of Technology 
The VineCopula and Bayesian Belief Net
Representation of High Dimensional Distributions
[abstract] Regular vines are a graphical tool for representing complex high dimensional distributions as bivariate and conditional bivariate distributions. Assigning marginal distributions to each variable and (conditional) copulae to each edge of the vine uniquely species the joint distribution, and every joint density can be represented (nonuniquely) in this way. From a vinecopulae representation an expression for the density and a sampling routine can be immediately derived. Moreover the mutual information (which is the appropriate generalization of the determinant for nonlinear dependencies) can be given an additive decomposition in terms of the conditional bivariate mutual informations at each edge of the vine. This means that minimal information completions of partially specied vinecopulae can be trivially constructed. The basic results on vines have recently been applied to derive similar representations for continuous, nonparametric Bayesian Belief Nets (BBNs). These are directed acyclic graphs in which influence (directed arcs) are interpreted in terms of conditional copulae. Interpreted in this way, BBNs inherit all the desirable properties of regular vines, and in addition have a more transparent graphical structure. New results concern 'optimal' vinecopulae representations; that is, loosely, representations which capture the most dependence in the smallest number of edges. This development uses the mutual information decomposition theorem, the theory of majorization and Schur convex functions. Keywords: correlation, graphs, positive definite matrix, Bayesian Belief Nets, majorization, determinant, mutual information, Schur convex functions, model inference.

Sept. 22, 2008 
Paul Kabaila
La Trobe University Department of Mathematics and Statistics 
Confidence Intervals in Regression Utilizing Prior Information 
Sept. 15, 2008 
Muni S. Srivastava
University of Toronto Department of Statistics 
Analyzing High Dimensional Data with Fewer Observations 
Sept. 12, 2008 
Larry Shepp
Rutgers University Department of Statistics 
A Mathematical Approach to Managing Diabetes 