Yale Statistics Department Seminars: 2008-09

Date Speaker Seminar Title
Apr. 27, 2009 Hongyu Zhao
Epid/Public Health, Yale University
Incorporating Prior Biological Knowledge in Genome-Wide Association Studies
The last four years have seen great successes in many Genome-Wide Association Studies (GWAS) which have identified numerous genetic variants underlying complex traits. The analysis and interpretation of data from GWAS presents great statistical and computational challenges, especially after the initial discoveries of variants carrying relatively large effects. Although various statistical approaches have been or are being developed to better analyze GWAS data, it has become apparent that the incorporation of information from prior studies and other sources is indispensable. In this presentation, we discuss our recently developed statistical methods and bioinformatics tools that are designed to more effectively integrate diverse types of prior biological information in analyzing GWAS data. The usefulness of these methods will be illustrated through their applications to some recent large scale GWAS data.
Apr. 20, 2009
no seminar
Apr. 13, 2009 Ioannis (or Yiannis) Kontoyiannis
Department of Informatics Athens University of Econ & Business
Control Variates for Reversible MCMC Samplers
Control variates are a well-known and effective tool for variance reduction in Monte Carlo sampling. We will present a general methodology for the construction and effective use of control variates for reversible Markov chain Monte Carlo (MCMC) samplers. We will show that the values of the coefficients of the optimal linear combination of the control variates can be computed explicitly, and we will derive adaptive, consistent MCMC estimators for these optimal coefficients. Numerous MCMC simulation examples from Bayesian inference applications will be presented, demonstrating that the resulting variance reduction can be quite dramatic, often by a factor of the order of hundreds or thousands.

Joint work with Petros Dellaportas.
Apr. 6, 2009 Ian McKeague
Department of Biostatistics Columbia University
Logistic Regression with Brownian-like Predictors
This talk introduces a new type of logistic regression model involving functional predictors of binary responses, along with an extension of the approach to generalized linear models. The predictors are trajectories that have certain sample-path properties in common with Brownian motion. Time points are treated as parameters of interest, and confidence intervals developed under prospective and retrospective (case-control) sampling designs. In an application to fMRI data, signals from individual subjects are used to find the portion of the time course that is most predictive of the response. This allows the identification of sensitive time points, specific to a brain region and associated with a certain task, that can be used to distinguish between responses. A second application concerns gene expression data in a case-control study involving breast cancer, where the aim is to identify genetic loci along a chromosome that best discriminate between cases and controls.

The talk is based on joint work with Martin Lindquist.
Mar. 30, 2009 Mokshay Madiman
Department of Statistics Yale University
A New Look at the Compound Poisson Distribution and Compound Poisson Approximation using Entropy
We develop an information-theoretic foundation for compound Poisson approximation and limit theorems (analogous to the corresponding developments for the central limit theorem and for simple Poisson approximation). First, sucient conditions are given under which the compound Poisson distribution has maximal entropy within a natural class of probability measures on the nonnegative integers. In particular, it is shown that a maximum entropy property is valid if the measures under consideration are log-concave, but that it fails in general. Second, approximation bounds in the (strong) relative entropy sense are given for distributional approximation of sums of independent nonnegative integer valued random variables by compound Poisson distributions. The proof techniques involve the use of a notion of local information quantities that generalize the classical Fisher information used for normal approximation, as well as the use of ingredients from Stein's method for compound Poisson approximation.

This work is joint with Andrew Barbour (Zurich), Oliver Johnson (Bristol) and Ioannis Kontoyiannis (AUEB).
Mar. 23, 2009 Jack Silverstein
Department of Mathematics North Carolina State University
Eigenvalues of Large Dimensional Random Matrices
Mar. 9, 2009
Spring Break: no seminar until March 23
Mar. 2, 2009 Andrew Barron
Yale University Department of Statistics
Superposition Codes with Polynomial Size Dictionary are Reliable at Rates Up to Channel Capacity
Feb. 23, 2009 Charles R. Johnson
Department of Mathematics The College of William & Mary
Determinantal Inequalities: Ancient History and Recent Advances
Feb. 16, 2009 Daniel Spielman
Computer Science Department Yale University
Graph approxiamtion and local clustering, with applications to the solution of diagonally-dominant systems of linear equations.
Feb. 9, 2009 Gideon Weiss
University of Haifa
FCFS Infinite Bipartite Matching of Servers and Customers
Feb. 2, 2009 Hui Zou
School of Statistics University of Minnesota
Local CQR Smoothing
Jan. 26, 2009 Joerg Stoye
Department of Economics New York University
More on Confidence Intervals for Partially Identified Parameters
Jan. 19, 2009
Martin Luther King Jr. Day : no seminar
Jan. 12, 2009 Patrick Wolfe
Statistics and Information Sciences Laboratory Harvard University
Prespectives on Large-Scale Network Data: Blending Inference and Algorithms for Analysis
Dec. 8, 2008
Winter Break: no seminar until January 12
Dec. 1, 2008 Matt Harrison
Department of Statistics, Carnegie Mellon University
Conditional inference for assessing the statistical significance of neural spiking patterns
Conditional inference has proven useful for exploratory analysis of neurophysiological point process data. I will illustrate this approach and then focus on two sub-problems: (1) uniform generation of binary matrices with marginal constraints and (2) multiple hypothesis testing for random measures. (1) Sequential importance sampling (SIS) is an effective technique for approximate uniform sampling of binary matrices with specified marginals. I will describe how to simplify and improve existing SIS procedures using improved asymptotic enumeration and dynamic programming (DP). The DP approach is interesting because it facilitates generalizations. (2) For point process data or functional data collected in different experimental conditions, it is often important to determine if the data are distributed differently in different conditions and to further localize where (in time or space, for example) the differences occur. This can be framed as a multiple testing problem for random measures. When differences might exist at multiple and/or unknown (spatio-temporal) scales, I call this multi-scale multiple testing because for each location there are many hypothesis tests corresponding to many potential scales. I will describe a non-parametric permutation test approach to this problem.

This is joint work with Stuart Geman and Asohan Amarasingham.
Nov. 24, 2008
Thanksgiving Break: no seminar
Nov. 17, 2008 Yufeng Liu
Department of Statistics and Operations Research, University of North Carolina
The Large Margin Unified Machine: A Bridge between Hard and Soft Classification.
Margin-based classifiers have been popular in both machine learning and statistics for classification problems. Among numerous classifiers, some are hard classifiers and some are soft ones. Soft classifiers explicitly estimate the class conditional probabilities and then perform classification based on estimated probabilities. In contrast, hard classifiers directly target on the classification decision boundary without producing the probability estimation. These two types of classifiers are based on different philosophies and each has its own merits. In this talk, instead of making a choice between hard and soft classification, we propose a novel family of large-margin classifiers, namely large-margin unified machines (LUMs), which cover a broad range of margin-based classifiers including both hard and soft ones. The LUM family has close connections with some well-known large margin classifiers such as the Support Vector Machine and Boosting. By offering a natural bridge from soft to hard classification, the LUM provides a unified algorithm to fit various classifiers and hence a convenient platform to compare hard and soft classification.
Nov. 10, 2008 George Michailidis
Dept of Statistics and EECS, University of Michigan
Dual Modality Network Tomography
In this talk, we discuss joint modeling mechanisms for packet volumes and byte volumes to perform computer network tomography, whose goal is to estimate characteristics of source-destination flows based on link measurements. Network tomography is a prototypical example of a linear inverse problem on graphs. We examine two generative models for the relation between packet and byte volumes, establish identifiability of their parameters and discuss different estimating procedures. The proposed estimators of the flow characteristics are evaluated using both simulated and emulated data. Finally, the proposed models allow us to estimate parameters of the packet size distribution, thus providing additional insights into the composition of network traffic.
Nov. 3, 2008 Kjell Doksum
University of Wisconsin
On Nonparametric Variable Selection
We consider regression experiments involving a response variable Y and a large numbered of predictor variables (X's) many of which may be of no value for the prediction of Y and thus need to be removed before predicting Y from the X's. This talk considers procedures that select variables by using importance scores that measure the strength of the relationship between predictor variables and a response. In the first of these procedures, scores are obtained by randomly drawing subregions (tubes) of the covariate space that constrain all but one predictor and in each subregion computing a signal to noise ratio (efficacy) based on a nonparametric univariate regression of Y on the unconstrained variable. The regions are adapted to boost weak variables iteratively by searching (hunting) for the regions where the efficacy is maximized. The efficacy can be viewed as an approximation to a one-to-one function of the probability of identifying features. By using importance scores based on averages of maximized efficacies, we develop a variable selection algorithm called EARTH (Efficacy adaptive egression tube hunting). The second importance score method (RFVS) is based on using Random Forest importance values to select variables. Computer simulations show that EARTH and RFVS are successful variable selection methods when compared to other procedures in nonparametric situations with a large number of irrelevant predictor variables. Moreover, when each is combined with the model selection and prediction procedure MARS, the tree-based prediction procedure GUIDE, or the Random Forest prediction method, the combinations lead to improved prediction accuracy for certain models with many irrelevant variables. We give conditions under which a version of the EARTH algorithm selects the correct model with probability tending to one as the sample size tends to infinity, even if d tends to infinity as n tends to infinity. We end with the analysis of a real data set.

(This is joint work with Shijie Tang and Kam Tsui.)
Oct. 27, 2008 Jun Liu
Department of Statistics, Harvard University
Inference of Patterns and Associations Using Dictionary Models
Pattern discovery is a ubiquitous problem in many disciplines. It is especially prominent in recent years due to our greatly improved data-generation capabilities in science and technologies. The method I present here is motivated by the "motif-finding" and "module-finding" problems in biology, i.e., to find sequence patterns (i.e., "words") that seem to appear more frequent than usual in a given set of text sequences (i.e., sentences) and to find which of these "words" tend to co-occur in a sentence. A challenge in the motif-finding problem is that there are no spacings and punctuations between the words and the dictionary of "words" is unknown to us. Existing methods are mostly "bottom-up" approaches, i.e., to build up the dictionary starting with single-letter words and then concatenate some existing words that appear to occur next to each other in sentences more frequently than chance. Our new approach is a top-down strategy, which uses a tree structure to represent the relationship among all possible existing words and uses the EM algorithm to estimate the usage frequency of each word. It automatically trims down most of the incorrect "words" by letting their usage frequencies converge to zero.

The module-finding problem is closely related to the well-known "market basket" problem, in which one attempts to mine association rules among the items in a supermarket based on customers' transaction records. It is also related to the two-way clustering problem. In this problem, we assume that the words are given, and our goal is to find subsets of words that tend to co-occur in a sentence. We call the set of co-occurring words (not necessarily orderly) a "theme" or a "module". We can generalize the dictionary model to the "theme"-model and use a similar EM-strategy to infer these themes. I will demonstrate its applications in a few examples including an analysis of Chinese medicine prescriptions and an analysis of a Chinese novel.

This is based on a joint work with Ke Deng and Zhi Geng.
Oct. 20, 2008 Alexander Barvinok
Department of Mathematics University of Michigan
What does a random contingency table look like?
Oct. 13, 2008 Sid Resnick
Cornell University
Detection of the Conditional Extreme Value Model
Oct. 8, 2008 Hannes Leeb
Department of Statistics Yale University
(informal seminar)
Oct. 6, 2008 Bing Li
Department of Statistics Penn State University
Dimension Reduction for Non-Elliptically Distributed Predictors: Second-Order Methods
Sept. 29, 2008 Roger Cooke
Resources for the Future and Department of Mathematics, Delft University of Technology
The Vine-Copula and Bayesian Belief Net Representation of High Dimensional Distributions
Regular vines are a graphical tool for representing complex high dimensional distributions as bivariate and conditional bivariate distributions. Assigning marginal distributions to each variable and (conditional) copulae to each edge of the vine uniquely species the joint distribution, and every joint density can be represented (non-uniquely) in this way. From a vine-copulae representation an expression for the density and a sampling routine can be immediately derived. Moreover the mutual information (which is the appropriate generalization of the determinant for non-linear dependencies) can be given an additive decomposition in terms of the conditional bivariate mutual informations at each edge of the vine. This means that minimal information completions of partially specied vine-copulae can be trivially constructed. The basic results on vines have recently been applied to derive similar representations for continuous, non-parametric Bayesian Belief Nets (BBNs). These are directed acyclic graphs in which influence (directed arcs) are interpreted in terms of conditional copulae. Interpreted in this way, BBNs inherit all the desirable properties of regular vines, and in addition have a more transparent graphical structure. New results concern 'optimal' vine-copulae representations; that is, loosely, representations which capture the most dependence in the smallest number of edges. This development uses the mutual information decomposition theorem, the theory of majorization and Schur convex functions. Keywords: correlation, graphs, positive definite matrix, Bayesian Belief Nets, majorization, determinant, mutual information, Schur convex functions, model inference.
Sept. 22, 2008 Paul Kabaila
La Trobe University Department of Mathematics and Statistics
Confidence Intervals in Regression Utilizing Prior Information
Sept. 15, 2008 Muni S. Srivastava
University of Toronto Department of Statistics
Analyzing High Dimensional Data with Fewer Observations
Sept. 12, 2008 Larry Shepp
Rutgers University Department of Statistics
A Mathematical Approach to Managing Diabetes

Revised: 3 July 2009