Publications of Joe Chang

Reconstruction of evolutionary trees from pairwise distributions on current species. Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface, E. M. Keramidas, Ed., 254-257, 1991. (With John Hartigan.) Postscript [4 pages, 287 Kb].

Established identifiability and consistency of a distance method of evolutionary tree topology reconstruction. Other popular methods had been shown earlier to be inconsistent. As far as I know this is the first paper showing any method to be consistent for the general class of Markov models on evolutionary trees. This was done by introducing a log determinant distance function on the tree.

Learning rate schedules for faster stochastic gradient search. Neural Networks for Signal Processing 2 -- Proceedings of the 1992 IEEE Workshop, IEEE, Piscataway, NJ. 3-12, 1992. (With Chris Darken and John Moody.)

Stochastic approximation methods are the basis of a number of important algorithms, for example, for clustering data and training neural networks. Here we propose a method of selecting the learning rate in stochastic approximation procedures, designed to achieve near-optimal rates of convergence.

On moments of the first ladder height of random walks with small drift. The Annals of Applied Probability, 2: 714-738, 1992.

A random walk is a discrete-time process $\{S_n:n=0,1,\ldots\}$ with independent and identically distributed increments, starting from $S_0=0$. The first ladder height is the first positive value taken by the random walk. For random walks whose increments are distributed according to an exponential family of distributions, this paper refines an asymptotic expansion of Siegmund for moments of the first ladder height as the mean of the increments tends to 0. This expansion is an important tool in corrected diffusion and other approximations for boundary crossing probabilities. An application to the CUSUM procedure of quality control is described.

Inequalities for the overshoot. The Annals of Applied Probability, 4: 1223-1233, 1994.

The overshoot, also known as the excess or residual lifetime, is one of the fundamental objects of study in random walk and renewal theory. For example, when a random walk first crosses a horizontal boundary at height $b$, the amount by which it exceeds that boundary is the overshoot of $b$. The limiting distribution and moments of the overshoot as $b\to\infty$ are classical results. Gary Lorden found some striking inequalities that bound moments and tail probabilities of the overshoot uniformly for all finite $b$, which, in many applications, is the question of more genuine interest than the limit as $b\to\infty$. In this paper I give simple new coupling proofs of Lorden's inequalities, and also establish some new inequalities that are sharper than Lorden's. The paper ends with a conjecture.

A Bayesian improvement on a test-based estimator. American Statistical Association Proceedings of the Section on Bayesian Statistical Science, 40-44, 1995. (With Tom Kelleher and John Hartigan.)

Studies the adequacy of the common statistical practice of first selecting a model---e.g. by criteria such as AIC or BIC---then estimating parameters within the selected model. Such estimators are generally inadmissible. In a very simple case, finds a Bayesian estimator that dominates such two-stage model selection estimators.

Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Mathematical Biosciences, 134: 189-216, 1996. [PDF]

Standard models in phylogeny reconstruction have assumed that different positions in the DNA have evolved according to iid Markov random fields. This assumption is unrealistic, as different positions are known to evolve at different rates. This paper shows that methods that ignore this rate heterogeneity can fail dramatically, converging to the wrong tree topology even under very simple forms of rate heterogeneity. For example, just mixing in some invariable characters (characters whose state cannot change) with the usual iid characters is enough to cause maximum likelihood and distance methods to be misled. Since such rate heterogeneity is well established scientifically, this is a serious consideration for some methods that had previously been considered reliable, and it provides theoretical support for using some more elaborate methods that have been developed recently.

Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Mathematical Biosciences, 137: 51-73, 1996. [PDF]

Under weak conditions, identifiability is proved in the problem of full reconstruction of Markov evolutionary models -- this includes both the reconstruction of the tree topology and the recovery of the Markov transition matrices on the edges of the tree. As a byproduct of the identifiability result, the consistency of maximum likelihood is established. This is the first proof of identifiability and consistency for such Markov models, and it was an early contributor to the development of spectral and tensor approaches in more general models.

The measurement of homoplasy: a stochastic view. In Homoplasy: The Recurrence of Similarity in Evolution, M. J. Sanderson and L. Hufford, eds., Academic Press, 1996, pp. 189-203. (With Junhyong Kim.)

The term homoplasy refers to a similarity between characters in different species that arises "by chance,'' in the sense that the character evolved independently more than once, rather than having evolved once before the species diverged from each other. This paper proposes a new stochastic method of quantifying the homoplasy concept, identifies pitfalls of a number of approaches, and reconsiders the scientific utility of the enterprise.

Ladder heights, Gaussian random walks, and the Riemann zeta function. The Annals of Probability, 25: 787-802, 1997. (With Yuval Peres.) [PDF]

Here we study Gaussian random walks whose increments have a normal distribution $N(\theta,1)$, where $\theta\ge 0$. The object of study is the behavior of the expected value of the first ladder height, as a function of $\theta$. We show that this function is analytic at 0 and give the full Taylor series, the coefficients of which are values of the Riemann zeta function at $1/2,-1/2,-3/2,...$. Previously only the first term of this expansion was known. We also investigate the extent to which the function may be analytically continued throughout the complex plane. The paper also applies the Taylor series to obtain full asymptotic expansions for boundary crossing problems for Gaussian random walks.

Conditioning as disintegration. Statistica Neerlandica, 51: 287-317, 1997. (With David Pollard.)

Kolmogorov's mathematical formalization of the concept of conditional probabilities and expectations is often awkward to work with in concrete problems, such as those arising in statistics. The elementary definitions of conditioning, while intuitive, are restrictive, applying easily only to densities with respect to counting measure and Lebesgue measure. This paper discusses the notion of disintegration (closely related to conditional probability distributions) and presents many examples and applications to statistics, such as rigorous treatments of sufficiency, the EM algorithm, and marginalization paradoxes.

An extension of affected pedigree member analyses to triads of relatives. Genetic Epidemiology, 14: 1005-1010, 1997. (With Elena Grigorenko.)

This paper concerns genetic linkage analysis, which is one of the techniques used in finding genes responsible for traits. A well known "robust'' (or "nonparametric'') method uses pairs of affected relatives in the pedigree. Here we extend the method to use certain triads of relatives. The hope is that this kind of method may sometimes permit more powerful linkage analysis, allowing linkage to be established with fewer individuals typed.

Review of Probability and Statistical Inference, by R. Bartoszynski and M. Niewiadomska-Bugaj (Wiley, 1996). Technometrics 40: 354-356, 1998.

Detecting when a monotonically increasing mean has crossed a threshold. Journal of Quality Technology 31: 217-234, 1999. (With Ronald D. Fricker, Jr.).

Suppose we are monitoring a sequence of random variables, and we want to raise an alarm as soon as possible after the mean becomes too large. The CUSUM sequential change-point detection scheme has optimality properties if the mean experiences a single, one-time jump increase from one known level to another. However, many monitoring situations are not realistically described by this stylized change-point model; for example, in modeling tool wear, it is natural to allow the possibility of gradual monotonic changes. Here we introduce a model that assumes only that the mean is nondecreasing over time. The problem is to let the process continue as long as its mean is under some specified threshold value and to stop it as soon as possible after the mean exceeds the threshold. We show how to apply the CUSUM and exponentially weighted moving average (EWMA) to this problem, and compare them to a repeated generalized likelihood ratio (GLR) test we designed specifically for the monotone setting. Much of the literature on sequential detection problems advocates the GLR principle as a reliable method for designing good procedures. The surprising result here is that the CUSUM and EWMA, appropriately applied, usually outperform the GLR test. A case is made for the CUSUM as the best overall choice.

Stochastic Processes.

Book in preparation. You can see most of this as class notes here.

Recent common ancestors of all present-day individuals. [With discussion by Peter Dononelly, Carsten Wiuf & Jotun Hein, Montgomery Slatkin, W. J. Ewens, and J. F. C. Kingman.] Paper: Advances in Applied Probability, 31: 1002-1026, 1999. Invited discussion and my reply: Advances in Applied Probability, 31: 1027-1038, 1999.

Previous study of the time to a common ancestor of all present-day individuals has focused on models in which each individual has just one parent in the previous generation. For example, "mitochondrial Eve'' is the most recent common ancestor (MRCA) when ancestry is defined only through maternal lines. In the standard Wright-Fisher model with population size n, the expected number of generations to the MRCA is about 2n, and the standard deviation of this time is also of order n. This paper studies a two-parent analog of the Wright-Fisher model that defines ancestry using both parents. In this model, if the population size n is large, the number of generations, T_n, back to a MRCA has a distribution that is concentrated around log₂(n), in the sense that the ratio T_n/log₂(n) converges in probability to 1 as n tends to infinity. Also, continuing to trace back further into the past, at about 1.77log₂(n) generations before the present, all partial ancestry of the current population ends, in the following sense: with high probability for large n, in each generation at least 1.77log₂(n) generations before the present, all individuals who have any descendants among the present-day individuals are actually ancestors of all present-day individuals.

N. Luscombe, T. Royce, P. Bertone, N. Echols, C. Horak, J. Chang, M. Snyder, and M. Gerstein. ExpressYourself: a modular platform for processing and visualizing microarray data. Nucleic Acids Res. 31:3477-3482, 2003.

Y. Kluger, R. Basri, J. Chang, and M. Gerstein. Spectral biclustering of microarray cancer data: co-clustering genes and conditions. Genome Research 13: 703-716, 2003.

H.M. Kluger, Y. Kluger, M. Gilmore-Hebert, K. DiVito, J. Chang, S. Rodov, O. Mironenko, B.M. Kacinski, A.S. Perkins, and E. Sapi. cDNA microarray analysis of invasive and tumorigenic phenotypes in a breast cancer model. Laboratory Investigation 84: 320-331, 2004.

Y. Kluger, D. Tuck, J. Chang, Y. Nakayama, R. Poddar, N. Kohya, Z. Lian, A. Ben Nasr, H.R. Halaban, D. Krause, X. Zhang, P. Newburger, and S. Weissman. Lineage specificity of gene expression patterns. Proceedings of the National Academy of Sciences, 101:6508-6513, 2004. [PDF]

D. Rohde, S. Olson, and J. Chang. Modelling the recent common ancestry of all living humans. Nature, 431:562-566, 2004. [Press release] [Article] [Supplement-A] [Supplement-B]

If a common ancestor of all living humans is defined as an individual who is a genealogical ancestor of all present-day people, the most recent common ancestor (MRCA) for a randomly mating population would have lived in the very recent past (ref). However, the random mating model ignores essential aspects of population substructure, such as the tendency of individuals to choose mates from the same social group, and the relative isolation of geographically separated groups.Here we show that recent common ancestors also emerge from two models incorporating substantial population substructure. One model, designed for simplicity and theoretical insight, yields explicit mathematical results through a probabilistic analysis. A more elaborate second model, designed to capture historical population dynamics in a more realistic way, is analysed computationally through Monte Carlo simulations. These analyses suggest that the genealogies of all living humans overlap in remarkable ways in the recent past. In particular, the MRCA of all present-day humans lived just a few thousand years ago in these models. Moreover, among all individuals living more than just a few thousand years earlier than the MRCA, each presentday human has exactly the same set of genealogical ancestors.

N Carriero, MV Osier, KH Cheung, PL Miller, M Gerstein, H Zhao, B Wu, S Rifkin, J Chang, H Zhang, K White, K Williams, M Schultz. A high productivity/low maintenance approach to high-performance computation for biomedicine: four case studies. J Am Med Inform Assoc 12:90-98, 2005.

M. Lacey and J. Chang. A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences. Mathematical Biosciences 199: 188-215, 2006. [PDF]

Phylogeny reconstruction is the process of inferring evolutionary relationships from molecular sequences, and methods that are expected to accurately reconstruct trees from sequences of reasonable length are highly desirable. To formalize this concept, the property of fast-convergence has been introduced to describe phylogeny reconstruction methods that, with high probability, recover the true tree from sequences that grow polynomially in the number of taxa n. While provably fast-converging methods have been developed, the neighbor-joining (NJ) algorithm of Saitou and Nei remains one of the most popular methods used in practice. This algorithm is known to converge for sequences that are exponential in n, but no lower bound for its convergence rate has been established. To address this theoretical question, we analyze the performance of the NJ algorithm on a type of phylogeny known as a 'caterpillar tree'. We find that, for sequences of polynomial length in the number of taxa n, the variability of the NJ criterion is sufficiently high that the algorithm is likely to fail even in the first step of the phylogeny reconstruction process, regardless of the degree of polynomial considered. This result demonstrates that, for general n-taxa trees, the exponential bound cannot be improved.

E. Grigorenko, A. Naples, J. Chang, C. Romano, D. Ngorosho, S. Kungulilo, M. Jukes, D. Bundy. Back to Africa: Tracing dyslexia genes in east Africa. Reading and Writing, 20:27-49, February 2007. [PDF]

Z. Zhang, A. Paccanaro, Y. Fu, S. Weissman, Z. Weng, J. Chang, M. Snyder, M. Gerstein. Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. Genome Research 17:787-797, 2007.

Y. Zhou, J. Ferguson, J. Chang, Y. Kluger. Inter- and intra-combinatorial regulation by transcription factors and microRNAs. BMC Genomics 8:396, 2007.

Back to my home page.