Tianhao Wang

Department of S&DS
219 Prospect Avenue
New Haven, CT 06511
tianhao.wang@yale.edu

I am a final year Ph.D. student in the Department of Statistics and Data Science at Yale University. I am very fortunate to be advised by Prof. Zhou Fan. I am broadly interested in various aspects of statistics and machine learning theory.

Prior to Yale, I obtained my Bachelor’s degree in mathematics with a dual degree in computer science at University of Science and Technology of China.

Recent papers(*: equal contribution)

Foundations of Transformers

Implicit regularization of gradient flow on one-layer softmax attention

Heejune Sheen, Siyu Chen, Tianhao Wang, and Harrison H. Zhou

arXiv:2403.08699, 2024
Accepted to ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning
arXiv
How well can Transformers emulate in-context Newton's method?

Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, and Jason D. Lee

arXiv:2403.03183, 2024
Accepted to ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning
arXiv
Training dynamics of multi-head softmax attention for in-context learning: emergence, convergence, and optimality

Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang

arXiv:2402.19442, 2024
Accepted to ICLR 2024 Workshop on Bridging the Gap Between Practice and Theory in Deep Learning
arXiv

Approximate Message Passing algorithms

Approximate Message Passing for orthogonally invariant ensembles: Multivariate non-linearities and spectral initialization

Xinyi Zhong*, Tianhao Wang*, and Zhou Fan

Submitted to Information and Inference, minor revision. arXiv:2110.02318, 2021

Abstract arXiv

We study a class of Approximate Message Passing (AMP) algorithms for symmetric and rectangular spiked random matrix models with orthogonally invariant noise. The AMP iterates have fixed dimension \(K>1\), a multivariate non-linearity is applied in each AMP iteration, and the algorithm is spectrally initialized with \(K\) super-critical sample eigenvectors. We derive the forms of the Onsager debiasing coefficients and corresponding AMP state evolution, which depend on the free cumulants of the noise spectral distribution. This extends previous results for such models with \(K=1\) and an independent initialization. Applying this approach to Bayesian principal components analysis, we introduce a Bayes-OAMP algorithm that uses as its non-linearity the posterior mean conditional on all preceding AMP iterates. We describe a practical implementation of this algorithm, where all debiasing and state evolution parameters are estimated from the observed data, and we illustrate the accuracy and stability of this approach in simulations.
Universality of Approximate Message Passing algorithms and tensor networks

Tianhao Wang, Xinyi Zhong, and Zhou Fan

The Annals of Applied Probability, to appear

Abstract arXiv

Approximate Message Passing (AMP) algorithms provide a valuable tool for studying mean-field approximations and dynamics in a variety of applications. Although usually derived for matrices having independent Gaussian entries or satisfying rotational invariance in law, their state evolution characterizations are expected to hold over larger universality classes of random matrix ensembles. We develop several new results on AMP universality. For AMP algorithms tailored to independent Gaussian entries, we show that their state evolutions hold over broadly defined generalized Wigner and white noise ensembles, including matrices with heavy-tailed entries and heterogeneous entrywise variances that may arise in data applications. For AMP algorithms tailored to rotational invariance in law, we show that their state evolutions hold over matrix ensembles whose eigenvector bases satisfy only sign and permutation invariances, including sensing matrices composed of subsampled Hadamard or Fourier transforms and diagonal operators. We establish these results via a simplified moment-method proof, reducing AMP universality to the study of products of random matrices and diagonal tensors along a tensor network. As a by-product of our analyses, we show that the aforementioned matrix ensembles satisfy a notion of asymptotic freeness with respect to such tensor networks, which parallels usual definitions of freeness for traces of matrix products.

Implicit bias of optimization algorithms

The Marginal Value of Momentum for Small Learning Rate SGD

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, and Zhiyuan Li

In International Conference on Learning Representations (ICLR), 2024

Abstract arXiv

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine- tuning language models on downstream tasks.
Fast mixing of stochastic gradient descent with normalization and weight decay

Zhiyuan Li, Tianhao Wang, and Dingli Yu

In Advances in Neural Information Processing Systems (NeurIPS), 2022
Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent

Zhiyuan Li*, Tianhao Wang*, Jason D. Lee, and Sanjeev Arora

In Advances in Neural Information Processing Systems (NeurIPS), 2022
Abridged version accepted for a contributed talk to ICML 2022 Workshop on Continuous time methods for machine learning
Abstract arXiv Poster Slides

As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametrization, which encompasses all the previous results in this setting. It is shown that gradient flow with any commuting parametrization is equivalent to continuous mirror descent with a related Legendre function. Conversely, continuous mirror descent with any Legendre function can be viewed as gradient flow with a related commuting parametrization. The latter result relies upon Nash’s embedding theorem.
What happens after SGD reaches zero loss?--A mathematical framework

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora

In International Conference on Learning Representations (ICLR), 2022 (Spotlight)

Abstract arXiv Link Poster Slides

Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one of the key challenges in deep learning, especially for overparametrized models, where the local minimizers of the loss function \(L\) can form a manifold. Intuitively, with a sufficiently small learning rate \(\eta\), SGD tracks Gradient Descent (GD) until it gets close to such manifold, where the gradient noise prevents further convergence. In such a regime, Blanc et al. (2020) proved that SGD with label noise locally decreases a regularizer-like term, the sharpness of loss, \(\text{tr}[\nabla^2 L]\). The current paper gives a general framework for such analysis by adapting ideas from Katzenberger(1991). It allows in principle a complete characterization for the regularization effect of SGD around such manifold—i.e., the "implicit bias"—using a stochastic differential equation (SDE) describing the limiting dynamics of the parameters, which is determined jointly by the loss function and the noise covariance. This yields some new results: (1) a global analysis of the implicit bias valid for \(\eta^{-2}\) steps, in contrast to the local analysis of Blanc et al. (2020) that is only valid for \(\eta^{-1.6}\) steps and (2) allowing arbitrary noise covariance. As an application, we show with arbitrary large initialization, label noise SGD can always escape the kernel regime and only requires \(O(\kappa\ln d)\) samples for learning an \(\kappa\)-sparse overparametrized linear model in \(\mathbb{R}^d\) (Woodworth et al., 2020), while GD initialized in the kernel regime requires \(\Omega(d)\) samples. This upper bound is minimax optimal and improves the previous \(\widetilde{O}(\kappa^2)\) upper bound (HaoChen et al., 2020).

Data-driven decision-making problems

Noise-adaptive Thompson sampling for linear contextual bandits

Ruitu Xu, Yifei Min, and Tianhao Wang

In Advances in Neural Information Processing Systems (NeurIPS), 2023
Learn to match with no regret: Reinforcement learning in Markov matching markets

Yifei Min, Tianhao Wang, Ruitu Xu, Zhaoran Wang, Michael I Jordan, and Zhuoran Yang

In Advances in Neural Information Processing Systems (NeurIPS), 2022 (Oral)

Abstract arXiv

We study a Markov matching market involving a planner and a set of strategic agents on the two sides of the market. At each step, the agents are presented with a dynamical context, where the contexts determine the utilities. The planner controls the transition of the contexts to maximize the cumulative social welfare, while the agents aim to find a myopic stable matching at each step. Such a setting captures a range of applications including ridesharing platforms. We formalize the problem by proposing a reinforcement learning framework that integrates optimistic value iteration with maximum weight matching. The proposed algorithm addresses the coupled challenges of sequential exploration, matching stability, and function approximation. We prove that the algorithm achieves sublinear regret.
A simple and provably efficient algorithm for asynchronous federated contextual linear bandits

Jiafan He*, Tianhao Wang*, Yifei Min*, and Quanquan Gu

In Advances in Neural Information Processing Systems (NeurIPS), 2022

Abstract arXiv

We study federated contextual linear bandits, where \(M\) agents cooperate with each other to solve a global contextual linear bandit problem with the help of a central server. We consider the asynchronous setting, where all agents work independently and the communication between one agent and the server will not trigger other agents’ communication. We propose a simple algorithm named FedLinUCB based on the principle of optimism. We prove that the regret of FedLinUCB is bounded by \(\tilde O(d\sqrt{\sum_{m=1}^M T_m})\) and the communication complexity is \(\tilde{O}(dM^2)\), where \(d\) is the dimension of the contextual vector and \(T_m\) is the total number of interactions with the environment by \(m\)-th agent. To the best of our knowledge, this is the first provably efficient algorithm that allows fully asynchronous communication for federated contextual linear bandits, while achieving the same regret guarantee as in the single-agent setting.
Variance-aware off-policy evaluation with linear function approximation

Yifei Min*, Tianhao Wang*, Dongruo Zhou, and Quanquan Gu

In Advances in neural information processing systems (NeurIPS), 2021

Abstract arXiv Link Poster Slides

We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
Provably efficient reinforcement learning with linear function approximation under adaptivity constraints

Tianhao Wang*, Dongruo Zhou*, and Quanquan Gu

In Advances in Neural Information Processing Systems (NeurIPS), 2021

Abstract arXiv Link Poster Slides

We study reinforcement learning (RL) with linear function approximation under the adaptivity constraint. We consider two popular limited adaptivity models: the batch learning model and the rare policy switch model, and propose two efficient online RL algorithms for episodic linear Markov decision processes, where the transition probability and the reward function can be represented as a linear function of some known feature mapping. In specific, for the batch learning model, our proposed LSVI-UCB-Batch algorithm achieves an \(\widetilde{O}(\sqrt{d^3 H^3 T} + dHT/B)\) regret, where \(d\) is the dimension of the feature mapping, \(H\) is the episode length, \(T\) is the number of interactions and \(B\) is the number of batches. Our result suggests that it suffices to use only \(\sqrt{T/dH}\) batches to obtain \(\tilde{O}(d^3 H^3 T)\) regret. For the rare policy switch model, our proposed LSVI-UCB-RareSwitch algorithm enjoys an \(\tilde{O}(d^3 H^3 T [1 + T/(dH)]^{dH/B})\) regret, which implies that \(d H \log T\) policy switches suffice to obtain the \(\tilde{O}(d^3 H^3 T)\) regret. Our algorithms achieve the same regret as the LSVI-UCB algorithm (Jin et al., 2020), yet with a substantially smaller amount of adaptivity. We also establish a lower bound for the batch learning model, which suggests that the dependency on B in our regret bound is tight.

Orbit recovery model

Maximum likelihood for high-noise group orbit estimation and single-particle cryo-EM

Zhou Fan, Roy R. Lederman, Yi Sun, Tianhao Wang, and Sheng Xu

The Annals of Statistics, 2024

Abstract arXiv Link

Motivated by applications to single-particle cryo-electron microscopy (cryo-EM), we study several problems of function estimation in a low SNR regime, where samples are observed under random rotations of the function domain. In a general framework of group orbit estimation with linear projection, we describe a stratification of the Fisher information eigenvalues according to a sequence of transcendence degrees in the invariant algebra, and relate critical points of the log-likelihood landscape to a sequence of method-of-moments optimization problems. This extends previous results for a discrete rotation group without projection. We then compute these transcendence degrees and the forms of these moment optimization problems for several examples of function estimation under \(SO(2)\) and \(SO(3)\) rotations, including a simplified model of cryo-EM as introduced by Bandeira, Blum-Smith, Kileel, Perry, Weed, and Wein. For several of these examples, we affirmatively resolve numerical conjectures that 3rd-order moments are sufficient to locally identify a generic signal up to its rotational orbit. For low-dimensional approximations of the electric potential maps of two small protein molecules, we empirically verify that the noise-scalings of the Fisher information eigenvalues conform with these theoretical predictions over a range of SNR, in a model of \(SO(3)\) rotations without projection.
Likelihood landscape and maximum likelihood estimation for the discrete orbit recovery model

Zhou Fan, Yi Sun, Tianhao Wang, and Yihong Wu

Communications on Pure and Applied Mathematics, 2022

Abstract arXiv Link

We study the non-convex optimization landscape for maximum likelihood estimation in the discrete orbit recovery model with Gaussian noise. This is a statistical model motivated by applications in molecular microscopy and image processing, where each measurement of an unknown object is subject to an independent random rotation from a known rotational group. Equivalently, it is a Gaussian mixture model where the mixture centers belong to a group orbit.
We show that fundamental properties of the likelihood landscape depend on the signal-to-noise ratio and the group structure. At low noise, this landscape is “benign” for any discrete group, possessing no spurious local optima and only strict saddle points. At high noise, this landscape may develop spurious local optima, depending on the specific group. We discuss several positive and negative examples, and provide a general condition that ensures a globally benign landscape at high noise. For cyclic permutations of coordinates on \(\mathbb{R}^d\) (multi-reference alignment), there may be spurious local optima when \(d \geq 6\), and we establish a correspondence between these local optima and those of a surrogate function of the phase variables in the Fourier domain.
We show that the Fisher information matrix transitions from resembling that of a single Gaussian distribution in low noise to having a graded eigenvalue structure in high noise, which is determined by the graded algebra of invariant polynomials under the group action. In a local neighborhood of the true object, where the neighborhood size is independent of the signal-to-noise ratio, the landscape is strongly convex in a reparametrized system of variables given by a transcendence basis of this polynomial algebra. We discuss implications for optimization algorithms, including slow convergence of expectation-maximization, and possible advantages of momentum-based acceleration and variable reparametrization for first- and second-order descent methods.