Andrew R. Barron
Yale University
Professor of Statistics (100% Appointment)
Professor of Electrical Engineering (Curtesy Appointment)
Co-Director of Graduate Studies in Statistics and Data Science
Senior Coordinator of Undergraduate Studies in Applied Mathematics

Address: Department of Statistics and Data Science, 24 Hillhouse Avenue, New Haven, CT 06511
Phone: +1-203-997-5229, 203-432-0666, 203-432-0634
Fax: 203-432-0633
E-Mail: Andrew.Barron@yale.edu

  • Ph.D., Electrical Engineering, Stanford University, 1985.
  • M.S., Electrical Engineering, Stanford University, 1982.
  • B.S. (Magna Cum Laude), E.E. and Math Science, Rice University, 1981.
  • W. T. Woodson H.S., Fairfax, Virginia, 1977.

Experience:

  • 2016 - present   Professor, Department of Statistics and Data Science, Yale University
  • 1992 - 2016       Professor, Department of Statistics, Yale University
  • 1990 - 1992       Associate Professor of Statistics and Electrical & Computer Engineering, University of Illinois
  • 1984 - 1998       Consultant, Barron Associates, Inc., Stanardsville, Virginia
  • 1992 Spring      Visiting Reseach Scholar, Barron Associates, Inc., Stanardsville, Virginia
  • 1991 Fall           Visiting Scholar, Mathematical Sciences Research Institute, Berkeley, California
  • 1985 - 1990       Assistant Professor of Statistics and Electrical & Computer Engineering, University of Illinois
  • 1982 - 1985       Research Assistant, Stanford University
  • 1981 - 1983       Consultant, Adaptronics, Inc., McLean, Virginia
  • 1977 - 1980       Engineer, Adaptronics, Inc., McLean, Virginia (Summers)

Honors:

  • Fellow of the IEEE. For Contributions to Information Theory and Statistics.
  • Best paper prize for all IEEE journals in 1990-1991, Browder J. Thompson Memorial Prize,
    for authors of age 30 or under at time of submission.
  • IMS Medallion Award Winner. Presented at 2005 Joint ASA-IMS Annual Meetings, Minneapolis, MN.
  • Best paper prize, National Aerospace Electronics Conference, 1990.
  • Finalist for the best paper prize, Information Theory Society, IEEE, 1987.
  • Nominated for the Marconi Young Scientist Award, 1990.
  • Board of Governors, IEEE Information Theory Society, Four terms,1995-1997, 1998-2000, 2013-2016, 2017-2020.
  • Appointed Secretary, Board of Governors, IEEE Information Theory Society, 1989-1990.
  • Keynote speaker at several conferences.
  • Chairman, AMS Summer Research Conference, Adaptive Selection of Models and Procedures, 1996.
  • Program Committee, IMS-ASA Joint Statistical Meetings, 1991.
  • Program Committees, IEEE International Symposium on Information Theory, Numerous times.
  • Program Committees, IEEE Workshop on Information Theory, 1989, 2008.
  • Program Committee, World Congress on Neural Networks, 1995.
  • Program Committee, Neural Information Processing Systems: Natural and Synthetic, 1995.
  • Program Committee, ACM Workshop on Computational Learning Theory, 1991, 1997.
  • James Waters Creativity Award for best undergraduate research at Rice, 1981.
  • Houston Telephone Engineers scholarship, top student in communication theory, 1981.
  • Top Award for leadership, scholarship and service, Woodson, High School, Fairfax, VA, 1977.

Research Interests:

  • Entropy Power Inequalities and Central Limit Theorems.
  • Capacity-achieving Sparse Superposition Codes for the Gaussian Channel. Communication by Regression.
  • Entropy Rates, Likelihood Stabilization, and Inference for Dependent Processes.
  • Foundations of Minimum Description Length Principle of Inference and Universal Data Compression.
  • Statistical Risk Analysis for Penalized Criteria for Model Selection.
  • Statistical Risk Analysis for Bayes Procedures.
  • Statistical Perspectives and Analysis of Artificial Neural Networks.
  • Nonlinear Approximation and Estimation for High-dimensional Libraries of Functions.
  • Greedy Algorithms for Subset Selection, Mixture Density Estimation, and L1 Penalty Optimization.
  • Maximum Wealth Stock Indices and Growth Rate Optimal Portfolio Estimation.

Ph.D. Dissertation:

  1. A. R. Barron (1985). Logically smooth density estimation. Stanford Univ., Stanford, CA.

Journal Publications:

  1. D. Cleveland, A. R. Barron, A. N. Mucciardi (1980). Methods for determining the depth of near-surface defects. Journal of Nondestructive Evaluation, Vol.1, pp.21-36.
  1. A. R. Barron (1985). The strong ergodic theorem for densities: generalized Shannon-McMillan-Breiman theorem. Annals of Probability, Vol.13, pp.1292-1303. (Finalist for the best paper prize by the IEEE Information Theory Society.)
  1. A. R. Barron (1986). Entropy and the central limit theorem. Annals of Probability, Vol.14, pp.336-342. (Finalist for the best paper prize by the IEEE Information Theory Society.)
  1. A. R. Barron (1986). Discussion on Diaconis and Freedman: the consistency of Bayes estimates. Annals of Statistics, Vol.14, pp.26-30.
  1. A. R. Barron and T. M. Cover (1988). A bound on the financial value of information. IEEE Transactions on Information Theory, Vol.34, pp.1097-1100.
  1. A. R. Barron (1989). Uniformly powerful goodness of fit tests. Annals of Statistics, Vol.17, pp.107-124.
  1. B. Clarke and A. R. Barron (1990). Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory, Vol.IT-38, pp.453-471. (Winner 1992 Browder J. Thompson Memorial Prize award for the best paper in all IEEE journals for authors of age 30 or under at time of submission).
  1. A. R. Barron and X. Xiao (1991). Discussion on Friedman's multivariate adaptive regression. Annals of Statistics, Vol.19, pp.67-82.
  1. A. R. Barron and C. Sheu (1991). Approximation of density functions by sequences of exponential families. Annals of Statistics, Vol.19, pp.1347-1369.
  1. A. R. Barron and T. M. Cover (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory, Vol.IT-37, pp.1034-1054.
  1. A. R. Barron, L. Gyorfi, and E. C. van der Meulen (1992). Distribution estimation consistent in total variation and in two types of information divergence. IEEE Transactions on Information Theory, Vol.IT-38, pp.1437-1454.
  1. A. R. Barron (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, Vol.IT-39, pp.930-944.
  1. A. R. Barron (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, Vol.14, pp.113-143.
  1. A. R. Barron (1994). Comment on Cheng and : Neural Networks, A Review from a Statistical Perspective. Statistical Science, Vol.9, No. 1, pp.33-35.
  1. B. Clarke and A. R. Barron (1994). Jeffreys' prior is asymptotically least favorable under entropy risk. Journal of Statistical Planning and Inference, Vol.41, pp.37-60.
  1. Q. Xie and A. R. Barron (1997). Minimax redundancy for the class of memoryless sources. IEEE Transactions on Information Theory, Vol.43, pp.646-657.
  1. Y. Yang and A. R. Barron (1998). An asymptotic property of model selection criteria. IEEE Transactions on Information Theory, Vol.44, pp.117-133.
  1. A. R. Barron, J. Rissanen and B. Yu (1998). The minimum description length principle in coding and modeling. (Invited Paper. Special issue in honor of 50 years since Claude Shannon's seminal work.) IEEE Transactions on Information Theory, Vol.44, pp.2734-2760.
  1. A. R. Barron and N. Hengartner (1998). Information theory and superefficiency. Annals of Statistics, Vol.26, pp.1800-1825.
  1. A. R. Barron, L. Birge and P. Massart (1999). Risk bounds for model selection by penalization. Probability Theory and Related Fields, Vol.113, pp.301-413.
  1. A. R. Barron, M. Schervish and L. Wasserman (1999). The consistency of posterior distributions in nonparametric problems. Annals of Statistics, Vol.27, pp.536-651.
  1. Y. Yang and A. R. Barron (1999). Information-theoretic determination of minimax rates of convergence. Annals of Statistics, Vol.27, pp.1564-1599.
  1. Q. Xie and A. R. Barron (2000). Asymptotic minimax regret for data compression, gambling, and prediction. IEEE Transaction on Information Theory, Vol.46, pp.431-445.
  1. G. Cheang and A. R. Barron (2000). A Better Approximation for Balls. Journal of Approximation Theory, Vol.104, pp. 183-203.
  1. J. E. Cross and A. R. Barron (2003). Efficient Universal Portfolios for Past Dependent Target Classes. Mathematical Finance, Volume 13, Issue 2, Page 245-276.
  1. O. Johnson and A. R. Barron (2004). Fisher Information Inequalities and the Central Limit Theorem. Probability Theory Related Field , 129, Page 391-409.
  1. F. Liang and A. R. Barron (2004). Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection. IEEE Transactions on Information Theory, Vol. 50, Page 2708-2726.
  1. G. Leung and A. R. Barron (2006). Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory, Vol. 52. no.8, pp.3396-3410.
  1. M. Madiman and A. R. Barron (2007). Generalized entropy power inequalities and monotonicity properties of information. IEEE Transactions on Information Theory, Vol. 53. no.7, pp.2317-2329.
  1. A. R. Barron, A. Cohen, W. Dahmen and R. DeVore (2008). Approximation and learning by greedy algorithms. Annals of Statistics, Vol. 36, pp. 64-94.
  1. J. Takeuchi, T. Kawabata and A. R. Barron (2013). Properties of Jeffreys Mixture for Markov Sources. IEEE Transactions on Information Theory . Vol.59, no.1. January, pp.438-457.
  1. A. Joseph and A. R. Barron (2012). Least Squares Superposition Codes of Moderate Dictionary Size Are Reliable at Rates up to Capacity. IEEE Transactions on Information Theory, vol.58, no.5. May, pp.2541-2557.
  1. A. Joseph and A. R. Barron (2014). Fast Sparse Superposition Codes have Exponentially Small Error Probability for R < C. IEEE Transactions on Information Theory, vol.60. No.2, February, pp.919-942.
  1. A. M. Kagan, Tinghui Yu, A. R. Barron and M. Madiman (2014). Contributions to the theory of Pitman estimators. Journal of Mathematical Sciences. vol.199, no.2, pp.202-214.
  1. L. Jakobek, M. Boc and A. R. Barron (2015). Optimization of Ultrasonic-Assisted Extraction of Phenolic Compounds from Apples. Food Anal. Methods, Vol. 8, pp. 2612-2625.
  1. L. Jakobek and A. R. Barron (2016). Ancient Apple Varieties from Croatia as a Source of Bioactive Polyphenolic Compounds. Journal of Food Composition and Analysis, Vol.45, pp. 9-15.
  1. X. Yang and A. R. Barron (2017). Minimax Compression and Large Alphabet Approximation through Poissonization and Tilting. IEEE Transactions on Information Theory, vol.63.
  1. J. M. Klusowski and A. R. Barron (2017). Minimax Lower Bounds for Ridge Combinations Including Neural Nets. Submitted.

Book Chapters:

  1. A. R. Barron (1984). Predicted squared error: a criterion for automatic model selection. Chapter 4 in Self-Organizing Methods in Modeling, S. J. Farlow (Editor), Marcel Dekker, New York, pp.87-103.
  1. R. L. Barron, A. N. Mucciardi, F. J. Cook, J. N. Craig, and A. R. Barron (1984). Adaptive learning networks. Chapter 2 in Self-Organizing Methods in Modeling, S. J. Farlow (Editor), Marcel Dekker, New York, pp.25-65.
  1. A. R. Barron (1987). Are Bayes rules consistent in information? In Open Problems in Communication and Computation, T. M. Cover and B. Gopinath (Editors), Springer-Verlag, New York, pp.85-91.
  1. A. R. Barron (1991). Complexity regularization with application to artificial neural networks. In Nonparametric Functional Estimation and Related Topics, G. Roussas (Editor), Kluwer Academic Publishers, Boston, MA and Dordrecht, The Netherlands, pp.561-576.
  1. A. R. Barron (1998). Information-theoretic Characterization of Bayes Performance and the Choice of Priors in Parametric and Nonparametric Problems. In Bayesian Statistics 6, J.M. Bernardo, J.O. Berger, A.P. Dawid and A.F.M. Smith (Editors). Oxford University Press. pp.27-52.
  1. J.Q. Li and A.R. Barron (2000). Mixture Density Estimation. In Advances in Neural Information Processing Systems, Vol.12, S.A. Solla, T.K. Leen and K-R. Mueller (Editors). MIT Press, Cambridge, Massachusetts, pp. 279-285.
  1. F. Liang and A. Barron (2005). Exact minimax predictive density estimation and MDL. [Plus the table with MDL parameter estimates.] In Advances in Minimum Description Length: Theory and Applications, P.D. Grunwald, I.J. Myung and M.A. Pitt (Editors). MIT Press, Cambridge, Massachusetts. pp.177-193.
  1. A.R. Barron, C. Huang, J. Q. Li and Xi Luo (2008). MDL Principle, Penalized Likelihood, and Statistical Risk. In Festschrift for Jorma Rissanen. Peter Grunwald, Petri Myllymaki, Ioan Tabus, Marcelo Weinberger & Bin Yu (Editors). Tampere International Center for Signal Processing. TICSP series, #38. Tampere University of Technology, Tampere, Finland.

Publications in Conference Proceedings: (4 to 27 pages)

  1. A. R. Barron, F. W. van Straten, and R. L. Barron (1977). Adaptive learning network approach to weather forcasting: a summary. Proceedings of the IEEE International Conference on Cybernetics and Society, Washington, DC, September 19-21. Published by IEEE, New York, pp.724-727.
  1. A. R. Barron and R. L. Barron (1988). Statistical learning networks: a unifying view. In Computing Science and Statistics: Proceedings of the 20th Symposium on the Interface, Reston, Virginia, April 20-23. E. Wegman (Editor), Published by the American Statistical Association, Alexandria, Virginia, pp.192-203. (Invited presentation).
  1. A. R. Barron (1989). Statistical properties of artificial neural networks. Proceedings of the IEEE International Conference on Decision and Control, Tampa, Florida, Dec.13-15. pp.280-285, vol.1. (Invited presentation).
  1. R. L. Barron, R. L. Cellucci, P. R. Jordan, N. E. Beam, P. Hess, and A. R. Barron (1990). Applications of polynomial neural networks to fault detection, isolation, and estimation (FDIE) and reconfigurable flight control. Proceedings of the National Aerospace Electronics Conference, Dayton, Ohio, May 23-25, pp.507-519, vol.2 (Winner of the best paper prize, 1990 NAECON). Republished in Proceedings 1998 NAECON, pp. 348-360.
  1. A. R. Barron (1991). Approximation and estimation bounds for artificial neural networks. In Computational Learning Theory: Proceedings of the Fourth Annual ACM Workshop, Santa Cruz, CA, August 5-7. L. Valiant, Ed., Morgan Kaufmann Publishers, Inc., San Mateo, California, pp.243-249. (Honored as one of the four papers that appeared by invitation in expanded form in the special issue of Machine Learning, representing the top presentations at the workshop.)
  1. A. R. Barron (1992). Neural Net Approximation. Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems, May 20-22, K. S. Narendra (Editor), Center for Systems Science, Yale University, pp. 69-72.
  1. D. Haussler and A. R. Barron (1993). How well do Bayes methods work for on-line prediction of + or -1 values? Computational Learning and Cognition: Proc. Third NEC Research Symposium, SIAM, Philadelphia, pp.74-101.
  1. J. Takeuchi and A. R. Barron (1997). Asymptotically minimax regret for exponential families. 20th Symposium on Information Theory and Its Applications. pp.665-668.
  1. J. Takeuchi and A. R. Barron (1998). Robustly Minimax Codes for Universal Data Compression. 21st Symposium on Information Theory and Its Applications. Gifu, Japan, December 2-5.
  1. G.H.L. Cheang and A. R. Barron (1999). Estimation with Two Hidden Layer Neural Nets. Proceedings of the 1999 International Joint Conference on Neural Networks (IJCNN), pp.375-378, vol.1.
  1. G.H.L. Cheang and A. R. Barron (2001). Penalized Least Squares, Model Selection, Convex Hull Classes, and Neural Nets. Proceedings of the 9th European Symposium on Artificial Neural Networks. Bruges, Belgium, April 25-27, pp.371-376.
  1. A. R. Barron (2000). Limits of Information, Markov Chains and Projection. Proc. IEEE International Symposium on Information Theory. Sorrento, Italy, June 25-30.
  1. J. Takeuchi and A. R. Barron (2001). Properties of Jeffreys mixture for Markov sources. Proc. Workshop on Information Based Induction Sciences (IBIS), pp. 327-333.
  1. G. Leung and A. R. Barron (2005). Combining Least-squares Regressions: An Upper Bound on Mean Squared Error. Proc. International Symposium on Information Theory, September 4-9, pp. 1711-1715.
  1. M. Madiman and A. R. Barron (2006). The Monotonicity of Information in the Central Limit Theorem and Entropy Power Inequalities. Proc. IEEE International Symposium on Information Theory. Seattle, Washington, July 2006. pp. 1021-1025.
  1. J. Takeuchi, A. R. Barron, T. Kawabata (2006). Statistical curvature and stochastic complexity. Proceedings of the 2nd Symposium on Information Geometry and its Applications, Tokyo, Japan, December 12-16, pp. 29-36.
  1. A. R. Barron and Xi Luo (2007). Adaptive Annealing. Proceedings 45th Annual Allerton Conference on Communication, Control, and Computing. Allerton House, UIUC, Illinois. September 26-28. pp. 665-673.
  1. A. R. Barron, C. Huang, J. Q. Li, and Xi Luo (2008). MDL, Penalized Likelihood and Statistical Risk. IEEE Information Theory Workshop. Porto, Portugal, May 4-9. pp. 247-257.
  1. A. R. Barron and Xi Luo (2008). MDL procedures with l_1 penalty and their statistical risk. First Workshop on Information Theoretic Methods in Science and Engineering. Tampere, Finland, August 18-20, 2008.
  1. M. Madiman, A. R. Barron, A. Kagan, T. Yu (2009). A Model for Pricing Data Bundles by Minimax Risks for Estimation of a Location Parameter. Proceedings of the IEEE Workshop on Information Theory. Volos, Greece, June 10-12. pp. 106-109.
  1. A. R. Barron, A. Joseph (2010). Least Squares Superposition Codes of Moderate Dictionary Size, Reliable at Rates up to Capacity. Proc. IEEE International Symposium on Information Theory. Austin, Texas, June 13-18. pp. 275-279.
  1. A. R. Barron, A. Joseph (2010). Towards fast reliable communication at rates near capacity with Gaussian noise. Proc. IEEE International Symposium on Information Theory. Austin, Texas, June 13-18. pp. 315-319.
  1. A. R. Barron, A. Joseph (2011). Analysis of fast sparse superposition codes. Proc. IEEE International Symposium on Information Theory. St Petersburg, Russia, August 1-6. pp. 1772-1776.
  1. E. Abbe, A. R. Barron (2011). Polar coding schemes for the AWGN channel. Proc. IEEE International Symposium on Information Theory. St Petersburg, Russia, August 1-6, 2011. pp. 194-198.
  1. A. R. Barron and Sanghee Cho (2012). High-rate sparse superposition codes with iteratively optimal estimates. Proc. IEEE International Symposium on Information Theory. Cambridge, MA, July 2012, pp. 120-124.
  1. Cynthia Rush and A. R. Barron (2013). Using the method of nearby measures in superposition coding with a Bernoulli dictionary. 6th Workshop on Information Theory Methods in Science and Engineering (WITMSE). Tokyo, Japan, August 2013.
  1. Xiao Yang and A. R. Barron (2013). Large alphabet coding and prediction through Poissonization and tilting. Proc. 6th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE). Tokyo, Japan, August 2013. pp. 68-74.
  1. Xiao Yang and A. R. Barron (2014). Compression and prediction for large alphabet iid and Markov models. Proc. 7th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE) . Honolulu, July 5-8. pp. 31-34.
  1. Jun'ichi Takeuchi and A. R. Barron (2013). Asymptotically minimax regret by Bayes mixtures for non-exponential families. Proc. IEEE Information Theory Workshop. pp. 1-5.
  1. Jun'ichi Takeuchi and A. R. Barron (2014). Asymptotically Minimax Regret for Models with Hidden Variables. Proc. IEEE International Symposium on Information Theory. Honolulu, HI, June 2014, pp.3037-3041.
  1. Jun'ichi Takeuchi and A. R. Barron (2014). Stochastic Complexity for Tree Models. Proc. IEEE Information Theory Workshop. November 2-4, 2014, pp.222-226.
  1. Sabyasachi Chatterjee and A. R. Barron (2014). Information Theoretic Validity of Penalized Likelihood. Proc. IEEE International Symposium on Information Theory. Honolulu, HI, June 2014, pp. 3027-3031.
  1. Grace Xiao Yang and A. R. Barron (2014). Compression and Predictive Distributions for Large Alphabet i.i.d. and Markov Models. Proc. IEEE International Symposium on Information Theory. Honolulu, HI, June 2014, pp. 2504-2508.
  1. A. Barron, Teemu Roos, Kazuho Watanabe (2014). Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation. Proc. IEEE International Symposium on Information Theory. Honolulu, HI, June 2014. pp.1667-1671.

Newsletter Article:

  1. R. Venkataramanan, S. Tatikonda, A. Barron (2016). Sparse Regression Codes. IEEE Information Theory Society Newsletter.December 2016, pp. 7-15. [Based on ISIT Tutorial by R. Venkataramanan and A. Barron, Barcelona, July 2016.]

Patents:

Technical Reports: (with details not in subsequent publications)

A Selection of Seminar Presentation Files (pdf format); to view on a computer or project on a screen:

  • MDL, Penalized Likelihood and Statistical Risk. Presented at the Information Theory Workshop, Porto, Portugal, May 8, 2008. Festschrift on the occasion of the 75th birthday of Jorma Rissanen. Similar presentations with updates for the regression case at the Workshop on Information Theory Methods in Science and Engineering, Tampere Finland, August 19, 2008 and the Information and Communication Conference, Renyi Institute, Budapest, August 25-28, 2008, on the occasion of the 70th birthday of Imre Csiszar: MDL Procedures with L_1 Penalty and their Statistical Risk
  • Adaptive Annealing. Presentation at the Allerton Conference on Communication, Control, and Computing. September 27, 2007.
  • Information Theory and Flexible High-Dimensional Non-Linear Function Estimation. Presented at the Info-Metrics Institute Workshop, American University, Wash, DC, November 12, 2011. Similar presentation at Harvard University, Department of Statistics, Oct.2011. Overview of several useful results for high-dimensional function estimation. Disclaimer: The proposed solution on page 19 to the differential equation for Adaptive Annealing is problematic due to discontinuity of the gradient at the origin.
  • Analysis of Fast Sparse Superposition Codes. Presentation at the IEEE International Symposium on Information Theory, St. Petersburg, Russia, August 5, 2011. Adds details of distribution analysis not in the earlier presentation "Toward Fast Reliable Communication, at Rates Near Capacity with Gaussian Noise," IEEE International Symposium on Information Theory, Austin, TX, June 18, 2010. Similar Presentations: "Communication by Regression: Practical Achievement of Shannon Capacity," at Workshop Infusing Statistics and Engineering, Harvard University, June 5-6, 2011. "Sparse Superposition Codes: low complexity and exponentially small error probability at all rates below capacity," Workshop on Information Theory Methods in Science and Engineering, Helsinki, Finland, August 8, 2011.

Other Conference Presentations (1983-2008): (proceedings containing not more than 1 page abstracts). [A number of subsequent presentations are described in the links above.]

  • A. R. Barron (1983). Convergence of logically simple estimates of unknown probability densities. IEEE International Symposium on Information Theory, Saint Jovite, Canada, September 26-30.
  • A. R. Barron (1985). Entropy and the central limit theorem. IEEE International Symposium on Information Theory, Brighton, England, June 23-28.
  • A. R. Barron (1985). Ergodic theorem for densities: generalized Shannon-McMillan-Breiman theorem. IEEE International Symposium on Information Theory, Brighton, England, June 23-28.
  • A. R. Barron (1985). Logically smooth density estimation. Joint IMS, ASA Annual Meeting, Las Vegas, Nevada, August 5-8.
  • A. R. Barron (1987). Applications of large deviations in statistics. Conference on Asymptotic Methods for Stochastic Systems: Large Deviations Theory and Practice, University of Maryland, October 25-27. (Invited Presentation).
  • A. R. Barron (1988). The convergence in information of probability density estimators. IEEE International Symposium on Information Theory, Kobe, Japan, June 19-24.
  • T. M. Cover and A. R. Barron (1988). A bound on the financial value of information. IEEE International Symposium on Information Theory, Kobe, Japan, June 19-24.
  • A. R. Barron (1989). Minimum complexity density estimation. IMS Regional Meeting, Lexington, Kentucky, March 19-22. (Invited Presentation).
  • A. R. Barron (1989). Portfolio selection based on nonparametric density estimates for the stock market. 21st Symposium on the Interface: Computing Science and Statistics, Orlando, Florida, April 9-12. (Invited Presentation).
  • A. R. Barron (1989). Minimum complexity estimation. IEEE Workshop on Information Theory, Center for Applied Math, Cornell University, June 26-30. (Session Organizer).
  • A. R. Barron (1989). Some statistical properties of polynomial networks and other artificial neural networks. Conference on Neural Information Processing Systems, Denver, Colorado, November 27-30. (Invited Plenary Presentation).
  • A. R. Barron (1990). An index of resolvability of probability density estimators. IEEE International Symposium on Information Theory, San Diego, California, January 14-19.
  • A. R. Barron (1990). Some statistical convergence properties of artificial neural networks. IEEE International Symposium on Information Theory, San Diego, California, January 14-19.
  • A. R. Barron (1990). The index of resolvability: statistical convergence of minimum complexity estimation. AAAI Symposium on Minimal-Length Encoding, Stanford California, March 27-29 (Invited Presentation).
  • A. R. Barron (1990). Some approximation and estimation theorems for artificial neural networks. IEEE Information Theory Workshop, Veldhoven, The Netherlands, June 10-15 (Invited Presentation).
  • A. R. Barron (1990). Statistical properties of artificial neural networks. SIAM Annual Meeting, Chicago, Illinois, July 20 (Invited Presentation).
  • A. R. Barron (1990). Complexity Regularization. NATO Advanced Study Institute on Nonparametric Functional Estimation and Related Topics, Spetses, Greece, July 29 - August 11 (Invited Presentation).
  • A. R. Barron (1991). Information theory and the stock market: the effect of side information. Workshop on Coordination of Distributed Information and Decisions, Cornell University, April 11-13. (Invited Presentation).
  • A. R. Barron (1991). Approximation and estimation results for adaptively synthesized neural network architectures. Workshop on Theoretical Issues in Neural Nets, Center for Discrete Mathematics and Theoretical Computer Science, Rutgers, University, May 20-23. (Invited Presentation).
  • A. R. Barron and R. L. Barron (1991). Artificial neural networks in industry: some developments using statistical techniques. IMS Special Topics Meeting on Statistics in Industry, Philadelphia, Pennsylvania, June 9-12 (Invited Presentation).
  • A. R. Barron (1991). Universal approximation bounds for superpositions of a sigmoidal function. IEEE International Symposium on Information Theory, Budapest, Hungary, June 23-29.
  • A. R. Barron (1991). Approximation and estimation bounds for sigmoidal and polynomial networks. Joint Statistical Meetings, Atlanta, Georgia, August 19-22. (Session Organizer).
  • A. R. Barron (1991). Approximation and estimation bounds for neural networks and projection pursuit. Workshop on Neural Information Processing Systems, Vail, Colorado, December 10-11. (Invited Presentation).
  • A. R. Barron (1991). Risk estimation, risk optimality, regularization, and neural nets. Workshop on Neural Information Processing Systems, Vail, Colorado, December 10-11.
  • A. R. Barron (1992). Approximation, estimation, and computation results for artificial neural networks. 24th Symposium on the Interface: Computing Science and Statistics, College Station, Texas, March 19-21. (Invited Presentation).
  • A. R. Barron (1992). Artificial neural networks: stochastic analysis and engineering applications? Symposium on Stochastic Processes in Engineering Applications, Otaniemi, Finland, April 7-9. (Invited Plenary Presentation).
  • A. R. Barron, D. Olive, and Y. Yang (1992). Asymptotically optimal complexity-based model selection. IMS Regional Meeting, Corvallis, Oregon, June 14-15. (Invited Presentation).
  • A. R. Barron (1992). Statistical accuracy of neural nets. Conference on Neural Information Processing Systems: Natural and Synthetic. Denver, Colorado, November 30 - December 2. (Invited Tutorial).
  • A. R. Barron (1993). Neural net approximation. Annual Meeting of the American Math Society, San Antonio, Texas, January 13-16. (Invited Presentation).
  • A. R. Barron, L. Gyorfi, and E. C. van der Meulen (1993). Universal coding of non-discrete sources based on distribution estimation consistent in expected information divergence. IEEE International Symposium on Information Theory, San Antonio, Texas, January 17-22. p.51.
  • A. R. Barron, B. Clarke, and D. Haussler (1993). Information bounds for the risk of Bayesian predictions and the redundancy of universal codes. IEEE International Symposium on Information Theory, San Antonio, Texas, January 17-22.
  • A. R. Barron (1993). Statistical accuracy of neural nets. Workshop on Information and Geometry, Hakone, Japan. March. (Invited Presentation).
  • A. R. Barron (1993). Optimal rate properties of minimum complexity estimation. Workshop on Descriptional Complexity, Schloss Dagstahl, Wadern, Germany. May. (Invited Presentation).
  • A. R. Barron (1993). Do neural nets avoid the curse of dimensionality? Congress on Statistics, Vannes, France, May. (Invited Presentation).
  • A. R. Barron (1993). Performance bounds for neural net estimation and classification. Annual Meeting of the Classification Society of North America, Pittsburgh, June. (Invited Presentation).
  • A. R. Barron (1993). Do neural nets avoid the curse of dimensionality? NATO ASI on Statistics and Neural Nets, Les Arc, France, June. (Invited Presentation).
  • A. R. Barron (1994). The accuracy of Bayes estimates of neural nets. 26th Symposium on the Interface: Computing Science and Statistics, Research Triangle Park, NC, June 15-18.
  • A. R. Barron, Y. Yang and B. Yu (1994). Asymptotically optimal function estimation by minimum complexity criteria. IEEE International Symposium on Information Theory, Trondheim, Norway, June 27 - July 1.
  • A. R. Barron (1994). Neural net approximation and estimation. IEEE Workshop on Information Theory, Moscow, Russia, July 3-6. (Invited Presentation).
  • A. R. Barron (1994) Minimum complexity estimation. ML/COLT: Workshop on Applications of Descriptional Complexity to Induction, Statistical and Visual Inference, New Brunswick, New Jersey, July 8-9. (Invited Presentation.)
  • A. R. Barron (1995). Statistics and neural nets. New England Statistics Conference, Storrs, Connecticut, April 22. (Invited Plenary Presentation).
  • A. R. Barron, Y. Yang (1995). Information-theoretic development of minimax rates of convergence. IEEE International Symposium on Information Theory, Whistler, British Columbia, September 17-22. (Added to session at recommendation of program chair; not in proceedings).
  • N. Hengartner and A. R. Barron (1995). Information theory and superefficiency. IMS regional meeting Stanford, California. (Invited Presentation).
  • B. S. Clarke and A. R. Barron (1995). Jeffreys' prior yields the asymptotic minimax redundancy. IEEE-IMS Workshop on Information Theory and Statistics, Alexandria, Virginia, October 27-29. (Invited Presentation).
  • A. R. Barron (1995). Asymptotically optimal model selection and neural nets. IEEE-IMS Workshop on Information Theory and Statistics, Alexandria, Virginia, October 27-29. (Invited Presentation).
  • Q. Xie and A. R. Barron (1996). Asymptotic minimax regret for data compression, gambling, and prediction. Workshop on sequence prediction, Santa Cruz, California, May 3-5. (Invited, jointly presented).
  • A. R. Barron (1996). The fundamental role of Kullback Information in large sample statistics. Solomon Kullback Memorial Conference, Washington, DC, May 23-24. (Invited Presentation).
  • A. R. Barron (1996). Adaptation and model selection. AMS-IMS-SIAM Summer Research Conference on adaptive selection of models and statistical procedures, Mount Holyoke, Massachusetts, June 22-28. (Conference Chairmen, A. Barron, P. Bickel, I. Johnstone, D. Donoho).
  • A. R. Barron and Y. Yang (1996). Information theory in nonparametric estimation. Nonparametric Estimation: The Road Ahead, Canberra, Australia, July 2-4. (Invited Presentation).
  • A. R. Barron (1996). Adaptive model selection and neural networks. Sydney International Statistics Congress, IMS regional meeting, Sydney, Australia, July 8-12. (Invited Presentation).
  • A. R. Barron (1996). Asymptotics of Bayes estimators. International Society of Bayesian Analysis, Regional Meeting, Chicago, Illinois, August 2-3. (Invited Presentation).
  • A. R. Barron and Y. Yang (1996). Adaptive model selection and the index of resolvability. Joint Statistical Meetings of the ASA and IMS, Chicago, Illinois, August 4-8. (Invited Presentation).
  • A. R. Barron and Q. Xie (1997). Asymptotic minimax regret for data compression, gambling and prediction. IEEE International Symposium on Information Theory, Ulm, Germany, June 29 - July 4.
  • A. R. Barron (1997). Information theory in probability, statistics, learning, and neural nets. Computational Learning Theory: Tenth Annual ACM Workshop, Nashville, Tennessee, July 6-9. (Invited Plenary Presentation).
  • A. R. Barron (1997). Information theory in probability, statistics, learning, and neural nets. International Conference on Combinatorics, Information Theory, and Statistics, University of Southern Maine, Portland, Maine, July 18-20. (Invited Presentation).
  • A. R. Barron and Y. Yang (1997). Information-theoretic determination of minimax rates of convergence. Symposium on Nonparametric Functional Estimation, Centre de Recherches Mathematiques, University of Montreal, October 16-18. (Invited Presentation).
  • A. R. Barron (1998). Nonlinear approximation, greedy algorithms, and neural networks. Ninth International Conference on Approximation Theory, January 4-6. (Invited Presentation).
  • A. R. Barron (1998). How information theory illuminates the behavior of risk functions of Bayes procedures. Purdue Workshop on the Interface between Paradigms of Statistics. June 17-19. (Invited Presentation).
  • A. R. Barron and J.-I. Takauchi (1998). Mixture models achieving optimal coding regret. IEEE Information Theory Workshop. Killarney, Ireland, June 22-26.
  • J. Takeuchi and A. R. Barron (1998). Asymptotic Minimax Regret by Bayes Mixtures. International Symposium on Information Theory. Cambridge, MA, August 16-21.
  • A. R. Barron (1998). Information theory in probability and statistics; Approximation and estimation bounds for Gaussian mixtures. CIRM Workshop on Information Theory, Statistics, and Image Analysis, Marsielle, France, December 7-11. (Two Invited Presentations).
  • A. R. Barron (1999). Information theory in probability and statistics. IEEE Information Theory Workshop on Detection, Estimation, Classification, and Imaging, Sante Fe, New Mexico, February 24-26. (Invited Presentation).
  • A. R. Barron (1999). Decision theory of regret for universal coding, gambling, and prediction. DIMACS Workshop: Online Decision Making, Rutgers University, New Brunswick, NJ, July 12-15. (Invited Presentation).
  • A. R. Barron (2000). Limits of information, Markov chains and projection. IEEE International Symposium on Information Theory, Sorrento, Italy, June 26-30.
  • A. R. Barron, Laszlo Gyorfi, Micheal Nussbaum (2000). Nonparametric Estimation, Neural Nets and Risk Asymptotics. Short Course. Mathematical Research Institute, Oberwolfach, Germany, June 10-17.
  • A. R. Barron (2000). Information theory in probability; Information theory in statistics. J. Bolyia Society Conference on Information Theory in Mathematics, honoring 50th anniversary of formation of what is now known as the Renyi Institute of Mathematics of the Hungarian Academy of Sciences. Balatonelle, Hungary, July 3-7. (Two Invited Presentations).
  • A. R. Barron (2000). Prediction, data compression, gambling, and model selection: Do Bayes procedures nearly minimize the maximum of regret over all possible data sequences? AMS-IMS-SIAM Summer Research Conference on Bayes, Frequentist, and Likelihood Inference: a Synthesis. Mount Holyoke College, South Hadley, Massachusetts, July 9-13. (Invited Presentation.)
  • A. R. Barron (2001). Information-theoretic bounds for mixture modeling, model selection, and data compression. Workshop on Information Theory and Statistics, DIMACS, Rutgers University, March 2001. (Invited Presentation).
  • A. R. Barron (2001). Information theory in probability and statistics. 23rd European Meeting of Statistics, Funchal, Madeira, Portugal, August 13-18. (Invited Keynote Plenary Presentation).
  • F. Liang and A. R. Barron (2001). Minimax optimal predictive density estimation, data compression, and model selection. Workshop on MDL at Conference on Neural Information Processing Systems, Whistler, British Columbia, December 10. (Invited Presentation).
  • F. Liang and A. R. Barron (2002). Exact minimax strategies for predictive density estimation, data compression and model selection. IEEE International Symposium on Information Theory, Lausanne, Switzerland, July 1-5.
  • A. R. Barron, Jiangfeng Yu, and Wei Qui (2003). Maximum compounded wealth: portfolio estimation, option pricing, and stock selection. Workshop on Complexity and Inference, DIMACS, Rutgers University, June 2-5. (Invited Presentation).
  • A. R. Barron (2003). The role of information in the central limit problem. Symposium on Information Theory and Some Friendly Neighbors -- in Wunschkonzert, ZIF, Center for Interdisciplinary Research, Bielefeld, August 11-13. (Invited Presentation).
  • A. R. Barron (2003). Interplay of statistics and information theory in formulation and selection of models. Workshop on Model Building, Dortmund, Germany, November 13-14. (Invited Presentation).
  • G. Leung and A. R. Barron (2004). Information theory, model selection and model mixing for regression, Conference on Information Sciences and Systems, Princeton, NJ, March 17-19. (Invited Presentation in session on Information Theory, Computer Science, and Statistics.)
  • A. R. Barron and G. Leung (2004). Risk assessment for Bayes procedures and model mixing in regression. IVth Workshop on Bayesian Nonparametrics, Universita di Roma La Sapienza, Rome, Italy, June 12-16. (Invited Presentation).
  • A. R. Barron and G. Leung (2004). Risk assessment for model mixing. Workshop on Mathematical Foundations of Learning Theory, Barcelona, Spain, June 18-23. (Invited Presentation).
  • A. R. Barron (2004). Relative entropy in probability theory and mathematical statistics. Workshop on Entropy in the Mathematical, Physical, and Engineering Sciences, Padova, Italy, June 24-27. (Two Invited Presentations).
  • A. R. Barron (2004). Fitting functions of many variables: neural networks and beyond. 16th Conference on Computational Statistics (COMPSTAT 2004), Prague, Czech Republic, August 23-28. (Invited Keynote Plenary Presentation).
  • A. R. Barron (2005). Neural nets, mixture models, and adaptive kernel machines. Yale Workshop on Adaptive and Learning Systems, May 29-31, Center for Systems Science, Yale University. (Invited presentation).
  • A. R. Barron (2005). Challenges in high-dimensional function estimation and attempted solutions, Congress on Statistics, Pau, France, June 6-10. (Invited Presentation).
  • A. R. Barron (2005). Information theory and risk analysis. Medallion Lecture. Joint Statistical Meetings of the IMS and ASA, Minneapolis, Minnesota, August 7-11. (Presented with IMS Medallion Award; one-hour special invited presentation on Aug. 7).
  • A. R. Barron (2005). Information theory and statistics for machine learning, IEEE Workshop on Machine Learning for Signal Processing XV, Mystic, CT, October 28-30. (Invited Keynote Plenary Presentation).
  • M. Madiman and A. R. Barron (2006). Monotonicity of information in the central limit theorem, Workshop on Information Theory and its Applications, University of California, San Diego, February 6-9. (Invited Presentation).
  • A. R. Barron (2006). Simple risk bounds for mixing least squares regressions. Journees: Model Selection in Statistics: Different approaches, University de Nice, Sophia-Antipolis, Nice, France, March 14-19 (Two one-hour invited presentations).
  • A. R. Barron (2006). Simple risk bounds for mixing least squares regressions. International Workshop on Applied Probability, University of Connecticut, Storrs, CT, May 18. (Invited Presentation).
  • A. R. Barron and Wei Qiu (2007). Maximum wealth portfolios, Workshop on Information Theory and its Applications, University of California, San Diego, January 29 - February 2. (Invited Presentation).
  • A. R. Barron, Cong Huang, and Xi Luo (2008). Penalized squared error and likelihood: risk bounds and fast algorithms, Workshop on Sparsity in High Dimensional Statistics and Learning Theory, Georgia Institute of Technology, Atlanta, Georgia, March 22-24. (Invited Three-Part Presentation).
  • A. R. Barron (2008). Information theory principles in probability and statistics, Elements of Information Theory Workshop, on the Occasion of Tom Cover 70th birthday, Stanford University, Stanford, CA, May 16. (Invited Presentation).

Invited Departmental Seminar Presentations (1985-2007): (These had short abstract announcements).
[A number of subsequent presentations are indicated with the presentation files above.]

  • Purdue University, Joint Statistics Colloquium, October 3, 1985. Topic: Entropy and the central limit theorem.
  • Michigan State University, Department of Statistics and Probability, January 28, 1986. Topic: Generalized Shannon-McMillan-Breiman theorem.
  • University of Chicago, Department of Statistics, October 20, 1986. Topic: Uniformly powerful tests.
  • University of Virginia, Department of Mathematics, March 5, 1987. Topic: Convergence of Bayes estimators of probability density functions.
  • Stanford University, Department of Statistics, October 20, 1987. Topic: Convergence of Bayes estimators of probability density functions.
  • University of Chicago, Department of Statistics, March 7, 1988. Topic: Convergence of Bayes estimators of probability density functions.
  • McGill University, Joint Statistics Seminar for Montreal universities, March 31, 1988. Topic: Approximation of densities by sequences of exponential families.
  • Dupont Research Center, Dover, Delaware, April 26, 1988. Topic: Statistical learning networks.
  • IBM T. J. Watson Research Center, Yorktown Heights, New York, August 10, 1988. Topic: Statistical learning networks.
  • Stanford University, Information Systems Laboratory, November 3, 1988. Topic: Minimum complexity density estimation.
  • IBM Technical Education Center, Thornwood, New York, January 11-12, 1989. Statistical learning networks. In the short course on Knowledge Acquisition from Data.
  • Cornell University, Department of Economics and Program of Statistics (Co-hosts), February 1, 1989. Topic: Convergence of Bayes estimators of probability density functions.
  • Purdue University, Department of Statistics, September 7, 1989. Topic: Minimum complexity density estimation.
  • University of Lowell, Massachusetts, Joint Seminar, Department of Mathematics and Department of Electrical Engineering, March14, 1990. Topic: Statistical properties of polynomial networks and other artificial neural networks.
  • Carnegie Mellon University, Department of Statistics, April 4, 1990. Topic: Statistical properties of polynomial networks and other artificial neural networks.
  • University of Chicago, Department of Statistics, October 15, 1990. Topic: Statistical properties of artificial neural networks.
  • University of California, San Diego, Department of Mathematics, January 7, 1991. Topic: Approximation bounds for artificial neural networks.
  • University of California, San Diego, Department of Mathematics, January 8, 1991. Topic: Complexity regularization for nonlinear model selection.
  • Siemens Corporation, Princeton, New Jersey, February 28, 1991. Topic: Universal approximation bounds for superpositions of a sigmoidal function.
  • University of Wisconsin, Department of Statistics, April 3, 1991. Topic: Complexity regularization for nonlinear model selection.
  • University of Wisconsin, Department of Mathematics, April 4, 1991. Topic: Approximation bounds for artificial neural networks.
  • Technical University of Budapest, Department of Electrical Engineering, July 2, 1991. Topic: Universal approximation bounds for superpositions of a sigmoidal function.
  • Mathematical Sciences Research Institute, Berkeley, California, September 25, 1991. Topic: Empirical process bounds for artificial neural networks.
  • Stanford University, Department of Statistics, October 15, 1991. Topic: Approximation and estimation bounds for artificial neural networks.
  • University of California, Santa Cruz, Department of Computer and Information Sciences, October 17, 1991. Topic: Computationally efficient approximation and estimation of functions using artificial neural networks.
  • Yale University, Department of Statistics, January 13, 1992. Topic: Neural network estimation.
  • University of Virginia, Department of Electrical Engineering, Eminent Speaker Series, February 21, 1992. Topic: Estimation of functions of several variables -- neural networks, Fourier decomposition, and Bayes methods.
  • North Carolina State University, Department of Statistics, February 29, 1992. Topic: Estimation of functions of several variables -- neural networks, Fourier decomposition, and Bayes methods.
  • Cornell University, Center for Applied Mathematics, March 6, 1992. Topic: Estimation of functions of several variables -- neural networks, Fourier decomposition, and Bayes methods.
  • University of North Carolina, Department of Statistics, March 30, 1992. Topic: Estimation of functions of several variables -- neural networks, Fourier decomposition, and Bayes methods.
  • University of Joenesu, Finland, Department of Statistics, April 9, 1992. Topic: Introduction to artificial neural networks.
  • University of Paris VI, Department of Statistics, April 15, 1992, and University of Paris, Orsay, Department of Statistics, April16, 1992. Topic: Estimation of functions of several variables -- neural networks, Fourier decomposition, and Bayes methods.
  • University of Paris VI, Department of Statistics, April 22, 1992, and University of Paris, Orsay, Department of Statistics, April23, 1992. Topic: Performance bounds for complexity-based model selection.
  • Princeton University, Department of Electrical Engineering, May 14, 1992. Topic: Overview of approximation results for sigmoidal networks.
  • University of Massachusetts at Lowell, Joint Seminar, Department of Mathematics and Department of Electrical Engineering, October 21, 1992. Topic: Statistical accuracy of neural nets.
  • University of Tokyo, Japan, Department of Information, Physics and Engineering, March 1993. Topic: Information theory and model selection.
  • University of Paris VI, Department of Statistics, May 1993. Topic: Optimal rate properties of minimum complexity estimation.
  • University of Paris VI, Department of Statistics, May 1993. Topic: Information-theoretic proof of martingale convergence.
  • University of Pennsylvania, Wharton School, October 21, 1993. Topic: Neural networks and statistics.
  • Massachusetts Institute of Technology, Center for Biological and Computational Learning, October 27, 1993. Topic: Neural networks and statistics.
  • Rutgers University, Department of Statistics, October 5, 1994, Topic: Statistical accuracy of neural nets.
  • University of South Carolina, Department of Mathematics, Spring 1995. Topic: Neural net approximation.
  • Carnegie Mellon University, Department of Statistics, Fall 1995. Topic: Information risk and superefficiency.
  • Massachusetts Institute of Technology, Department of Applied Mathematics, March 1996. Topic: Consistent and uniformly consistent classification.
  • Columbia University, Department of Statistics, Fall 1996. Topic: Consistency of posterior distributions in nonparametric problems.
  • Northeastern University, Joint Mathematics Colloquium with MIT, Harvard, and Brandiess, February 27, 1997. Topic: Information theory in probability and statistics.
  • Iowa State University, Department of Statistics, March 28, 1997. Topic: Information theory in probability and statistics: The fundamental role of Kullback divergence.
  • Washington University, St. Louis, Department of Electrical Engineering, Center for Imaging Systems, April 16, 1997. Topic: Universal data compression, prediction, and gambling.
  • Massachusetts Institute of Technology, LIDS Colloquium, May 5, 1998. Topic: Simple universal portfolio selection.
  • University of California, Santa Cruz, Baskin Center for Computer Engineering, October 1998. Topic: Approximation bounds for Gaussian mixtures.
  • Lucent, Bell Laboratories, Murray Hill, New Jersey, March 1999. Topic: Approximation and estimation bounds for mixture density estimation.
  • Stanford University, Department of Statistics, Probability Seminar, May 24, 1999. Topic: Information, martingales, Markov chains, convex projections, and the CLT.
  • Stanford University, Department of Statistics, Statistics Seminar, May 25, 1999. Topic: Mixture density estimation.
  • Rice University, Departments of Statistics and Electrical Engineering, November 4, 2000. Topics: Information theory and statistics -- best invariant predictive density estimators.
  • University of Chicago, Department of Statistics, November 21, 2000. Topics: Information theory and statistics -- best invariant predictive density estimators.
  • Yale University, Department of Computer Science, Alan J. Perlis Seminar, April 26, 2001. Topic: Neural nets, Gaussian mixtures, and statistical information theory.
  • Brown University, Department of Applied Mathematics, May 9, 2001. Topic: I do not recall.
  • University of Massachusetts at Lowell, Department of Mathematics, September 19, 2001. Topic: Mixture density estimation.
  • University of California at Los Angeles, Department of Statistics, May 21, 2002. Topic: Nonlinear approximation, estimation, and neural nets (I do not recall the specific title).
  • Columbia University, Department of Statistics, October 28, 2002. Topic: Information inequalities in probability and statistics.
  • University of Georgia, Department of Statistics, November 26, 2002. Topic: Information inequalities in probability and statistics.
  • University of Pennsylvania, Wharton School, October 29, 2003. Topic: Portfolio estimation for compounding wealth.
  • University of North Carolina (in conjunction with Duke University), Departments of Statistics, November 3, 2004. Topic: Risk assessment and advantages of model mixing for regression.
  • South Carolina, Department of Mathematics, April 7, 2005. IMI Distinquished Lecture. Topic: Statistical theory for nonlinear function approximation: neural nets, mixture models, and adaptive kernel machines.
  • Helsinki University and Helsinki Institute of Information Technology, Helsinki, Finland. August 22-25, 2005. Two talks: (1) Statistical foundations and analysis of the minimum description length principle. (2) Consequences of MDL for neural nets and Gaussian mixtures.
  • Princeton University, Department of Operations Research and Financial Engineering, October 4, 2005. Topic: Statistical perspectives on growth rate optimal portfolio estimation.
  • IBM Research Laboratories, Yorktown Heights, September 22, 2006. Topic: Generalized entropy power inequalities and the central limit theorem.
  • Purdue University, Department of Computer Science, February 26, 2007. Prestige Lecture Series on the Science of Information. Topic: The interplay of information theory, probability, and statistics.
  • University of Illinois, Joint Seminar, Department of Statistics and Department of Electrical and Computer Engineering, February 27, 2007. Prestige Conference Series. Topic: The interplay of information theory and probability.
  • Boston University, Department of Statistics, March 1, 2007. Prestige Conference Series. Topic: Information inequalities and the central limit theorem.
  • University of California at Berkeley, Department of Computer Science, October 4, 2007. Topic: Fast and accurate greedy algorithm for L1 penalized least squares. Primarily presented by Cong Huang.
  • Rutgers University, Department of Statistics, December 12, 2007. Topic: Fast and accurate L1 penalized least squares. Co-presented with Cong Huang.

Ph.D. Dissertations Supervised:

  1. Bertrand S. Clarke (1989). Asymptotic Cumulative Risk and Bayes Risk under Entropy Loss, with Applications. University of Illinois at Urbana-Champaign. [Was Assistant Professor at Purdue University; then Professor at University of British Columbia and University of Miami; Now Chairman, Department of Statistics, University of Nebraska]
  1. Chyong Hwa Sheu (1989). Density Estimation with Kullback-Leibler Loss. University of Illinois at Urbana-Champaign.
  1. Yuhong Yang (1996). Minimax Optimal Density Estimation. Yale University. [Was Assistant Professor at Iowa State University; Now Professor at University of Minnesota, Department of Statistics.]
  1. Qun (Trent) Xie (1997). Minimax Coding and Prediction. Yale University. [Was at GE Capital, Inc., Fairfield, CT. Then an Assistant Professor at Tsinghua Univ.]
  1. Gerald Cheang (1998). Neural Net Approximation and Estimation of Functions. Yale University. [Now at Singapore Technical and Education University.]
  1. Jason Cross (1999). Universal Portfolios for Target Classes having a Continuous Form of Dependence on Side Information. Yale University. [Was at an investment start-up firm with Myron Scholes in New York. Now runs an investment firm in Minneapolis, Minnesota.]
  1. Qiang (Jonathan) Li (1999). Estimation of Mixture Models. Yale University. [Was at KPMG Financial Services, New York. Then at Stanford Research Institute, Palo Alto. Now at Radar Networks, Inc., San Franscisco.]
  1. Feng Liang (2002). Exact Minimax Predictive Density Estimation. Yale University. [Was Assistant Professor, Department of Statistics, Duke University. Now Associate Professor, Department of Statistics, University of Illinois at Urbana-Champaign]
  1. Gilbert Leung (2004). Information Theory and Mixing Least Squares Regression. Yale University. [At Qualcomm, first in San Diego, now near San Jose.]
  1. Wei (David) Qiu (2007). Maximum Wealth Portfolios. Yale University. [J.P. Morgan Chase, Columbus, Ohio.]
  1. Cong Huang (2008). Risk of Penalized Least Squares, Greedy Selection and L1-Penalization for Flexible Function Libraries. Yale University. [Was at Columbia University, Department of Statistics. Now in China.]
  1. Xi Luo (Rossi) (2009). Penalized Likelihoods: Fast Algorithms and Risk Bounds. Yale University. [Was at University of Pennsylvania, postdoc with Tony Cai, Department of Statistics. Now Assistant Professor at Brown University, Department of Biostatistics.]
  1. Antony Joseph (2012).Achieving Information Theoretic Limits with High Dimensional Regression. Yale University. Co-developer at Yale of capacity-achieving sparse superposition codes. [Was at University of California, Berkeley, postdoc with Bin Yu, Department of Statistics. Now at Walmart Research.]

  1. Sanghee Cho (2014). High-Dimensional Regression with Random Design, including Sparse Superposition Codes. Yale University. [At GE Research, Schenectady, New York]
  1. Sabyasachi Chatterjee (2014). Adaptation in Estimation and Annealing. Yale University. [Postdoc with John Lafferty, Department of Statistics, University of Chicago.]
  1. Xiao (Grace) Yang (2015). Compression and Predictive Distributions for Large Alphabets. Yale University. [At Apple, Inc, Cupertino, California.]
  1. Cynthia Rush (May 2016). Iterative Algorithms for Inference and Optimization, with Applications in Communications and Compressed Sensing. Yale University. [At Columbia University, Department of Statistics]

Research Interests: discussion of results and open problems

  • Entropy Power Inequalities and Central Limit Theorems:
    [Publications 3,26,29,34,64,69; At least seven conference presentations and at least seven seminars.]

Information theory tools involving entropy and Fisher information provide demonstration of a strengthened central limit theorem. One compares the distribution of the standardized sum of i.i.d. random variables to the standard normal distribution. Publication [3], building on earlier work by Larry Brown, shows the Kullback divergence (relative entropy distance) converges to zero if and only if it is eventually finite.

Publication [26], with Oliver Johnson, shows it converges to zero at rate 1/n if the random variables have finite Poincare constant; also shown, in the same issue, by Artstein, Ball, Barthe, and Naor. Implications are given in Publication [26] for risk efficiency of the best unbiased estimator of natural parameters in exponential families.

Publications [29] and [64], with Mokshay Madiman, simplify proofs of monotonicity of information in the central limit theorem and extend the Shannon entropy power inequality to arbitrary collections of subset sums. Implications for distributed estimation of location parameters are explored with Madiman, Abram Kagan, T. Yu in Publications [34] and [69].


More Recent Developments: Bobkov, Chirstyakov and Gotze (2013, Annals of Probability) show that the relative entropy distance converges to zero at rate 1/n assuming only finite initial relate entropy and finite fourth moments, rather than the infinite moments implied by a finite Poincare constant. The proof integrates classical local limit theorem results of Petrov from 1964.


Open Problems: Give an information theoretic proof of the 1/n convergence assuming only finite initial information distance and finite third or fourth moment. Seek a possibly more direct understanding of the entropy power inequality without appealing to corresponding inequalities for Fisher information or mean squared prediction error. Characterize cleanly the role of dimension in bounds on the relative entropy distance from the normal.

  • Entropy Rates, Asymptotic Equipartition Property, Likelihood Stabilization, and Inference for Dependent Processes:
    [Publications 0, 2; At least two Conference Presentations and at least one seminar 2.]

Asymptotically there is agreement between sample information quantities and their expectations; likewise agreement between log likelihoods and their expectations. In information theory and in statistical physics it is called the asymptotic equipartition property (AEP).

Previously, the Shannon, McMillan, Brieman theory from the 1950s, which was for discrete ergodic processes, and the theory of Moy, Perez, and Kieffer, which established L1 covergence for (1/n) log-densities of absolutely continuous joint distributions of size n samples from stationary processes. Publication [2], in 1985, established the long-conjectured convergence with probability 1; independently shown by Steven Orey in the same year. The conclusion also holds for log likelihood ratios, if the second measure is Markov with stationary transitions, while the governing first measure is not required to be Markov.

This AEP or stability for log-likelihood ratios provides best error exponents for hypothesis tests as discussed in [2]. It also provides a step in Publication [0], allowing demonstration of universal consistency properties of logical smoothing (a general Kolmogorov complexity based minimum description-length criterion) for all computable stationary ergodic processes.


Open Problem: Provide an extension showing log likelihood ratio stabilization in the case that both measures are stationary and ergodic without Markov assumption. What weaker assumption on the second measure is appropriate? A word of caution: John Kieffer in the early 1970s gave a counterexample in which the (1/n) log-likelihood ratio is not convergent for a pair of stationary (but not ergodic) measures.

  • Foundations of Minimum Description Length Principle of Inference and Universal Data Compression: [Publications 0,7,10,15,16,18,23,27,31,37,46,48,49,57,58,62,65,67,68,77,78,79,80,81,82,83; At least twenty-four conference presentations and at least two seminars; Student Theses of Clarke, Xie, Li, Liang and Chatterjee.]

The minimum description length principle of statistical inference initiated by Jorma Rissanen, with related work by Wallace, has realizations via two-part codes, mixture codes, and predictive codes. Early statistical analysis is in Barron's Thesis (Publication [0]).

Two-part codes correspond to complexity penalized log-likelihood, for which risk analysis is in Publications [0,10,47,49,67,68] and the theses of Li and Chatterjee. Mixture codes (average case optimal and minimax optimal) correspond to universal data compression in information theory and to reference priors in statistics, for which analysis is in Publications [0,7,15,16,18,23,27,44,46,49,55,72]. Further details on minimax regret for compression and prediction of arbitrary sequences is in work with Jun-ichi Takeuchi in Publications [31,57,58,62,65,78,79,80] and a patent. Optimal predictive codes correspond to prequential inference and to posterior model averaging, for which analysis is in Publications [7,18,19,22,23,27,28,44,48,49].

A recent result [83] with Roos and Watanabe provides exact Bayes interpretation for Starkov's normalized maximum likelihood distribution which is exactly minimax over all sequences. What is stricking about this result is that to provide exact minimaxity there is the potential for the prior to use negative weights. Collectively, these investigations show relationships in properties of penalized likelihoods, Bayes mixture procedures, and predictive procedures revealed through their minimum description length interpretations. Moreover, data compression performance and statistical risk properties are linked.

For further literature on the minimum description length principle and where our work fits in this context, one may see the review by Barron, Jorma Rissanen, and Bin Yu (Publication [18]), and the book by Peter Grunwald (2007, MIT Press).


Open Problems: It is the cumulative Kullback risk of predictive estimators, for sample sizes up to n, that is most directly related to redundancy of data compression, for which simple and clean bounds are available. 1/n times the cumulative Kullback risk expresses an average risk across sample sizes. (1) Show that the Kullback risk with sample size n (the last term in the sum) has corresponding performance (order d/2n in the parametric case) or give a counterexample. Can the individual risk be worse than the average risk across sample sizes? (2) For two-stage codes (complexity penalized log likelihoods), the existing risk bounds in broad generality are for Hellinger divergence and related measures of distance. Show that such penalized procedures have corresponding performance for the stronger Kullback risk, comparable to the predictive Bayes procedures, or give a counterexample.

  • Analysis of Penalized Criteria for Model Selection:
    [Publications 0,10,13,17,20,30,40,42,45,47,49,54,60,67,68,81; Technical report with Huang and Cheang; At Least nine conference presentations and at least four seminars; Student Theses of Yang, Cheang, Huang, Luo, Chatterjee and current student Klusowski.]

Desirable penalties for empirical criteria produce estimators with statistical risk shown to be related to the corresponding population tradeoff between accuracy and penalty. Conditions on the penalty for which such conclusions hold are in the cited publications, some of which are joint with Yuhong Yang, Lucien Birge, Pascal Massart, Gerald Cheang, Cong Huang or Jason Klusowski.

The penalty conditions emphasize variable-complexity cover properties. Accordingly, one can adapt the structure of models as well the number and size of parameters. For models built from linear combinations of candidate terms from a library, implications for subset size penalties are in Publications [0,10,17,20,30,40,42,45,51,54,60,67], likewise for L1 penalties on coefficients in Publications [30,40,45,49,67,68,81] and for combinations of such penalties in Publication 13 and the technical report with Huang and Cheang.


Open Problems:Works by Sara van der Geer (Annals of Statistics, Feb 2008) and by Wainwright and Negaban give alternative theory for penalized criterion involving functional analysis conditions for a penalty such that desirable risk bounds hold. Those approaches are also effective for L1 penalties, though they do not take advantage of cover properties of the library of candidate terms. Determine the relationship between their conditions and our complexity-based conditions.

  • Analysis of Bayes Procedures and other Model Averaging Procedures:
    [Publications 0,4,6,7,15,16,21,22,27,28,44,46,48,49,56,63; Two Technical Reports; At least ten conference presentations and at least five seminars; Student Theses of Clarke and Leung.]

Building on theory of Lorraine Schwarz from the early 1960s, necessary and sufficient conditions for consistency of posterior distributions is established in the 1988 technical report, formulated in terms of existence of uniformly consistent tests, parts of which are presented in Publications [6], [46] and [21], the latter with Mark Schervish and Larry Wasserman.

Posterior predictive densities can be accurate even when posterior distributions on parameters do not necessarily concentrate well on good neighborhoods of the generative parameter. Quantification of prior probability of Kullback neighborhoods is sufficient to provide rates of convergence of predictive densities as shown in [44] and [46], using cumulative Kullback risk.

In a fixed-design Gaussian-error regression setting, simple information-theoretic bounds on risk of posterior model averaging are given with Gibert Leung in Publication [28]. These bounds are obtained there in the case of known error variance, with subsequent extension to the case of unknown error variance in work by Christophe Giraud and Yannick Baraud. A tool in these developments is the unbiased estimator of risk due to Stein. The cleanest risk bounds have weights for model averaging that are proportional to square roots of posterior weights. The risk bounds involve optimal approximation error and subset size tradeoff, with a multiplicative constant of 1. Accordingly, the conclusion for model mixing are stronger in this setting than all known results for model selection.

A fascinating result following from Publication [28] is that an unbiased estimator of the mean squared error of prediction of Bayes regression procedures is equal to RSS + 2V, the residual sum of squared errors of the fit plus twice the sum of posterior variances. This risk assessment is an attractive combination of accuracy of fit to training data plus posterior variability.


Open Problem: Starting perhaps with the case of Gaussian distribution on the inputs to a regression, find extension of these conclusions to the case of random design, with estimation of the mean squared error of prediction at new points independently drawn the same distribution as the sample.

  • Optimal Rates of Function Estimation for Large Function Classes:
    [Publications 19, 22, 40,41; At least four conference presentations; Thesis of Yang.]

For regression and density estimation, three quantities of interest for a function class are (1) the Wald minimax risk R_n of estimation from finite samples of size n with mean square error, Hellinger, or Kullback loss, (2) the Kolmogorov metric entropy rate H_n= H(epsilon)/n at a critical covering or packing radius epsilon_n, solving H(epsilon)/n = epsilon^2 and (3) the Shannon information capacity rate C_n of the channel that takes one from functions in the class to data of the indicated size (known in information theory to also equal the minimax redundancy rate of data compression). It turns out, per [22], that these three seemingly rather different quantities are the same to first order. This equivalence is the characterization that provides simple determination of minimax rates of function estimation.

Beginning in the 1940s, an inequality of Fano was used to bound communication rates (log-cardinality of message sets divided by block length n) by information capacity, an essential step in establishing the capacity of communication channels. Versions of this inequality were borrowed into statistical investigations of lower bounds on minimax risk of function estimation by Pinsker, Hasminskii, and Birge in the late 70s and early 80s. The implication at that time was that if an optimal cover of a small diameter portion of the function space has an appropriately large order size, then that log-cardinality divided by n determines the minimax rate.

In publication [22] with Yuhong Yang, we appealed to the original Fano inequality to derive that, for any infinite-dimensional function class, the minimax risk is of the same order as the metric entropy rate at the critical epsilon_n. Thus to verify the minimax rate for a function class, or to upper and lower bound that rate one only needs to know upper and lower bounds on the metric entropy (one does not need to exhibit the more specialized cover property of small diameter subsets). Equally interesting was the approach of Haussler and Opper, also from the mid-90s, which established similar conclusions with new bounds on the information capacity.

It may be of interest to discuss the implications of optimal rate theory in the model selection setting. For any complete sequence of basis functions and any decreasing sequence r_k tending zero polynomially fast, a corresponding function class is the approximation class of functions for which the projection onto the first k basis functions has squared L2 error not more than r_k for all k. Lorentz in the 1950s determined the metric entropy for such classes. The implication from Publication [22] is that the minimax risk of these classes is of order min {r_k + k/n}, where the minimum is over choices of k. This conclusion is in concert with what is achieved by Bayes procedures as in Publication [22] or with model selection rules that adapt k, as in Publications [20],[30] or the Tech Report with Huang and Cheang. These estimators are adaptive in that the minimax rate is achieved simultaneously for all such approximation classes.

There are implications of optimal rate theory for subset selection. A formulation of sparse approximation classes dictated by the accuracy of subset selection is also given in Publication [22]. The formulation there is idiosyncratic to facilitate appeal to the conclusion of Lorentz for a subsequence of allowed subsets, and does not completely match what would be desired for characterizing sparse approximation classes. Lower bounds on metric entropy for these or other formulations of sparse approximation classes can provide implications for optimal statistical rates for interpolation classes discussed in publication [30], section 6. Positive results are given there, providing estimators achieving rates achieved simultaneously for a range of interpolation classes for linear combinations of libraries of finite metric dimension. These rates are faster than previously obtained and it would be of interest to know whether they are minimax optimal.


Open Problem: For a given library of candidate terms, generalize the formulation of sparse approximation classes to better match the setting of approximation by subsets of size k out of M and use it to bound the metric entropy of these classes and associated interpolation classes, and hence their minimax rates of estimation.

  • Greedy Algorithms for Subset Selection in Regression and Density Estimation, and L1 Penalty Optimization:
    [Publications 12,14,30,33,40,41,47,49,51,59,67,71,72 and the Technical Report with Huang and Cheang; At least seven conference Presentations and six seminars; Theses of Cheang, Li, Huang, Luo, Joseph, and Cho, and current student Klusowski.]

Greedy algorithms are ubiquitous in a range of methods for function estimation and approximation including forward stepwise regression, projection pursuit regression, multiple additive splines, and basis pursuit for wavelets. In 1989 Lee Jones established a result (published in the Annals of Statistics in 1992) for a broad class of greedy procedures that include a relaxation factor in the update formulation, with a slight improvement in Publication [12]. This result shows that a target function representable as a linear combination of a library of bounded functions with L1 norm of coefficients not more than V is approximated by k steps of such greedy algorithms (that is by k terms of the library) with accuracy C/k where C is the square of V.

To deal with noise and with targets that may be outside of specified convex hulls, extension of the greedy algorithm result was obtained in a lemma jointly with Wee-Sun Lee, Peter Bartlett, and Bob Williamson (in 1996 IEEE IT, see their acknowledgements), with further refinement in Publication [30] with Albert Cohen, Wolfgang Dahmen, and Ron Devore. These results show that in fitting a function f, the squared L2 error of greedy algorithms is bounded by the minimum over all possible linear combinations g of the sum of approximation error ||f-g||^2 and C(g)/k where C(g)=4 V^2(g), with V(g) the L1 norm of coefficients in g.

Moreover, with data-based adaptive selection of the number of terms k, a statistical risk is obtained that corresponds to min ||f-g||^2 + lambda V(g), the best tradeoff between approximation error and L1 norm of coefficients times a factor lambda equal to the square root of (log M)/n where M is the library size and n is the sample size. Such conclusions are in [30],[40] and the technical report with Huang and Cheang

My student Cong Huang showed in his Thesis (and the indicated tech report) that a slight variant of the relaxed greedy algorithm (which we call L1 penalized greedy pursuit -- LPGP), actually solves the L1 penalized least squares problem, with explicit control of the computational accuracy after k steps (through the C(g)/k term in the bound). Refinements of this result are in the Thesis of Xi Luo and the paper with Klusowski [40].

Corresponding results for log density expansions, with Li, Huang, and Luo, are in Publications [49,67,68] and for mixture density estimation, with Jonathan Li, are in his Thesis and in publication [47].


Open Problem: A variant of Least Angle Regression (LARS) algorithm of Brad Efron, Trevor Hastie, Iain Johnstone, and Rob Tibshirani, also builds up the solution one term at a time, for a range of values of lambda, with final solution after a number of steps k said to be not more than the sample size. Relate LARS to the greedy algorithms, with inclusion of a relaxation factor if need be, to quantify its approximate accuracy with fewer numbers of steps.

  • Nonlinear Approximation and Estimation for High-dimensional Libraries of Functions including Neural Nets:
    [Publications 8,12,13,14,20,22,24,30,39,40,41,43,45,50,51,52,53,54,55,59,60,66 and Technical Report with Huang and Cheang; At least twenty-seven conference presentations and at least twenty-six seminars; Theses of Cheang, Huang, Yang and current student Klusowski.]

In dealing with adaptation and high-dimensionality, the early 1990s were a turning point in mathematical understanding of function estimation and approximation, as will be explained. Long before that time, statistical and engineering practice, both in academics and in industry, revealed clear advantage to data-based adaptation of the structure as well choice of number of parameters of statistical models. The choice of subsets of linear combination in regression is a simple example of such adaptation, and the choice of structure of classification and regression trees, multiple layer neural nets and polynomial nets provide more elaborate examples (see Publication [51] for a review).

However, such statistics and engineering practice was very much at odds with what was the prevailing mathematical understanding of nonparametric function estimation and approximation theory. The tradition was to capture function regularity primarily by local smoothness, through presumption of Sobolev or Holder properties of derivatives. For these function classes, the statistical risk result showed that no advantage for procedures other than those based on linear strategies (fixed Kernel methods or linear projection methods onto fixed bases). The most relevant approximation theory results for these classes was the theory of Kolmogorov k-widths, expressing what fixed k terms provides the best linear space minimizing of the maximal error of L2 projection onto the linear span. In this linear theory, the choice of k terms is fixed, depending on the class, but not on the target function within the class.

In contrast, for nonlinear approximation based on a given library of candidate terms, the choice of the best k terms from the library is allowed to depend on the target. Though such nonlinear k-width theory was being developed, when applied to the traditional Sobolev and Holder function classes, it did not reveal substantial advantage in performance.

Belief that such local derivative regularity is what matters distanced the theory from statistical practice, discouraging the taking of general subsets of terms in regression, when the practitioner knows that such adaptation is advantageous. These disparities between theory and practice were extenuated by the results that for functions of d variables the minimax optimal rates for the traditional classes severely degrade with dimension, leading to a conclusion in that theory that one needs exponentially large sample sizes as a function of the dimension.

Two activities in the study of approximation and estimation of functions dramatically changed perspectives. One was the development of theory and practice of wavelets. Retaining the idea that regularity can be captured locally through behaviors of moduli of smoothness, it was recognized that such regularity need not be of the same type throughout the input space of the function. Certain Besov spaces capture such freedom and were shown to be equivalent to certain norm properties on coefficients in newly developed orthogonal wavelet expansions. Moreover, optimal approximation and estimation in these classes were shown to require target-based or data-based choices subsets of terms to include, which is a type of nonlinear approximation, rather than fixed linear projection. See books and papers by Ron DeVore and Yuly Makavoz and others from approximation theory, and the writings of this time of David Donoho and Iain Johnstone and others from mathematical statistics. Taking advantage of the orthogonality, for the low-dimensional spaces of two- and three-dimensional images and time evolutions of such, for which one can have an exhaustive filling of data-points, practical algorithms become available which bridged theory and practice in these settings.

The other activity concerns approximation and estimation of responses for high-dimensional input data. In this setting the input data is sparse, typically modeled as a random sample from a distribution for which subsequent data will likewise be drawn, and popular models entertain linear combinations of nonlinearly parameterized terms, such as sinusoids, sigmoids, or multivariate splines with variable knots. Barron found himself in a whirlwind of discussion, as evidenced by the many conference and seminar requests. He argued that for high-dimensional function estimation, since local derivative regularity fails to capture what can be inferred from manageable data sizes, for such settings we should abandon those local smoothness notions as the primary source of regularity. Instead, more structural notions of regularity matter, related to information-theoretic complexity or to the suitability of the presumed forms for accurate representation of the target.

A theory he put forward, beginning in 1991, is for functions f which when divided by a suitable V, are in the closure of the convex hull of a given library of candidate terms (the smallest such V being called the variation of f with respect to the library, in concert with the variation interpretation associated with the library of indicators of half-spaces, publication [55]). Including terms of either sign in the library, V(f) is equivalent to an L1 norm on coefficients of representation. A Fourier norm is shown to be one device for controlling the variation for ridge type libraries such as sigmoids and sinusoids (Publication [12]). For functions of finite variation the squared norm of best k-term approximation and of statistical risk are bounded, respectively, by V^2 /k and by V sqrt{d/n} to within logarithmic factors. These bounds give rates as exponents of (1/k) or of (d/n), where the exponents do not depend on the input dimension d.

To achieve these rates under such structural assumptions, unlike the previously mentioned traditional function class settings, adaptation of the subset of k-terms (depending on the target) is essential. So we have a better match of such theory to what we know to be important in practice. For functions that do not have finite variation, but reside in interpolation classes between this L1-coefficient norm class and all of L2, the results of publications [30],[40] (and the tech report with Cheang and Huang) give bounds on rate, which continue to be respectable even for libraries of large metric dimension. These function classes give settings in which, with suitable nonlinear approximation and estimation techniques, we are not subject to a curse of dimensionality.


Recent Developments: So under the finite variation condition the exponent is independent of dimension. However, the base of the exponent was d/n which necessitated that n be large compared to the input dimension d. This was already reasonably good because the effective size of the dictionary of nonlinear terms is exponentially large in d. Nevertheless, the recent trend in high-dimensional data analysis has been to try to allow the input dimension to be much larger than n. That is a challenge when the model is nonlinear in these inputs.

Thus it is a pleasure to report the recent 2016 results with current Jason Klusowski (Publications 40,41). We show that with L_1 control (V1) on the internal parameters of nonlinear ridge parameterized terms, (as well as L1 control V0 on the external parameters of linear combination),the risk of estimation of finite variation combinations of such functions is of order [V (log d)/n]^{1/3}.

These new results for ridge combinations (including neural nets) are superior to the classical bounds for d > n^{1/3}. Moreover, they can produce small risk even for very large dimension situation of d = exp{o(n)}. One needs n large compared to only (log d) though one does pay the price of the smaller exponent of {1/3} for the mean squared error compared to the classical exponent of {1/2} for estimation of functions with finite variation with respect to these nonlinear dictionaries. This puts this neural net theory squarely in the big data situation.


Open Problem: A tantalizing prospect of this theory, with infinite libraries of high but finite metric-dimensionality, is that accurate computation of the estimates is possible by use of suitable greedy algorithms discussed above. After all, each step of an algorithm involves optimization of only one term from the library. Empirically, investigators have been pleased with local optimization strategies for each new term in various settings, including the Friedman and Stuetze projection pursuit regression algorithm and various gradient-based neural net training algorithms. The hope of such local strategies is that the residual surface is cleaned by each term included so that ultimately the globally important terms are revealed, though that hope has not yet been confirmed on general mathematical grounds.

Or one may study adaptive annealing methods for guided random search of suitable terms from the library (some steps in this direction are in reference 60 with Xi Luo and in the Thesis of Chatterjee). The question is to resolve for what forms of nonlinearly parameterized terms is such adaptive annealing rapidly convergent to a term of suitably high inner product with the residuals from the previous greedy algorithm step. Resolution of suitably formed computational questions is critical for both understanding and practice of high-dimensional function estimation. <p


Recent Computational Developments With current student Jason Klusowski we are making progress in the analysis of adaptive annealing for Lipshitz bounded activation functions, using what we call variable replication, yielding a bounded coordinate effect property, sufficient for stability of adaptive annealing. Also with Klusowski and another current student William David Brinda we are getting some progress with what we call the nonlinear power method of optimization. Some aspects of both of these developments are in the seminar slides from the recent 2015 NIPS workshop as well as recent 2016 Princeton and Osijek presentations. These slides can be accessed from the vita above.

  • Maximum Wealth Stock Indices and Growth Rate Optimal Portfolio Estimation:
    [Publications 5,23,25; Two technical reports; At least seven conference Presentations and at least four seminars; Theses of Jason Cross and Wei (David) Qiu.]

Compounding over n periods of investment, rebalancing each period to a portfolio vector W which maintains fractions W1,W2,...,Wm of wealth in each of m assets (stocks, bonds, cash, etc), let (X1i,X2i,...,Xmi) be the vector of returns (ratios of price plus dividend at end of period i to the price at the start of period i), then the multi-period factor by which a unit of wealth is multiplied, denoted Sn(W) is the product for i=1,2,...,n of the portfolio returns Zi= W1 X1i + W2 X2i + ... + Wm Xmi. This multi-period wealth function Sn(W) is a log-concave function of the portfolio vector W. That is, the wealth Sn(W) is the exponential of a smooth concave function Yn(W) which is the sum of the log portfolio returns. As such it exhibits a shape as a function of W governed by Laplace approximation, peaked at a choice of rebalancing portfolio W which would have made the most wealth over that time period. Accordingly, in the report with Wei Qiu (as well as in his thesis), in which we take the periods of investment to be months to avoid over-exposure to commissions, we maintain that an index of the wealth in a market over a suitable period of months (e.g. a decade), appropriate for historical consideration, is the maximum of Sn(W) over choices of W. Examples are given where the market is taken to be the Nasdaq 100, the S&P 500, or all stocks (about 3000 in number) actively traded over an eleven year period accessed in the U. Penn, Wharton School data set (Wharton Research Data Services).

We provide a relaxed greedy algorithm for fast computation of this maximum. The algorithm starts with the historically single best stock, and then at each algorithm step introduces a new stock which, in convex combination with the previously computed portfolio, yields the greatest increase in compounded wealth. It is proven that in k steps of the algorithm (that is with k stocks included) the error in computation of the maximum of Sn(W) is bounded by V/k in the exponent, where V depends on the volatility of the historically given returns. Thus the algorithm is guaranteed to converge to the portfolio vector W which historically had the highest returns. Our studies showed that only a handful of stocks are included in the hindsight optimal portfolio vector and that the associated maximal portfolio wealth factor (typically several hundred) is much higher than the best single stock over the indicated time period.

Beginning in the first issue of Mathematical Finance, Tom Cover in 1990 introduced the notion of universal portfolios. These are updating portfolio strategies in which it is guaranteed that the wealth achieved is within a specified factor of the maximum of Sn(W). With Erik Ordentlick in the mid 90s, he refined the universal portfolio theory to show that the optimal factor, valid uniformly over all possible return sequences, exhibits a drop from the maximum by not more than a factor of order (1/n)^d where the power d=(m-1)/2. For markets in which the maximal exponent Yn(W) grows linearly with n with a non-vanishing rate, this shows that the hindsight optimal rate can be achieved universally without advance knowledge of what portfolio W will be best. In publication 23 with Trent Xie, we identified the precise asymptotics of the form of the wealth drop factor (C/n)^d including identification of the constant C, and showed that this theory reduces to the same mathematics as the identification of asymptotics of minimax pointwise regret for data compression and prediction also developed there.

A continuous-time version of universal portfolios is developed in publication 25 as well as a version that allows target portfolios that use past returns in parameterized ways. An advantage of continuous-time is that for stock price processes of bounded empirical variation, the Sn(W) becomes an explicit Gaussian shaped function of W so that certain universal portfolios can be evaluated exactly by known Gaussian integral computations.
An obstacle to realizing high gains by portfolio estimation strategies is that it is difficult to know in advance what subset of stocks is best. If one tries to be universal for hundreds of stocks, the worst case factor (1/n)^d will kill off typical gains from the maximum of Sn(W) over the time horizons in which the desirable portfolios remain stable. In the technical report with Wei Qiu, we develop a strategy in which we do not try to be universal for all of the stocks, but rather we mix across all possible subsets of three, four, or more of the stocks. By such mixture strategies we show growth rate optimality for portfolios of stocks of bounded volatility.

Winner-take-all gambling markets (e.g. horse races) provide a setting in which the theory and practice of growth rate optimality is considerably simplified, as in the text by Cover and Thomas 2006. Briefly the wealth exponent of a gambler is expressible as the differences of Kullback divergences from empirical distributions of distributions associated with the odds-makers and the distribution specified by gambling proportions. The technical report with Jengfeng Yu shows that for single stocks, the presence of put or call options with strike prices at each possible level of the stock makes it possible for linear algebra transformations to convert the questions of portfolio choice (portfolios of cash, stock, and options on the stock) and option pricing to the simpler problems of gambling and odds on such gambles, where the events on which one gambles are the different possible outcomes of the stock price.


Open Problem: For multiple stocks, basket options (that is, options on portfolios of stocks) are sufficient to convert the option portfolio problem and the option pricing problem to the associated gambling and odds-making problems, even in a discrete time setting. Use approximation results for linear combinations of indicators of half-spaces to resolve how many basket options are required to come appropriately close to the optimal growth exponent of the pure gambling version of the problem.

  • Practical Codes for Capacity-Achieving Communication with the Gaussian Noise Channel:
    [US Patent 8913686; Publications 32,33,70,71,72,73,74,75; One tutorial and associated newsletter article [84] with Ramji Venkataramanan; One technical report; Several conference papers and seminars; Theses of Antony Joseph; Sanghee Cho and Cynthia Rush.]

The current cell-phone revolution in society is a consequence in part of the high-rate low-error-probability codes based on LDPC and turbo decoders using iterative probability propagation (message passing) in graphical networks (also called Bayesian belief propagation in graphical statistical models). These codes were empirically demonstrated beginning in the mid-1990s, to have the property of reasonably low error probability at communication rates that are a reasonably fraction of capacity. These decoding algorithms apply methodology that would determine the posterior most probable codeword if the graphical model representing the decoding problem were a tree (a loop-free network). The graphical network representing the decoding task for these codes does not have such structure and accordingly scholars find the problem of mathematical proof of favorable error probability and communication rate properties for such practical codes to be elusive except for certain simple channels (the erasure channel and some channels related to it).

Communication technology is left in a rather nebulous state concerning whether the performance will scale suitably as computation and memory capability improves. What is theoretically desirable with decoding remaining fast enough to be performed on-the-fly, is that the error probability be exponentially small with increasing size n of the codes, with communication rate R that remains at a specified preferably high fraction of the Shannon Capacity C. Moreover, this should be so for the real additive Gaussian noise channel common to electromagnetic propagation through the air or electrical propagation on wires or cables, or to sufficiently finely quantized versions of this real channel.

It is unknown whether existing LDPC or turbo codes will have these desirable scaling properties. Indeed it is even conceivable that there is what is called a noise floor for the error, that is, a positive value below which the error probability will not be made any smaller even if the existing style codes are made larger. This uncertainty about communication capability should appropriately give pause to companies whose performance hinges on communications technology.

My current and recent students and I are engaged in an effort to bring a positive resolution to the problem of practical communication for the real channel, with low computational complexity, and with error probability that rigorously scales to be exponentially small with the code size n, for any specified rate R < C.

We tackle practical capacity-achieving communication, as described in publications [32,33,70,71,72,74,75,84], using an approach that is motivated by understanding of problems of subset selection in a linear model representation of the communications task. Capacity-achieving versions of these codes are called Sparse Superposition Codes.

An ensemble of n by N design matrixes or dictionaries X are specified. In practice each entry would be independent and +-1 valued and equiprobable, though our first theorems for this setting take these entries of the design to be independent standard normal. The number of columns N dictates the memory size for chips that would implement the code.

The stream of message bits and the set of design matrix columns are both partitioned into L sections, of size log M bits for each piece of the message, used to select (i.e. address) one out of M columns in each section of the design matrix. The codeword consists of the sum (the superposition) of the selected columns (one from each section), with a final scaling to satisfy requirements of power allocation.

Natural choices of L and M are comparable to n, and subject to the relationship n R = L log M, specifying two equivalent ways to count the number of information bits, where R is the communication rate.

Decoding as described in [33,71,72,74,84], consists of parallel pipelined computation of inner products, first with the received vector Y, with each of the columns of X, followed by adding to the codeword fit those columns for which the inner product is above a specified threshold. In parallel, a stack of such chips likewise compute the inner products of such columns with the residuals of the fits for a succession of such received sequences. Each such inner product resolution determines an additional portion of the true columns superimposed in the codeword. A logarithmic number of such inner product chips are found to be sufficient to sufficiently resolve the codeword, with exponentially small error probability.


Additional Specifics and an Open Problem: For some scholars, there is interest not only in the case of fixed R < C, but also interest in the case that R approaches C as the code size n increases. The rate at which R approaches C cannot be too fast if we want small error probability. Ideally, the error probability is exponentially small in n(C-R)^2, so, as long as n(C-R)^2 grows without bound, the probability of error will become small while allowing ever higher rate near C. This exponential relationship was first shown by Shannon (for his codes that did not attempt to be practical).

The paper [32] shows that this desirable exponential scaling continues to hold for optimally-decoded sparse superposition codes of manageable dictionary size. That allows R to approach C at a rate arbitrarily close to (1/n)^{0.5} below capacity. Similar expressions for the practical decoder are obtained, different from the ideal in two main respects. One is that the error probability is exponential not in n(C-R)^2 but rather exponential in L(C-R)^2 (which is within a log factor of optimal). The other difference is that, so-far, the bound for the practical decoder is demonstrated, in references 33,71,72,74] only for R below C by at least an order 1/ log n gap.

The open problem concerns the restriction that when R approaches C it maintains a gap of 1/ log n. The problem is whether such a restriction is necessary for a practical decoder (one such as ours that computes on-the-fly or more generally for any decoder with computational resource a low order polynomial in n) in order to have error probability exponential in n(C-R)^2.

In the work in reference [74] and the Thesis of Sanghee Cho, we have the decoding algorithm compute posterior weights for each section based on current summary statistics (a kind of soft-decision rather than hard-decision decoder, for all but the final step of the algorithm). Evidence is given that this procedure allows for R a larger fraction of C for sensible size codes.

 

Associate Editorship:

  • 1993 - 1995 IEEE Transactions on Information Theory. A.E. for nonparametric estimation, classification and neural nets
  • 1995 - 1997 Annals of Statistics.
  • 1994 - 1997 Neural Networks.

Yale Departmental and Divisional Responsibilities:

  • Department of Statistics (newly named Statistics and Data Science)

   Director of Graduate Studies (Fall 1993 through Spring 1999, Spring 2017).
   Director of Undergraduate Studies (Fall 2010 through Spring 2016).
  Acting Chair (Spring 1998).
  Chair (Spring 2001 - Fall 2006).

  • Division of Social Sciences

Senior Appointments Committee (Fall 1994 - Spring 1995, Fall 1999 - Spring 2000)

  • Program of Applied Mathematics

   Director of Undergraduate Studies (Fall 1999 - Spring 2001; Fall 2004 - Spring 2006)
   Senior Coordinator of Undergraduate Studies (Fall 2006 - Spring 2015.)

Other Activities and Honors:

  • FAI free flight model glider designer and competitor, F1A class (Andrew Barron and family):


   Andrew is a five time U.S. National Champion: 1984, 1987, 1992, 2007 and 2009.
   Three time U.S. National Team Member at the World Championships:
       1995 in Hungary, 2001 in California, and 2013 in France.
   Member of nine man U.S. Team winning the Challenge France World Championships 2013.
   Alternate to the U.S. National Team in 1977 and 2015.    America's Cup Champion in 1998 and 2009.

   Son Peter was a U.S. National Team Member at the World Championships 2015 in Mongolia.

   Son Peter was a U.S. National Team Member at the Junior World Championships 1998 in Romania.

   Son John was a U.S. National Team Member at the Junior World Championships 2000 in the Chech Republic.

   Son Timothy was the U.S. National Champion in 2010 and 2012.
   Timothy was a US National Team Member at the Junior World Championships 2008 in the Ukraine.
   Timothy finished as third place individual, part of second place F1A team (USA), and part of
   first place U.S. team (F1A,F1B,F1P combined).

   Daughters Michelle and Gina earned place on U.S. National Team for Junior WC 2012 in Slovania.
   Gina finished as part of second place F1A team and Second place team overall (F1A,F1B,F1P combined).

Co-owner and manager of Barron Field, LLC, which owns a 287 acre sod-farm in Orange County, NY. It is the primary flying site used in the north-eastern US region to host free-flight meets and competitions, sanctioned by the Academy of Model Aeronautics (AMA) and the Federation Aeronautique International (FAI).