Andrew
R. Barron
Yale University
Professor of Statistics (100%
Appointment)
Professor of Electrical Engineering (Curtesy Appointment)
CoDirector of Graduate Studies in Statistics and Data Science
Senior Coordinator of Undergraduate Studies in Applied Mathematics
Address: Department of
Statistics and Data Science, 24 Hillhouse Avenue, New Haven, CT 06511
Phone:
+12039975229, 2034320666, 2034320634
Fax:
2034320633
EMail: Andrew.Barron@yale.edu
 Ph.D.,
Electrical Engineering, Stanford University, 1985.
 M.S.,
Electrical Engineering, Stanford University, 1982.
 B.S. (Magna Cum Laude), E.E. and Math Science, Rice University,
1981.
 W. T. Woodson
H.S., Fairfax, Virginia, 1977.
Experience:
 2016  present
Professor, Department of Statistics and Data Science, Yale University
 1992  2016
Professor, Department of Statistics, Yale University
 1990  1992
Associate Professor of Statistics and Electrical & Computer
Engineering, University of Illinois
 1984  1998
Consultant, Barron Associates, Inc., Stanardsville,
Virginia
 1992 Spring Visiting Reseach Scholar, Barron Associates, Inc., Stanardsville, Virginia
 1991 Fall
Visiting Scholar, Mathematical Sciences Research Institute,
Berkeley, California
 1985  1990
Assistant Professor of Statistics and Electrical & Computer
Engineering, University of Illinois
 1982  1985
Research Assistant, Stanford University
 1981  1983
Consultant, Adaptronics, Inc., McLean,
Virginia
 1977  1980
Engineer, Adaptronics, Inc., McLean,
Virginia (Summers)
Honors:
 Fellow of the
IEEE. For Contributions to Information Theory and Statistics.
 Best paper
prize for all IEEE journals in 19901991, Browder J. Thompson Memorial
Prize,
for authors of age 30 or under at time of submission.
 IMS Medallion
Award Winner. Presented at 2005 Joint ASAIMS Annual Meetings,
Minneapolis, MN.
 Best paper
prize, National Aerospace Electronics Conference, 1990.
 Finalist for
the best paper prize, Information Theory Society, IEEE, 1987.
 Nominated for
the Marconi Young Scientist Award, 1990.
 Board of
Governors, IEEE Information Theory Society, Four terms,19951997,
19982000, 20132016, 20172020.
 Appointed
Secretary, Board of Governors, IEEE Information Theory Society,
19891990.
 Keynote
speaker at several conferences.
 Chairman, AMS
Summer Research Conference, Adaptive Selection of Models and Procedures,
1996.
 Program
Committee, IMSASA Joint Statistical Meetings, 1991.
 Program
Committees, IEEE International Symposium on Information Theory, Numerous
times.
 Program
Committees, IEEE Workshop on Information Theory, 1989, 2008.
 Program
Committee, World Congress on Neural Networks, 1995.
 Program
Committee, Neural Information Processing Systems: Natural and Synthetic,
1995.
 Program
Committee, ACM Workshop on Computational Learning Theory, 1991, 1997.
 James Waters
Creativity Award for best undergraduate research at Rice, 1981.
 Houston
Telephone Engineers scholarship, top student in communication theory,
1981.
 Top Award for
leadership, scholarship and service, Woodson, High School, Fairfax, VA,
1977.
Research
Interests:
 Entropy Power
Inequalities and Central Limit Theorems.
 Capacityachieving
Sparse Superposition Codes for the Gaussian Channel. Communication by
Regression.
 Entropy Rates,
Likelihood Stabilization, and Inference for Dependent Processes.
 Foundations of
Minimum Description Length Principle of Inference and Universal Data
Compression.
 Statistical
Risk Analysis for Penalized Criteria for Model Selection.
 Statistical
Risk Analysis for Bayes Procedures.
 Statistical
Perspectives and Analysis of Artificial Neural Networks.
 Nonlinear
Approximation and Estimation for Highdimensional Libraries of
Functions.
 Greedy
Algorithms for Subset Selection, Mixture Density Estimation, and L1
Penalty Optimization.
 Maximum Wealth
Stock Indices and Growth Rate Optimal Portfolio Estimation.
Ph.D. Dissertation:
Journal Publications:
 D. Cleveland,
A. R. Barron, A. N. Mucciardi (1980). Methods
for determining the depth of nearsurface defects. Journal of Nondestructive
Evaluation, Vol.1, pp.2136.
 B. Clarke and
A. R. Barron (1990). Informationtheoretic
asymptotics of Bayes methods. IEEE Transactions on
Information Theory, Vol.IT38, pp.453471. (Winner 1992
Browder J. Thompson Memorial Prize award for the best paper in all IEEE
journals for authors of age 30 or under at time of submission).
Book
Chapters:
 R. L. Barron,
A. N. Mucciardi, F. J. Cook, J. N. Craig, and
A. R. Barron (1984). Adaptive
learning networks. Chapter 2 in SelfOrganizing
Methods in Modeling, S. J. Farlow (Editor), Marcel Dekker,
New York, pp.2565.
 J.Q. Li and
A.R. Barron (2000). Mixture
Density Estimation. In Advances
in Neural Information Processing Systems, Vol.12, S.A. Solla, T.K. Leen and KR.
Mueller (Editors). MIT Press, Cambridge, Massachusetts, pp. 279285.
 A.R. Barron,
C. Huang, J. Q. Li and Xi Luo (2008). MDL
Principle, Penalized Likelihood, and Statistical Risk. In Festschrift for Jorma Rissanen.
Peter Grunwald, Petri Myllymaki, Ioan Tabus, Marcelo Weinberger & Bin Yu (Editors).
Tampere International Center for Signal Processing. TICSP series, #38. Tampere
University of Technology, Tampere, Finland.
Publications
in Conference Proceedings: (4 to 27 pages)
 A. R. Barron,
F. W. van Straten, and R. L. Barron (1977). Adaptive
learning network approach to weather forcasting:
a summary.
Proceedings of the IEEE International Conference on Cybernetics and
Society, Washington, DC, September 1921. Published by IEEE,
New York, pp.724727.
 A. R. Barron
and R. L. Barron (1988). Statistical
learning networks: a unifying view. In Computing Science and Statistics: Proceedings of
the 20th Symposium on the Interface, Reston, Virginia, April
2023. E. Wegman (Editor), Published by the
American Statistical Association, Alexandria, Virginia, pp.192203.
(Invited presentation).
 R. L. Barron,
R. L. Cellucci, P. R. Jordan, N. E. Beam, P.
Hess, and A. R. Barron (1990). Applications
of polynomial neural networks to fault detection, isolation, and
estimation (FDIE) and reconfigurable flight control. Proceedings of the National
Aerospace Electronics Conference, Dayton, Ohio, May 2325,
pp.507519, vol.2 (Winner of the best paper prize, 1990 NAECON).
Republished in Proceedings
1998 NAECON, pp. 348360.
 A. R. Barron
(1991). Approximation
and estimation bounds for artificial neural networks. In Computational Learning Theory:
Proceedings of the Fourth Annual ACM Workshop, Santa Cruz,
CA, August 57. L. Valiant, Ed., Morgan Kaufmann Publishers, Inc., San
Mateo, California, pp.243249. (Honored as one of the four papers
that appeared by invitation in expanded form in the special issue of Machine Learning,
representing the top presentations at the workshop.)
 A. R. Barron
(1992). Neural
Net Approximation. Proceedings
of the 7th Yale Workshop on Adaptive and Learning Systems, May
2022, K. S. Narendra (Editor), Center for Systems Science, Yale University,
pp. 6972.
 J. Takeuchi
and A. R. Barron (2001). Properties of Jeffreys mixture for Markov sources.
Proc. Workshop on Information Based Induction Sciences (IBIS),
pp. 327333.
 A. R. Barron
and Xi Luo (2007). Adaptive
Annealing. Proceedings 45th Annual Allerton Conference on Communication,
Control, and Computing. Allerton House, UIUC, Illinois. September 2628.
pp. 665673.
 A. Barron, Bayesian
Properties of Normalized Maximum Likelihood and its Fast Computation.
Proc. IEEE
International Symposium on Information Theory. Honolulu, HI,
June 2014. pp.16671671.
Newsletter Article:
 R. Venkataramanan,
S. Tatikonda, A. Barron (2016). Sparse Regression Codes.
IEEE Information Theory Society Newsletter.December 2016, pp. 715. [Based on ISIT Tutorial by
R. Venkataramanan and A. Barron, Barcelona, July 2016.]
Patents:
Technical Reports: (with
details not in subsequent publications)
 A. R. Barron
(1984). Monotonic
central limit theorem for densities. Department of Statistics
Technical Report #50, Stanford University, Stanford, California.
 A. R. Barron
(1988). The
exponential convergence of posterior probabilities with implications for
Bayes estimators of density functions. Department of Statistics
Technical Report #7, University of Illinois, Champaign, Illinois.
 B. Clarke and
A. R. Barron (1990). Entropy
risk and the Bayesian central limit theorem. Department of
Statistics Technical Report, Purdue University, West Lafayette, Indiana.
 A. R. Barron
(1991). Information
theory and martingales. Presented at the 1991 IEEE International
Symposium on Information Theory (recent results session), Budapest,
Hungary, June 2329.
 A. R. Barron,
Y. Yang and B. Yu (1994). Asymptotically
optimal function estimation by minimum complexity criteria. Seven page
submission. Presented at the IEEE
International Symposium on Information Theory, Trondheim,
Norway, June 27  July 1.
 A. R. Barron
(1997). Information theory in
probability, statistics, learning, and neural nets. Department of
Statistics. Yale University. Working paper distributed at plenary
presentation of the Tenth Annual ACM Workshop on
Computational Learning Theory.
 J.I. Takeuchi
and A. R. Barron (1997). Asymptotically
minimax regret for exponential and curved exponential families.
Fourteen page original. Presented at the 1998 International Symposium
on Information Theory, Cambridge, Massachusetts.
 A. R. Barron
(1999). Limits of information,
Markov chains, and projection. Eight page original submission.
Presented at the 2000 IEEE International Symposium on
Information Theory, Sorrento, Italy.
 J. Yu and A.
R. Barron (2003). Maximal
compounded wealth for portfolios of stocks and options. Working paper,
some of which was presented at the Workshop on Complexity and Inference,
DIMACS, Rutgers University, June 25.
 W. Qiu and
A. R. Barron (2007). A maximum
wealth asset index and mixture strategies for universal portfolios on
subsets of stocks. See also the Yale Dissertation of
Wei (David) Qiu.
 C. Huang,
G.L.H. Cheang and A. R. Barron (2008). Risk
of Penalized Least Squares, Greedy Selection and L1 Penalization for
Flexible Function Libraries. Yale Department of Statistics Technical
Report. [Too long for journal publication, yet still has some of our best results.]
 A. R. Barron
and A. Joseph (2011). Sparse
Superposition Codes are Fast and Reliable at
Rates Approaching Capacity with Gaussian Noise. June 10, 2011. [This
is an expanded version. A shorter version was completed in 2012, appearing
2014 in the IEEE Transactions on Information Theory, per the
publication list above.]
 S. Chatterjee
and A. R. Barron (2014). Information Theory of
Penalized Likelihoods and its Statistical Implications. arXiv:1401.6714v2,
April 27. [A shorter version is in ISIT 2014.]
A Selection of Seminar
Presentation Files (pdf format); to view on a computer or project on a
screen:
 MDL,
Penalized Likelihood and Statistical Risk. Presented at the
Information Theory Workshop, Porto, Portugal, May 8, 2008. Festschrift
on the occasion of the 75th birthday of Jorma Rissanen. Similar
presentations with updates for the regression case at the Workshop on
Information Theory Methods in Science and Engineering, Tampere Finland,
August 19, 2008 and the Information and Communication Conference,
Renyi Institute, Budapest, August 2528, 2008, on
the occasion of the 70th birthday of Imre Csiszar: MDL Procedures with
L_1 Penalty and their Statistical Risk
 Adaptive Annealing.
Presentation at the Allerton Conference on Communication, Control, and
Computing. September 27, 2007.
 Information
Theory and Flexible HighDimensional NonLinear Function Estimation.
Presented at the InfoMetrics Institute Workshop, American University,
Wash, DC, November 12, 2011. Similar presentation at Harvard University,
Department of Statistics, Oct.2011. Overview of several useful results
for highdimensional function estimation. Disclaimer: The proposed
solution on page 19 to the differential equation for Adaptive Annealing
is problematic due to discontinuity of the gradient at the origin.
 Analysis of
Fast Sparse Superposition Codes. Presentation at the IEEE
International Symposium on Information Theory, St. Petersburg, Russia,
August 5, 2011. Adds details of distribution analysis not in the earlier
presentation "Toward Fast Reliable Communication, at Rates Near
Capacity with Gaussian Noise," IEEE International Symposium on
Information Theory, Austin, TX, June 18, 2010. Similar Presentations:
"Communication by Regression: Practical Achievement of Shannon
Capacity," at Workshop Infusing Statistics and Engineering, Harvard
University, June 56, 2011. "Sparse Superposition Codes: low
complexity and exponentially small error probability at all rates below
capacity," Workshop on Information Theory Methods in Science and
Engineering, Helsinki, Finland, August 8, 2011.
Other Conference Presentations
(19832008): (proceedings containing not more than 1 page abstracts).
[A number of subsequent presentations are described in the links above.]
 A. R. Barron
(1983). Convergence of logically simple estimates of unknown probability
densities. IEEE
International Symposium on Information Theory, Saint Jovite, Canada, September 2630.
 A. R. Barron
(1985). Entropy and the central limit theorem. IEEE International Symposium on
Information Theory, Brighton, England, June 2328.
 A. R. Barron
(1985). Ergodic theorem for densities: generalized ShannonMcMillanBreiman theorem. IEEE
International Symposium on Information Theory, Brighton,
England, June 2328.
 A. R. Barron
(1985). Logically smooth density estimation. Joint IMS, ASA Annual Meeting, Las
Vegas, Nevada, August 58.
 A. R. Barron
(1987). Applications of large deviations in statistics. Conference on Asymptotic
Methods for Stochastic Systems: Large Deviations Theory and Practice,
University of Maryland, October 2527. (Invited Presentation).
 A. R. Barron
(1988). The convergence in information of probability density
estimators. IEEE
International Symposium on Information Theory, Kobe, Japan,
June 1924.
 T. M. Cover
and A. R. Barron (1988). A bound on the financial value of information. IEEE International Symposium on
Information Theory, Kobe, Japan, June 1924.
 A. R. Barron
(1989). Minimum complexity density estimation. IMS Regional Meeting,
Lexington, Kentucky, March 1922. (Invited Presentation).
 A. R. Barron
(1989). Portfolio selection based on nonparametric density estimates for
the stock market. 21st
Symposium on the Interface: Computing Science and Statistics,
Orlando, Florida, April 912. (Invited Presentation).
 A. R. Barron
(1989). Minimum complexity estimation.
IEEE Workshop on Information Theory, Center for Applied
Math, Cornell University, June 2630. (Session Organizer).
 A. R. Barron
(1989). Some statistical properties of polynomial networks and other
artificial neural networks.
Conference on Neural Information Processing Systems, Denver,
Colorado, November 2730. (Invited Plenary Presentation).
 A. R. Barron
(1990). An index of resolvability of probability density estimators. IEEE International Symposium on
Information Theory, San Diego, California, January 1419.
 A. R. Barron
(1990). Some statistical convergence properties of artificial neural
networks. IEEE
International Symposium on Information Theory, San Diego,
California, January 1419.
 A. R. Barron
(1990). The index of resolvability: statistical convergence of minimum
complexity estimation. AAAI
Symposium on MinimalLength Encoding, Stanford California,
March 2729 (Invited Presentation).
 A. R. Barron
(1990). Some approximation and estimation theorems for artificial neural
networks. IEEE
Information Theory Workshop, Veldhoven,
The Netherlands, June 1015 (Invited Presentation).
 A. R. Barron
(1990). Statistical properties of artificial neural networks. SIAM Annual Meeting,
Chicago, Illinois, July 20 (Invited Presentation).
 A. R. Barron (1990).
Complexity Regularization. NATO Advanced Study Institute on Nonparametric
Functional Estimation and Related Topics, Spetses, Greece,
July 29  August 11 (Invited Presentation).
 A. R. Barron (1991).
Information theory and the stock market: the effect of side information.
Workshop on Coordination of Distributed
Information and Decisions, Cornell University, April 1113.
(Invited Presentation).
 A. R. Barron
(1991). Approximation and estimation results for adaptively synthesized
neural network architectures.
Workshop on Theoretical Issues in Neural Nets, Center for
Discrete Mathematics and Theoretical Computer Science, Rutgers,
University, May 2023. (Invited Presentation).
 A. R. Barron
and R. L. Barron (1991). Artificial neural networks in industry: some
developments using statistical techniques. IMS Special Topics Meeting on Statistics in
Industry, Philadelphia, Pennsylvania, June 912 (Invited
Presentation).
 A. R. Barron
(1991). Universal approximation bounds for superpositions
of a sigmoidal function. IEEE
International Symposium on Information Theory, Budapest,
Hungary, June 2329.
 A. R. Barron
(1991). Approximation and estimation bounds for sigmoidal and polynomial
networks. Joint
Statistical Meetings, Atlanta, Georgia, August 1922.
(Session Organizer).
 A. R. Barron
(1991). Approximation and estimation bounds for neural networks and
projection pursuit. Workshop
on Neural Information Processing Systems, Vail, Colorado,
December 1011. (Invited Presentation).
 A. R. Barron
(1991). Risk estimation, risk optimality, regularization, and neural
nets. Workshop on
Neural Information Processing Systems, Vail, Colorado,
December 1011.
 A. R. Barron
(1992). Approximation, estimation, and computation results for
artificial neural networks. 24th
Symposium on the Interface: Computing Science and Statistics,
College Station, Texas, March 1921. (Invited Presentation).
 A. R. Barron
(1992). Artificial neural networks: stochastic analysis and engineering
applications? Symposium
on Stochastic Processes in Engineering Applications, Otaniemi, Finland, April 79. (Invited Plenary
Presentation).
 A. R. Barron,
D. Olive, and Y. Yang (1992). Asymptotically optimal complexitybased
model selection. IMS
Regional Meeting, Corvallis, Oregon, June 1415. (Invited
Presentation).
 A. R. Barron
(1992). Statistical accuracy of neural nets. Conference on Neural Information Processing
Systems: Natural and Synthetic. Denver, Colorado, November
30  December 2. (Invited Tutorial).
 A. R. Barron
(1993). Neural net approximation. Annual
Meeting of the American Math Society, San Antonio, Texas,
January 1316. (Invited Presentation).
 A. R. Barron,
L. Gyorfi, and E. C. van der Meulen (1993). Universal coding of nondiscrete
sources based on distribution estimation consistent in expected
information divergence. IEEE
International Symposium on Information Theory, San Antonio,
Texas, January 1722. p.51.
 A. R. Barron,
B. Clarke, and D. Haussler (1993). Information bounds for the risk of
Bayesian predictions and the redundancy of universal codes. IEEE International Symposium on
Information Theory, San Antonio, Texas, January 1722.
 A. R. Barron
(1993). Statistical accuracy of neural nets. Workshop on Information and Geometry,
Hakone, Japan. March. (Invited Presentation).
 A. R. Barron
(1993). Optimal rate properties of minimum complexity estimation. Workshop on Descriptional
Complexity, Schloss Dagstahl, Wadern, Germany.
May. (Invited Presentation).
 A. R. Barron
(1993). Do neural nets avoid the curse of dimensionality? Congress on Statistics,
Vannes, France, May. (Invited Presentation).
 A. R. Barron
(1993). Performance bounds for neural net estimation and classification.
Annual Meeting of the
Classification Society of North America, Pittsburgh, June.
(Invited Presentation).
 A. R. Barron
(1993). Do neural nets avoid the curse of dimensionality? NATO ASI on Statistics and
Neural Nets, Les Arc, France, June. (Invited Presentation).
 A. R. Barron
(1994). The accuracy of Bayes estimates of neural nets. 26th Symposium on the Interface:
Computing Science and Statistics, Research Triangle Park,
NC, June 1518.
 A. R. Barron,
Y. Yang and B. Yu (1994). Asymptotically optimal function estimation by
minimum complexity criteria. IEEE
International Symposium on Information Theory, Trondheim,
Norway, June 27  July 1. (One page proceeding and seven page original submission
available).
 A. R. Barron
(1994). Neural net approximation and estimation. IEEE Workshop on Information
Theory, Moscow, Russia, July 36. (Invited Presentation).
 A. R. Barron
(1994) Minimum complexity estimation. ML/COLT:
Workshop on Applications of Descriptional
Complexity to Induction, Statistical and Visual Inference,
New Brunswick, New Jersey, July 89. (Invited Presentation.)
 A. R. Barron
(1995). Statistics and neural nets. New
England Statistics Conference, Storrs, Connecticut, April
22. (Invited Plenary Presentation).
 A. R. Barron,
Y. Yang (1995). Informationtheoretic development of minimax rates of
convergence. IEEE
International Symposium on Information Theory, Whistler, British
Columbia, September 1722. (Added to session at recommendation of
program chair; not in proceedings).
 N. Hengartner and A. R. Barron (1995). Information
theory and superefficiency. IMS regional meeting
Stanford, California. (Invited Presentation).
 B. S. Clarke
and A. R. Barron (1995). Jeffreys' prior
yields the asymptotic minimax redundancy. IEEEIMS Workshop on Information Theory and
Statistics, Alexandria, Virginia, October 2729. (Invited
Presentation).
 A. R. Barron
(1995). Asymptotically optimal model selection and neural nets. IEEEIMS Workshop on
Information Theory and Statistics, Alexandria, Virginia,
October 2729. (Invited Presentation).
 Q. Xie and A. R. Barron (1996). Asymptotic minimax
regret for data compression, gambling, and prediction. Workshop on sequence prediction,
Santa Cruz, California, May 35. (Invited, jointly presented).
 A. R. Barron
(1996). The fundamental role of Kullback
Information in large sample statistics. Solomon Kullback
Memorial Conference, Washington, DC, May 2324. (Invited
Presentation).
 A. R. Barron
(1996). Adaptation and model selection. AMSIMSSIAM Summer Research Conference on
adaptive selection of models and statistical procedures,
Mount Holyoke, Massachusetts, June 2228. (Conference Chairmen, A.
Barron, P. Bickel, I. Johnstone, D. Donoho).
 A. R. Barron
and Y. Yang (1996). Information theory in nonparametric estimation. Nonparametric Estimation: The
Road Ahead, Canberra, Australia, July 24. (Invited
Presentation).
 A. R. Barron
(1996). Adaptive model selection and neural networks. Sydney Interational
Statistics Congress, IMS regional meeting, Sydney,
Australia, July 812. (Invited Presentation).
 A. R. Barron
(1996). Asymptotics of Bayes estimators. International Society of
Bayesian Analysis, Regional Meeting, Chicago, Illinois,
August 23. (Invited Presentation).
 A. R. Barron
and Y. Yang (1996). Adaptive model selection and the index of
resolvability. Joint
Statistical Meetings of the ASA and IMS, Chicago, Illinois,
August 48. (Invited Presentation).
 A. R. Barron
and Q. Xie (1997). Asymptotic minimax regret
for data compression, gambling and prediction. IEEE International Symposium on
Information Theory, Ulm, Germany, June 29  July 4.
 A. R. Barron
(1997). Information theory in probability, statistics, learning, and
neural nets. Computational
Learning Theory: Tenth Annual ACM Workshop, Nashville,
Tennessee, July 69. (Invited Plenary Presentation).
 A. R. Barron
(1997). Information theory in probability, statistics, learning, and
neural nets. International
Conference on Combinatorics, Information Theory, and Statistics,
University of Southern Maine, Portland, Maine, July 1820. (Invited
Presentation).
 A. R. Barron
and Y. Yang (1997). Informationtheoretic determination of minimax rates
of convergence. Symposium
on Nonparametric Functional Estimation, Centre de Recherches Mathematiques,
University of Montreal, October 1618. (Invited Presentation).
 A. R. Barron
(1998). Nonlinear approximation, greedy algorithms, and neural networks.
Ninth International
Conference on Approximation Theory, January 46. (Invited
Presentation).
 A. R. Barron
(1998). How information theory illuminates the behavior of risk
functions of Bayes proceedures. Purdue Workshop on the
Interface between Paradigms of Statistics. June 1719.
(Invited Presentation).
 A. R. Barron
and J.I. Takauchi (1998). Mixture models
achieving optimal coding regret. IEEE
Information Theory Workshop. Killarney, Ireland, June 2226.
 J. Takeuchi
and A. R. Barron (1998). Asymptotic Minimax Regret by Bayes Mixtures. International
Symposium on Information Theory. Cambridge, Ma, August
1621.
 A. R. Barron
(1998). Information theory in probability and statistics; Approximation
and estimation bounds for Gaussian mixtures. CIRM Workshop on Information Theory, Statistics,
and Image Analysis, Marsielle,
France, December 711. (Two Invited Presentations).
 A. R. Barron
(1999). Information theory in probability and statistics. IEEE Information Theory
Workshop on Detection, Estimation, Classification, and Imaging,
Sante Fe, New Mexico, February 2426. (Invited
Presentation).
 A. R. Barron
(1999). Decision theory of regret for universal coding, gambling, and
prediction. DIMACS
Workshop: Online Decision Making, Rutgers University, New
Brunswick, NJ, July 1215. (Invited Presentation).
 A. R. Barron
(2000). Limits
of information, Markov chains and projection. IEEE International Symposium on
Information Theory, Sorrento, Italy, June 2630.
 A. R. Barron,
Laszlo Gyorfi, Micheal
Nussbaum (2000). Nonparametric
Estimation, Neural Nets and Risk Asymptotics.
Short Course. Mathematical Research Institute, Oberwolfach,
Germany, June 1017.
 A. R. Barron
(2000). Information theory in probability; Information theory in
statistics. J. Bolyia Society Conference on Information Theory in
Mathematics, honoring 50th anniversary of formation of what is
now known as the Renyi Institute of
Mathematics of the Hungarian Academy of Sciences. Balatonelle,
Hungary, July 37. (Two Invited Presentations).
 A. R. Barron
(2000). Prediction, data compression, gambling, and model selection: Do
Bayes procedures nearly minimize the maximum of regret over all possible
data sequences? AMSIMSSIAM
Summer Research Conference on Bayes, Frequentist, and Likelihood
Inference: a Synthesis. Mount Holyoke College, South Hadley,
Massachusetts, July 913. (Invited Presentation.)
 A. R. Barron
(2001). Informationtheoretic bounds for mixture modeling, model
selection, and data compression. Workshop
on Information Theory and Statistics, DIMACS, Rutgers
University, March 2001. (Invited Presentation).
 A. R. Barron
(2001). Information theory in probability and statistics. 23rd European Meeting of
Statistics, Funchal, Madeira, Portugal, August 1318.
(Invited Keynote Plenary Presentation).
 F. Liang and
A. R. Barron (2001). Minimax optimal predictive density estimation, data
compression, and model selection. Workshop
on MDL at Conference on Neural Information Processing Systems,
Whistler, British Columbia, December 10. (Invited Presentation).
 F. Liang and
A. R. Barron (2002). Exact minimax strategies for predictive density
estimation, data compression and model selection. IEEE International Symposium on
Information Theory, Lausanne, Switzerland, July 15.
 A. R. Barron, Jiangfeng Yu, and Wei Qui (2003). Maximum compounded
wealth: portfolio estimation, option pricing, and stock selection. Workshop on Complexity and
Inference, DIMACS, Rutgers University, June 25. (Invited
Presentation).
 A. R. Barron
(2003). The role of information in the central limit problem. Symposium on Information Theory
and Some Friendly Neighbors  Ein Wunschkonzert, ZIF, Center for
Interdisciplinary Research, Bielefeld, August 1113. (Invited
Presentation).
 A. R. Barron
(2003). Interplay of statistics and information theory in formulation
and selection of models. Workshop
on Model Building, Dortmund, Germany, November 1314.
(Invited Presentation).
 G. Leung and
A. R. Barron (2004). Information theory, model selection and model
mixing for regression, Conference
on Information Sciences and Systems, Princeton, NJ, March
1719. (Invited Presentation in session on Information Theory, Computer
Science, and Statistics.)
 A. R. Barron
and G. Leung (2004). Risk assessment for Bayes procedures and model
mixing in regression. IVth Workshop on Bayesian Nonparametrics,
Universita di Roma La Sapienza, Rome, Italy,
June 1216. (Invited Presentation).
 A. R. Barron
and G. Leung (2004). Risk assessment for model mixing. Workshop on Mathematical
Foundations of Learning Theory, Barcelona, Spain, June
1823. (Invited Presentation).
 A. R. Barron
(2004). Relative entropy in probability theory and mathematical
statistics. Workshop
on Entropy in the Mathematical, Physical, and Engineering Sciences,
Padova, Italy, June 2427. (Two Invited Presentations).
 A. R. Barron
(2004). Fitting functions of many variables: neural networks and beyond.
16th Conference on
Computational Statistics (COMPSTAT 2004), Prague, Czech
Republic, August 2328. (Invited Keynote Plenary Presentation).
 A. R. Barron
(2005). Neural nets, mixture models, and adaptive kernel machines. Yale Workshop on Adaptive and
Learning Systems, May 2931, Center for Systems Science,
Yale University. (Invited presentation).
 A. R. Barron
(2005). Challenges in highdimensional function estimation and attempted
solutions, Congress
on Statistics, Pau, France, June 610. (Invited Presentation).
 A. R. Barron
(2005). Information theory and risk analysis. Medallion Lecture. Joint Statistical Meetings of
the IMS and ASA, Minneapolis, Minnesota, August 711.
(Presented with IMS Medallion Award; onehour special invited
presentation on Aug. 7).
 A. R. Barron
(2005). Information theory and statistics for machine learning, IEEE Workshop on Machine
Learning for Signal Processing XV, Mystic, CT, October
2830. (Invited Keynote Plenary Presentation).
 M. Madiman and A. R. Barron (2006). Monotonicity of
information in the central limit theorem, Workshop on Information Theory and its
Applications, University of California, San Diego, February
69. (Invited Presentation).
 A. R. Barron
(2006). Simple risk bounds for mixing least squares regressions. Journees: Model Selection in
Statistics: Different approaches, University de Nice,
SophiaAntipolis, Nice, France, March 1419 (Two onehour invited
presentations).
 A. R. Barron
(2006). Simple risk bounds for mixing least squares regressions. International Workshop on
Applied Probability, University of Connecticut, Storrs, CT,
May 18. (Invited Presentation).
 A. R. Barron
and Wei Qiu (2007). Maximum wealth portfolios,
Workshop on
Information Theory and its Applications, University of
California, San Diego, January 29  February 2. (Invited Presentation).
 A. R. Barron,
Cong Huang, and Xi Luo (2008). Penalized squared error and likelihood:
risk bounds and fast algorithms, Workshop
on Sparsity in High Dimensional Statistics and Learning Theory,
Georgia Institute of Technology, Atlanta,
Georgia, March 2224. (Invited ThreePart Presentation).
 A. R. Barron
(2008). Information theory principles in probability and statistics, Elements of Information Theory
Workshop, on the Occasion of Tom Cover 70th birthday,
Stanford University, Stanford, CA, May 16. (Invited Presentation).
Invited Departmental
Seminar Presentations (19852007): (These had short abstract announcements).
[A number of subsequent presentations are indicated with the presentation files above.]
 Purdue
University, Joint Statistics Colloquium, October 3, 1985. Topic: Entropy
and the central limit theorem.
 Michigan State
University, Department of Statistics and Probability, January 28, 1986.
Topic: Generalized ShannonMcMillanBreiman
theorem.
 University of
Chicago, Department of Statistics, October 20, 1986. Topic: Uniformly
powerful tests.
 University of
Virginia, Department of Mathematics, March 5, 1987. Topic: Convergence
of Bayes estimators of probability density functions.
 Stanford
University, Department of Statistics, October 20, 1987. Topic:
Convergence of Bayes estimators of probability density functions.
 University of
Chicago, Department of Statistics, March 7, 1988. Topic: Convergence of
Bayes estimators of probability density functions.
 McGill
University, Joint Statistics Seminar for Montreal universities, March
31, 1988. Topic: Approximation of densities by sequences of exponential
families.
 Dupont Research
Center, Dover, Delaware, April 26, 1988. Topic: Statistical learning
networks.
 IBM T. J.
Watson Research Center, Yorktown Heights, New York, August 10, 1988.
Topic: Statistical learning networks.
 Stanford
University, Information Systems Laboratory, November 3, 1988. Topic:
Minimum complexity density estimation.
 IBM Technical
Education Center, Thornwood, New York, January 1112, 1989. Statistical
learning networks. In the short course on Knowledge Acquisition from
Data.
 Cornell
University, Department of Economics and Program of Statistics
(Cohosts), February 1, 1989. Topic: Convergence of Bayes estimators of
probability density functions.
 Purdue University,
Department of Statistics, September 7, 1989. Topic: Minimum complexity
density estimation.
 University of
Lowell, Massachusetts, Joint Seminar, Department of Mathematics and
Department of Electrical Engineering, March14, 1990. Topic: Statistical
properties of polynomial networks and other artificial neural networks.
 Carnegie
Mellon University, Department of Statistics, April 4, 1990. Topic:
Statistical properties of polynomial networks and other artificial
neural networks.
 University of
Chicago, Department of Statistics, October 15, 1990. Topic: Statistical
properties of artificial neural networks.
 University of
California, San Diego, Department of Mathematics, January 7, 1991.
Topic: Approximation bounds for artificial neural networks.
 University of
California, San Diego, Department of Mathematics, January 8, 1991.
Topic: Complexity regularization for nonlinear model selection.
 Siemens
Corporation, Princeton, New Jersey, February 28, 1991. Topic: Universal
approximation bounds for superpositions of a
sigmoidal function.
 University of
Wisconsin, Department of Statistics, April 3, 1991. Topic: Complexity
regularization for nonlinear model selection.
 University of
Wisconsin, Department of Mathematics, April 4, 1991. Topic:
Approximation bounds for artificial neural networks.
 Technical
University of Budapest, Department of Electrical Engineering, July 2,
1991. Topic: Universal approximation bounds for superpositions
of a sigmoidal function.
 Mathematical
Sciences Research Institute, Berkeley, California, September 25, 1991.
Topic: Empirical process bounds for artificial neural networks.
 Stanford
University, Department of Statistics, October 15, 1991. Topic:
Approximation and estimation bounds for artificial neural networks.
 University of
California, Santa Cruz, Department of Computer and Information Sciences,
October 17, 1991. Topic: Computationally efficient approximation and
estimation of functions using artificial neural networks.
 Yale
University, Department of Statistics, January 13, 1992. Topic: Neural
network estimation.
 University of
Virginia, Department of Electrical Engineering, Eminent Speaker Series,
February 21, 1992. Topic: Estimation of functions of several variables
 neural networks, Fourier decomposition, and Bayes methods.
 North Carolina
State University, Department of Statistics, February 29, 1992. Topic:
Estimation of functions of several variables  neural networks, Fourier
decomposition, and Bayes methods.
 Cornell
University, Center for Applied Mathematics, March 6, 1992. Topic:
Estimation of functions of several variables  neural networks, Fourier
decomposition, and Bayes methods.
 University of
North Carolina, Department of Statistics, March 30, 1992. Topic:
Estimation of functions of several variables  neural networks, Fourier
decomposition, and Bayes methods.
 University of Joenesu, Finland, Department of Statistics, April 9,
1992. Topic: Introduction to artificial neural networks.
 University of
Paris VI, Department of Statistics, April 15, 1992, and University of
Paris, Orsay, Department of Statistics,
April16, 1992. Topic: Estimation of functions of several variables 
neural networks, Fourier decomposition, and Bayes methods.
 University of
Paris VI, Department of Statistics, April 22, 1992, and University of
Paris, Orsay, Department of Statistics,
April23, 1992. Topic: Performance bounds for complexitybased model
selection.
 Princeton
University, Department of Electrical Engineering, May 14, 1992. Topic:
Overview of approximation results for sigmoidal networks.
 University of
Massachusetts at Lowell, Joint Seminar, Department of Mathematics and
Department of Electrical Engineering, October 21, 1992. Topic:
Statistical accuracy of neural nets.
 University of
Tokyo, Japan, Department of Information, Physics and Engineering, March
1993. Topic: Information theory and model selection.
 University of
Paris VI, Department of Statistics, May 1993. Topic: Optimal rate
properties of minimum complexity estimation.
 University of
Paris VI, Department of Statistics, May 1993. Topic:
Informationtheoretic proof of martingale convergence.
 University of
Pennsylvania, Wharton School, October 21, 1993. Topic: Neural networks
and statistics.
 Massachusetts
Institute of Technology, Center for Biological and Computational
Learning, October 27, 1993. Topic: Neural networks and statistics.
 Rutgers
University, Department of Statistics, October 5, 1994, Topic:
Statistical accuracy of neural nets.
 University of
South Carolina, Department of Mathematics, Spring 1995. Topic: Neural
net approximation.
 Carnegie
Mellon University, Department of Statistics, Fall 1995. Topic:
Information risk and superefficiency.
 Massachusetts
Institute of Technology, Department of Applied Mathematics, March 1996.
Topic: Consistent and uniformly consistent classification.
 Columbia
University, Department of Statistics, Fall 1996. Topic: Consistency of
posterior distributions in nonparametric problems.
 Northeastern
University, Joint Mathematics Colloquium with MIT, Harvard, and Brandiess, February 27, 1997. Topic: Information
theory in probability and statistics.
 Iowa State
University, Department of Statistics, March 28, 1997. Topic: Information
theory in probability and statistics: The fundamental role of Kullback divergence.
 Washington
University, St. Louis, Department of Electrical Engineering, Center for
Imaging Systems, April 16, 1997. Topic: Universal data compression,
prediction, and gambling.
 Massachusetts
Institute of Technology, LIDS Colloquium, May 5, 1998. Topic: Simple
universal portfolio selection.
 University of
California, Santa Cruz, Baskin Center for Computer Engineering, October
1998. Topic: Approximation bounds for Gaussian mixtures.
 Lucent, Bell
Laboratories, Murray Hill, New Jersey, March 1999. Topic: Approximation
and estimation bounds for mixture density estimation.
 Stanford
University, Department of Statistics, Probability Seminar, May 24, 1999.
Topic: Information, martingales, Markov chains, convex projections, and
the CLT.
 Stanford
University, Department of Statistics, Statistics Seminar, May 25, 1999.
Topic: Mixture density estimation.
 Rice
University, Departments of Statistics and Electrical Engineering,
November 4, 2000. Topics: Information theory and statistics  best
invariant predictive density estimators.
 University of
Chicago, Department of Statistics, November 21, 2000. Topics:
Information theory and statistics  best invariant predictive density
estimators.
 Yale University,
Department of Computer Science, Alan J. Perlis Seminar, April 26, 2001.
Topic: Neural nets, Gaussian mixtures, and statistical information
theory.
 Brown
University, Department of Applied Mathematics, May 9, 2001. Topic: I do
not recall.
 University of
Massachusetts at Lowell, Department of Mathematics, September 19, 2001.
Topic: Mixture density estimation.
 University of
California at Los Angeles, Department of Statistics, May 21, 2002.
Topic: Nonlinear approximation, estimation, and neural nets (I do not
recall the specific title).
 Columbia
University, Department of Statistics, October 28, 2002. Topic:
Information inequalities in probability and statistics.
 University of
Georgia, Department of Statistics, November 26, 2002. Topic: Information
inequalities in probability and statistics.
 University of
Pennsylvania, Wharton School, October 29, 2003. Topic: Portfolio
estimation for compounding wealth.
 University of
North Carolina (in conjunction with Duke University), Departments of
Statistics, November 3, 2004. Topic: Risk assessment and advantages of
model mixing for regression.
 South
Carolina, Department of Mathematics, April 7, 2005. IMI Distinquished Lecture. Topic: Statistical theory for
nonlinear function approximation: neural nets, mixture models, and
adaptive kernel machines.
 Helsinki
University and Helsinki Institute of Information Technology, Helsinki,
Finland. August 2225, 2005. Two talks: (1) Statistical foundations and
analysis of the minimum description length principle. (2) Consequences
of MDL for neural nets and Gaussian mixtures.
 Princeton
University, Department of Operations Research and Financial Engineering,
October 4, 2005. Topic: Statistical perspectives on growth rate optimal
portfolio estimation.
 IBM Research
Laboratories, Yorktown Heights, September 22, 2006. Topic: Generalized
entropy power inequalities and the central limit theorem.
 Purdue
University, Department of Computer Science, February 26, 2007. Prestige
Lecture Series on the Science of Information. Topic: The interplay of
information theory, probability, and statistics.
 University of
Illinois, Joint Seminar, Department of Statistics and Department of
Electrical and Computer Engineering, February 27, 2007. Prestige
Conference Series. Topic: The interplay of information theory and
probability.
 Boston
University, Department of Statistics, March 1, 2007. Prestige Conference
Series. Topic: Information inequalities and the central limit theorem.
 University of
California at Berkeley, Department of Computer Science, October 4, 2007.
Topic: Fast and accurate greedy algorithm for L1 penalized least
squares. Primarily presented by Cong Huang.
 Rutgers
University, Department of Statistics, December 12, 2007. Topic: Fast and
accurate L1 penalized least squares. Copresented with Cong Huang.
Ph.D. Dissertations
Supervised:
 Yuhong Yang (1996). Minimax Optimal Density
Estimation. Yale University. [Was Assistant Professor at Iowa State
University; Now Professor at University of Minnesota, Department of
Statistics.]
 Qun (Trent) Xie (1997). Minimax Coding and Prediction.
Yale University. [Was at GE Capital, Inc., Fairfield, CT. Then an
Assistant Professor at Tsinghua Univ.]
 Qiang (Jonathan) Li
(1999). Estimation of
Mixture Models. Yale University. [Was at KPMG Financial Services,
New York. Then at Stanford Research Institute, Palo Alto. Now at Radar
Networks, Inc., San Franscisco.]
 Feng Liang
(2002). Exact Minimax
Predictive Density Estimation. Yale University. [Was Assistant
Professor, Department of Statistics, Duke University. Now Associate
Professor, Department of Statistics, University of Illinois at UrbanaChampaign]
Research
Interests: discussion of results and open problems
 Entropy Power
Inequalities and Central Limit Theorems:
[Publications 3,26,29,34,64,69; At least seven conference presentations
and at least seven seminars.]
Information
theory tools involving entropy and Fisher information provide demonstration
of a strengthened central limit theorem. One compares the distribution of the
standardized sum of i.i.d. random variables to the
standard normal distribution. Publication [3], building on earlier work by
Larry Brown, shows the Kullback divergence
(relative entropy distance) converges to zero if and only if it is eventually
finite.
Publication [26], with Oliver Johnson, shows it converges to zero at
rate 1/n if the random variables have finite Poincare constant; also shown,
in the same issue, by Artstein, Ball, Barthe, and Naor. Implications
are given in Publication [26] for risk efficiency of the best unbiased
estimator of natural parameters in exponential families.
Publications [29] and [64], with Mokshay Madiman, simplify proofs of monotonicity of information
in the central limit theorem and extend the Shannon entropy power inequality
to arbitrary collections of subset sums. Implications for distributed
estimation of location parameters are explored with Madiman,
Abram Kagan, T. Yu in Publications [34] and [69].
More
Recent Developments: Bobkov, Chirstyakov and Gotze (2013, Annals of Probability) show that
the relative entropy distance converges to zero at rate 1/n assuming only
finite initial relate entropy and finite fourth moments, rather than the
infinite moments implied by a finite Poincare constant. The proof integrates
classical local limit theorem results of Petrov from 1964.
Open
Problems:
Give an information theoretic proof of the 1/n convergence assuming only
finite initial information distance and finite third or fourth moment. Seek a
possibly more direct understanding of the entropy power inequality without
appealing to corresponding inequalities for Fisher information or mean
squared prediction error. Characterize cleanly the role of dimension in
bounds on the relative entropy distance from the normal.
 Entropy Rates, Asymptotic Equipartition Property,
Likelihood Stabilization, and Inference for Dependent Processes:
[Publications 0, 2; At least two Conference Presentations and at least one seminar
2.]
Asymptotically
there is agreement between sample information quantities and their
expectations; likewise agreement between log likelihoods and their
expectations. In information theory and in statistical physics it is called
the asymptotic equipartition property (AEP).
Previously, the Shannon, McMillan, Brieman
theory from the 1950s, which was for discrete ergodic processes, and the
theory of Moy, Perez, and Kieffer, which
established L1 covergence for (1/n) logdensities
of absolutely continuous joint distributions of size n samples from
stationary processes. Publication [2], in 1985, established the
longconjectured convergence with probability 1; independently shown by
Steven Orey in the same year. The conclusion also holds
for log likelihood ratios, if the second measure is Markov with stationary
transitions, while the governing first measure is not required to be Markov.
This AEP or stability for loglikelihood ratios provides best error exponents for
hypothesis tests as discussed in [2]. It also provides a step in Publication [0],
allowing demonstration of universal consistency properties of logical smoothing
(a general Kolmogorov complexity based minimum descriptionlength criterion) for all
computable stationary ergodic processes.
Open
Problem:
Provide an extension showing log likelihood ratio stabilization in the case
that both measures are stationary and ergodic without Markov assumption. What
weaker assumption on the second measure is appropriate? A word of caution:
John Kieffer in the early 1970s gave a counterexample in which the (1/n) loglikelihood
ratio is not convergent for a pair of stationary (but not ergodic) measures.
 Foundations of
Minimum Description Length Principle of Inference and Universal Data
Compression:
[Publications 0,7,10,15,16,18,23,27,31,37,46,48,49,57,58,62,65,67,68,77,78,79,80,81,82,83;
At least twentyfour conference presentations and at least two seminars; Student Theses of
Clarke, Xie, Li, Liang and Chatterjee.]
The
minimum description length principle of statistical inference initiated by Jorma Rissanen,
with related work by Wallace, has realizations via twopart codes, mixture codes, and
predictive codes. Early statistical analysis is in Barron's Thesis (Publication [0]).
Twopart
codes correspond to complexity penalized loglikelihood, for which risk
analysis is in Publications [0,10,47,49,67,68] and the theses of Li and Chatterjee. Mixture
codes (average case optimal and minimax optimal) correspond to universal data
compression in information theory and to reference priors in statistics, for
which analysis is in Publications [0,7,15,16,18,23,27,44,46,49,55,72].
Further details on minimax regret for compression and prediction of arbitrary sequences
is in work with Junichi Takeuchi in Publications [31,57,58,62,65,78,79,80] and a patent.
Optimal predictive codes correspond to prequential
inference and to posterior model averaging, for which analysis is in
Publications [7,18,19,22,23,27,28,44,48,49].
A recent result [83] with Roos and Watanabe provides exact Bayes interpretation for
Starkov's normalized maximum likelihood distribution which is exactly minimax over all
sequences. What is stricking about this
result is that to provide exact minimaxity there is the potential for the prior to use
negative weights. Collectively, these investigations show
relationships in properties of penalized likelihoods, Bayes mixture
procedures, and predictive procedures revealed through their minimum
description length interpretations. Moreover, data compression performance
and statistical risk properties are linked.
For further literature on the minimum description length principle and where
our work fits in this context, one may see the review by Barron, Jorma Rissanen,
and Bin Yu (Publication [18]), and the book by Peter Grunwald
(2007, MIT Press).
Open
Problems:
It is the cumulative Kullback risk of predictive
estimators, for sample sizes up to n, that is most
directly related to redundancy of data compression, for which simple and
clean bounds are available. 1/n times the cumulative Kullback
risk expresses an average risk across sample sizes. (1) Show that the
Kullback risk with sample size n (the last term in the
sum) has corresponding performance (order d/2n in the parametric case) or
give a counterexample. Can the individual risk be worse than the average risk
across sample sizes? (2) For twostage codes (complexity penalized log
likelihoods), the existing risk bounds in broad
generality are for Hellinger divergence and related measures of distance.
Show that such penalized procedures have corresponding performance for the
stronger Kullback risk, comparable to the predictive
Bayes procedures, or give a counterexample.
 Analysis of
Penalized Criteria for Model Selection:
[Publications 0,10,13,17,20,30,40,42,45,47,49,54,60,67,68,81; Technical
report with Huang and Cheang; At Least nine conference presentations and
at least four seminars; Student Theses of Yang, Cheang, Huang, Luo,
Chatterjee and current student Klusowski.]
Desirable
penalties for empirical criteria produce estimators with statistical risk
shown to be related to the corresponding population tradeoff between accuracy
and penalty. Conditions on the penalty for which such conclusions hold are in
the cited publications, some of which are joint with Yuhong Yang, Lucien Birge,
Pascal Massart, Gerald Cheang, Cong Huang or Jason Klusowski.
The penalty conditions emphasize variablecomplexity cover
properties. Accordingly, one can adapt the structure of models as well the
number and size of parameters. For models built from linear combinations of
candidate terms from a library, implications for subset size penalties are in
Publications [0,10,17,20,30,40,42,45,51,54,60,67], likewise for L1
penalties on coefficients in Publications [30,40,45,49,67,68,81] and for combinations
of such penalties in Publication 13 and the technical report with Huang and Cheang.
Open
Problems:Works
by Sara van der Geer (Annals of Statistics, Feb 2008)
and by Wainwright and Negaban give alternative theory for penalized criterion
involving functional analysis conditions for a penalty such that desirable risk
bounds hold. Those approaches are also effective for L1 penalties, though they
do not take advantage of cover properties of the library of candidate terms.
Determine the relationship between their conditions and our complexitybased conditions.
 Analysis of
Bayes Procedures and other Model Averaging Procedures:
[Publications 0,4,6,7,15,16,21,22,27,28,44,46,48,49,56,63; Two
Technical Reports; At least ten conference presentations and at least
five seminars; Student Theses of Clarke and Leung.]
Building
on theory of Lorraine Schwarz from the early 1960s, necessary and sufficient
conditions for consistency of posterior distributions is established in the
1988 technical report, formulated in terms of existence of uniformly
consistent tests, parts of which are presented in Publications [6], [46] and
[21], the latter with Mark Schervish and Larry Wasserman.
Posterior predictive densities can be accurate even when posterior
distributions on parameters do not necessarily concentrate well on good
neighborhoods of the generative parameter. Quantification of prior
probability of Kullback neighborhoods is sufficient to provide rates of
convergence of predictive densities as shown in [44] and [46], using cumulative
Kullback risk.
In a fixeddesign Gaussianerror regression setting, simple
informationtheoretic bounds on risk of posterior model averaging are
given with Gibert Leung in Publication [28]. These
bounds are obtained there in the case of known error variance, with
subsequent extension to the case of unknown error variance in work by
Christophe Giraud and Yannick Baraud. A tool in
these developments is the unbiased estimator of risk due to Stein. The
cleanest risk bounds have weights for model averaging that are proportional
to square roots of posterior weights. The risk bounds involve optimal
approximation error and subset size tradeoff, with a multiplicative constant
of 1. Accordingly, the conclusion for model mixing are stronger in this setting
than all known results for model selection.
A fascinating result following from Publication [28] is that an
unbiased estimator of the mean squared error of prediction of Bayes
regression procedures is equal to RSS + 2V, the residual sum of squared
errors of the fit plus twice the sum of posterior variances. This risk
assessment is an attractive combination of accuracy of fit to training data
plus posterior variability.
Open
Problem:
Starting perhaps with the case of Gaussian distribution on the inputs to a
regression, find extension of these conclusions to the case of random design,
with estimation of the mean squared error of prediction at new points
independently drawn the same distribution as the sample.
 Optimal Rates
of Function Estimation for Large Function Classes:
[Publications 19, 22, 40,41; At least four conference presentations;
Thesis of Yang.]
For
regression and density estimation, three quantities of interest for a
function class are (1) the Wald minimax risk R_n of
estimation from finite samples of size n with mean square error, Hellinger,
or Kullback loss, (2) the Kolmogorov metric entropy
rate H_n= H(epsilon)/n at a critical covering or
packing radius epsilon_n, solving H(epsilon)/n =
epsilon^2 and (3) the Shannon information capacity rate C_n
of the channel that takes one from functions in the class to data of the
indicated size (known in information theory to also equal the minimax
redundancy rate of data compression). It turns out, per [22], that these
three seemingly rather different quantities are the same to first order.
This equivalence is the characterization that provides simple determination
of minimax rates of function estimation.
Beginning in the 1940s, an inequality
of Fano was used to bound communication rates
(logcardinality of message sets divided by block length n) by information
capacity, an essential step in establishing the capacity of communication
channels. Versions of this inequality were borrowed into statistical
investigations of lower bounds on minimax risk of function estimation by
Pinsker, Hasminskii, and Birge in the late 70s and early 80s. The implication at
that time was that if an optimal cover of a small diameter portion of the
function space has an appropriately large order size, then that
logcardinality divided by n determines the minimax rate.
In publication [22] with Yuhong Yang, we appealed to
the original Fano inequality to derive that, for any
infinitedimensional function class, the minimax risk is of the same order as
the metric entropy rate at the critical epsilon_n.
Thus to verify the minimax rate for a function class, or to upper and lower
bound that rate one only needs to know upper and lower bounds on the metric
entropy (one does not need to exhibit the more specialized cover property of
small diameter subsets). Equally interesting was the approach of Haussler and
Opper, also from the mid90s, which established similar conclusions with new
bounds on the information capacity.
It may be of interest to discuss the implications of optimal rate theory in
the model selection setting. For any complete sequence of basis functions and
any decreasing sequence r_k tending zero polynomially fast, a corresponding
function class is the approximation class of functions for which the projection
onto the first k basis functions has squared L2 error not more than r_k for
all k. Lorentz in the 1950s determined the metric entropy for such classes.
The implication from Publication [22] is that the minimax risk of these classes
is of order min {r_k + k/n}, where the minimum is over choices of k. This
conclusion is in concert with what is achieved by Bayes procedures as in
Publication [22] or with model selection
rules that adapt k, as in Publications [20],[30] or the
Tech Report with Huang and Cheang. These estimators
are adaptive in that the minimax rate is achieved simultaneously for all such
approximation classes.
There are implications of optimal rate theory for subset selection. A
formulation of sparse approximation classes dictated by the accuracy of
subset selection is also given in Publication [22]. The formulation there is
idiosyncratic to facilitate appeal to the conclusion of Lorentz for a
subsequence of allowed subsets, and does not completely match what would be
desired for characterizing sparse approximation classes. Lower bounds on
metric entropy for these or other formulations of sparse approximation
classes can provide implications for optimal statistical rates for
interpolation classes discussed in publication [30], section 6. Positive
results are given there, providing estimators achieving rates
achieved simultaneously for a range of interpolation classes for linear
combinations of libraries of finite metric dimension. These rates are faster
than previously obtained and it would be of interest to know whether they are
minimax optimal.
Open
Problem:
For a given library of candidate terms, generalize the formulation of sparse
approximation classes to better match the setting of approximation by subsets
of size k out of M and use it to bound the metric entropy of these classes
and associated interpolation classes, and hence their minimax rates of
estimation.
 Greedy
Algorithms for Subset Selection in Regression and Density Estimation,
and L1 Penalty Optimization:
[Publications 12,14,30,33,40,41,47,49,51,59,67,71,72 and the
Technical Report with Huang and Cheang; At
least seven conference Presentations and six seminars; Theses of Cheang, Li,
Huang, Luo, Joseph, and Cho, and current student Klusowski.]
Greedy
algorithms are ubiquitous in a range of methods for function estimation and
approximation including forward stepwise regression, projection pursuit
regression, multiple additive splines, and basis pursuit for wavelets. In
1989 Lee Jones established a result (published in the Annals of Statistics in
1992) for a broad class of greedy procedures that include a relaxation factor
in the update formulation, with a slight improvement in Publication [12]. This
result shows that a target function representable as a linear combination of
a library of bounded functions with L1 norm of coefficients not more than V
is approximated by k steps of such greedy algorithms (that is by k terms of
the library) with accuracy C/k where C is the square of V.
To deal with noise and with targets that may be outside of specified convex
hulls, extension of the greedy algorithm result was obtained in a lemma
jointly with WeeSun Lee, Peter Bartlett, and Bob Williamson (in 1996 IEEE
IT, see their acknowledgements), with further refinement in Publication [30]
with Albert Cohen, Wolfgang Dahmen, and Ron Devore.
These results show that in fitting a function f, the squared L2 error of
greedy algorithms is bounded by the minimum over all possible linear
combinations g of the sum of approximation error fg^2 and C(g)/k where
C(g)=4 V^2(g), with V(g) the L1 norm of coefficients in g.
Moreover, with
databased adaptive selection of the number of terms k, a statistical risk is
obtained that corresponds to min fg^2 + lambda V(g),
the best tradeoff between approximation error and L1 norm of coefficients
times a factor lambda equal to the square root of (log M)/n where M is the
library size and n is the sample size. Such conclusions are in [30],[40] and
the technical report with Huang and Cheang
My student Cong Huang showed in his Thesis (and the indicated tech report)
that a slight variant of the relaxed greedy algorithm (which we call
L1 penalized greedy pursuit  LPGP), actually solves the L1 penalized least
squares problem, with explicit control of the computational accuracy after k
steps (through the C(g)/k term in the bound). Refinements of this result are
in the Thesis of Xi Luo and the paper with Klusowski [40].
Corresponding results for log density expansions, with Li, Huang, and
Luo, are in Publications [49,67,68] and for mixture density estimation, with
Jonathan Li, are in his Thesis and in publication [47].
Open
Problem:
A variant of Least Angle Regression (LARS) algorithm of Brad Efron, Trevor Hastie, Iain Johnstone, and Rob Tibshirani, also builds up the solution one term at a
time, for a range of values of lambda, with final solution after a number of
steps k said to be not more than the sample size. Relate LARS to the greedy
algorithms, with inclusion of a relaxation factor if need be, to quantify its
approximate accuracy with fewer numbers of steps.
 Nonlinear
Approximation and Estimation for Highdimensional Libraries of Functions
including Neural Nets:
[Publications 8,12,13,14,20,22,24,30,39,40,41,43,45,50,51,52,53,54,55,59,60,66
and Technical Report with Huang and Cheang; At
least twentyseven conference presentations and at least twentysix seminars;
Theses of Cheang, Huang, Yang and current student Klusowski.]
In
dealing with adaptation and highdimensionality, the early 1990s were a
turning point in mathematical understanding of function estimation and
approximation, as will be explained. Long before that time, statistical and
engineering practice, both in academics and in industry, revealed clear advantage to
databased adaptation of the structure as well choice of number of parameters
of statistical models. The choice of subsets of linear combination in
regression is a simple example of such adaptation, and the choice of structure
of classification and regression trees, multiple layer neural nets and
polynomial nets provide more elaborate examples (see Publication [51] for a
review).
However, such statistics and engineering practice was very much at odds with
what was the prevailing mathematical understanding of nonparametric function
estimation and approximation theory. The tradition was to capture function
regularity primarily by local smoothness, through presumption of Sobolev
or Holder properties of derivatives. For these function classes, the
statistical risk result showed that no advantage for procedures other than
those based on linear strategies (fixed Kernel methods or linear projection
methods onto fixed bases). The most relevant approximation theory results for
these classes was the theory of Kolmogorov kwidths, expressing what fixed k
terms provides the best linear space minimizing of the maximal error of L2
projection onto the linear span. In this linear theory, the choice of k terms
is fixed, depending on the class, but not on the target function within the
class.
In contrast, for nonlinear approximation based on a given library of
candidate terms, the choice of the best k terms from the library is allowed
to depend on the target. Though such nonlinear kwidth theory was being
developed, when applied to the traditional Sobolev
and Holder function classes, it did not reveal substantial advantage in
performance.
Belief that such local derivative regularity is what matters distanced the
theory from statistical practice, discouraging the taking of general subsets
of terms in regression, when the practitioner knows that such adaptation is
advantageous. These disparities between theory and practice were
extenuated by the results that for functions of d variables the minimax optimal
rates for the traditional classes severely degrade with dimension, leading to
a conclusion in that theory that one needs exponentially large sample sizes
as a function of the dimension.
Two activities in the study of approximation and estimation of functions
dramatically changed perspectives. One was the development of theory and
practice of wavelets. Retaining the idea that regularity can be captured
locally through behaviors of moduli of smoothness, it was recognized that
such regularity need not be of the same type throughout the input space of
the function. Certain Besov spaces capture such
freedom and were shown to be equivalent to certain norm properties on
coefficients in newly developed orthogonal wavelet expansions. Moreover,
optimal approximation and estimation in these classes were shown to require
targetbased or databased choices subsets of terms to include, which is a
type of nonlinear approximation, rather than fixed linear projection. See
books and papers by Ron DeVore and Yuly Makavoz and others from
approximation theory, and the writings of this time of David Donoho
and Iain Johnstone and others from mathematical statistics. Taking
advantage of the orthogonality, for the lowdimensional spaces of two and
threedimensional images and time evolutions of such, for which one can have
an exhaustive filling of datapoints, practical algorithms become available
which bridged theory and practice in these settings.
The other activity concerns approximation and estimation of responses for
highdimensional input data. In this setting the input data is sparse,
typically modeled as a random sample from a distribution for which subsequent
data will likewise be drawn, and popular models entertain linear combinations
of nonlinearly parameterized terms, such as sinusoids, sigmoids, or multivariate
splines with variable knots. Barron found himself in a whirlwind of discussion,
as evidenced by the many conference and seminar requests. He argued that for
highdimensional function estimation, since local derivative regularity fails
to capture what can be inferred from manageable data sizes, for such settings
we should abandon those local smoothness notions as the primary source of
regularity. Instead, more structural notions of regularity matter, related to
informationtheoretic complexity or to the suitability of the presumed forms for
accurate representation of the target.
A theory he put forward, beginning in 1991, is for functions f which when
divided by a suitable V, are in the closure of the convex hull of a given
library of candidate terms (the smallest such V being called the variation of f
with respect to the library, in concert with the variation interpretation
associated with the library of indicators of halfspaces, publication [55]).
Including terms of either sign in the library, V(f) is equivalent to an L1
norm on coefficients of representation. A Fourier norm is shown to be one
device for controlling the variation for ridge type libraries such as sigmoids
and sinusoids (Publication [12]). For functions of finite variation the squared
norm of best kterm approximation and of statistical risk are bounded,
respectively, by V^2 /k and by V sqrt{d/n} to within logarithmic factors.
These bounds give rates as exponents of (1/k) or of (d/n), where the exponents
do not depend on the input dimension d.
To achieve these rates under such structural assumptions, unlike the
previously mentioned traditional function class settings, adaptation of the
subset of kterms (depending on the target) is essential. So we have a better
match of such theory to what we know to be important in practice. For
functions that do not have finite variation, but reside in interpolation
classes between this L1coefficient norm class and all of L2, the results of
publications [30],[40] (and the tech report with Cheang and Huang) give bounds
on rate, which continue to be respectable even for libraries of large metric
dimension. These function classes give settings in which, with suitable nonlinear
approximation and estimation techniques, we are not subject to a curse of
dimensionality.
Recent
Developments:
So under the finite variation condition the exponent is independent of
dimension. However, the base of the exponent was d/n which necessitated that
n be large compared to the input dimension d. This was already reasonably
good because the effective size of the dictionary of nonlinear terms is
exponentially large in d. Nevertheless, the recent trend in highdimensional
data analysis has been to try to allow the input dimension to be much larger
than n. That is a challenge when the model is nonlinear in these inputs.
Thus
it is a pleasure to report the recent 2016 results with current Jason Klusowski
(Publications 40,41). We show that with L_1 control (V1)
on the internal parameters of nonlinear ridge parameterized terms, (as well as
L1 control V0 on the external parameters of linear combination),the risk
of estimation of finite variation combinations of such functions is of order
[V (log d)/n]^{1/3}.
These new results for ridge combinations (including neural nets) are superior to
the classical bounds for d > n^{1/3}. Moreover, they can
produce small risk even for very large dimension situation of d = exp{o(n)}.
One needs n large compared to only (log d)
though one does pay the price of the smaller exponent of {1/3} for
the mean squared error compared to the classical exponent of {1/2} for
estimation of functions with finite variation with respect to these nonlinear
dictionaries. This puts this neural net theory squarely in the big data
situation.
Open
Problem:
A tantalizing prospect of this theory, with infinite libraries of high but
finite metricdimensionality, is that accurate computation of the estimates
is possible by use of suitable greedy algorithms discussed above. After all,
each step of an algorithm involves optimization of only one term from the
library. Empirically, investigators have been pleased with local optimization
strategies for each new term in various settings, including the Friedman and Stuetze projection pursuit regression algorithm and
various gradientbased neural net training algorithms. The hope of such local
strategies is that the residual surface is cleaned by each term included so
that ultimately the globally important terms are revealed, though that hope
has not yet been confirmed on general mathematical grounds.
Or one may study adaptive annealing methods for guided random search of
suitable terms from the library (some steps in this direction are in
reference 60 with Xi Luo and in the Thesis of Chatterjee). The question is to
resolve for what forms of nonlinearly parameterized terms is such adaptive
annealing rapidly convergent to a term of suitably high inner product with
the residuals from the previous greedy algorithm step. Resolution of suitably
formed computational questions is critical for both understanding and
practice of highdimensional function estimation. <p
Recent
Computational Developments With current student Jason Klusowski we are making progress in the analysis of
adaptive annealing for Lipshitz bounded activation
functions, using what we call variable replication, yielding a bounded
coordinate effect property, sufficient for stability of adaptive annealing.
Also with Klusowski and another current student
William David Brinda we are getting some progress with what we call the
nonlinear power method of optimization. Some aspects of both of these
developments are in the seminar slides from the recent 2015 NIPS workshop as
well as recent 2016 Princeton and Osijek presentations. These slides can be
accessed from the vita above.
 Maximum Wealth
Stock Indices and Growth Rate Optimal Portfolio Estimation:
[Publications 5,23,25; Two technical reports; At least seven conference
Presentations and at least four seminars; Theses of Jason Cross
and Wei (David) Qiu.]
Compounding
over n periods of investment, rebalancing each period to a portfolio vector W
which maintains fractions W1,W2,...,Wm of wealth in each of m assets (stocks,
bonds, cash, etc), let (X1i,X2i,...,Xmi) be the vector of returns (ratios of
price plus dividend at end of period i to the price at the start of period i),
then the multiperiod factor by which a unit of wealth is multiplied, denoted
Sn(W) is the product for i=1,2,...,n of the portfolio returns Zi=
W1 X1i + W2 X2i + ... + Wm Xmi. This multiperiod
wealth function Sn(W) is a logconcave function of
the portfolio vector W. That is, the wealth Sn(W) is
the exponential of a smooth concave function Yn(W)
which is the sum of the log portfolio returns. As such it exhibits a shape as
a function of W governed by Laplace approximation, peaked at a choice of
rebalancing portfolio W which would have made the most wealth over that time
period. Accordingly, in the report with Wei Qiu (as
well as in his thesis), in which we take the periods of investment to be
months to avoid overexposure to commissions, we maintain that an index of
the wealth in a market over a suitable period of months (e.g. a decade),
appropriate for historical consideration, is the maximum of Sn(W) over
choices of W. Examples are given where the market is taken to be the Nasdaq
100, the S&P 500, or all stocks (about 3000 in number) actively traded
over an eleven year period accessed in the U. Penn, Wharton School data set
(Wharton Research Data Services).
We provide a relaxed greedy algorithm for fast computation of this maximum.
The algorithm starts with the historically single best stock, and then at
each algorithm step introduces a new stock which, in convex combination with
the previously computed portfolio, yields the greatest increase in compounded
wealth. It is proven that in k steps of the
algorithm (that is with k stocks included) the error in computation of the
maximum of Sn(W) is bounded by V/k in the exponent,
where V depends on the volatility of the historically given returns. Thus the
algorithm is guaranteed to converge to the portfolio vector W which
historically had the highest returns. Our studies showed that only a handful
of stocks are included in the hindsight optimal portfolio vector and that the
associated maximal portfolio wealth factor (typically several hundred) is
much higher than the best single stock over the indicated time period.
Beginning in the first issue of Mathematical Finance, Tom Cover in 1990
introduced the notion of universal portfolios. These are updating portfolio
strategies in which it is guaranteed that the wealth achieved is within a
specified factor of the maximum of Sn(W). With Erik Ordentlick in the mid 90s, he
refined the universal portfolio theory to show that the optimal factor, valid
uniformly over all possible return sequences, exhibits a drop from the
maximum by not more than a factor of order (1/n)^d where the power d=(m1)/2.
For markets in which the maximal exponent Yn(W)
grows linearly with n with a nonvanishing rate, this shows that the
hindsight optimal rate can be achieved universally without advance knowledge
of what portfolio W will be best. In publication 23 with Trent Xie, we
identified the precise asymptotics
of the form of the wealth drop factor (C/n)^d including identification of the
constant C, and showed that this theory reduces to the same mathematics as
the identification of asymptotics of minimax
pointwise regret for data compression and prediction also developed there.
A continuoustime version of universal portfolios is developed in publication
25 as well as a version that allows target portfolios that use past returns
in parameterized ways. An advantage of continuoustime is that for stock
price processes of bounded empirical variation, the Sn(W)
becomes an explicit Gaussian shaped function of W so that certain universal
portfolios can be evaluated exactly by known Gaussian integral computations.
An obstacle to realizing high gains by portfolio estimation strategies is
that it is difficult to know in advance what subset of stocks is best. If one
tries to be universal for hundreds of stocks, the worst case factor (1/n)^d will kill off typical gains from the maximum of Sn(W)
over the time horizons in which the desirable portfolios remain stable. In
the technical report with Wei Qiu, we develop a
strategy in which we do not try to be universal for all of the stocks, but
rather we mix across all possible subsets of three, four, or more of the
stocks. By such mixture strategies we show growth rate optimality for
portfolios of stocks of bounded volatility.
Winnertakeall gambling markets (e.g. horse races) provide a setting in
which the theory and practice of growth rate optimality is considerably
simplified, as in the text by Cover and Thomas 2006. Briefly the wealth
exponent of a gambler is expressible as the differences of
Kullback divergences from empirical distributions of
distributions associated with the oddsmakers and the distribution specified
by gambling proportions. The technical report with Jengfeng
Yu shows that for single stocks, the presence of put or call options with
strike prices at each possible level of the stock makes it possible for
linear algebra transformations to convert the questions of portfolio choice
(portfolios of cash, stock, and options on the stock) and option pricing to
the simpler problems of gambling and odds on such gambles, where the events
on which one gambles are
the different possible outcomes of the stock price.
Open
Problem:
For multiple stocks, basket options (that is, options on portfolios of stocks)
are sufficient to convert the option portfolio problem and the option pricing
problem to the associated gambling and oddsmaking problems, even in a
discrete time setting. Use approximation results for linear combinations of
indicators of halfspaces to resolve how many basket options are required to
come appropriately close to the optimal growth exponent of the pure gambling
version of the problem.
 Practical
Codes for CapacityAchieving Communication with the Gaussian Noise
Channel:
[US Patent 8913686; Publications 32,33,70,71,72,73,74,75; One tutorial and
associated newsletter article [84] with Ramji Venkataramanan; One
technical report; Several conference papers and seminars; Theses of
Antony Joseph; Sanghee Cho and Cynthia Rush.]
The
current cellphone revolution in society is a consequence in part of the
highrate lowerrorprobability codes based on LDPC and turbo decoders using
iterative probability propagation (message passing) in graphical networks
(also called Bayesian belief propagation in graphical statistical models).
These codes were empirically demonstrated beginning in the mid1990s, to have
the property of reasonably low error probability at communication rates that
are a reasonably fraction of capacity. These decoding algorithms apply
methodology that would determine the posterior most probable codeword if the
graphical model representing the decoding
problem were a tree (a loopfree network). The graphical network representing
the decoding task for these codes does not have such structure and
accordingly scholars find the problem of mathematical proof of favorable
error probability and communication rate properties for such practical codes
to be elusive except for certain simple channels (the erasure channel and
some channels related to it).
Communication technology is left in a rather nebulous state concerning
whether the performance will scale suitably as computation and memory
capability improves. What is theoretically desirable with decoding remaining
fast enough to be performed onthefly, is that the
error probability be exponentially small with increasing size n of the codes,
with communication rate R that remains at a specified preferably high
fraction of the Shannon Capacity C. Moreover, this should be so for the real
additive Gaussian noise channel common to electromagnetic propagation through
the air or electrical propagation on wires or cables, or to sufficiently
finely quantized versions of this real channel.
It is unknown whether existing LDPC or turbo codes will have these desirable
scaling properties. Indeed it is even conceivable that there is what is
called a noise floor for the error, that is, a positive value below which the
error probability will not be made any smaller even if the existing style
codes are made larger. This uncertainty about communication capability should
appropriately give pause to companies whose performance hinges on
communications technology.
My current and recent students and I are engaged in an effort to bring a
positive resolution to the problem of practical communication for the real
channel, with low computational complexity, and with error probability that
rigorously scales to be exponentially small with the code size n, for any
specified rate R < C.
We tackle practical capacityachieving communication, as described in
publications [32,33,70,71,72,74,75,84], using an approach that is
motivated by understanding of problems of subset selection in a linear model
representation of the communications task. Capacityachieving versions of
these codes are called Sparse Superposition Codes.
An ensemble of n by N design matrixes or dictionaries X are specified. In
practice each entry would be independent and +1 valued and
equiprobable, though our first theorems for this setting
take these entries of the design to be independent standard normal. The
number of columns N dictates the memory size for chips that would implement
the code.
The stream of message bits and the set of design matrix columns are both
partitioned into L sections, of size log M bits for each piece of the
message, used to select (i.e. address) one out of M
columns in each section of the design matrix. The codeword
consists of the sum (the superposition) of the selected columns (one from
each section), with a final scaling to satisfy requirements of power
allocation.
Natural choices of L and M are comparable to n, and subject to the relationship
n R = L log M, specifying two equivalent ways to count the number of
information bits, where R is the communication rate.
Decoding as described in [33,71,72,74,84], consists of parallel pipelined
computation of inner products, first with the received vector Y, with each of
the columns of X, followed by adding to the codeword
fit those columns for which the inner product is above a specified threshold.
In parallel, a stack of such chips likewise compute the inner products of
such columns with the residuals of the fits for a succession of such received
sequences. Each such inner product resolution determines an additional
portion of the true columns superimposed in the codeword.
A logarithmic number of such inner product chips are
found to be sufficient to sufficiently resolve the codeword,
with exponentially small error probability.
Additional
Specifics and an Open Problem: For some scholars, there is interest not
only in the case of fixed R < C, but also interest in the case that R
approaches C as the code size n increases. The rate at which R approaches C
cannot be too fast if we want small error probability. Ideally, the error
probability is exponentially small in n(CR)^2, so, as long as n(CR)^2 grows
without bound, the probability of error will become small while allowing ever
higher rate near C. This exponential relationship was first shown by Shannon
(for his codes that did not attempt to be practical).
The paper [32] shows that this desirable exponential scaling continues to hold
for optimallydecoded sparse superposition codes of manageable dictionary
size. That allows R to approach C at a rate arbitrarily close to (1/n)^{0.5}
below capacity. Similar expressions for the
practical decoder are obtained, different from the ideal in two main respects.
One is that the error probability is exponential not in n(CR)^2 but rather
exponential in L(CR)^2 (which is within a log factor of optimal). The other
difference is that, sofar, the bound for
the practical decoder is demonstrated, in references 33,71,72,74] only for R
below C by at least an order 1/ log n gap.
The open problem concerns the restriction that when R approaches C it maintains
a gap of 1/ log n. The problem is whether such a
restriction is necessary for a practical decoder (one such as ours that
computes onthefly or more generally for any decoder with computational
resource a low order polynomial in n) in order to have error probability
exponential in n(CR)^2.
In the work in reference [74] and the Thesis of Sanghee
Cho, we have the decoding algorithm compute posterior weights for each
section based on current summary statistics (a kind of softdecision rather
than harddecision decoder, for all but the final step of the algorithm).
Evidence is given that this procedure allows for R a larger fraction of C for
sensible size codes.
Associate
Editorship:
 1993  1995 IEEE Transactions on
Information Theory. A.E. for nonparametric estimation,
classification and neural nets
 1995  1997 Annals of Statistics.
 1994  1997 Neural Networks.
Yale
Departmental and Divisional Responsibilities:
Director of Graduate
Studies (Fall 1993 through Spring 1999).
Director of Undergraduate Studies (Fall 2010 through Spring
2016, Spring 2017).
Acting Chair (Spring 1998).
Chair (Spring 2001  Fall 2006).
 Division of
Social Sciences
Senior Appointments Committee (Fall 1994 
Spring 1995, Fall 1999  Spring 2000)
 Program of
Applied Mathematics
Director of Undergraduate
Studies (Fall 1999  Spring 2001; Fall 2004  Spring 2006)
Senior Coordinator of Undergraduate Studies (Fall 2006 
Spring 2015)
Other
Activities and Honors:
 FAI free
flight model glider designer and competitor, F1A class (Andrew Barron
and family):
Andrew is a five time U.S. National Champion: 1984, 1987,
1992, 2007 and 2009.
Three time U.S. National Team Member at the World
Championships:
1995 in Hungary, 2001 in California, and 2013
in France.
Member of nine man U.S. Team
winning the Challenge France World Championships 2013.
Alternate to the U.S. National Team in 1977 and 2015.
America's Cup Champion in 1998 and 2009.
Son Peter was a U.S. National Team Member at the World
Championships 2015 in Mongolia.
Son Peter was a U.S. National Team Member at the Junior
World Championships 1998 in Romania.
Son John was a U.S. National Team Member at the Junior
World Championships 2000 in the Chech Republic.
Son Timothy was the U.S. National Champion in 2010 and
2012.
Timothy was a US National Team Member at the Junior World
Championships 2008 in the Ukraine.
Timothy finished as third place individual, part of second
place F1A team (USA), and part of
first place U.S. team (F1A,F1B,F1P combined).
Daughters Michelle and Gina earned place on U.S. National
Team for Junior WC 2012 in Slovania.
Gina finished as part of second place F1A team and Second
place team overall (F1A,F1B,F1P combined).
Coowner and manager of Barron Field, LLC, which owns a 287 acre sodfarm in
Orange County, NY. It is the primary flying site used in the northeastern US
region to host freeflight meets and competitions, sanctioned by the Academy
of Model Aeronautics (AMA) and the Federation Aeronautique
International (FAI).

