Just in time for Christmas, I received two days ago the first hard copies of my book! It is a mix of feelings of relief and pride after 3 years of work. As most book writers will probably acknowledge, it took much longer than I expected when I started, but overall it was an enriching experience for several reasons: (1) I learned a lot of new math to write it; (2) trying to squeeze the simplest arguments for the analysis of algorithms led to many new research questions; (3) I was able to get the help from many colleagues and students (see the long list in the preface below), I cannot thank them enough for their help.
The book is published at MIT Press in the series I am editing (as expected, they were very supportive throughout the process). It is freely available online from my webpage, but of course, nothing prevents you from buying a hard copy from your favorite online store (see a list here), thus contributing to the purchase of my next road bike (for French fans, amazon.fr and fnac.com also list it, as probably major online stores in many countries).
Beyond the pdf / hard copy, all figures can be reproduced with code in Python and Matlab, available here (I am planning to add Julia as well at one point). I have started to collect solutions for the many exercises that cover all chapters (available here, warning: work in progress). If you want to contribute solutions, highlight typos, or suggest improvements, this would be much appreciated (just send me an email). I may add at one point some historical sections, which I decided to leave out because it would have taken more time than I had (feel free to suggest pointers).
Why yet another book on learning theory?
There are many good books on learning theory. Why write a new one? Read the preface (reproduced at the end of the post) for a series of arguments, but the main reason is that I felt that the current trend in the mathematical analysis of machine learning was leading to overly complicated arguments and results that are often not relevant to practitioners. Therefore, my aim was to propose the simplest formulations that can be derived from first principles, trying to remain rigorous without overwhelming readers with more powerful results that require too much mathematical sophistication. I have done my best but I am sure that there are always simpler arguments at various places; let me know if you spot one. Also, I try to link theoretical results with practical performance through a series of toy experiments.
A key distinctive feature of the book is the focus on real-valued prediction functions: it has become the de facto standard for modern machine learning techniques, even when predicting discrete-valued outputs. Thus, although its historical importance and influence are crucial, I chose not to present the Vapnik-Chervonenkis dimension, and instead directly based my generic bounds on Rademacher complexities. This focus on real-valued prediction functions makes least-squares regression a central part of the theory, which is well appreciated by students as many of the important concepts in machine learning (regularization, stochastic algorhtms) are already present in a simplified manner.
Among the various topics that I am covering, some of them follow a standard treatment, but some may deserve attention, even from experienced readers (I may write future blog posts about some of them). See below for some impressions on each chapter:
- Chapter 1 (mathematical preliminaries): nothing really fancy here, where I present a bag of useful computational tricks and the main concentration inequalities. Some have already led to blog posts, e.g., Jensen’s inequality, concentration inequalities for matrices, or will (matrix inversion lemma).
- Chapter 2 (introduction to supervised learning): the chapter focuses on the traditional decision-theoretic formulation of supervised learning, with losses, risks, etc. I added a no-free lunch theorem from Luc Devroye (1982) because I think it nicely shows that learning is impossible without assumptions.
- Chapter 3 (linear least-squares regression): Whether this old method originated from Legendre or Gauss is irrelevant, I believe it is important as it already encapsulates many of the classical concepts in machine learning, with in particular the need for regularization to avoid convergence rates in \(d/n\), where \(n\) is the number of observations and \(d\) the number of parameters. It can thus already hammer the message: The number of parameters is usually not the best way to measure the generalization capabilities of a learning method.
- Chapter 4 (empirical risk minimization): The chapter starts with an extensive (but classical) study of convex surrogates for binary classification (which will be extended in chapter 13 with structured prediction). For convex loss functions, regularized estimation is first done (as this is is simpler) using constrained optimization, but to avoid the annoying discrepancy between theory and practice, there is a dedicated section on penalized estimation, with simple (some of them new) generic bounds.
- Chapter 5 (optimization): It was hard to fit everything in one chapter (but I cheated as I left out some nuggets in chapter 11 on online learning)! Starting with quadratic problems, where the convergence of gradient descent can be established with linear algebra, the standard tools from convex optimization and then stochastic approximation are presented in a unified way, with an emphasis on the natural testing error performance of stochastic gradient descent. Variance reduction is also presented, with the simplest proof I could find.
- Chapter 6 (local averaging methods): k-nearest-neighbor prediction is a bit old school, but this is the simplest method that can adapt to any prediction function (and the simplest to explain to your grandparents). I am reusing here the nice exchangeability arguments from the 2015 book of Gérard Biau and Luc Devroye to get simple bounds with few assumptions. For Nadaraya-Watson estimators (e.g., kernel regression), I am presenting a simple proof using Bernstein inequality.
- Chapter 7 (kernel methods): this is also a dense chapter, where I focus primarily on Sobolev spaces to be able to characterize adaptivity to smoothness. For Lipschitz-continuous losses, only approximation errors need to be characterized, and I was able to avoid integral operators. For square loss, I am reusing the very nice proof technique from Jaouad Mourtada and Lorenzo Rosasco, which leads to particularly simple bounds in expectation.
- Chapter 8 (sparse methods): I am focusing a lot on the square loss in this chapter, reusing a proof technique from Philippe Rigollet and Sacha Tsybakov, which is adapted to constrained or penalized estimation, leading to the celebrated result in \(( \sigma^2 k \log d)/n\). After briefly presenting the fixed design treatment often seen in the statistical literature (with a zoo of conditions on the design matrix), I am focusing on random design where fast rates can be reasonably simply obtained with only strong convexity.
- Chapter 9 (neural networks): I made the choice to present only single hidden layer neural networks, where the estimation error and optimization error properties can be finely characterized, with a particular focus on the adaptivity to linear latent variables. Again, the number of hidden neurons is not the key driver for potential generalization to unseen data. The link with kernel methods and random features (which corresponds to only optimizing the last layer of weights) is made explicit. This rather “classical” treatment is complemented with a later chapter on overparameterized models.
- Chapter 10 (ensemble learning): this chapter has mostly two independent parts, the first on bagging / random projections, where classical Gaussian random projections are presented, but also extensions to non-linear predictions. The second part on boosting is trying to unify algorithms originating from different communities, such as matching pursuit and Adaboost, with an explicit new proof of the performance of boosting, without imposing extra regularization. The rate is not optimal but this is closer to what is used in pratice (regularization by early stopping). There are probably sharper results here.
- Chapter 11 (from online learning to bandits): I am only scratching the surface of a wider topic here, but for online learning, the differences with classical stochastic optimization are made explicit with unified notations (this is also where I squeeze in mirror descent). I describe multi-armed bandits in ten self-contained pages; this is short but enough to capture the main ideas and the (dis)similarities with more classical supervised learning.
- Chapter 12 (overparameterized models): this is a chapter closer to research, where I tried to describe the recent important results on overparameterized models in the simplest possible setting, such as global convergence of gradient descent, implicit bias for convex and nonconvex problems (done for diagonal linear networks), double descent (with both a simple argument for Gaussian data and less simple one for random projections), and lazy learning.
- Chapter 13 (structured prediction): another chapter with recent results, starting with a treatment of multi-category classification with a focus on multivariate prediction functions and the associated generalization properties (where stochastic gradient descent leads to strictly better bounds than the ones obtained from Rademacher averages for empirical risk minimization). I am then presenting in a unified way some of the recent literature on predicting complex outputs using convex surrogates, with a sequential treatment of quadratic, smooth, and nonsmooth surrogates.
- Chapter 14 (probabilistic methods): I first review probabilistic modeling interpretations of several learning methods, focusing primarily on identifying losses and priors with log densities but drawing clear distinctions between what this analogy brings and what it does not (in particular, sparse methods such as \(\ell_1\)-minimization are not adapted to data coming from the distribution whose negative log density is the \(\ell_1\)-norm). We then show how Bayesian inference naturally leads to model selection criteria and end the chapter with a description of PAC-Bayesian analysis, with references to the nice recent monograph of Pierre Alquier.
- Chapter 15 (lower bounds on generalization and optimization errors): research-wise, I prefer contributing to upper bounds on performance by designing and analyzing fast algorithms, but I have to admit that lower bounds are important as well (in particular when they match the upper bounds). This chapter deals with both optimization lower bounds (where typically hard functions to optimize are presented, mostly obtained from Yurii Nesterov’s work), and also with statistical bounds (where information-theoretical arguments I used). For stochastic gradient descent, I am reusing the nice proof technique from Agarwal et al. (2012).
.
Preface
Why study learning theory? Data have become ubiquitous in science, engineering, industry, and personal life, leading to the need for automated processing. Machine learning is concerned with making predictions from training examples and is used in all of these areas, in small and large problems, with a variety of learning models, ranging from simple linear models to deep neural networks. It has now become an important part of the algorithmic toolbox.
How can we make sense of these practical successes? Can we extract a few principles to understand current learning methods and guide the design of new techniques for new applications or to adapt to new computational environments? This is precisely the goal of learning theory. Beyond being already mathematically rich and interesting (as it imports from many mathematical fields), most behaviors seen in practice can, in principle, be understood with sufficient effort and idealizations. In return, once understood, appropriate modifications can be made to obtain even greater success.
Why read this book? The goal of this textbook is to present old and recent results in learning theory for the most widely used learning architectures. Doing so, a few principles are laid out to understand the overfitting and underfitting phenomena, as well as a systematic exposition of the three types of components in their analysis, estimation, approximation, and optimization errors. Moreover, the goal is not only to show that learning methods can learn given sufficient amounts of data but also to understand how quickly (or slowly) they learn, with a particular eye toward adaptivity to specific structures that make learning faster (such as smoothness of the prediction functions or dependence on low-dimensional subspaces).
This book is geared toward theory-oriented students, as well as students who want to acquire a basic mathematical understanding of algorithms used throughout machine learning and associated fields that are significant users of learning methods (such as computer vision and natural language processing). Moreover, it is well suited to students and researchers coming from other areas of applied mathematics or computer science who want to learn about the theory behind machine learning. Finally, since many simple proofs have been put together, it can serve as a reference for researchers in theoretical machine learning.
A particular effort will be made to prove many results from first principles while keeping the exposition as simple as possible. This will naturally lead to a choice of key results showcasing the essential concepts in learning theory in simple but relevant instances. A few general results will also be presented without proof. Of course, the concept of first principles is subjective, and I will assume the readers have a good knowledge of linear algebra, probability theory, and differential calculus.
Moreover, I will focus on the part of learning theory that deals with algorithms that can be run in practice, and thus, all algorithmic frameworks described in this book are routinely used. Since many modern learning methods are based on optimization, chapter 5 is dedicated to that topic. For most learning methods, some simple illustrative experiments are presented, with accompanying code (MATLAB and Python for the moment, and Julia in the future) so students can see for themselves that the algorithms are simple and effective in synthetic experiments. Exercises currently come with no solutions and are meant to help students understand the related material.
Finally, the third part of the book provides an in-depth discussion of modern special topics such as online learning, ensemble learning, structured prediction, and overparameterized models.
Note that this is not an introductory textbook on machine learning. There are already several good ones in several languages (see, e.g., Alpaydin, 2020; Lindholm et al., 2022; Azencott, 2019; Alpaydin, 2022). This textbook focuses on learning theory–that is, deriving mathematical guarantees for the most widely used learning algorithms and characterizing what makes a particular algorithmic framework successful. In particular, given that many modern methods are based on optimization algorithms, we put a significant emphasis on gradient-based methods and their relation with machine learning.
A key goal of the book is to look at the simplest results to make them easier to understand, rather than focusing on material that is more advanced but potentially too hard at first and that provides only marginally better understanding. Throughout the book, we propose references to more modern work that goes deeper.
Book organization. The book comprises three main parts: an introduction, a core part, and special topics. Readers are encouraged to read the first two parts to understand the main concepts fully and can pick and choose among the special topic chapters in a second reading or if used in a two-semester class.
All chapters start with a summary of the main concepts and results that will be covered. All the simulation experiments are available at https://www.di.ens.fr/~fbach/ltfp/ as MATLAB and Python code. Many exercises are proposed and are embedded in the text with dedicated paragraphs, with a few mentioned within the text (e.g., as “proof left as an exercise”). These exercises are meant to deepen the understanding of the nearby material, by proposing extensions or applications.
Many topics are not covered at all, and many others are not covered in depth. There are many good textbooks on learning theory that go deeper or wider (e.g., Christmann and Steinwart, 2008; Koltchinskii, 2011; Mohri et al., 2018; Shalev-Shwartz and BenDavid, 2014). See also the nice notes from Alexander Rakhlin and Karthik Sridharan, as well as from Michael Wolf.
In particular, the book focuses primarily on real-valued prediction functions, as it has become the de facto standard for modern machine learning techniques, even when predicting discrete-valued outputs. Thus, although its historical importance and influence are crucial, I choose not to present the Vapnik-Chervonenkis dimension (see, e.g., Vapnik and Chervonenkis, 2015), and instead based my generic bounds on Rademacher complexities. This focus on real-valued prediction functions makes least-squares regression a central part of the theory, which is well appreciated by students. Moreover, this allows for drawing links with the related statistical literature.
Some areas, such as online learning or probabilistic methods, are described in a single chapter to draw links with the classical theory and encourage readers to learn more about them through dedicated books. I have also included chapter 12 on overparameterized models and chapter 13 on structured prediction, which present modern topics in machine learning. More generally, the goal in the third part of the book (special topics) was, for each chapter, to introduce new concepts, while remaining a few steps away from the core material and using unified notations.
A book is always a work in progress. In particular, there are still typos and almost surely places where more details are needed; readers are most welcome to report them to me (and then get credit for it). I am convinced that more straightforward mathematical arguments are possible in many places in the book. Please let me know if you have any elegant and simple ideas I have overlooked.
How to use this book? The first nine chapters (in sequence, without the diamond parts) are adapted for a one-semester upper-undergraduate or graduate class, if possible, after an introductory course on machine learning. The following six chapters can be read mostly in any order and are here to deepen the understanding of some special topics; they can be read as homework assignments (using the exercises) or taught within a longer (e.g., two-semester) class. The book is intended to be adapted to self-study, with the first nine chapters being read in sequence and the last six in random order. In all situations, chapter 1, on mathematical preliminaries, can be read quickly and studied in more detail when relevant notions are needed in subsequent chapters.
Acknowledgments. This textbook is extracted from lecture notes from a class that I have taught (unfortunately online, but this gave me an opportunity to write more detailed notes) during the Fall 2020 semester, with extra passes during the classes I taught in the Spring 2021, Fall 2021, Fall 2022, and Fall 2023 semesters.
These class notes have been adapted from the notes of many colleagues I had the pleasure to work with, in particular L´ena¨ıc Chizat, Pierre Gaillard, Alessandro Rudi, and Simon Lacoste-Julien. Special thanks to L´ena¨ıc Chizat for his help with chapter 9 on neural networks and for proofreading many of the chapters, to Jaouad Mourtada for his help on lower bounds and random design analysis for least-squares regression, to Alex Nowak-Vila for his help on calibration functions, to Vivien Cabannes for the help on consistency proofs for local averaging techniques, to Alessandro Rudi for his help on kernel methods, to Adrien Taylor for his help on chapter 5 on optimization, to Marc Lelarge for his help on overparameterized models, Olivier Cappé for his help on multiarmed bandits, and Lawrence Stewart for his help on neural network architectures. The notes from Philippe Rigollet have been a very precious help for chapter 8 on model selection. The careful readings of large portions of the text by Bertille Follain and Gabriel Stoltz have been very helpful. Feedback from the anonymous reviewers was also useful.
Former and current collaborators also helped in the final stages by reading carefully, annotating, and commenting a chapter: Eloïse Berthier, Raphaël Berthier, Vivien Cabannes, Aymeric Dieuleveut, Nicolas Flammarion, Pierre Gaillard, Hadrien Hendrickx, David Holzmüller, Dmitrii Ostrovskii, Loucas Pillaud-Vivien, Alessandro Rudi, Kevin Scaman, and Adrien Taylor. This was greatly appreciated.
Typos and suggestions have been highlighted by Ritobrata Ghosh, Thanh NguyenTang, Ishaan Gulrajani, Johannes Oswald, Seijin Kobayashi, Mathieu Dagreou, Dimitri Meunier, Antoine Moulin, Laurent Condat, Quentin Duchemin, Quentin Berthet, Mathieu Bloch, Fabien Pesquerel, Guillaume Bied, Uladzimir Yahorau, Pierre Dognin, Vihari Piratla, Tim Tsz-Kit Lau, Samy Clementz, Mohammad Alkousa, Eloïse Berthier, Pierre Marion, Vincent Liu, Atsushi Nitanda, Cheik Traoré, Ruiyuan Huang, Naoyuki Terashita, Jiangrui Kang, Moritz Haas, Mastane Achab, Beré Nortier, Cassidy Laidlaw, Jing Wang, Motonobu Kanagawa, Shane Hoeberichts, Dishank Jain, Aymeric Dieuleveut, Steffen Grünewälder, Claire Boyer, Bernhard Schölkopf, Piyushi Manupriya, Qingyue Zhao, Thomas Pock, Eliot Beyler, Yves Leconte, Jean Pichon, Brieuc Antoine Dit Urban, Théo Voldoire, Guénolé Joubioux, Adéchola Kouande, Zhu Wang, Leon Rofagha, John Zarka, Liviu Aolaritei, Gaétan Marceau Caron, Ivan Barrientos, Thomas Boudou, Sebastian Gruber, Julien Stoehr, Jingxin Zhang, Sacha Braun (Add your name to the list by sending me typos and comments!).
I am grateful to Elizabeth Swayze, Matthew Valades, Susan McClung, Roger Wood, Emma Donovan, Jitendra Kumar, and everyone at MIT Press for their assistance inpreparing and publishing this book.