Quadratic functions are the workhorse of the analysis of iterative algorithms such as gradient-based optimization. They lead, in discrete and continuous time, to closed-form dynamics that treat all eigensubspaces of the Hessian matrix independently. This leads to simple math for understanding convergence behaviors (maximal step-size, condition number, acceleration, scaling laws for gradient descent or its…
Category: Machine learning
Machine learning concepts or tools
Beyond Power Laws: Scaling Laws for Next-Token Prediction
The past posts on optimization scaling laws [1, 2] focused on problems that do not become significantly harder as the problem size increases. We showed that for some problems, as the dimension \(d\) goes to infinity, the optimality gap converges at a sublinear rate \(\Theta(k^{-p})\) for some power \(p\) depending on the problem, but independent…
Revisiting scaling laws via the z-transform
In the last few years, we have seen a surge of empirical and theoretical works about “scaling laws”, whose goals are to characterize the performance of learning methods based on various problem parameters (e.g., number of observations and parameters, or amount of compute). From a theoretical point of view, this marks a renewed interest in…
Unraveling spectral properties of kernel matrices – II
This month, we pursue our exploration of spectral properties of kernel matrices. As mentioned in a previous post, understanding how eigenvalues decay is not only fun but also key to understanding algorithmic and statistical properties of many learning methods (see, e.g., chapter 7 of my book “Learning Theory from First Principles“). This month, we look…
My book is (at last) out!
Just in time for Christmas, I received two days ago the first hard copies of my book! It is a mix of feelings of relief and pride after 3 years of work. As most book writers will probably acknowledge, it took much longer than I expected when I started, but overall it was an enriching…
Scaling laws of optimization
Scaling laws have been one of the key achievements of theoretical analysis in various fields of applied mathematics and computer science, answering the following key question: How fast does my method or my algorithm converge as a function of (potentially partially) observable problem parameters. For supervised machine learning and statistics, probably the simplest and oldest…
Unraveling spectral properties of kernel matrices – I
Since my early PhD years, I have plotted and studied eigenvalues of kernel matrices. In the simplest setting, take independent and identically distributed (i.i.d.) data, such as in the cube below in 2 dimensions, take your favorite kernels, such as the Gaussian or Abel kernels, plot eigenvalues in decreasing order, and see what happens. The…
Revisiting the classics: Jensen’s inequality
There are a few mathematical results that any researcher in applied mathematics uses on a daily basis. One of them is Jensen’s inequality, which allows bounding expectations of functions of random variables. This really happens a lot in any probabilistic arguments but also as a tool to generate inequalities and optimization algorithms. In this blog…
Rethinking SGD’s noise – II: Implicit Bias
In the previous post, we showed (or at least tried to!) how the inherent noise of the stochastic gradient descent algorithm (SGD), in the context of modern overparametrised architectures, is structured and carries two important features: (i) it vanishes for interpolating solutions and (ii) it belongs to a low-dimensional manifold spanned by the gradients. Building…
Rethinking SGD’s noise
It seemed a bit unfair to devote a blog to machine learning (ML) without talking about its current core algorithm: stochastic gradient descent (SGD). Indeed, SGD has become, year after year, the basic foundation of many algorithms used for large-scale ML problems. However, the history of stochastic approximation is much older than that of ML:…