Last month, we saw that self-concordance is a key property in optimization, to use local quadratic approximations in the sharpest possible way. In particular it was an affine-invariant quantity leading to a simple and elegant analysis of Newton method. The key assumption was a link between third and second-order derivatives, which took the following form…
Author: Francis Bach
Going beyond least-squares – I : self-concordant analysis of Newton method
Least-squares is a workhorse of optimization, machine learning, statistics, signal processing, and many other scientific fields. I find it particularly appealing (too much, according to some of my students and colleagues…), because all algorithms, such as stochastic gradient [1], and analyses, such as for kernel ridge regression [2], are much simpler and rely on reasonably…
Finding global minima with kernel approximations
Last month, I showed how global optimization based only on accessing function values can be hard with no convexity assumption. In a nutshell, with limited smoothness, the number of function evaluations has to grow exponentially fast in dimension, which is a rather negative statement. On the positive side, this number does not grow as fast…
Optimization is as hard as approximation
Optimization is a key tool in machine learning, where the goal is to achieve the best possible objective function value in a minimum amount of time. Obtaining any form of global guarantees can usually be done with convex objective functions, or with special cases such as risk minimization with one-hidden over-parameterized layer neural networks (see…
The Cauchy residue trick: spectral analysis made “easy”
In many areas of machine learning, statistics and signal processing, eigenvalue decompositions are commonly used, e.g., in principal component analysis, spectral clustering, convergence analysis of Markov chains, convergence analysis of optimization algorithms, low-rank inducing regularizers, community detection, seriation, etc. Understanding how the spectral decomposition of a matrix changes as a function of a matrix is…
Polynomial magic III : Hermite polynomials
After two blog posts earlier this year on Chebyshev and Jacobi polynomials, I am coming back to orthogonal polynomials, with Hermite polynomials. This time, in terms of applications to machine learning, no acceleration, but some interesting closed-form expansions in positive-definite kernel methods. Definition and first properties There are many equivalent ways to define Hermite polynomials….
The many faces of integration by parts – II : Randomized smoothing and score functions
This month I will follow-up on last month blog post and look at another application of integration by parts, which is central to many interesting algorithms in machine learning, optimization and statistics. In this post, I will consider extensions in higher dimensions, where we take integrals on a subset of \(\mathbb{R}^d\), and focus primarily on…
The many faces of integration by parts – I : Abel transformation
Integration by parts is a highlight of any calculus class. It leads to multiple classical applications for integration of logarithms, exponentials, etc., and it is the source of an infinite number of exercises and applications to special functions. In this post, I will look at a classical discrete extension that is useful in machine learning…
Gradient descent for wide two-layer neural networks – I : Global convergence
Supervised learning methods come in a variety of flavors. While local averaging techniques such as nearest-neighbors or decision trees are often used with low-dimensional inputs where they can adapt to any potentially non-linear relationship between inputs and outputs, methods based on empirical risk minimization are the most commonly used in high-dimensional settings. Their principle is…
Effortless optimization through gradient flows
Optimization algorithms often rely on simple intuitive principles, but their analysis quickly leads to a lot of algebra, where the original idea is not transparent. In last month post, Adrien Taylor explained how convergence proofs could be automated. This month, I will show how proof sketches can be obtained easily for algorithms based on gradient…