Supervised learning methods come in a variety of flavors. While local averaging techniques such as nearest-neighbors or decision trees are often used with low-dimensional inputs where they can adapt to any potentially non-linear relationship between inputs and outputs, methods based on empirical risk minimization are the most commonly used in high-dimensional settings. Their principle is…
Category: Machine learning
Machine learning concepts or tools
Effortless optimization through gradient flows
Optimization algorithms often rely on simple intuitive principles, but their analysis quickly leads to a lot of algebra, where the original idea is not transparent. In last month post, Adrien Taylor explained how convergence proofs could be automated. This month, I will show how proof sketches can be obtained easily for algorithms based on gradient…
Computer-aided analyses in optimization
In this blog post, I want to illustrate how computers can be great allies in designing (and verifying) convergence proofs for first-order optimization methods. This task can be daunting, and highly non-trivial, but nevertheless usually unavoidable when performing complexity analyses. A notable example is probably the convergence analysis of the stochastic average gradient (SAG) [1],…
On the unreasonable effectiveness of Richardson extrapolation
This month, I will follow up on last month’s blog post, and describe classical techniques from numerical analysis that aim at accelerating the convergence of a vector sequence to its limit, by only combining elements of the sequence, and without the detailed knowledge of the iterative process that has led to this sequence. Last month,…
Are all kernels cursed?
The word “kernel” appears in many areas of science (it is even worse in French with “noyau”); it can have different meanings depending on context (see here for a nice short historical review for mathematics). Within machine learning and statistics, kernels are used in two related but different contexts, with different definitions and some kernels…
The “η-trick” reloaded: multiple kernel learning
In my previous post, I described various (potentially non-smooth) functions that have quadratic (and thus smooth) variational formulations, a possibility that I referred to as the η-trick. For example, in its simplest formulation, we have \( \displaystyle |w| = \min_{ \eta \geq 0} \frac{1}{2} \frac{w^2}{\eta} + \frac{1}{2} \eta\). While it seems most often used for…