Online Optimization Specialization (4/4): Review of 'Projection-free Online Learning'

Specialization Introduction

This specialization covers five selected grounding papers in online optimization. In each blog, I will discuss one paper, where I aim to include

Brief introduction and summary to the paper
Key takeaways of the paper

Notice that all the discussion and summary in this specialization are based on the reviewed papers. None of the algorithms or theorems is proposed by myself.

Summary

Paper Detail

Title: Projection-free Online Learning

Author(s): Elad Hazan, Satyen Kale

URL: https://icml.cc/2012/papers/292.pdf

Abstract

The computational bottleneck in applying online learning to massive data sets is usually the projection step. We present efficient online learning algorithms that eschew projections in favor of much more efficient linear optimization steps using the Frank-Wolfe technique. We obtain a range of regret bounds for online convex optimization, with better bounds for specific cases such as stochastic online smooth convex optimization.
Besides the computational advantage, other desirable features of our algorithms are that they are parameter-free in the stochastic case and produce sparse decisions. We apply our algorithms to computationally intensive applications of collaborative filtering, and show the theoretical improvements to be clearly visible on standard datasets.

Advantage of the Projection-free Algorithm

Parameter Free: The algorithm is parameter-free in the basic setting and can be naturally extended to the stochastic setting while maintaining the parameter-free feature.
Efficient Representation and Sparsity: The algorithm maintains a distribution over the vertices of the decision set, thereby eliminating the need for any further decomposition. This gives a form of sparsity.

Preliminary Knowledge

Here, I will list the preliminary knowledge that we do not cover in class.

Smoothed Functions

Let $\mathbb{B}$ and $\mathbb{S}$ denote the unit ball and unit sphere in $\mathbb{R}^n$ respectively. Given $\delta > 0$, let the $\delta$-smoothing of a function $f$ be:

$$
\hat{f_\delta}(x) = E_{u \in \mathbb{B}}[f(x+\delta u)],
$$

where $u$ is chosen uniformly at random from $\mathbb{B}$. The following lemma shows that $\hat{f_\delta}$ is a good smooth approximation of $f$.

Lemma 2.1

If $f$ is convex and L-Lipschitz, then the function $\hat{f_\delta}$ has the following properties:

$\hat{f_\delta}$ is convex and L-Lipschitz.
For andy $x \in \mathcal{K}, \nabla\hat{f_\delta}(x) = \frac{d}{\delta}E_{u \in \mathbb{B}}[f(x+\delta u)u]$.
For any $x \in \mathcal{K}, \| \nabla \hat{f_\delta}(x) \| \leq dL$.
$\hat{f_\delta}$ is $\frac{dL}{\delta}$-smooth.
For any $x \in \mathcal{K}, |f(x)-\hat{f_\delta}(x)| \leq \delta L$.

Sparsity

Let $\mathcal{K} \in \mathbb{R}^n$ be a convex, compact set and let $x \in \mathcal{K}$. We say that $x$ is t-sparse w.r.t. $\mathcal{K}$ if it can be written as a convex combination of t boundary points of $\mathcal{K}$.

All the algorithms in this paper produce t-sparse prediction at iteration t w.r.t. the underlying decision set $\mathcal{K}$.

Algorithm and Analysis

Algorithm 1 Online Frank-Wolfe

Theorem 3.1

Assume that for $t = 1,2,...,T$, the function $f_t$ is L-Lipschitz, $Bt^{-b}$-smooth for some constants $b \in [-1,1/2]$ and $B \geq 0$, and $St^{-s}$-strongly convex for some constants $s \in [0,1)$ and $S \geq 0$. Then in algorithm 1, for all $t > 1$, we have

$$
\Delta_t \leq Ct^{-d},
$$
for both the following values of C and d:
$$
(C,d) = (max\{9D^2B,3LD\},\frac{1+b}{2})
$$
and
$$
(C,d) = (max\{9D^2B,36L^2/S, 3LD\},\frac{2+2b-s}{3}).
$$

Here, $\Delta_t = F_t(x_t) - F_t(x_t^*)$. In either case, this bound is obtained by setting $a = d - b$ in Algorithm 1.

Regret Bound

Similar to what we covered in class, the author discussed and proved the regret bound for Algorithm1. The discussion was divided into the smooth stochastic case and the non-smooth stochastic case. In either case, the algorithm achieves a tight and efficient bound.

Search This Blog

山南水北