Online Optimization Specialization (2/4): Review of 'Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization'

Specialization Introduction

This specialization covers five selected grounding papers in online optimization. In each blog, I will discuss one paper, where I aim to include

Brief introduction and summary to the paper
Key takeaways of the paper

Notice that all the discussion and summary in this specialization are based on the reviewed papers. None of the algorithms or theorems is proposed by myself.

Summary

Paper Detail

Title: Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Author(s): Alexander Rakhlin, Ohad Shamir, Karthik Sridharan

URL: https://icml.cc/2012/papers/261.pdf

Abstract

Stochastic gradient descent (SDG) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be $\mathcal{O}(log(T)/T)$, by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal $\mathcal{O}(1/T)$ rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of choice. In this paper, we investigate the optimality of SDG in a stochastic setting. We show that for smooth problems, the algorithm attains the optimal $\mathcal{O}(1/T)$ rate, and no other change of the algorithm is necessary. We also present experimental results which support our findings, and point out open problems.

Contributions and Organization of the Paper

This paper studies the convergence rate of SGD for stochastic strongly convex problems. The contribution of their work can be summarized below under the organization of this paper:

First, in section 3, they extend known results to show that if $F$ is not only strongly convex, but also smooth (with respect to the optimum), then SGD with and without averaging achieves the optimal $\mathcal{O}(1/T)$ convergence rate.
In section 4, they then show that for non-smooth $F$, there are cases where the convergence rate of SGD with averaging is $\Omega (log(T)/T)$. In other words, the $\Omega (log(T)/T)$ bound for general strongly convex problems is real, and not just an artifact of the currently-known analysis.
However, in section 5 & 6, they show that one can recover the optimal $\mathcal{O}(1/T)$ convergence rate (in expectation and high probability) by a simple modification of the averaging step: Instead of averaging of $T$ points, they only average the last $\alpha T$ points, where $\alpha \in (0,1)$ is arbitrary. Thus, to obtain an optimal rate, one does not need to use an algorithm significantly different than SGD, such as those discussed earlier.
In section 7, they perform an empirical study on both artificial and real-world data, which supports our findings.

Definitions

Stochastic Gradient Descent Algorithm

The SGD algorithm is parameterized by step sizes $\eta_1, ..., \eta_T$, and is defined as follows:

Initialize $w_1 \in \mathcal{W}$ arbitrarily (or randomly)
For $t = 1,...,T$:

Query the stochastic gradient oracle at $w_t$ to get a random $\hat{g}_t$ such that $\mathbb{E} [\hat{g}_t] = g_t$ is a subgradient of $F$ at $w_t$.
Let $w_{t+1} = \Pi_\mathcal{W}(w_t - \eta_t \hat{g}_t)$, where $\Pi_\mathcal{W}$ is the projection operator on $\mathcal{W}$.

This algorithm returns a sequence of points $w_1,...,w_T$. To obtain a single point, SGD with averaging is to return the average point

$$
\bar{w}_T = \frac{1}{T}(w_1+...+w_T).
$$

Comparing SGD and Gradient Descent Algorithm

This part credits to @Sociopath on StackExchange.

In both gradient descent (GD) and stochastic gradient descent (SGD), you iteratively update a set of parameters to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration; in SGD, on the other hand, you use ONLY ONE or SUBSET of training sample from your training set to do the update for a parameter in a particular iteration. If you use SUBSET, it is called Minibatch Stochastic gradient Descent.

Thus, if the number of training samples is large, then using gradient descent may take too long. Because in every iteration, when you are updating the parameters' values, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample, and it starts improving itself right away from the first sample.

SGD often converges much faster than GD, but the error function is not as well minimized as in GD. Often in most cases, the close approximation that you get in SGD for the parameter values is enough because they reach the optimal values and keep oscillating there.

$\lambda$-Strongly Convex

A function $F$ is $\lambda$-strongly convex, if for all $w, w' \in \mathcal{W}$ and any subgradient $g$ of $F$ at $w$,

$$
F(w') \geq F(w) + <g,w'-w> + \frac{\lambda}{2} \| w - w^* \|^2.
$$

$\mu$-Smooth

A function $F$ is $\mu$-smooth with respect to $w^*$ if for all $w \in \mathcal{W}$,

$$
F(w) - F(w^*) \leq \frac{\mu}{2} \| w - w^* \|^2.
$$

Stochastic Gradient Oracle

A stochastic gradient oracle for a differentiable function $f$: $\mathbb{R}^n \to \mathbb{R}$ takes as input a point $x \in \mathbb{R}^n$ and outputs a random vector $a \in \mathbb{R}^n$ such that $\mathbb{E}[a] = \nabla f(x)$, where the expectation is taken within respect to the randomization of the oracle. Sometimes, this expectation condition is made explicit by saying that the oracle is an unbiased estimator for the gradient. [Source]

Theorems

Theorem 2

Suppose $F$ is $\lambda$-strongly convex and $\mu$-smooth with respect to $w^*$ over a convex set $\mathcal{W}$, and that $\mathbb{E} \| \hat{g}_t \leq G^2 \|$. Then if we pick $\eta_t = c/\lambda t$ for some constant $c > 1/2$,

$$
\mathbb{E} [F(\bar{w}_T) - F(w^*)] \leq 2 \max \{ \frac{\mu G^2}{\lambda^2}, \frac{4\mu G}{\lambda}, \frac{\mu G}{\lambda} \sqrt{\frac{4c}{2-1/c}} \} \frac{1}{T}.
$$

Understanding Theorem 2

Theorem 2 shows if function $F( \cdot )$ is both strongly convex and smooth with respect to $w^*$, the SDG with averaging has an optimal $\mathcal{O}(1/T)$ convergence rate with even better dependence on c.

Theorem 4

Consider the strongly convex stochastic optimization problem presented above. If SGD is initialized at any point $w_1$ with $w_{1,1} \geq 0$, and ran with $\eta_t = c/t$, then for any $T \geq T_0 + 2$, where $T_0 = \max \{ 2,6c+1 \}$, we have

$$
\mathbb{E} [F(\bar{w}_T) - F(w^*)] \geq \frac{3c}{16T} \sum_{t=T_0 + 2}^T (\frac{1}{t}) - \frac{T_0}{T}.
$$

Understanding Theorem 4

When $c$ is considered a constant, the lower bound is $\Omega (log(T)/T)$. And the log term in the bound follows from

$$
\sum_{r=1}^n \frac{1}{r} \approx \int_1^n \frac{dx}{x} = log n.
$$

Theorem 4 shows that there are strongly convex stochastic optimization problems in Euclidean space, in which the convergence rate of SGD with averaging is lower bounded by $\Omega (log(T)/T)$. It happens in the context of learning when we have a non-smooth loss function, such as hinge loss.

Theorem 5

Consider SGD with $\alpha$-suffix averaging as described above, and with step sizes $\eta_t = c/\lambda t$ where $c > 1/2$ is a constant. Suppose $F$ is $\alpha$-strongly convex, and that $\mathbb{E}[\| \hat{g}_t \|^2] \leq G$ for all t. Then for any $T$, it holds that
$$
\mathbb{E} [ F(\bar{w}_T^\alpha) - F(w^*) ] \leq \frac{(c' + (\frac{c}{2}+c')log(\frac{1}{1-\alpha}))}{\alpha} \frac{G^2}{\alpha T},
$$
where $c' = \max \{\frac{2}{c}, \frac{1}{4-2/c}\}$, and $\alpha$-suffix averaging is
$$
\bar{w}_T^\alpha = \frac{w_{(1-\alpha)T+1}+...+w_T}{\alpha T}
$$

for some constant $\alpha \in (0,1)$ (assuming $\alpha T$ and $(1-\alpha)T$ are integers).

Understanding Theorem 5

Theorem 5 shows how to get an $\mathcal{O}(1/T)$ rate using a simple modification of the algorithm: Instead of averaging of $T$ points, they only average the last $\alpha T$ points, where $\alpha \in (0,1)$ is arbitrary. Thus, to obtain an optimal rate, one does not need to use an algorithm significantly different than SGD, such as those discussed earlier.

Search This Blog

山南水北