Online Optimization Specialization (3/4): Review of 'Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization'

Specialization Introduction

This specialization covers five selected grounding papers in online optimization. In each blog, I will discuss one paper, where I aim to include

Brief introduction and summary to the paper
Key takeaways of the paper

Notice that all the discussion and summary in this specialization are based on the reviewed papers. None of the algorithms or theorems is proposed by myself.

Summary

Paper Detail

Title: Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization

Author(s): Jacob Abernethy, Elad Hazan, Alexander Raklin

URL: http://web.eecs.umich.edu/~jabernet/123-Abernethy.pdf

Abstract

We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal $O*(\sqrt{T})$ regret. The setting is a natural generalization of the stochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a self-concordant potential function. Our approach presents a novel connection between online learning and interior point methods.

Key Contribution

The main contribution of this paper is that the authors proposed an efficient algorithm which achieves $O(\sqrt{T})$ regret for online optimization problem in the bandit setting.
Provide a link between Bregman divergence and self-concordant barrier

Definitions

Self-concordant function

Self-concordant Barrier

Algorithms

Bandit Setting vs Online Setting

In both settings, after the learner made a decision, s/he will suffer a loss. The main difference is that, in the full-information setting, the learner can observe all of the loss functions $f_t(\cdot)$, while in the bandit setting, the learner can only observe $f_t(x_t)$.

Algorithm 1 Bandit Online Linear Optimization

Theorem 1

Understand Theorem 1

The expected regret over the original set $\mathcal{K}$ is within an additive $O(\sqrt{nT})$ factor from the above guarantee.

Takeaway

Easy-to-hard Approach: One common way for researchers to solve a problem is the easy-to-hard approach. Similar to this paper, the authors designed their algorithms based on the knowledge in the well-studied full information setting. They utilized a reduction to the full-information setting in one way or another because any algorithm that aimed for low-regret in the bandit setting would necessarily have to achieve low regret given full information.
The exploration-exploitation trade-off: The exploration-exploitation trade-off is the primary source of difficulty in obtaining $O(\sqrt{T})$ guarantees on the regret. Roughly, there are two categories of approaches to perform both exploration and exploitation.

Alternating Explore/Exploit: Flip an $\epsilon$-biased coin to determine whether to explore or exploit.
Simultaneous Explore/Exploit

Search This Blog

山南水北