Online Optimization Specialization (3/4): Review of 'Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization'
Specialization Introduction
This specialization covers five selected grounding papers in online optimization. In each blog, I will discuss one paper, where I aim to include
- Brief introduction and summary to the paper
- Key takeaways of the paper
Notice that all the discussion and summary in this specialization are based on the reviewed papers. None of the algorithms or theorems is proposed by myself.
Summary
Paper Detail
Title: Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization
Author(s): Jacob Abernethy, Elad Hazan, Alexander Raklin
Abstract
We introduce an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal $O*(\sqrt{T})$ regret. The setting is a natural generalization of the stochastic multi-armed bandit problem, and the existence of an efficient optimal algorithm has been posed as an open problem in a number of recent papers. We show how the difficulties encountered by previous approaches are overcome by the use of a self-concordant potential function. Our approach presents a novel connection between online learning and interior point methods.
Key Contribution
- The main contribution of this paper is that the authors proposed an efficient algorithm which achieves $O(\sqrt{T})$ regret for online optimization problem in the bandit setting.
- Provide a link between Bregman divergence and self-concordant barrier
Definitions
Self-concordant function
Self-concordant Barrier
Algorithms
Bandit Setting vs Online Setting
In both settings, after the learner made a decision, s/he will suffer a loss. The main difference is that, in the full-information setting, the learner can observe all of the loss functions $f_t(\cdot)$, while in the bandit setting, the learner can only observe $f_t(x_t)$.
Algorithm 1 Bandit Online Linear Optimization

Understand Theorem 1
The expected regret over the original set $\mathcal{K}$ is within an additive $O(\sqrt{nT})$ factor from the above guarantee.Takeaway
- Easy-to-hard Approach: One common way for researchers to solve a problem is the easy-to-hard approach. Similar to this paper, the authors designed their algorithms based on the knowledge in the well-studied full information setting. They utilized a reduction to the full-information setting in one way or another because any algorithm that aimed for low-regret in the bandit setting would necessarily have to achieve low regret given full information.
- The exploration-exploitation trade-off: The exploration-exploitation trade-off is the primary source of difficulty in obtaining $O(\sqrt{T})$ guarantees on the regret. Roughly, there are two categories of approaches to perform both exploration and exploitation.
- Alternating Explore/Exploit: Flip an $\epsilon$-biased coin to determine whether to explore or exploit.
- Simultaneous Explore/Exploit
Comments
Post a Comment