Tag
This paper studies regret minimization in Markovian bandits with non-observable states and constrained decision epochs, introducing a generalization called self-degrading Markovian bandits. The authors propose the UCB-NOM algorithm that achieves nearly logarithmic regret and provide bounds that do not depend on the number of states.