Vivek Suresh Raj -

Comprehensive exploration is necessary for collectively exploring the arms, while still exploiting the optimal arm, in K−armed stochastic bandits, specifically for higher arm spaces like the rescue and surveillance operation where relatively higher exploration of the agent is expected in the search space. Under the case of Multi-armed Bandits (MAB), where K, is relatively higher action dimensions, many exploration algorithms like the epsilon-greedy for example use random and direct exploration, where sub-optimal actions may be chosen frequently, thus increasing regret linearly. In this paper, we study on the theoretical aspects of MaxExp-UCB algorithm, which promotes comprehensive exploration while still having a sub-linear regret growth. We introduce normalized exploration across bandit arms as 2 ln(t) N i (t) • δ(K−1) i̸ =i ⋆ N i (t) , and show that, in case of higher K values, the δ(K−1) i̸ =i ⋆ N i (t) term, becomes smaller, promoting further exploration to ensure a comprehensive search across all available arms in our MAB setting. We also conduct a theoretical study on the on the worst-case upper bound of the term δ(K−1) i̸ =i ⋆ N i (t) and prove that the upper bound is ≤ √ t 2δ(K−1) • (t) ∆•T • ln(t) N i (t). Finally, using the previous-worst case-bound, we derive and analyses the pseudo-regret bound in adapting comprehensive exploration and show that our regret has sub-linear properties. Through this we conclude that, our regret analysis is near to UCB regret bound, indicating the effectiveness of MaxExp-UCB in making near-optimal decisions while promoting comprehensive exploration across K-possible arms in MAB setting.