Déclinaisons de bandits et leurs applications
|Advisor:||Gagné, Christian; Pineau, Joelle|
|Abstract:||This thesis deals with various variants of the bandits problem, wihch corresponds to a simplified instance of a RL problem with emphasis on the exploration-exploitation trade-off. More specifically, the focus is on three variants: contextual, structured, and multi-objective bandits. In the first, an agent searches for the optimal action depending on a given context. In the second, an agent searches for the optimal action in a potentially large space characterized by a similarity metric. In the latter, an agent searches for the optimal trade-off on a Pareto front according to a non-observable preference function. The thesis introduces algorithms adapted to each of these variants, whose performances are supported by theoretical guarantees and/or empirical experiments. These bandit variants provide a framework for two real-world applications with high potential impact: 1) adaptive treatment allocation for the discovery of personalized cancer treatment strategies; and 2) online optimization of microscopic imaging parameters for the efficient acquisition of useful images. The thesis therefore offers both algorithmic, theoretical, and applicative contributions. An adaptation of the BESA algorithm, GP BESA, is proposed for the problem of contextual bandits. Its potential is highlighted by simulation experiments, which motivated the deployment of the strategy in a wet lab experiment on real animals. Promising results show that GP BESA is able to extend the longevity of mice with cancer and thus significantly increase the amount of data collected on subjects. An adaptation of the TS algorithm, Kernel TS, is proposed for the problem of structured bandits in RKHS. A theoretical analysis allows to obtain convergence guarantees on the cumulative pseudo-regret. Concentration results for the regression with variable regularization as well as a procedure for adaptive tuning of the regularization based on the empirical estimation of the noise variance are also introduced. These contributions make it possible to lift the typical assumption on the a priori knowledge of the noise variance in streaming kernel regression. Numerical results illustrate the potential of these tools. Empirical experiments also illustrate the performance of Kernel TS and raise interesting questions about the optimality of theoretical intuitions. A new variant of multi-objective bandits, generalizing the literature, is also proposed. More specifically, the new framework considers that the preference articulation between the objectives comes from a nonobservable function, typically a user (expert), and suggests integrating this expert into the learning loop. The concept of preference radius is then introduced to evaluate the robustness of the expert’s preference function to errors in the estimation of the environment. A variant of the TS algorithm, TS-MVN, is introduced and analyzed. Empirical experiments support the theoreitcal results and provide a preliminary investigation of questions about the presence of an expert in the learning loop. Put together, structured and multi-objective bandits approaches are then used to tackle the online STED imaging parameters optimization problem. Experimental results on a real microscopy setting and with real neural samples show that the proposed technique makes it possible to significantly accelerate the process of parameters characterization and facilitate the acquisition of images relevant to experts in neuroscience.|
|Document Type:||Thèse de doctorat|
|Open Access Date:||24 April 2018|
|Collection:||Thèses et mémoires|
All documents in CorpusUL are protected by Copyright Act of Canada.