IIIT Hyderabad Publications |
|||||||||
|
Action Selection For Composable Modular Deep Reinforcement LearningAuthor: Vaibhav Gupta Date: 2021-02-13 Report no: IIIT/TH/2021/13 Advisor:Praveen Paruchuri AbstractReinforcement Learning (RL) has proved its effectiveness by providing a way of modeling the agents’ behavior based on environmental feedback. Conventional RL employs a single monolithic agent that learns to solve the entire complex task on its own. In spite of its evident success this approach has some limitations (i) the joint state space for the complete task grows exponentially and often becomes intractable for standard RL techniques and (ii) a single monolithic agent is unable to exploit the problem structure. Fortunately complex domains are often composed of simpler subproblems that might be easier to learn. Modular reinforcement learning (MRL) has been primarily useful in leveraging such inherent structure in the problem domain. In MRL a complex decision making problem is decomposed into multiple modules which solve different components of the original problem. After such decomposition, the task of recomposing their individual knowledge to construct a global policy remains one of the main challenges in MRL. Often, subproblems addressed by different modules have conflicting goals, and incompatible reward scales. A composable decision making architecture requires that even the modules authored separately with possibly misaligned reward scales can be combined coherently. An arbitrator that coordinates the learnings of individual modules should consider module’s action preferences to learn effective global action selection across all the modules. We present a novel framework called GRACIAS that assigns fine-grained importance to the different modules based on their relevance in a given state, and enables composable decision making based on modern deep RL methods such as deep deterministic policy gradient (DDPG) and deep Q-learning. We provide insights into convergence properties of GRACIAS by exploiting connections to the two timescale stochastic approximation. We show that our approach is able to overcome the major limitations of existing algorithms and also reduce previous MRL algorithms to special cases of our framework. We experimentally demonstrate on several standard MRL domains that our approach works significantly better than the previous MRL methods, and is highly robust to incompatible reward scales. Our framework extends MRL to complex Atari games such as Qbert, and has a better learning curve than the conventional RL algorithms. Full thesis: pdf Centre for Others |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |