IIIT Hyderabad Publications |
|||||||||
|
Richer Knowledge Transfer in Teacher Student Framework using State Categorization and Advice ReplayAuthor: Daksh Anand Date: 2021-04-21 Report no: IIIT/TH/2021/39 Advisor:Praveen Paruchuri AbstractReinforcement learning is an area of machine learning where an agent learns to solve sequential decision-making problems by performing actions, receiving feedback, and updating its knowledge about the environment. Although there have been many successful RL applications, scaling up to large state spaces remains a challenge. The teacher-student framework aims to improve the sample efficiency of the conventional Reinforcement Learning algorithms by deploying an advising mechanism in which a teacher helps a student by guiding its exploration. Prior work in this field has considered an advising mechanism where the teacher advises the student about the optimal action to take in a given state. Although such guided exploration helps the student to converge faster, real-world teachers typically do not guide the student only about the best possible action in a particular situation. They can leverage their domain expertise even further to provide the student with richer and more informative signals about the given environment and its state space. Using this insight, we propose to extend the current advising framework wherein the teacher would provide not only the optimal action but also a qualitative assessment of the state. We introduce a novel architecture, namely Advice Replay Memory (ARM), to effectively reuse the advice provided by the teacher. We demonstrate the robustness of our approach by showcasing our experiments on multiple Atari 2600 games using a fixed set of hyper-parameters. Additionally, we show that a student taking help even from a sub-optimal teacher can achieve significant performance boosts and eventually outperform the teacher. Our approach outperforms the baselines even when provided with comparatively suboptimal teachers and an advising budget, which is smaller by orders of magnitude. We also present an interesting ”non-interactive learning” setting where the teacher can compose a batch of advice tuples and provide it to a student at the beginning of its learning period. From thereon, even if the teacher goes offline or does not pay any attention, the student might still be able to gain reasonable performance boosts. The main highlights of this thesis are the following: (a) We supplement the student’s knowledge by providing the state category as advice, (b) We introduce an Advice Replay Memory so that the student can effectively reuse the teacher’s advice throughout its learning process, (c) We provide with the student the ability to achieve significant performance boost even with a coarse state categorization, (d) We enable the student to eventually outperform the teacher that is advising it. Full thesis: pdf Centre for Others |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |