ETRI-Knowledge Sharing Plaform

KOREAN
논문 검색
Type SCI
Year ~ Keyword

Detail

Conference Paper An Efficient Policy Improvement in Human Interactive Learning Using Entropy
Cited 0 time in scopus Share share facebook twitter linkedin kakaostory
Authors
Sung-Yun Park, Dae-Wook Kim, Sang-Kwang Lee, Seong-Il Yang
Issue Date
2021-10
Citation
International Conference on Information and Communication Technology Convergence (ICTC) 2021, pp.283-286
Publisher
IEEE
Language
English
Type
Conference Paper
DOI
https://dx.doi.org/10.1109/ICTC52510.2021.9620856
Abstract
Human knowledge is used in reinforcement learning (RL), which reduces the amount of time taken by the learning agent to achieve its goal. The TAMER (Training an Agent Manually via Evaluative Reinforcements) algorithm allows a human to provide a reward to an autonomous agent through a manual interface while watching the agent performs the action. Because a policy, the agent have, is updated based on human rewards, it approximates how a human trainer gives rewards to the agent. For policy update, events that occurred during learning were selected. Furthermore, while selecting events, the temporal distance from the event to the human reward is considered. Thus, the events that only occurred in a certain time interval before the human trainer gives a reward are selected. However, this approach of considering only the time factor demands quite many human rewards for the policy. The policy update with high complexity make the human trainer exhausted during improvement of policy. Therefore, we propose a new method of selecting events, which considers the entropy value over the distribution of Q-values, in addition to the time factor. For the policy update in our proposed event selection method, we reuse the events despite of long temporal distance since human reward when their each human reward is negative and entropy value (over the distribution of Q-values) is low. To compare the effectiveness of the proposed method with the classic TAMER, we implement an experiment with the policy initialized to an incorrect weight. The results show that the TAMER algorithm, using our proposed selection of events, efficiently improves the policy.
KSP Keywords
Entropy value, Event selection, Human knowledge, Interactive Learning, Policy update, Reinforcement Learning(RL), Selection method, Time factor, Time interval, autonomous agent, learning agent