Please use this identifier to cite or link to this item:
http://bura.brunel.ac.uk/handle/2438/26848
Title: | Decomposing Synthesized Strategies for Reactive Multi-agent Reinforcement Learning |
Authors: | Zhu, C Zhu, J Cai, Y Wang, F |
Keywords: | reinforcement learning;linear temporal logic;GR(1);reward shaping |
Issue Date: | 27-Jun-2023 |
Publisher: | Springer |
Citation: | Zhu, C. et al. (2023) ‘Decomposing Synthesized Strategies for Reactive Multi-agent Reinforcement Learning’ in David, C. and Sun, M. (eds) Theoretical Aspects of Software Engineering. 17th International Symposium,TASE 2023. (Lecture Notes in Computer Science, Vol. 13931). Cham, Switzerland: Springer., pp 59–76. doi: 10.1007/978-3-031-35257-7_4. |
Abstract: | Multi-Agent Reinforcement Learning (MARL) has been used to solve sequential decision problems by a collection of intelligent agents interacting in a shared environment. However, the design complexity of MARL strategies increases with the complexity of the task specifications. In addition, current MARL approaches suffer from slow convergence and reward sparsity when dealing with multi-task specifications. Linear temporal logic works as one of the software engineering practices to describe non-Markovian task specifications, whose synthesized strategies can be used as a priori knowledge to train the multi-agents to interact with the environment more efficiently. In this paper, we consider multi-agents that react to each other with a high-level reactive temporal logic specification called Generalized Reactivity of rank 1 (GR(1)). We first decompose the synthesized strategy of GR(1) into a set of potential-based reward machines for individual agents. We prove that the parallel composition of the reward machines forward simulates the original reward machine, which satisfies the GR(1) specification. We then extend the Markov Decision Process (MDP) with the synchronized reward machines. A value-iteration-based approach is developed to compute the potential values of the reward machine based on the strategy structure. We also propose a decentralized Q-learning algorithm to train the multi-agents with the extended MDP. Experiments on multi-agent learning under different reactive temporal logic specifications demonstrate the effectiveness of the proposed method, showing a superior learning curve and optimal rewards. |
URI: | https://bura.brunel.ac.uk/handle/2438/26848 |
DOI: | https://doi.org/10.1007/978-3-031-35257-7_4 |
ISBN: | 978-3-031-35256-0 (hbk) 978-3-031-35257-7 (ebk) |
ISSN: | 0302-9743 |
Other Identifiers: | ORCID iDs: Chenyang Zhu https://orcid.org/0000-0002-2145-0559; Fang Wang https://orcid.org/0000-0003-1987-9150. |
Appears in Collections: | Dept of Computer Science Embargoed Research Papers |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
FullText.pdf | Embargoed until 27 June 2024 | 1.34 MB | Adobe PDF | View/Open |
Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.