Decomposing Synthesized Strategies for Reactive Multi-agent Reinforcement Learning

Zhu, C; Zhu, J; Cai, Y; Wang, F

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/26848

Full metadata record

DC Field	Value	Language
dc.contributor.author	Zhu, C	-
dc.contributor.author	Zhu, J	-
dc.contributor.author	Cai, Y	-
dc.contributor.author	Wang, F	-
dc.date.accessioned	2023-07-25T16:28:18Z	-
dc.date.available	2023-07-25T16:28:18Z	-
dc.date.issued	2023-06-27	-
dc.identifier	ORCID iDs: Chenyang Zhu https://orcid.org/0000-0002-2145-0559; Fang Wang https://orcid.org/0000-0003-1987-9150.	-
dc.identifier.citation	Zhu, C. et al. (2023) ‘Decomposing Synthesized Strategies for Reactive Multi-agent Reinforcement Learning’ in David, C. and Sun, M. (eds) Theoretical Aspects of Software Engineering. 17th International Symposium,TASE 2023. (Lecture Notes in Computer Science, Vol. 13931). Cham, Switzerland: Springer., pp 59–76. doi: 10.1007/978-3-031-35257-7_4.	en_US
dc.identifier.isbn	978-3-031-35256-0 (hbk)	-
dc.identifier.isbn	978-3-031-35257-7 (ebk)	-
dc.identifier.issn	0302-9743	-
dc.identifier.uri	https://bura.brunel.ac.uk/handle/2438/26848	-
dc.description.abstract	Multi-Agent Reinforcement Learning (MARL) has been used to solve sequential decision problems by a collection of intelligent agents interacting in a shared environment. However, the design complexity of MARL strategies increases with the complexity of the task specifications. In addition, current MARL approaches suffer from slow convergence and reward sparsity when dealing with multi-task specifications. Linear temporal logic works as one of the software engineering practices to describe non-Markovian task specifications, whose synthesized strategies can be used as a priori knowledge to train the multi-agents to interact with the environment more efficiently. In this paper, we consider multi-agents that react to each other with a high-level reactive temporal logic specification called Generalized Reactivity of rank 1 (GR(1)). We first decompose the synthesized strategy of GR(1) into a set of potential-based reward machines for individual agents. We prove that the parallel composition of the reward machines forward simulates the original reward machine, which satisfies the GR(1) specification. We then extend the Markov Decision Process (MDP) with the synchronized reward machines. A value-iteration-based approach is developed to compute the potential values of the reward machine based on the strategy structure. We also propose a decentralized Q-learning algorithm to train the multi-agents with the extended MDP. Experiments on multi-agent learning under different reactive temporal logic specifications demonstrate the effectiveness of the proposed method, showing a superior learning curve and optimal rewards.	en_US
dc.description.sponsorship	National Natural Science Foundation of China (No.62202067); Natural Science Foundation of the Higher Education Institutions of Jiangsu Province (No. 22KJB520012)	en_US
dc.format.extent	59 - 76	-
dc.format.medium	Print-Electronic	-
dc.publisher	Springer	en_US
dc.rights	Copyright © 2023 The Author(s), under exclusive licence to Springer Nature Switzerland AG. This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/978-3-031-35257-7_4. Rights and permissions: Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. (see: https://www.springernature.com/gp/open-research/policies/journal-policies).	-
dc.rights.uri	https://www.springernature.com/gp/open-research/policies/journal-policies	-
dc.subject	reinforcement learning	en_US
dc.subject	linear temporal logic	en_US
dc.subject	GR(1)	en_US
dc.subject	reward shaping	en_US
dc.title	Decomposing Synthesized Strategies for Reactive Multi-agent Reinforcement Learning	en_US
dc.type	Conference Paper	en_US
dc.identifier.doi	https://doi.org/10.1007/978-3-031-35257-7_4	-
dc.relation.isPartOf	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)	-
pubs.publication-status	Published	-
pubs.volume	13931 LNCS	-
dc.identifier.eissn	1611-3349	-
dc.rights.holder	The Author(s)	-
Appears in Collections:	Department of Computer Science Embargoed Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	Embargoed until 27 June 2024	1.34 MB	Adobe PDF	View/Open

Show simple item record