AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation

Daiki Matsunaga; Jongmin Lee; Jaeseok Yoon; Stefanos Leonardos; Pieter Abeel; Kee-Eung Kim

AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation

Daiki Matsunaga^*, Jongmin Lee^*, Jaeseok Yoon, Stefanos Leonardos, Pieter Abeel, Kee-Eung Kim

^*Corresponding author for this work

Informatics

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

75 Downloads (Pure)

Abstract

One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To remedy the exponential complexity, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, we observe that these methods, even combined with the conservatism principles used in offline RL, can result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.

Original language	English
Title of host publication	Advances in Neural Information Processing Systems 36 (NeurIPS 2023)
Publication status	Accepted/In press - 21 Sept 2023

Access to Document

_NeurIPS_2023_AlberDICE_1_Accepted author manuscript, 1.65 MB

Cite this

@inbook{55cf9a659e094b73a5e5ebbf24f66009,

title = "AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation",

abstract = "One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To remedy the exponential complexity, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, we observe that these methods, even combined with the conservatism principles used in offline RL, can result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.",

author = "Daiki Matsunaga and Jongmin Lee and Jaeseok Yoon and Stefanos Leonardos and Pieter Abeel and Kee-Eung Kim",

year = "2023",

month = sep,

day = "21",

language = "English",

booktitle = "Advances in Neural Information Processing Systems 36 (NeurIPS 2023)",

}

AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation. / Matsunaga, Daiki; Lee, Jongmin; Yoon, Jaeseok et al.
Advances in Neural Information Processing Systems 36 (NeurIPS 2023). 2023.

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

TY - CHAP

T1 - AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation

AU - Matsunaga, Daiki

AU - Lee, Jongmin

AU - Yoon, Jaeseok

AU - Leonardos, Stefanos

AU - Abeel, Pieter

AU - Kim, Kee-Eung

PY - 2023/9/21

Y1 - 2023/9/21

N2 - One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To remedy the exponential complexity, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, we observe that these methods, even combined with the conservatism principles used in offline RL, can result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.

AB - One of the main challenges in offline Reinforcement Learning (RL) is the distribution shift that arises from the learned policy deviating from the data collection policy. This is often addressed by avoiding out-of-distribution (OOD) actions during policy improvement as their presence can lead to substantial performance degradation. This challenge is amplified in the offline Multi-Agent RL (MARL) setting since the joint action space grows exponentially with the number of agents. To remedy the exponential complexity, existing MARL methods adopt either value decomposition methods or fully decentralized training of individual agents. However, we observe that these methods, even combined with the conservatism principles used in offline RL, can result in the selection of OOD joint actions in offline MARL. To this end, we introduce AlberDICE, an offline MARL algorithm that alternatively performs centralized training of individual agents based on stationary distribution optimization. AlberDICE circumvents the exponential complexity of MARL by computing the best response of one agent at a time while effectively avoiding OOD joint action selection. Theoretically, we show that the alternating optimization procedure converges to Nash policies. In the experiments, we demonstrate that AlberDICE significantly outperforms baseline algorithms on a standard suite of MARL benchmarks.

M3 - Conference paper

BT - Advances in Neural Information Processing Systems 36 (NeurIPS 2023)

ER -

AlberDICE: Addressing Out-Of-Distribution Joint Actions in Offline Multi-Agent RL via Alternating Stationary Distribution Correction Estimation

Abstract

Access to Document

Fingerprint

Cite this