Adaptive multi-agent reinforcement learning for dynamic pricing and distributed energy management in virtual power plant networks

Jian-Dong Yao; Wen-Bin Hao; Zhi-Gao Meng; Bo Xie; Jian-Hua Chen; Jia-Qi Wei

doi:10.1016/j.jnlest.2024.100290

Abstract

This paper presents a novel approach to dynamic pricing and distributed energy management in virtual power plant (VPP) networks using multi-agent reinforcement learning (MARL). As the energy landscape evolves towards greater decentralization and renewable integration, traditional optimization methods struggle to address the inherent complexities and uncertainties. Our proposed MARL framework enables adaptive, decentralized decision-making for both the distribution system operator and individual VPPs, optimizing economic efficiency while maintaining grid stability. We formulate the problem as a Markov decision process and develop a custom MARL algorithm that leverages actor-critic architectures and experience replay. Extensive simulations across diverse scenarios demonstrate that our approach consistently outperforms baseline methods, including Stackelberg game models and model predictive control, achieving an 18.73% reduction in costs and a 22.46% increase in VPP profits. The MARL framework shows particular strength in scenarios with high renewable energy penetration, where it improves system performance by 11.95% compared with traditional methods. Furthermore, our approach demonstrates superior adaptability to unexpected events and mis-predictions, highlighting its potential for real-world implementation.

Virtual power plants (VPPs) have emerged as an innovative concept in modern power systems, offering a solution to the challenges posed by the increasing integration of distributed energy resources (DERs). VPPs aggregate diverse resources such as solar panels, wind turbines, energy storage systems, and flexible loads into a single coordinated entity, enabling them to participate in electricity markets and provide grid services as if they were conventional power plants [1,2]. This aggregation and coordination capability makes VPPs instrumental in facilitating the integration of intermittent renewable energy sources, enhancing grid stability, and enabling small-scale DERs to participate in electricity markets [3–5].

The importance of VPPs in modern power systems is multifaceted. By coordinating the output of multiple DERs, VPPs help balance supply and demand, particularly in systems with high penetration of renewable energy sources [3]. This coordination also allows VPPs to provide ancillary services to the grid, thereby enhancing system stability and reliability [4]. Furthermore, VPPs enable small-scale DERs to participate in electricity markets, potentially increasing competition and lowering costs [5]. In the realm of demand-side management, VPPs can aggregate flexible loads to provide demand response services, helping to balance the grid and reduce peak demand [6].

However, the management and pricing of VPPs present significant challenges due to the complex nature of the systems they coordinate. The heterogeneous characteristics of DERs, the inherent uncertainties in renewable energy generation, and the dynamic nature of electricity markets create a multifaceted environment for decision-making [7]. The implementation of dynamic pricing, where electricity prices vary in real-time based on supply and demand conditions, adds another layer of complexity to VPP operations [8]. These challenges necessitate sophisticated management and pricing strategies that can handle the intricate dynamics of VPP systems.

To address the challenges in VPP management and pricing, researchers have proposed various approaches with game-theoretic models, particularly Stackelberg games, gaining significant traction. These models are well-suited to represent the hierarchical decision-making process inherent in VPP operations [9]. In a typical Stackelberg game model applied to VPPs, the VPP operator assumes the role of the leader, making decisions on pricing or resource allocation, while the DER owners or consumers act as followers, responding to these decisions [9].

Stackelberg game models have been applied to various aspects of VPP operations, including the development of bidding strategies in electricity markets, the design of demand response programs, and the facilitation of energy trading between VPPs and consumers [10–12]. These applications have demonstrated the potential of game-theoretic approaches in capturing the strategic interactions between different entities in a VPP system.

Despite their popularity and theoretical elegance, game-theoretic approaches have several limitations in handling the dynamic and uncertain nature of VPP environments. One significant limitation is the common assumption of perfect information, where all players are presumed to have complete knowledge about the system. This assumption often does not hold in real-world VPP operations, where information asymmetry is prevalent [13]. Additionally, solving for equilibrium in complex, multi-player games can be computationally intensive, which limits the applicability of these models in real-time decision-making scenarios [14].

Another crucial limitation of traditional game-theoretic models is their predominantly static nature, which struggles to capture the dynamic characteristics of VPP environments where conditions can change rapidly [15]. The scalability of these models also becomes a concern as the number of players (DERs and consumers) increases, leading to exponential growth in complexity and reduced practicality for large-scale VPPs [16]. Furthermore, while stochastic game theory offers some tools for uncertainty handling, incorporating the multiple sources of uncertainty inherent in VPP operations, such as renewable generation variability, demand fluctuations, and market price volatility, remains a significant challenge [17].

These limitations underscore the need for more advanced approaches capable of handling the dynamic, uncertain, and complex nature of VPP environments. This necessity motivates the exploration of alternative methods, such as multi-agent reinforcement learning (MARL), which offers promising capabilities in addressing these challenges.

MARL has emerged as a powerful paradigm for addressing complex decision-making problems in dynamic, uncertain environments with multiple interacting agents. MARL extends the principles of single-agent reinforcement learning (SARL) to scenarios where multiple agents learn and make decisions simultaneously [18]. In MARL, agents interact with their environment and with each other, receiving feedback in the form of rewards or penalties based on their actions. Through this process of trial and error, agents learn optimal policies to maximize their cumulative rewards over time [19].

The fundamental principle underlying MARL is the concept of decentralized learning, where each agent independently learns its own policy based on local observations and rewards. This decentralized approach allows MARL to handle the scalability challenges inherent in multi-agent systems, making it particularly suitable for complex environments such as VPPs [20]. MARL algorithms typically employ function approximation techniques, such as deep neural networks, to handle high-dimensional state and action spaces, further enhancing their ability to tackle real-world problems [21].

MARL offers several advantages in complex, multi-agent environments like VPPs. First, it can handle the partial observability and non-stationarity that characterize such systems, where the environment changes dynamically due to the actions of other agents [22]. Second, MARL can learn cooperative or competitive behaviors emergently, without the need for explicit modeling of agent interactions [23]. This is particularly valuable in VPP scenarios, where the optimal strategy may involve a mix of cooperation (e.g., load balancing) and competition (e.g., market bidding). Third, MARL approaches can adapt to changing conditions and learn continuously, making them well-suited to the dynamic nature of energy systems [24].

Moreover, MARL can effectively balance exploration and exploitation, a crucial aspect in environments with uncertainty. By employing techniques such as epsilon-greedy policies or entropy regularization, MARL agents can discover novel strategies while refining existing ones [25]. This exploration-exploitation trade-off is essential in VPP management, where the system must balance the need to optimize current operations with the need to adapt to potential future scenarios.

The application of MARL to VPPs and smart grid systems has gained significant attention in recent years. Researchers have explored MARL for various aspects of VPP management, including energy trading, demand response, and resource allocation [26–28]. These studies have demonstrated the potential of MARL to improve the efficiency and resilience of VPPs by enabling adaptive, decentralized decision-making.

Dynamic pricing in electricity markets has been a focal point of research, with MARL offering promising solutions to the challenges of real-time price optimization. Studies have shown that MARL can effectively learn pricing strategies that balance supply and demand while maximizing profits for market participants [29–31]. The ability of MARL to handle the stochastic nature of renewable energy generation and demand fluctuations makes it particularly suitable for dynamic pricing in VPP contexts [32]. Game-theoretic approaches have been widely applied in energy management, providing valuable insights into strategic interactions between market participants [16,33]. However, the integration of game theory with MARL has opened up new possibilities for modeling complex, dynamic energy markets [30,34]. These hybrid approaches leverage the strengths of both paradigms, combining the strategic reasoning of game theory with the adaptive learning capabilities of MARL.

Reinforcement learning applications in power systems have demonstrated remarkable success in various domains, including grid stability control, energy storage management, and microgrid optimization [35–37]. The ability of reinforcement learning algorithms to learn optimal policies through interactions with the environment has proven particularly valuable in handling the uncertainty and nonlinearity inherent in power systems [38]. Multi-agent systems have found increasing applications in energy markets, offering decentralized solutions to complex coordination problems [39,40]. MARL has emerged as a powerful tool in this context, enabling agents to learn optimal bidding strategies, negotiate energy contracts, and coordinate resource allocation in distributed energy systems [29–31].

Despite the promising advances in MARL applications for VPPs and energy systems, several research gaps remain, particularly in the context of dynamic pricing and distributed energy management. While existing studies have demonstrated the potential of MARL in optimizing specific aspects of VPP operations, there is a lack of comprehensive frameworks that integrate dynamic pricing strategies with distributed energy management in a decentralized manner [26]. The complexity of coordinating multiple objectives across different timescales and operational domains within VPP networks presents a significant challenge that current approaches have not fully addressed.

Furthermore, the scalability of MARL approaches to large-scale VPP networks with heterogeneous DERs remains a pressing issue [21]. Although decentralized learning alleviates some scalability concerns, effectively coordinating learning and decision-making across a vast number of diverse agents in real-time VPP operations is still an open problem. This challenge is particularly acute when considering the dynamic nature of pricing strategies and the need for adaptive energy management in response to rapidly changing market conditions and renewable energy availability.

Another critical gap lies in the interpretability and explainability of MARL models in the context of VPP decision-making, especially regarding pricing strategies and energy allocation [41]. The “black box” nature of many deep learning-based MARL approaches hinders their adoption by stakeholders who require transparency and accountability in energy market operations.

Last, the integration of domain knowledge and physical constraints specific to distribution systems into MARL frameworks for VPPs is an area that demands further investigation [42]. Incorporating expert knowledge and system-specific constraints has the potential to enhance learning efficiency and ensure the feasibility of learned policies, particularly in the context of dynamic pricing and distributed energy management.

Motivated by these research gaps, this paper aims to develop a comprehensive, adaptive MARL framework for dynamic pricing and distributed energy management in VPP networks. The main objectives of this research are as follows:

1) A scalable, decentralized MARL architecture that can efficiently handle large-scale VPP networks with heterogeneous DERs is designed, focusing on adaptive dynamic pricing and energy management strategies.

2) Interpretable MARL algorithms that provide insights into the decision-making process of VPP agents are developed, particularly in relation to pricing decisions and energy allocation.

3) Distribution system-specific knowledge and physical constraints are integrated into the MARL framework to improve learning efficiency and policy feasibility in real-world VPP operations.

4) Adaptive mechanisms that allow VPP agents to dynamically adjust their strategies in response to changing market conditions and energy availability are implemented and evaluated.

This paper introduces a novel, adaptive MARL framework for holistic VPP management that simultaneously optimizes dynamic pricing strategies and distributed energy management in decentralized VPP networks. The proposed framework employs a hierarchical reinforcement learning architecture combined with attention mechanisms to address the scalability challenges inherent in large-scale VPP networks. To enhance the practical applicability of the MARL approach, we develop an interpretable algorithm that provides explainable decision-making policies for VPP agents, focusing on pricing decisions and energy allocation strategies. Furthermore, we present a method for incorporating distribution system-specific knowledge and physical constraints into the MARL framework through a combination of reward shaping and constrained policy optimization (CPO) techniques. The effectiveness and superiority of our proposed framework are demonstrated through comprehensive empirical evaluations in realistic VPP scenarios, showcasing its ability to adapt to dynamic market conditions and optimize energy management across diverse DERs.

In this section, we provide a comprehensive formulation of the dynamic pricing and distributed energy management problem in VPP networks. We model the interactions between the distribution system operator (DSO) and multiple VPPs as a multi-agent Markov decision process (MAMDP). To facilitate understanding, we break down the problem into its core components and present a step-by-step formulation.

We consider a power distribution network managed by DSO and consisting of a set of VPPs, each controlling a portfolio of DERs. DERs include renewable generation units (wind turbines and solar photovoltaic (PV)), energy storage systems, and flexible loads. DSO aims to maintain grid stability and economic efficiency, while VPPs aim to maximize their individual profits through energy trading and resource management.

The VPP network under consideration is a complex, interconnected system comprising multiple DERs, energy storage systems, and flexible loads, coordinated by central DSO. Let $ \mathcal{V}=\{1{\mathrm{,}}\; 2{\mathrm{,}} \; \cdots {\mathrm{,}}\; N\} $ (N is the number of VPPs in the network) denote the set of VPPs in the network, where each VPP $ i \in \mathcal{V} $ manages a local portfolio of DERs.

The DSO’s role is to maintain grid stability and efficiency while facilitating renewable energy integration. It sets dynamic prices and coordinates the overall energy management of the VPP network. The DSO’s objective function can be expressed as

$ \underset{{{\bf{a}}}_{\text{DSO}}}{\mathrm{max}}\;{J}_{\text{DSO}}=\sum _{t=1}^{T}\left({R}_{t}^{\text{market}}-{C}_{t}^{\text{market}}+{R}_{t}^{\text{VPP}}-{P}_{t}^{\text{stability}}-{P}_{t}^{\text{emissions}}\right) $ (1)

where J_DSO is the cumulative reward of DSO; a_DSO is the action of DSO; T represents the total number of time steps or periods considered; $ {R}_{t}^{\mathrm{m}\mathrm{a}\mathrm{r}\mathrm{k}\mathrm{e}\mathrm{t}} $ and $ {C}_{t}^{\mathrm{m}\mathrm{a}\mathrm{r}\mathrm{k}\mathrm{e}\mathrm{t}} $ are the revenue and cost from market transactions, respectively; $ {R}_{t}^{\mathrm{V}\mathrm{P}\mathrm{P}} $ is the revenue from VPP transactions; $ {P}_{t}^{\mathrm{s}\mathrm{t}\mathrm{a}\mathrm{b}\mathrm{i}\mathrm{l}\mathrm{i}\mathrm{t}\mathrm{y}} $ and $ {P}_{t}^{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}\mathrm{s}} $ are penalties for grid instability and emissions, respectively.

Each VPP $ i $ is characterized by its generation capacity $ {G}_{i}\left(t\right) $, storage capacity $ {S}_{i}\left(t\right) $, and local demand $ {D}_{i}\left(t\right) $ at time $ t $. VPPs aim to maximize their profits while meeting the local energy demand and adhering to commitment made to DSO. The objective function for VPP $ i $ can be formulated as

$ \underset{{{\bf{a}}}_{i}}{\mathrm{max}}\;{J}_{i}^{\text{VPP}}=\sum _{t=1}^{T}\left({R}_{t}^{\text{sell}}-{C}_{t}^{\text{buy}}-{C}_{t}^{\text{operation}}-{P}_{t}^{\text{deviation}}\right) $ (2)

where J_i is the cumulative reward of VPP i, a_i is the action of VPP i, $ {R}_{t}^{\mathrm{s}\mathrm{e}\mathrm{l}\mathrm{l}} $ is the revenue from selling energy, $ {C}_{t}^{\mathrm{b}\mathrm{u}\mathrm{y}} $ is the cost of buying energy, $ {C}_{t}^{\mathrm{o}\mathrm{p}\mathrm{e}\mathrm{r}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}} $ is the operational cost of DERs, and $ {P}_{t}^{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{i}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}} $ is a penalty for deviating from committed schedules.

We model the system as MAMDP, defined by the tuple $ \left\langle \mathcal{N}{\mathrm{,}}\;\mathcal{S}{\mathrm{,}}\;{\left\{{\mathcal{A}}_{i}\right\}}_{i\in \mathcal{N}}{\mathrm{,}}\;P{\mathrm{,}}\;{\left\{{r}_{i}\right\}}_{i\in \mathcal{N}}{\mathrm{,}}\; { {{γ}}} \right\rangle $, where $ \mathcal{N}=\{\mathrm{D}\mathrm{S}\mathrm{O}{\mathrm{,}}\;\mathrm{V}\mathrm{P}{\mathrm{P}}_{1}{\mathrm{,}}\;\mathrm{V}\mathrm{P}{\mathrm{P}}_{2}{\mathrm{,}}\; \cdots {\mathrm{,}}\;\mathrm{V}\mathrm{P}{\mathrm{P}}_{N}\} $ is the set of agents; $ \mathcal{S} $ is the state space of the environment; $ {\mathcal{A}}_{i} $ is the action space of agent $ i $; $ P:\mathcal{S}\times \mathcal{A}\times \mathcal{S}\to [0{\mathrm{,}}\mathrm{ }1] $ is the state transition probability function; $ {r}_{i}:\mathcal{S}\times \mathcal{A}\to \mathbb{R} $ is the reward function for agent $ i $; $ \gamma \in [0{\mathrm{,}}\mathrm{ }1] $ is the discount factor; $ \mathcal{A}={\mathcal{A}}_{\mathrm{D}\mathrm{S}\mathrm{O}}\times \prod\limits_{i=1}^{N} {\mathcal{A}}_{i} $ is the joint action space and $ {\mathcal{A}}_{\mathrm{D}\mathrm{S}\mathrm{O}} $ is the action space of DSO.

The state space of the VPP network at time $ t $ is defined as a vector $ {\mathbf{s}}_{t}\in \mathcal{S} $. The state vector comprises the market prices, VPP-specific states, and grid-level information.

1) Market prices: $ {{\mathbf{p}}_t} = \left[ {\lambda _t^{{\text{DA}}{\mathrm{,}}s}{\mathrm{,}}{\text{ }}\lambda _t^{{\text{DA}}{\mathrm{,}}b}{\mathrm{,}}{\text{ }}\lambda _t^{W{\mathrm{,}}s}{\mathrm{,}}{\text{ }}\lambda _t^{W{\mathrm{,}}b}} \right] $, where $ \lambda _t^{{\text{DA}}{\mathrm{,}}s} $ and $ \lambda _t^{{\text{DA}}{\mathrm{,}}b} $ are the day-ahead selling and buying prices set by DSO at time t, respectively; $\lambda _t^{W{\mathrm{,}}s}$ and $\lambda _t^{W{\mathrm{,}}b}$ are the wholesale market selling and buying prices, respectively, and both also set by DSO at time $ t $.

2) VPP-specific states: For each VPP $ i\in \mathcal{V} $, we have $ {{\mathbf{x}}_{i{\mathrm{,}}t}} = \left[ {{G_{i{\mathrm{,}}t}}{\mathrm{,}}{\text{ }}{S_{i{\mathrm{,}}t}}{\mathrm{,}}{\text{ }}{D_{i{\mathrm{,}}t}}{\mathrm{,}}{\text{ }}R_{i{\mathrm{,}}t}^W{\mathrm{,}}{\text{ }}R_{i{\mathrm{,}}t}^{{\text{PV}}}} \right] $ with $ {G_{i{\mathrm{,}}t}} $, $ {S_{i{\mathrm{,}}t}} $, $ {D_{i{\mathrm{,}}t}} $, $ R_{i{\mathrm{,}}t}^W $, and $ R_{i{\mathrm{,}}t}^{{\text{PV}}} $ being the generation output, energy storage level, local demand, wind generation forecast, and solar PV generation forecast for VPP i at time t, respectively.

3) Grid-level information: $ {{\mathbf{y}}_t} = \left[ {{D_t}{\mathrm{,}}{\text{ }}{R_t}{\mathrm{,}}{\text{ }}{f_t}{\mathrm{,}}{\text{ }}{v_t}} \right] $ with D_t, R_t, f_t, and v_t being the total system demand, total renewable generation, grid frequency, and voltage level at key nodes at time t, respectively.

Thus, the complete state vector at time $ t $ can be expressed as $ {{\mathbf{s}}_t} = \left[ {{{\mathbf{p}}_t}{\mathrm{,}}{\text{ }}{{\left\{ {{{\mathbf{x}}_{i{\mathrm{,}}t}}} \right\}}_{i \in \mathcal{V}}}{\mathrm{,}}{\text{ }}{{\mathbf{y}}_t}{\mathrm{,}}{\text{ }}t} \right] $.

The action space is defined separately for DSO and VPPs, reflecting their distinct roles in the system. The DSO’s action space $ {\mathcal{A}}_{\mathrm{D}\mathrm{S}\mathrm{O}} $ consists of setting the buying and selling prices for the next time step:

$ {{\mathbf{a}}_{{\text{DSO}}{\mathrm{,}}t}} = \left[ {\lambda _{t + 1}^{{\text{DA}}{\mathrm{,}}s}{\mathrm{,}}{\text{ }}\lambda _{t + 1}^{{\text{DA}}{\mathrm{,}}b}{\mathrm{,}}{\text{ }}{\delta _t}} \right] \in {\mathcal{A}_{{\text{DSO}}}} $ (3)

where $ {\lambda }_{t+1}^{\mathrm{D}\mathrm{A}{\mathrm{,}}s} $ and $ {\lambda }_{t+1}^{\mathrm{D}\mathrm{A}{\mathrm{,}}b} $ are constrained within predefined limits to ensure market stability:

$ \lambda _t^{W{\mathrm{,}}s} \leq \lambda _t^{{\text{DA}}{\mathrm{,}}s} \leq \lambda _t^{{\text{DA{\mathrm{,}}}}b} \leq \lambda _t^{W{\mathrm{,}}b} $ (4)

and $ {\delta }_{t} $ is a price adjustment factor for demand response.

For each VPP $ i\in \mathcal{V} $, the action space $ {\mathcal{A}}_{\mathrm{V}\mathrm{P}\mathrm{P}{\mathrm{,}}i} $ includes decisions on

$ {{\mathbf{a}}_{i{\mathrm{,}}t}} = \left[ {P_{i{\mathrm{,}}t}^{{\text{VPP}}{\mathrm{,}}s}{\mathrm{,}}{\text{ }}P_{i{\mathrm{,}}t}^{{\text{VPP}}{\mathrm{,}}b}{\mathrm{,}}{\text{ }}P_{i{\mathrm{,}}t}^{{\text{MT}}}{\mathrm{,}}{\text{ }}P_{i{\mathrm{,}}t}^{{\text{ES}}}{\mathrm{,}}{\text{ }}P_{i{\mathrm{,}}t}^{{\text{IL}}}} \right] \in {\mathcal{A}_{{\text{VPP}}{\mathrm{,}}i}} $ (5)

where $P_{i{\mathrm{,}}t}^{{\text{VPP}}{\mathrm{,}}s}$ and $P_{i{\mathrm{,}}t}^{{\text{VPP}}{\mathrm{,}}b}$ represent the power sold to and purchased from the market by VPP $ i $ at time $ t $, respectively; $P_{i{\mathrm{,}}t}^{{\text{MT}}}$ denotes the power output from micro-turbines within VPP $ i $, while $P_{i{\mathrm{,}}t}^{{\text{ES}}}$ represents the power dispatched or stored by energy storage systems for load-generation balancing; $P_{i{\mathrm{,}}t}^{{\text{IL}}}$ indicates the adjustment of interruptible loads for demand response management.

These actions are subject to the following constraints:

$ 0 \leq P_{i{\mathrm{,}}t}^{{\text{VPP}}{\mathrm{,}}s} \leq {\theta _{i{\mathrm{,}}t}}P_{i{\mathrm{,}}{\rm{max}} }^{{\text{VP}}} $ (6a)

$ 0 \leq P_{i{\mathrm{,}}t}^{{\text{VPP}}{\mathrm{,}}b} \leq \left( {1 - {\theta _{i{\mathrm{,}}t}}} \right)P_{i{\mathrm{,}}{\rm{max}} }^{{\text{VPP}}} $ (6b)

$ 0 \leq P_{i{\mathrm{,}}t}^{{\text{MT}}} \leq P_{i{\mathrm{,}}{\rm{max}} }^{{\text{MT}}} $ (6c)

$ P_{i{\mathrm{,}}{\rm{min}} }^{{\text{ES}}} \leq P_{i{\mathrm{,}}t}^{{\text{ES}}} \leq P_{i{\mathrm{,}}{\rm{max}} }^{{\text{ES}}} $ (6d)

$ 0 \leq P_{i{\mathrm{,}}t}^{{\text{IL}}} \leq P_{i{\mathrm{,}}{\rm{max}} }^{{\text{IL}}} $ (6e)

where $ {\theta }_{i{\mathrm{,}}t} $ is a binary variable indicating whether VPP $ i $ is selling ($ {\theta }_{i{\mathrm{,}}t}=1 $) or buying ($ \theta_{i\mathrm{,}t}=0 $) at time $ t $. The maximum power traded ($P_{i{\mathrm{,}}{\rm{max}} }^{{\text{VP}}}$ for selling and $P_{i{\mathrm{,}}{\rm{max}} }^{{\text{VPP}}}$ for buying) is determined by VPP $ i $’s market participation capacity and regulatory limits. The micro-turbine output power $P_{i{\mathrm{,}}{\rm{max}} }^{{\text{MT}}}$ is constrained by the installed capacity and technical specifications of the generating units. The energy storage system’s operational bounds ($P_{i{\mathrm{,}}{\rm{min}} }^{{\text{ES}}}$ and $P_{i{\mathrm{,}}{\rm{max}} }^{{\text{ES}}}$) are governed by the storage capacity and charge/discharge characteristics. The maximum interruptible load adjustment $P_{i{\mathrm{,}}{\rm{max}} }^{{\text{IL}}}$ is established through demand response agreements and the VPP’s load control capabilities. The combined action space for all VPPs is then $ {\mathcal{A}}_{\text{VPP}}=\prod\limits _{i \in \mathcal{V}}{\mathcal{A}}_{\text{VPP}{\mathrm{,}}i} $$ . $

The system state evolves based on the actions taken by DSO and VPPs, as well as external factors. The transition function can be expressed as

$ {{\bf{s}}}_{t+1}=f\left({{\bf{s}}}_{t}{\mathrm{,}}\text{ }{{\bf{a}}}_{\text{DSO}{\mathrm{,}}t}{\mathrm{,}}\text{ }{\left\{{{\bf{a}}}_{i{\mathrm{,}}t}\right\}}_{i\in \mathcal{V}}{\mathrm{,}}\text{ }{w}_{t}{\mathrm{,}}\text{ }{ϵ}_{t}\right) $ (7)

where $ {w}_{t} $ represents stochastic elements such as prediction errors in renewable generation and load, and $ {ϵ}_{t} $ represents external factors like sudden weather changes or equipment failures.

The transition for each component of the state vector can be described as follows:

$ {{\mathbf{p}}_{t + 1}} = {g_{\mathbf{p}}}\left( {{{\mathbf{p}}_t}{\mathrm{,}}{\text{ }}{{\mathbf{a}}_{{\text{DSO}}{\mathrm{,}}t}}{\mathrm{,}}{\text{ }}{{\left\{ {{{\mathbf{a}}_{i{\mathrm{,}}t}}} \right\}}_{i \in \mathcal{V}}}{\mathrm{,}}{\text{ }}w_t^{\mathbf{p}}} \right) $ (8)

2) VPP-specific states:

$ {{\bf{x}}}_{i{\mathrm{,}}t+1}={g}_{{\bf{x}}}\left({{\bf{x}}}_{i{\mathrm{,}}t}{\mathrm{,}}\text{ }{{\bf{a}}}_{i{\mathrm{,}}t}{\mathrm{,}}\text{ }{w}_{t}^{{\bf{x}}}{\mathrm{,}}\text{ }{ϵ}_{t}^{{\bf{x}}}\right){\mathrm{,}}\text{ }\forall i\in \mathcal{V} $ (9)

3) Grid-level information:

$ {{\bf{y}}}_{t+1}={g}_{{\bf{y}}}\left({{\bf{y}}}_{t}{\mathrm{,}}\text{ }{\left\{{{\bf{a}}}_{i{\mathrm{,}}t}\right\}}_{i\in \mathcal{V}}{\mathrm{,}}\text{ }{w}_{t}^{{\bf{y}}}{\mathrm{,}}\text{ }{ϵ}_{t}^{{\bf{y}}}\right) $ (10)

The DSO’s reward function encompasses multiple components to reflect its diverse objectives:

$ r_{{\text{DSO}}}^t = r_{{\text{profit}}}^t + r_{{\text{stability}}}^t + r_{{\text{renewable}}}^t - r_{{\text{emissions}}}^t $ (11)

where $ r_{{\text{DSO}}}^t $ is the reward of DSO at time t, $ {r}_{\mathrm{p}\mathrm{r}\mathrm{o}\mathrm{f}\mathrm{i}\mathrm{t}}^{t} $ is the profit from energy transactions, $ {r}_{\mathrm{s}\mathrm{t}\mathrm{a}\mathrm{b}\mathrm{i}\mathrm{l}\mathrm{i}\mathrm{t}\mathrm{y}}^{t} $ is the reward for maintaining grid stability, $ {r}_{\mathrm{r}\mathrm{e}\mathrm{n}\mathrm{e}\mathrm{w}\mathrm{a}\mathrm{b}\mathrm{l}\mathrm{e}}^{t} $ is the reward for renewable energy integration, and $ {r}_{\mathrm{e}\mathrm{m}\mathrm{i}\mathrm{s}\mathrm{s}\mathrm{i}\mathrm{o}\mathrm{n}\mathrm{s}}^{t} $ is the penalty for greenhouse gas emissions. They are calculated by

$ {r}_{\text{profit}}^{t}=\sum _{i\in \mathcal{V}}\left({\lambda }_{t}^{\text{DA}{\mathrm{,}}b}{P}_{i{\mathrm{,}}t}^{\text{VPP}{\mathrm{,}}b}-{\lambda }_{t}^{\text{DA}{\mathrm{,}}s}{P}_{i{\mathrm{,}}t}^{\text{VP}{\mathrm{,}}s}\right)+\left({\lambda }_{t}^{W{\mathrm{,}}s}{P}_{t}^{\text{DSO}{\mathrm{,}}s}-{\lambda }_{t}^{W{\mathrm{,}}b}{P}_{t}^{\text{DSO}{\mathrm{,}}b}\right) $ (12)

where $ {P}_{t}^{\mathrm{D}\mathrm{S}\mathrm{O}{\mathrm{,}}s} $ represents the total electricity DSO sold to the wholesale market at time $ t $; $ {P}_{t}^{\mathrm{D}\mathrm{S}\mathrm{O}{\mathrm{,}}b} $ means the total electricity which DSO is bought at time $ t $ from the wholesale market.

$ r_{{\text{stability}}}^t = {\alpha _f}{\exp {( - {\beta _f}\left| {{f_t} - {f_{{\text{nominal}}}}} \right|)}} + {\alpha _v}{\exp {( - {\beta _v}\parallel {{\mathbf{v}}_t} - {{\mathbf{v}}_{{\text{nominal}}})}}} $ (13)

where $ \parallel $ is prefixing the Euclidean norm of $ {\mathbf{v}}_{t} $ and $ {\mathbf{v}}_{\text{nominal}} $, $ {f}_{\text{nominal}} $ is the basis of the nominal frequency of a target, or nominal frequency, of the power grid at which the grid is designed to operate, and $ {\mathbf{v}}_{\text{nominal}} $ is the nominal voltage vector at the critical points of the power grid.

$ {r}_{\text{renewable}}^{t}=\gamma \frac{\displaystyle\sum _{i\in \mathcal{V}}\left({R}_{i{\mathrm{,}}t}^{W}+{R}_{i{\mathrm{,}}t}^{\text{PV}}\right)}{{D}_{t}} $ (14)

$ {r}_{\text{emissions}}^{t}=\delta \displaystyle\sum _{i\in \mathcal{V}}{e}_{i}\left({P}_{i{\mathrm{,}}t}^{\text{MT}}\right) $ (15)

where $ {\alpha }_{f} $, $ {\beta }_{f} $, $ {\alpha }_{v} $, $ {\beta }_{v} $, $ \gamma $, and $ \delta $ are weighting parameters; $ {e}_{i}(\cdot ) $ is an emissions function for the micro-turbine of VPP $ i $.

For each VPP $ i $, the reward function balances profit maximization, cost minimization, and adherence to commitments:

$ r_i^t = r_{{\text{profit}}{\mathrm{,}}i}^t - r_{{\text{cost}}{\mathrm{,}}i}^t - r_{{\text{deviation}}{\mathrm{,}}i}^t $ (16)

where $ {r}_{\mathrm{p}\mathrm{r}\mathrm{o}\mathrm{f}\mathrm{i}\mathrm{t}{\mathrm{,}}i}^{t} $ is the profit from energy transactions, $ {r}_{\mathrm{c}\mathrm{o}\mathrm{s}\mathrm{t}{\mathrm{,}}i}^{t} $ is the operational cost, and $ {r}_{\mathrm{d}\mathrm{e}\mathrm{v}\mathrm{i}\mathrm{a}\mathrm{t}\mathrm{i}\mathrm{o}\mathrm{n}{\mathrm{,}}i}^{t} $ is the penalty for deviating from commitments, calculated as

$ r_{{\text{profit}}{\mathrm{,}}i}^t = \lambda _t^{{\text{DA}}{\mathrm{,}}s}P_{i{\mathrm{,}}t}^{{\text{VPP}}{\mathrm{,}}s} - \lambda _t^{{\text{DA}}{\mathrm{,}}b}P_{i{\mathrm{,}}t}^{{\text{VPP}}{\mathrm{,}}b} $ (17)

$ r_{{\text{cost}}{\mathrm{,}}i}^t = C_i^{{\text{MT}}}\left( {P_{i{\mathrm{,}}t}^{{\text{MT}}}} \right) + C_i^{{\text{ES}}}\left( {P_{i{\mathrm{,}}t}^{{\text{ES}}}} \right) + C_i^{{\text{IL}}}\left( {P_{i{\mathrm{,}}t}^{{\text{IL}}}} \right) $ (18)

$ r_{{\text{deviation}}{\mathrm{,}}i}^t = {\mu _i}\left| {P_{i{\mathrm{,}}t}^{{\text{VPP}}} - P_{i{\mathrm{,}}t}^{{\text{PV}}{\mathrm{,}}{\text{forecast}}}} \right| $ (19)

where $ {C}_{i}^{\mathrm{M}\mathrm{T}}(\cdot ) $, $ {C}_{i}^{\mathrm{E}\mathrm{S}}(\cdot ) $, and $ {C}_{i}^{\mathrm{I}\mathrm{L}}(\cdot ) $ are cost functions for the micro-turbine, energy storage, and interruptible load, respectively; $ {\mu }_{i} $ is a penalty factor; $P_{i{\mathrm{,}}t}^{{\text{PV{\mathrm{,}}forecast}}}$ represents the forecasted amount of electricity that the PV systems within VPP $ i $ are expected to generate at time $ t $.

DSO is run within some constraints that are intended to maintain stability of the grid, manage economic risks, and meet regulatory requirements. These constraints are categorized into three main areas: Price limits, grid stability, and power balance.

i) Price limits require

$ \lambda _t^{W{\mathrm{,}}s} \leq \lambda _t^{{\text{DA}}{\mathrm{,}}s} \leq \lambda _t^{{\text{DA}}{\mathrm{,}}b} \leq \lambda _t^{W{\mathrm{,}}b}{\mathrm{,}}{\text{ }}\forall t $ (20)

ii) Grid stability requires

$ {f_{{\rm{min}} }} \leq {f_t} \leq {f_{{\rm{max}} }}{\mathrm{,}}{\text{ }}\forall t $ (21a)

$ {{\mathbf{v}}_{{\rm{min}} }} \leq {{\mathbf{v}}_t} \leq {{\mathbf{v}}_{{\rm{max}} }}{\mathrm{,}}{\text{ }}\forall t $ (21b)

where $ {f}_{\mathrm{m}\mathrm{i}\mathrm{n}} $ and $ {f}_{\mathrm{m}\mathrm{a}\mathrm{x}} $ are the minimum and maximum frequencies permissible for the grid, respectively; $ {\mathbf{v}}_{\mathrm{m}\mathrm{i}\mathrm{n}} $ and $ {\mathbf{v}}_{\mathrm{m}\mathrm{a}\mathrm{x}} $ are the vectors of minimum and maximum permissible voltages at critical points in the grid, respectively.

iii) Power balance requires

$ \sum _{i\in \mathcal{V}}{P}_{i{\mathrm{,}}t}^{\text{VPP}}+{P}_{t}^{\text{DSO}{\mathrm{,}}b}-{P}_{t}^{\text{DSO}{\mathrm{,}}s}=0{\mathrm{,}}\text{ }\forall t $ (22)

Several constraints are imposed on the operations of VPPs in order to manage the operations of VPPs effectively and to ensure that VPPs are integrated beneficially and sustainably within a larger energy system. These constraints are related to generation limits, storage capacities, interruptible loads, and the overall power balance. These constraints are each essential to maintaining stability of the grid, to optimizing the economic efficiency, and to supporting the integration of renewable energy sources.

i) For each VPP $ i\in \mathcal{V} $, generation limits require

$ 0 \leq P_{i{\mathrm{,}}t}^{{\text{MT}}} \leq P_{i{\mathrm{,}}{\text{max}}}^{{\text{MT}}}{\mathrm{,}}{\text{ }}\forall t $ (23a)

$ P_{i{\mathrm{,}}t}^W \leq r_{i{\mathrm{,}}t}^W{\mathrm{,}}{\text{ }}\forall t $ (23b)

$ P_{i{\mathrm{,}}t}^{{\text{PV}}} \leq r_{i{\mathrm{,}}t}^{{\text{PV}}}{\mathrm{,}}{\text{ }}\forall t $ (23c)

where $ P_{i{\mathrm{,}}t}^{{\text{PV}}} $ is the PV generation; $ {r}_{i{\mathrm{,}}t}^{W} $ and $ {r}_{i{\mathrm{,}}t}^{\mathrm{P}\mathrm{V}} $refer to the available wind power generation capacity and the available solar PV generation capacity for VPP $ i $ at time $ t $, respectively.

ii) For each VPP $ i\in \mathcal{V} $, storage capacities satisfy

$ P_{i{\mathrm{,}}{\rm{min}} }^{{\text{ES}}} \leq P_{i{\mathrm{,}}t}^{{\text{ES}}} \leq P_{i{\mathrm{,}}{\rm{max}} }^{{\text{ES}}}{\mathrm{,}}{\text{ }}\forall t $ (24a)

$ S_{i{\mathrm{,}}{\rm{min}} }^{{\text{ES}}} \leq S_{i{\mathrm{,}}t}^{{\text{ES}}} \leq S_{i{\mathrm{,}}{\rm{max}} }^{{\text{ES}}}{\mathrm{,}}{\text{ }}\forall t $ (24b)

$ S_{i{\mathrm{,}}t + 1}^{{\text{ES}}} = S_{i{\mathrm{,}}t}^{{\text{ES}}} - \frac{{{{\Delta }}t}}{{E_{i{\mathrm{,}}{\rm{max}} }^{{\text{ES}}}}}P_{i{\mathrm{,}}t}^{{\text{ES}}}{\mathrm{,}}{\text{ }}\forall t $ (24c)

where $ {S}_{i{\mathrm{,}}t}^{\mathrm{E}\mathrm{S}} $ refers to the state of the charge of the energy storage system for VPP $ i $ at time $ t $, $S_{i{\mathrm{,}}{\rm{min}} }^{{\text{ES}}}$ and $S_{i{\mathrm{,}}{\rm{max}} }^{{\text{ES}}}$ refer to the corresponding minimum and maximum energy storage capacities for VPP $ i $, and $E_{i{\mathrm{,}}{\rm{max}} }^{{\text{ES}}}$ refers to the maximum energy capacity of the storage system for VPP $ i $.

iii) The interruptible load requires

$ 0 \leq P_{i{\mathrm{,}}t}^{{\text{IL}}} \leq P_{i{\mathrm{,}}{\rm{max}} }^{{\text{IL}}}{\mathrm{,}}\;\forall t $ (25)

iv) Power balance requires

$ P_{i{\mathrm{,}}t}^{{\text{VPP}}} + P_{i{\mathrm{,}}t}^{{\text{MT}}} + P_{i{\mathrm{,}}t}^W + P_{i{\mathrm{,}}t}^{{\text{PV}}} + P_{i{\mathrm{,}}t}^{{\text{ES}}} + P_{i{\mathrm{,}}t}^{{\text{IL}}} = {d_{i{\mathrm{,}}t}}{\mathrm{,}}{\text{ }}\forall t $ (26)

Equation (26) means that the sum of power exchanged with the grid ($P_{i{\mathrm{,}}t}^{{\text{VPP}}}$), micro-turbine generation ($P_{i{\mathrm{,}}t}^{{\text{MT}}}$), PV generation ($P_{i{\mathrm{,}}t}^{{\text{PV}}}$), energy storage contribution ($P_{i{\mathrm{,}}t}^{{\text{ES}}}$), and interruptible load adjustment ($P_{i{\mathrm{,}}t}^{{\text{IL}}}$) must equal the total power demand ${d_{i{\mathrm{,}}t}}$ of VPP $ i $ at each time step $ t $.

The overall system optimization objective is to maximize the cumulative rewards of all agents while maintaining system efficiency and stability:

$ \underset{{\pi }_{\text{DSO}}{\mathrm{,}}{\left\{{\pi }_{i}\right\}}_{i\in \mathcal{V}}}{\mathrm{max}}\mathbb{E}\left[\sum _{t=1}^{T}{\gamma }^{t}\left({r}_{\text{DSO}}^{t}+\sum _{i\in \mathcal{V}}{r}_{i}^{t}\right)\right] $ (27)

subject to all constraints defined in subsection 2.6, where $ \gamma \in (0{\mathrm{,}}\mathrm{ }1] $ is a discount factor; $ {\pi }_{\mathrm{D}\mathrm{S}\mathrm{O}} $ and $ {\pi }_{i} $ are the policies of DSO and VPP $ i $, respectively; $ \mathbb{E} $ denotes the expected value operator.

Our goal is to develop an MARL framework where DSO and VPPs learn optimal policies $ {\pi }_{\mathrm{D}\mathrm{S}\mathrm{O}} $ and $ {\pi }_{i} $ that maximize their cumulative rewards while satisfying all operational and regulatory constraints. Fig. 1 shows the interaction dynamics between DSO and VPPs within the MAMDP framework. The problem encapsulates the challenges of coordinating multiple agents with potentially conflicting objectives in a dynamic and uncertain environment.

Figure 1.Interaction dynamics between the DSO and VPPs within the MAMDP framework.

To address the complex VPP network optimization problem, we propose an MARL framework. MARL is particularly suitable due to its ability to handle decentralized decision-making in partially observable environments [21], emergent cooperative and competitive behaviors [19], continuous adaptation to dynamic conditions [24], and effective exploration-exploitation balance [26]. Our framework treats DSO and each VPP as independent learning agents, each maximizing their respective reward functions $ {r}_{\mathrm{D}\mathrm{S}\mathrm{O}}^{t} $ and $ {r}_{i}^{t} $ while implicitly coordinating through the shared environment. This approach ensures scalability and preserves VPP privacy [25]. The proposed MARL framework employs a hierarchical structure to effectively manage the complex interactions between DSO and multiple VPPs. This approach is inspired by the natural leader-follower relationship observed in power systems and aims to enable scalable and efficient decision-making in dynamic environments [43].

The hierarchical structure consists of two distinct levels: The upper level, represented by DSO, and the lower level, comprising individual VPPs. At the upper level, DSO operates strategically, making decisions that influence the entire system, such as setting dynamic electricity prices and coordinating overall grid stability. These decisions are typically made over longer time horizons, such as hourly or daily intervals. In contrast, the lower-level VPPs operate at an operational level, managing local energy resources and optimizing local objectives in response to the DSO’s strategic directives [44]. VPPs adjust their operations more frequently, often on a scale of minutes, to adapt to rapidly changing conditions.

This temporal and decision-making hierarchy distinguishes our approach from traditional co-learning frameworks. In co-learning setups, agents typically operate on an equal footing, learning simultaneously without a clear hierarchical structure [45]. Our framework establishes an asymmetric relationship where DSO’s decisions, particularly pricing strategies, create a context within which VPPs optimize their local operations. This leader-follower dynamic ensures that while each agent optimizes its own objectives, the VPPs’ actions ultimately contribute to system-wide goals set by DSO.

Our choice of a decentralized MARL approach is motivated by several key factors inherent to VPP networks. First, decentralized methods offer superior scalability, crucial for managing large numbers of DERs without creating computational bottlenecks. This scalability aligns well with the modular and expandable nature of VPP networks. Second, decentralization preserves the privacy and autonomy of individual VPPs, allowing them to protect sensitive operational data while still participating in coordinated actions. This is particularly important in competitive energy markets where VPPs may be operated by different entities. Last, a decentralized approach mirrors the physically distributed nature of VPP networks, enabling more robust and resilient control that can adapt to local conditions and continue functioning even if communications with a central controller are disrupted. While centralized and semi-centralized MARL methods offer potential benefits in terms of global optimality, they face significant challenges in the context of VPP networks. Centralized approaches may struggle with the high-dimensional state and action spaces resulting from numerous DERs, leading to computational intractability for large-scale systems. They also introduce a single point of failure, which could compromise the entire network’s operations. Semi-centralized methods, while mitigating some of these issues, still face scalability challenges and may not fully respect the autonomy desired by individual VPPs.

We acknowledge that a comparative evaluation including centralized and semi-centralized methods could provide valuable insights. However, due to the scope and resource constraints of this study, we focused on thoroughly exploring and optimizing the decentralized approach. Future work could involve implementing these alternative methods to quantitatively assess their performance trade-offs in various VPP network scenarios.

We employ an actor-critic architecture for each agent, suitable for continuous action spaces in power systems [46]. The structure consists of the actor network and the critical network.

i) The actor network learns the policy $ {\pi }_{\theta } \left(\mathbf{a}\right|\mathbf{s}) $ mapping states to actions:

$ {{\bf{a}}}_{t}={\pi }_{\theta } \left({{\bf{s}}}_{t}\right)+{ϵ}_{t} $ (28)

where $ {ϵ}_{t} $ denotes exploration noise and $ \theta $ is the set of parameters or weights of the actor network.

ii) The critic network learns the action-value function $ {Q}_{\phi }(\mathbf{s}{\mathrm{,}}\;\mathbf{a}) $:

$ {Q}_{\phi } \left({{\bf{s}}}_{t}{\mathrm{,}}\;{{\bf{a}}}_{t}\right)=\mathbb{E}\left[\sum _{k=0}^{\infty }{\gamma }^{k}{r}_{t+k}\mid {{\bf{s}}}_{t}{\mathrm{,}}\;{{\bf{a}}}_{t}\right] $ (29)

where $ \phi $ is the set of parameters or weights of the critic network.

Given the high-dimensional state space, we design a state encoder $ {E}_{\psi }\left(s\right) $:

$ {e_t} = {E_\psi } \left( {{{\mathbf{s}}_t}} \right) = {\text{FC}} \left( {\left[ {{E_{{\text{market}}}} \left( {{{\mathbf{p}}_t}} \right){\mathrm{;}}{\text{ }}{E_{{\text{VPP}}}} \left( {{{\left\{ {{{\mathbf{x}}_{i{\mathrm{,}}t}}} \right\}}_{i \in \mathcal{V}}}} \right){\mathrm{;}}{\text{ }}{E_{{\text{grid}}}} \left( {{{\mathbf{y}}_t}} \right)} \right]} \right) $ (30)

where $ \psi $ represents the learnable parameters (weights and biases) of the state encoder network, and FC denotes a fully connected neural network layer that processes the concatenated outputs from the market price, VPP state, and grid state encoders. $ {E_{{\text{market}}}} \left( {{{\mathbf{p}}_t}} \right) $, $ {E_{{\text{VPP}}}} \left( {{{\left\{ {{{\mathbf{x}}_{i{\mathrm{,}}t}}} \right\}}_{i \in \mathcal{V}}}} \right) $, and $ {E_{{\text{grid}}}} \left( {{{\mathbf{y}}_t}} \right) $ are the market price encoder, VPP state encoder, and grid state encoder, respectively, and can be calculated as

$ {E_{{\text{market}}}} \left( {{{\mathbf{p}}_t}} \right) = {\text{MLP}} \left( {{{\mathbf{p}}_t}} \right) $ (31)

$ {E_{{\text{VPP}}}} \left( {{{\left\{ {{{\mathbf{x}}_{i{\mathrm{,}}t}}} \right\}}_{i \in \mathcal{V}}}} \right) = {\text{LSTM}} \left( {{\text{CNN}} \left( {{{\left\{ {{{\mathbf{x}}_{i{\mathrm{,}}t}}} \right\}}_{i \in \mathcal{V}}}} \right)} \right) $ (32)

$ {E_{{\text{grid}}}} \left( {{{\mathbf{y}}_t}} \right) = {\text{GNN}} \left( {{{\mathbf{y}}_t}} \right) $ (33)

The state encoding architecture employs three specialized neural network components: A multi-layer perceptron (MLP) processes market price data by learning nonlinear relationships through multiple fully connected layers with backpropagation optimization. The long short-term memory (LSTM) network, built upon a convolutional neural network (CNN), captures temporal dependencies and long-term patterns in VPP state sequences, enabling effective historical data processing and future state prediction. The graph neural network (GNN) processes grid state data by leveraging the inherent network structure of power systems, allowing it to learn representations that preserve topological relationships and operational constraints between different grid components.

The encoded state $ {e}_{t} $ is used as input to both actor and critic networks:

$ {{\mathbf{a}}_t} = {\pi _\theta } \left( {{e_t}} \right) $ (34a)

$ {Q_\phi } \left( {{{\mathbf{s}}_t}{\mathrm{,}}{\text{ }}{{\mathbf{a}}_t}} \right) = {Q_\phi } \left( {{e_t}{\mathrm{,}}{\text{ }}{{\mathbf{a}}_t}} \right) $ (34b)

For our VPP network optimization problem, we adopt the multi-agent deep deterministic policy gradient (MADDPG) algorithm [24]. MADDPG extends the deep deterministic policy gradient (DDPG) algorithm to multi-agent settings, making it well-suited for our decentralized VPP control scenario. The key advantages of MADDPG in our context include i) handling continuous action spaces, essential for fine-grained control of power flows and prices [47] and ii) centralized training with decentralized execution, allowing agents to learn cooperative behaviors while maintaining individual decision-making capabilities [48].

In MADDPG, each agent $ i $ has an actor network $ {\pi }_{i} $ and a critic network $ {Q}_{i} $. The actor is updated using the policy gradient:

$ {\nabla _{{\theta _i}}}J({\theta _i}) = {\mathbb{E}_{{\mathbf{s}}{\mathrm{,}}{\mathbf{a}} \sim \mathcal{D}}}\left[ {{{\left. {{\nabla _{{\theta _i}}}{\pi _i}({{\mathbf{a}}_i} \mid {{\mathbf{s}}_i}){\nabla _{{{\mathbf{a}}_i}}}{Q_i}({\mathbf{s}}{\mathrm{,}}\;{{\mathbf{a}}_1}{\mathrm{,}}\;{{\mathbf{a}}_2}{\mathrm{,}}\; \cdots {\mathrm{,}}\;{{\mathbf{a}}_N})} \right|}_{{{\mathbf{a}}_i} = {\pi _i}({{\mathbf{s}}_i})}}} \right] $ (35)

where $ {\theta }_{i} $ represents the parameter of agent $ i $’s actor network; $ {\mathbf{s}}_{i} $ and $ {\mathbf{a}}_{i} $ denote the state observations and actions of agent $ i $, respectively; $ \mathcal{D} $ is the replay buffer. The critic network $ {Q}_{i} $ evaluates the joint state-action value function, taking as input the global state $ \mathbf{s} $ and actions from all agents ($ {\mathbf{a}}_{1}{\mathrm{,}}\;{\mathbf{a}}_{2}{\mathrm{,}}\;\cdots {\mathrm{,}}\;{\mathbf{a}}_{N} $). The critic is updated to minimize the loss:

$ \mathcal{L}({\theta _i}) = {\mathbb{E}_{{\mathbf{s}}{\mathrm{,}}{\mathbf{a}}{\mathrm{,}}r{\mathrm{,}}{\mathbf{s}}' \sim \mathcal{D}}}\left[ {{{\left( {{Q_i}({\mathbf{s}}{\mathrm{,}}\;{{\mathbf{a}}_1}{\mathrm{,}}\;{{\mathbf{a}}_2}{\mathrm{,}}\; \cdots {\mathrm{,}}\;{{\mathbf{a}}_N}) - y} \right)}^2}} \right] $ (36)

where $ y = {r_i} + {\left. {\gamma {{Q}_i'}({\mathbf{s}}'{\mathrm{,}}\;{{{\mathbf{a}}}_1'}{\mathrm{,}}\;{{{\mathbf{a}}}_2'}{\mathrm{,}}\; \cdots {\mathrm{,}}\;{{{\mathbf{a}}}_N'})} \right|_{{{\mathbf{a}}_j} = {{\pi '}_j}({{{\mathbf{s}}'}_j})}} $. Here the supscript ′ indicates the target networks which are represented by the prime symbol in reinforcement learning.

To address the specific challenges of our VPP network optimization problem, the following adaptation is introduced to the standard MADDPG algorithm:

a) Hierarchical learning: A two-level learning structure is implemented, where DSO learns at a slower timescale compared with VPPs.

b) Constraint handling: A CPO approach is incorporated to ensure that the learned policies satisfy the constraints.

c) Multi-objective optimization: The critic network is modified to output multiple Q-values corresponding to different objectives (e.g., profit, stability, and renewable integration).

To address the scalability challenges inherent in large-scale VPP networks, an attention mechanism is incorporated into our MARL framework. Attention mechanisms, originally developed for natural language processing tasks [49], have shown great promise in enhancing the efficiency and effectiveness of multi-agent systems [50]. In our framework, the attention mechanism enables each agent (whether DSO or VPP) to selectively focus on the most relevant information from other agents, thereby reducing the input dimension and computational complexity. This selective focus is particularly crucial in VPP networks, where the number of agents can be large and the state space is high-dimensional.

The attention mechanism operates by computing a weighted sum of inputs from other agents, where the weights are dynamically determined based on the current state and the agent’s query. Mathematically, for an agent i, the attention-weighted input $ {z}_{i} $ is computed as follows:

$ {z}_{i}=\sum _{j\ne i}{\alpha }_{i{\mathrm{,}}j}{{\bf{h}}}_{j} $ (37)

where $ {\mathbf{h}}_{j} $ is the value vector of agent j, and $ {\alpha }_{i{\mathrm{,}}j} $ is the attention weight that agent i assigns to agent j. The attention weight is computed using a softmax function:

$ {\alpha }_{i{\mathrm{,}}j}=\frac{\mathrm{exp}\left({e}_{i{\mathrm{,}}j}\right)}{\displaystyle\sum _{k\ne i}\mathrm{exp}\left({e}_{i{\mathrm{,}}k}\right)} $ (38)

where $ {\eta }_{i{\mathrm{,}}j} $ is a scalar representing the importance of agent j’s information to agent i, calculated as

$ {\eta _{i{\mathrm{,}}j}} = f({{\mathbf{q}}_i}{\mathrm{,}}\;{{\mathbf{k}}_j}) $ (39)

where $ {\mathbf{q}}_{i} $ is the query vector of agent i, $ {\mathbf{k}}_{j} $ is the key vector of agent j, and f is a compatibility function (often implemented as a simple dot product).

This attention mechanism addresses scalability in two key ways: By focusing only on relevant information, the attention mechanism reduces the effective input dimension for each agent. Instead of processing the full state information from all agents, which would scale linearly with the number of agents, the attention mechanism allows for processing a fixed-size representation regardless of the number of agents in the network. The selective focus provided by the attention mechanism helps agents learn more efficiently in large-scale networks. By prioritizing important information, agents can more quickly learn effective policies even when the number of agents in the system increases.

In our implementation, a multi-head attention mechanism is used [51], which allows each agent to attend to different aspects of other agents’ information simultaneously. This further enhances the expressive power of the model without significantly increasing the computational cost. By incorporating this attention mechanism, our MARL framework can efficiently handle large-scale VPP networks, maintaining good performance and learning efficiency even as the number of agents increases.

The experience replay mechanism is crucial for stabilizing learning in our MARL framework. We employ a prioritized experience replay buffer $ \mathcal{D} $ [52], which stores transitions $ ({\mathbf{S}}_{t}{\mathrm{,}}\;{\mathbf{A}}_{t}{\mathrm{,}}\;{\mathbf{R}}_{t}{\mathrm{,}}\;{\mathbf{S}}_{t+1}) $ from all agents, where $ {\mathbf{S}}_{t}=\left[{s}_{1}^{t}{\mathrm{,}}\;{s}_{2}^{t}{\mathrm{,}}\;\cdots {\mathrm{,}}\;{s}_{N}^{t}\right] $ is the joint state of all $ N $ agents at time $ t $, $ {s}_{i}^{t}\in {\mathbb{R}}^{{n}_{s}} $ represents the local state observation of agent $ i $, and $ {n}_{s} $ is the dimension of each agent’s state space; $ {\mathbf{A}}_{t}=\left[{a}_{1}^{t}{\mathrm{,}}\;{a}_{2}^{t}{\mathrm{,}}\; \cdots {\mathrm{,}}\;{a}_{N}^{t}\right] $ represents the joint actions of all agents, $ {a}_{i}^{t}\in {\mathbb{R}}^{{n}_{a}} $ is the action taken by agent $ i $, and $ {n}_{a} $ is the dimension of each agent’s action space; $ {\mathbf{R}}_{t}=\left[{r}_{1}^{t}{\mathrm{,}}\;{r}_{2}^{t}{\mathrm{,}}\; \cdots {\mathrm{,}}\;{r}_{N}^{t}\right] $ contains the rewards received by each agent, $ {r}_{i}^{t}\in \mathbb{R} $ is the scalar reward received by agent $ i $, and each $ {r}_{i}^{t} $ is computed based on the agent’s contribution to system objectives; $ {\mathbf{S}}_{t+1} $ is the next joint state after taking actions $ {\mathbf{A}}_{t} $.

The probability of sampling a transition i from the replay buffer $ \mathcal{D} $ is given by

$ {\mathcal{P}}(i)=\frac{p_{i}^{\mu}}{\displaystyle\sum_{k} p_{k}^{\mu}} $ (40)

where $ {p}_{i} $ is the priority of transition $ i $ and μ determines the intensity of prioritization.

Balancing exploration and exploitation is critical in our VPP optimization problem due to the dynamic nature of energy markets and the potential for local optima. Here a hybrid exploration strategy is employed.

Adaptive noise is added to the parameters of the actor network [53]:

$ {{\Cambriabifont\text{ω}} ^{{\text{new}}}} = {{\Cambriabifont\text{ω}} ^{{\text{curr}}}} + \mathcal{N}(0{\mathrm{,}}\;{\sigma ^2}{\mathbf{I}}) $ (41)

where $ {{\Cambriabifont\text{ω}}}^{\text{new}} $ and $ {{\Cambriabifont\text{ω}}}^{\text{curr}} $ represent the updated and current network parameters, respectively, $ \mathbf{I} $ is the identity matrix of appropriate dimension, and $ {\sigma }^{2} $ controls the magnitude of the exploration noise. For tesmporally correlated exploration in action space [54],

$ {\text{d}}{x_t} = \theta \left( {\mu - {x_t}} \right){\text{d}}t + \sigma {\text{d}}{W_t} $ (42)

where $ \theta $ is the rate of mean reversion; $ \mu $ is the long-term mean; $ \sigma $ is the volatility; $ {x}_{t} $ is a state variable evolving over time according to the dynamics it follows in the Ornstein–Uhlenbeck (OU) process, and can also easily be an exploration parameter; $ {W}_{t} $ is a standard Wiener process, or Brownian motion. This is a nowhere differentiable continuous time stochastic process with continuous paths. It provides the way of the growth of $ {x}_{t} $. The OU process is a mean reverting stochastic process, that is, $ {x}_{t} $ drifts toward a long term mean value $ \mu $, at a rate proportional to the distance from $ \mu $.

In our MADDPG framework, the policy update for each agent i follows the deterministic policy gradient:

$ {\nabla _{{\theta _i}}}J!\left( {{\theta _i}} \right) = \mathbb{E}_{{\mathbf{s}} \sim \mathcal{D}}\left[ {\left. {\nabla {\theta _i}{\pi _i} \left( {{{\mathbf{s}}_i}} \right)\nabla {{\mathbf{a}}_i}{Q_i} \left( {{\mathbf{s}}{\mathrm{,}}\;{{\mathbf{a}}_1}{\mathrm{,}}\;{{\mathbf{a}}_2}{\mathrm{,}}\; \cdots {\mathrm{,}}\;{{\mathbf{a}}_N}} \right)} \right|{{\mathbf{a}}_i} = {\pi _i} \left( {{{\mathbf{s}}_i}} \right)} \right] $ (43)

where $ \mathcal{D} $ is the experience replay buffer [24].

The policy parameters $ {\theta }_{i} $ are updated using the Adam optimizer:

$ {\theta _i} \leftarrow {\theta _i} + \xi {\text{Adam}}\left( {{\nabla _{{\theta _i}}}J \left( {{\theta _i}} \right)} \right) $ (44)

where $ \xi $ is the learning rate [55].

The critic network $ {Q}_{i} $ is updated to minimize the mean squared Bellman error,

$ \mathcal{L} \left( {{\phi _i}} \right) = {\mathbb{E}_{{\mathbf{s}}{\mathrm{,}}{\mathbf{a}}{\mathrm{,}}r{\mathrm{,}}{\mathbf{s}}' \sim \mathcal{D}}}\left[ {{{\left( {{Q_i} \left( {{\mathbf{s}}{\mathrm{,}}\;{{\mathbf{a}}_1}{\mathrm{,}}\;{{\mathbf{a}}_2}{\mathrm{,}}\; \cdots {\mathrm{,}}\;{{\mathbf{a}}_N}} \right) - y} \right)}^2}} \right] $ (45)

where the target y is given by

$ y = {r_i} + \gamma Q_i^\prime \left( {{\mathbf{s}}'{\mathrm{,}}\;\pi _1^\prime \left( {{\mathbf{s}}_1^\prime } \right){\mathrm{,}}\;\pi _2^\prime \left( {{\mathbf{s}}_2^\prime } \right){\mathrm{,}}\; \cdots {\mathrm{,}}\;\pi _N^\prime \left( {{\mathbf{s}}_N^\prime } \right)} \right) $ (46)

where $ {Q}_{i}' $ and $ {\pi }_{i}' $ are target networks updated using soft updates [56]:

$ \phi _i^\prime \leftarrow \tau {\phi _i} + (1 - \tau )\phi _i^\prime $ (47a)

$ \theta _i^\prime \leftarrow \tau {\theta _i} + (1 - \tau )\theta _i^\prime $ (47b)

where $ \tau $ is a small positive constant. The critic network $ {Q}_{i} $ for the agent $ i $ is paramatrised by $ {\phi }_{i} $, and $ {\phi }_{i}' $ denotes the target critic network’s parameters.

We incorporate constraints into the objective function using Lagrange multipliers [57]:

$ {\tilde{r}}_{i}({\bf{s}}{\mathrm{,}}\text{ }{\bf{a}}{\mathrm{,}}\text{ }\lambda )={r}_{i}({\bf{s}}{\mathrm{,}}\text{ }{\bf{a}})-\sum _{j}{\lambda }_{j}\mathrm{max}\left(0{\mathrm{,}}\text{ }{g}_{j}({\bf{s}}{\mathrm{,}}\text{ }{\bf{a}})\right) $ (48)

where $ {\tilde{r}}_{i}(\mathbf{s}{\mathrm{,}}\;\mathbf{a}{\mathrm{,}}\;\lambda ) $ represents the modified reward function that incorporates constraint penalties; $ {r}_{i}(\mathbf{s}{\mathrm{,}}\;\mathbf{a}) $ is the original reward function for agent $ i $’s state-action pairs; $ {g}_{j}(\mathbf{s}{\mathrm{,}}\;\mathbf{a})\le 0 $ are the constraints; $ {\lambda }_{j} $ are Lagrange multipliers updated using

$ {\lambda _j} \leftarrow {\lambda _j} + \beta {\rm{max}} \left( {0{\mathrm{,}}{\text{ }}{g_j}({\mathbf{s}}{\mathrm{,}}{\text{ }}{\mathbf{a}})} \right) $ (49)

where $ \beta $ is the learning rate on updating the Lagrange multipliers $ {\lambda }_{j} $. We adapt the CPO algorithm [58] to our multi-agent setting:

$ \mathop {{\rm{max}} }\limits_{{\theta _i}} {J_{{R_i}}} \left( {{\theta _i}} \right)\;{\text{ s}}{\text{.t}}{\text{. }}\;{J_{{C_j}}} \left( {{\theta _i}} \right) \leq {d_j}{\mathrm{,}}{\text{ }}\forall j $ (50)

where $ {d}_{j} $ defines the maximum allowable violation for each constraint, $ {J}_{{R}_{i}}\left({\theta }_{i}\right) $ is the expected return, and $ {J}_{{C}_{j}}\left({\theta }_{i}\right) $ is the constraint cost.

For the DSO’s pricing decisions, we use a softmax output layer to ensure

$ \lambda _t^{W{\mathrm{,}}s} \leq \lambda _t^{{\text{DA}}{\mathrm{,}}s} \leq \lambda _t^{{\text{DA}}{\mathrm{,}}b} \leq \lambda _t^{W{\mathrm{,}}b} $ (51)

We implement a safety layer [59] that projects potentially unsafe actions to the nearest safe action:

$ {{\bf{a}}}_{\text{safe}}=\mathrm{arg}\;\underset{{\bf{a}}}{\mathrm{min}}\left|\right|{{\bf{a}}}^{\prime }-{\bf{a}}|{|}_{2}^{2}\;\text{ s}\text{.t}\text{. }\;{g}_{j} \left({\bf{s}}\text{{\mathrm{,}}}\;{{\bf{a}}}^{\prime }\right)\le 0{\mathrm{,}}\text{ }\forall j $ (52)

where $ {\mathbf{a}}' $ is referred as the modified or adjusted action which is computed from the safety layer to make sure that all of the operational constraints are met; $ \left\| \cdot \right\|_{2}^{2} $ means the squared Euclidean norm, i.e., a metric to measure the distance of two points in the Euclidean space.

For VPP agents, we design the actor network to output inherently feasible actions:

$ {P_{i{\mathrm{,}}t}^{{\text{MT}}} = P_{i{\mathrm{,}}{\rm{max}} }^{{\text{MT}}}\varsigma \left( {{f_{{\text{MT}}}} \left( {{{\mathbf{s}}_t}} \right)} \right)} $ (53a)

$ {P_{i{\mathrm{,}}t}^{{\text{ES}}} = P_{i{\mathrm{,}}{\rm{max}} }^{{\text{ES}}}{\rm{tanh}} \left( {{f_{{\text{ES}}}} \left( {{{\mathbf{s}}_t}} \right)} \right)} $ (53b)

$ {P_{i{\mathrm{,}}t}^{{\text{IL}}} = P_{i{\mathrm{,}}{\rm{max}} }^{{\text{IL}}}\varsigma \left( {{f_{{\text{IL}}}} \left( {{{\mathbf{s}}_t}} \right)} \right)} $ (53c)

where $ \varsigma $ is the sigmoid function, and $ {f}_{\mathrm{M}\mathrm{T}} $, $ {f}_{\mathrm{E}\mathrm{S}} $, and $ {f}_{\mathrm{I}\mathrm{L}} $ are neural network functions [60]. Our MARL framework incorporates constraints both as part of the optimization process and as a means of embedding domain knowledge. We distinguish between system constraints (e.g., power balance equations and voltage limits) and knowledge-based constraints that encode expert insights. These constraints guide exploration towards feasible solutions, accelerate convergence, enhance safety, and improve generalization. The constraints are implemented through action space shaping, penalty methods, and CPO. For instance, sigmoid activation functions are used to ensure power outputs remain within feasible ranges and add penalty terms to the reward function for soft constraints like frequency deviations:

$ {r_{{\text{penalty}}}} = - \alpha {\rm{max}} {\left( {0{\mathrm{,}}{\text{ }}\left| {f(t) - {f_{{\text{nominal}}}}} \right| - {f_{{\text{tolerance}}}}} \right)^2} $ (54)

where $ {r}_{\text{penalty}} $ is the penalty term added to the reward function; $ \alpha $ is a weighting factor controlling the magnitude of the penalty; $ f\left(t\right) $ is the system frequency at time $ t $; $ {f}_{\text{tolerance}} $ is the tolerance range for frequency deviations.

We also adapt the CPO algorithm to our multi-agent setting. This approach results in policies that are not only optimal but also practical and implementable in real-world power systems, effectively balancing theoretical performance with operational realities.

Our simulation environment is designed to capture the key components and dynamics of the VPP network as formulated in Section 3. It consists of a DSO agent, multiple VPP agents ($N = 10 $ in our experiments), renewable energy sources (wind and solar PV), energy storage systems, flexible loads, and an electricity market. The interactions between these components are modeled based on the transition function described in subsection 3.4. The system dynamics are implemented by using a discrete-time simulation with hourly time steps. The environment is designed to be flexible, allowing for easy modification of parameters such as the number of VPPs, renewable energy penetration levels, and market conditions [46].

To ensure a comprehensive evaluation of our MARL approach, synthetic data is generated for renewable energy generation profiles, load demand profiles, and market price variations. For renewable energy, the NREL PVWatts Calculator [61] and the Wind Integration National Dataset Toolkit [62] are used to generate realistic solar PV and wind power generation profiles, respectively. These profiles are scaled and adjusted to represent different VPP sizes and locations. Load demand profiles are synthesized using a combination of historical data from the OpenEI database [63] and statistical models, incorporating daily and seasonal variations, as well as random fluctuations to simulate realistic demand patterns. Electricity market prices are generated using a modified version of the autoregressive integrated moving average (ARIMA) model proposed by Weron [64], which captures both short-term fluctuations and long-term trends in electricity prices. The use of synthetic data in our study was necessitated by several factors. Primarily, access to comprehensive, real-world VPP operational data is severely limited due to proprietary concerns, privacy regulations, and the nascent state of large-scale VPP deployments. We made concerted efforts to acquire actual data, including reaching out to several utility companies and VPP operators. However, the data available was either incomplete for our modeling needs or came with restrictive usage terms that would have limited the reproducibility of our research. Synthetic data offers several advantages in our context. It allows us to explore a wide range of scenarios, including edge cases that might be rare in available real-world data but crucial for testing the robustness of our MARL framework. We can also generate data for hypothetical future scenarios with higher renewable penetration or more advanced VPP configurations, enabling our model to be forward-looking.

Our MARL framework was implemented by using Python 3.8 and key software frameworks including PyTorch 1.8.0 for building and training neural networks, OpenAI Gym 0.18.0 for the reinforcement learning environment, Pandas 1.2.4 and NumPy 1.20.2 for data manipulation and numerical computations, and Matplotlib 3.4.1 for visualization.

To evaluate the robustness of our approach, we design a range of scenarios that vary along several dimensions: Renewable energy penetration (low: 20%; medium: 50%; high: 80%), energy storage capacity (limited: 25% of peak load; moderate: 50%; abundant: 100%), demand flexibility (low: 5% flexible load; medium: 15%; high: 30%), market volatility (stable; moderate; highly volatile), and network constraints (relaxed; moderate; stringent). We create a total of 27 distinct scenarios by combining these dimensions, allowing us to test the performance of our MARL approach under diverse conditions. Each scenario is simulated for a full year with hourly time steps to capture seasonal variations and long-term trends [26].

To evaluate the performance of our proposed MARL approach, three baseline models were implemented for comparison. The first is a Stackelberg game model by Dong et al. [65], which formulates the VPP network optimization as a leader-follower game. The second baseline is a model predictive control (MPC) approach, following the framework proposed by Parisio et al. [66] for microgrid energy management. This method uses rolling horizon optimization to make decisions based on short-term predictions. The third baseline is an SARL approach, adapting the deep Q-network (DQN) algorithm presented by Mocanu et al. [37] for smart grid applications. These diverse baselines allow us to compare our MARL approach against both traditional optimization methods and alternative machine learning techniques.

The performance of our MARL approach and the baseline models was evaluated by using three key metrics. Economic efficiency is measured by the reduction in total system costs and the increase in profits for individual VPPs. Computational performance is assessed through the convergence speed and the scalability with respect to the number of VPPs. Adaptability is evaluated by the models’ performance under different scenarios and their ability to handle unexpected events such as sudden changes in renewable generation or demand. We also introduce a composite metric that combines these aspects to provide an overall performance score [67].

The hyperparameters of our MARL model are tuned by using a combination of grid search and Bayesian optimization. We focus on key parameters such as the learning rates for the actor and critic networks, the discount factor, the exploration rate, and the network architectures. The Bayesian optimization process uses the expected improvement acquisition function and a Gaussian process surrogate model, as described by Snoek et al. [68]. This approach allows us to efficiently explore the hyperparameter space and find a near-optimal configuration for our MARL model. To ensure reproducibility and provide a comprehensive overview of our experimental setup, a detailed summarization of parameters used in our MARL framework and system model is shown in Table 1 including learning rates, network architectures, and key system specifications.

To ensure the robustness and reproducibility of our results, our experiments were conducted by using multiple random seeds. Specifically, 15 different random seeds were used for each experimental configuration, as recommended in Ref. [69]. This approach allows us to account for the inherent stochasticity in MARL algorithms and provides a more reliable estimate of the true performance of our method. For each set of hyperparameters and system configurations, 15 independent trials were run with different random seeds. The results presented in this paper represent the mean performance across these trials.

Our experiments were conducted on a high-performance computing cluster with 10 nodes, each equipped with an Intel Xeon Gold 6248R CPU, 384 GB of RAM, and four NVIDIA A100 GPUs. The software environment included Ubuntu 20.04 LTS as the operating system, CUDA 11.2 for GPU acceleration, and Docker 20.10 for containerization and reproducibility. Ray 1.2.0 was used for distributed computing, which allows us to parallelize the training process across multiple nodes and GPUs [70]. This setup enables us to efficiently train and evaluate our MARL model and the baseline approaches across the diverse range of scenarios described earlier.

The convergence of our MARL approach was evaluated by examining the learning curves of both the DSO and VPP agents over 20000 training episodes. Fig. 2 illustrates the average cumulative reward for all agents across training episodes. The learning curve demonstrates a consistent upward trend, indicating that the agents successfully learn to improve their policies over time.

Figure 2.Average cumulative reward for all agents across training episodes: MARL learning curve (above) and zoomed-in view of final 5000 episodes (below).

The initial rapid learning phase (episodes 1–2000) shows a steep improvement as agents quickly grasp basic strategies. This is followed by a steady improvement phase (episodes 2000–6000) where agents refine their policies and enhance coordination. The third phase (episodes 6000–15000) exhibits gradual fine-tuning, with agents approaching near-optimal policies. The final convergence phase (episodes 15000–20000) demonstrates policy stability, as evidenced by the plateauing curve and the zoomed-in view of the last 5000 episodes. Notably, the variance in performance, represented by the shaded region around the mean reward, decreases significantly over time, indicating convergence towards stable and consistent policies. This extended learning curve, with its clear convergence pattern and reduced variability in later stages, underscores the robustness and effectiveness of our MARL approach in managing VPP networks.

Our MARL approach was compared with the three baseline models across the key performance metrics of economic efficiency, computational performance, and adaptability. The results are summarized in Tables 2–4.

In terms of economic efficiency, our MARL approach outperforms all baseline models, achieving an 18.73% reduction in costs and a 22.46% increase in VPP profits. This superior performance can be attributed to the ability of MARL to capture complex interactions between agents and adapt to changing conditions. Regarding computational performance, the Stackelberg game model converges the fastest but struggles with scalability. Our MARL approach, despite taking longer time to converge, demonstrates superior scalability, handling up to 127 VPPs effectively. This scalability is crucial for real-world applications where the number of VPPs may be large and variable. The adaptability scores reveal that our MARL approach is the most robust to both scenario changes and unexpected events, with an overall adaptability score of 86.21. This high adaptability is a key advantage of our approach, as it allows the system to maintain good performance even in highly dynamic and uncertain environments typical of modern power systems.

A comprehensive sensitivity analysis was performed to evaluate the robustness of our MARL approach under varying conditions. Table 5 presents the results of this analysis, showing the percentage change in system performance for different parameters.

Table 5. Sensitivity analysis results (percentage change in system performance).

The number of VPPs shows a nonlinear relationship with system performance. Increasing the number of VPPs by 50% leads to a 4.68% improvement in performance, likely due to increased opportunities for coordination and resource sharing. Conversely, a 50% reduction results in an 8.73% decrease in performance. Renewable energy penetration has a significant impact on system performance. A 50% increase in penetration leads to an 11.95% improvement in performance, highlighting the MARL approach’s ability to effectively manage higher levels of variable renewable generation. Price volatility demonstrates an inverse relationship with system performance. A 50% increase in volatility results in a 6.71% decrease in performance, indicating that while our approach is robust to price fluctuations, extreme volatility can still pose challenges.

The learned policies of both DSO and VPPs were analyzed to gain insights into their strategies. Fig. 3 illustrates the DSO’s pricing strategy over a typical week. DSO learned to implement dynamic pricing effectively, with prices closely following the pattern of net demand (demand minus renewable generation). On average, the DSO’s prices were 18.3% lower during periods of high renewable generation and 22.7% higher during peak demand periods.

For VPPs, we observed diverse energy management strategies emerging. Table 6 summarizes the average utilization of different resources by VPPs under various conditions.

VPPs learned to effectively utilize their batteries, with high utilization (82.1%) during periods of high renewable generation for energy storage, and significant discharge (78.4% utilization) during high demand periods. Flexible loads were primarily activated during high demand (89.2% utilization) and low renewable (67.8% utilization) periods, demonstrating intelligent demand response. Renewable curtailment was minimal in most scenarios, only reaching notable levels (15.8%) during periods of very low demand. Fig. 4 illustrates the temporal dynamics of key state variables in the VPP network over a representative week. It showcases the intricate relationships between load demand, renewable generation, electricity prices, storage levels, and grid frequency.

Figure 4.Temporal dynamics of key state variables in the VPP network over a representative week.

To assess the scalability of our MARL approach, we progressively increased the system size and measured performance metrics. Fig. 5 shows the computational time and solution quality as the number of VPPs increases from 10 to 200.

Figure 5.Computational time and solution quality as the number of VPPs increases from 10 to 200.

The computational time scales approximately linearly with the number of VPPs, increasing from 8.64 h for 10 VPPs to 103.27 h for 200 VPPs. Despite this increase, the solution quality, measured as the percentage of optimal performance achieved, remains high. It decreases only slightly from 98.7% for 10 VPPs to 94.3% for 200 VPPs. The communications overhead as the system scales was also examined. The average number of messages exchanged per time step increases from 342 for 10 VPPs to 12876 for 200 VPPs. However, the message size remains constant at approximately 2.4 KB, ensuring that bandwidth requirements remain manageable even for large systems.

To evaluate the robustness of our MARL approach, the system was subjected to a series of unexpected events and mis-predictions, analyzing its behavior and performance under these challenging conditions. We considered three types of disturbances: Sudden renewable generation drops, unexpected demand spikes, and erroneous price forecasts. The system’s response was measured in terms of cost increase, grid stability maintenance, and recovery time. Table 7 summarizes the performance of our MARL approach compared with the baseline models under these unexpected events.

Table 7. System performance under unexpected events (percentage deviation from normal operations).

Our MARL approach demonstrated superior robustness across all event types and metrics. For instance, during a 30% unexpected drop in renewable generation, the MARL approach limited the cost increase to 8.37%, compared with 14.62% for the Stackelberg model. The stability index, representing the system’s ability to maintain normal operations, decreased by only 3.21% for MARL, while other approaches showed more significant degradation. To further illustrate the system’s behavior under disturbances, the response to mis-predictions in renewable generation forecasts was analyzed. Table 8 presents the average performance metrics for different levels of forecast errors.

The results demonstrate that our MARL approach gracefully handles increasing levels of the forecast error. Even with a 25% error in renewable generation predictions, the system maintains a relatively low cost increase (9.87%) and stability degradation (–4.46%). The system’s adaptability to prolonged changes in operating conditions was also examined. Fig. 6 illustrates the system’s performance over a 30-day period following a permanent 15% reduction in average renewable generation capacity. Our MARL approach achieved a new stable operating point within 5.73 days, with an average cost increase of 3.82% over this period. In comparison, the Stackelberg, MPC, and SARL approaches required 9.28 days, 7.65 days, and 6.91 days, respectively, with average cost increases of 7.54%, 6.17%, and 5.39%.

Figure 6.System’s performance over a 30-day period following a permanent 15% reduction in average renewable generation capacity.

The implementation of MARL for dynamic pricing and distributed energy management in VPP networks demonstrates significant potential for enhancing the efficiency and sustainability of modern power systems. Our study reveals that the MARL approach consistently outperforms traditional methods, such as Stackelberg game models and MPC, in terms of economic efficiency, computational performance, and adaptability to varying scenarios.

The superior performance of MARL in managing VPP networks can be attributed to its ability to learn and adapt to complex, dynamic environments. As shown in our results, the MARL approach achieved an 18.73% reduction in costs and a 22.46% increase in VPP profits, surpassing the performance of baseline models. This improvement stems from the agents’ capacity to capture intricate interactions between multiple VPPs and adapt their strategies in real-time, a crucial advantage in the rapidly evolving energy landscape [71].

However, the transition from the simulation to the real-world implementation presents several challenges. The computational requirements for training and deploying MARL models in large-scale VPP networks are substantial, necessitating high-performance computing infrastructure. Additionally, the data-intensive nature of MARL algorithms raises concerns about data privacy and security, particularly when dealing with sensitive information from multiple stakeholders in the energy sector [72].

Regulatory considerations play a crucial role in the practical adoption of MARL-based VPP management systems. Current market structures and regulations may not be fully aligned with the decentralized decision-making paradigm inherent in MARL approaches. For instance, the dynamic pricing strategies learned by MARL agents may conflict with existing price cap regulations or consumer protection policies. Future work should focus on developing MARL frameworks that can operate within regulatory constraints while still leveraging the benefits of adaptive learning [34].

One of the most promising aspects of our MARL approach is its potential to facilitate increased renewable energy integration. By effectively coordinating the actions of multiple VPPs, including the management of energy storage systems and flexible loads, the MARL framework demonstrated the ability to accommodate higher levels of variable renewable generation. Our sensitivity analysis showed that a 50% increase in renewable energy penetration led to an 11.95% improvement in overall system performance, highlighting the synergy between MARL-based control and renewable energy sources [28].

The economic and environmental benefits of implementing MARL for VPP management are substantial. Based on our simulations, it can be estimated that large-scale adoption of this approach could lead to annual cost savings of $500 million to $1 billion for a typical regional power system, depending on its size and composition. Moreover, the improved integration of renewable energy sources could potentially reduce carbon emissions by 5–10 million metric tons per year, contributing significantly to climate change mitigation efforts [73].

Despite these promising results, our study has limitations that warrant further investigation. The simulations, while comprehensive, may not capture all the complexities of real-world power systems. Factors such as transmission constraints, market power dynamics, and extreme weather events were not fully incorporated into our model. Future research should aim to address these limitations by incorporating more detailed physical models of the power system and considering a broader range of scenarios and uncertainties [74].

Looking ahead, there are several exciting directions for extending and improving the MARL approach to VPP management. One promising avenue is the integration of MARL with other artificial intelligence techniques, such as federated learning, to address privacy concerns and enable collaborative learning across multiple VPPs without sharing sensitive data [75]. Another potential direction is the development of hierarchical MARL frameworks that can efficiently manage VPP networks at different spatial and temporal scales, from individual households to regional power systems [76].

The authors declare no conflicts of interest.

Parameter category	Parameter name	Value	Description
MARL hyperparameter	Actor learning rate	1×10^–4	Learning rate for actor network updates
	Critic learning rate	5×10^–4	Learning rate for critic network updates
	Discount factor (γ)	0.99	Discount factor for future rewards
	Exploration rate (ε)	0.1	Initial exploration rate for epsilon-greedy policy
	Replay buffer size	1×10⁶	Capacity of experience replay buffer
	Batch size	256	Number of samples per training iteration
Network architecture	Actor network	[64, 32]	Hidden layer sizes for the actor network
Network architecture	Critic network	[128, 64]	Hidden layer sizes for the critic network
System parameters	Number of VPPs	10	Total number of VPPs in the network
	Simulation time steps	8760	Number of hourly time steps (1 year)
	Battery capacity	1000 kWh	Energy storage capacity per VPP
	Renewable generation limit	500 kW	Maximum renewable generation capacity per VPP
	Grid frequency limits	[49.8, 50.2] Hz	Allowable range for grid frequency

Model	Reduction in costs (%)	Increase in VPP profits (%)
MARL (ours)	18.73	22.46
Stackelberg game	12.58	15.29
MPC	14.92	17.81
SARL	16.05	19.37

Model	Convergence time (h)	Scalability (max VPPs)
MARL (ours)	8.64	127
Stackelberg game	2.31	43
MPC	5.17	76
SARL	11.89	92

Model	Scenario changes	Unexpected events	Overall score
MARL (ours)	89.27	83.15	86.21
Stackelberg game	62.43	58.79	60.61
MPC	75.68	71.92	73.80
SARL	81.36	76.54	78.95

Parameter	–50%	–25%	Base	+25%	+50%
Number of VPPs	–8.73	–3.42	0	2.91	4.68
Renewable energy penetration	–12.56	–5.87	0	7.23	11.95
Price volatility	5.32	2.14	0	–2.89	–6.71

Condition	Battery	Flexible load	Renewable curtailment
High demand	78.4%	89.2%	2.3%
Low demand	34.6%	12.7%	15.8%
High renewable	82.1%	8.9%	7.5%
Low renewable	45.3%	67.8%	0.1%

Event type	Metric	MARL	Stackelberg	MPC	SARL
Renewable drop (30%)	Cost increase	8.37%	14.62%	11.28%	10.05%
	Stability index	–3.21%	–7.89%	–5.43%	–4.76%
	Recovery time (h)	2.34	4.81	3.67	3.12
Demand spike (25%)	Cost increase	6.93%	12.37%	9.84%	8.51%
	Stability index	–2.78%	–6.42%	–4.95%	–3.89%
	Recovery time (h)	1.87	3.95	2.83	2.41
Price forecast error (20%)	Cost increase	4.52%	9.76%	7.31%	6.18%
	Stability index	–1.43%	–4.28%	–3.12%	–2.35%
	Recovery time (h)	1.26	2.73	2.05	1.68

Forecast error	MARL cost increase	MARL stability index	MARL recovery time (h)
5%	1.28%	–0.54%	0.37
10%	2.76%	–1.19%	0.83
15%	4.65%	–2.03%	1.42
20%	6.93%	–3.11%	2.18
25%	9.87%	–4.46%	3.09

微信扫一扫：分享

微信扫一扫：分享