Study on Master Slave Interaction Model Based on Stackelberg Game in Distributed Environment

: In view of the problems such as low e ﬃ ciency, di ﬃ culty in resolving local conﬂicts and lack of practical application scenarios, existing in the interaction model of multi-agent systems in a distributed environment, a multi-master multi-slave interaction model was designed based on the Stackelberg game, which is applied to the interaction game problem between the controller and the participant in the command and control process. Through optimizing the Stackelberg game model and multi-attribute decision-making, the multi-master, multi-slave, multi-agent system of the Stackelberg game was designed, and the closed loop problem under the Stackelberg game is solved for dimension reduction and optimal function value. Finally, through the numerical derivation simulation and the training results of related system data, the high e ﬃ ciency and strong robustness of the model were veriﬁed from multiple perspectives, and this model algorithm was proved to be true and highly e


Introduction
With the rapid development of network information technology, intelligent command and control is widely applied in military, network, biological and other fields to realize intelligent control, and it has a high degree of sensitivity and accuracy in information task management and command control [1]. In this paper, a multi-master and multi-slave Stackelberg game model was proposed. Command and control behavior modeling is an important part of military analysis simulation [2]. In the current dynamically changing battlefield environment, whether command and control can meet the requirements of rapid information collection and real-time decision-making is a crucial link [3]. A command and control software system enables commanders, staff officers, and other participants to exchange the information about mission information (commands) and situational awareness status (reports) [4]. In order to effectively manage emergencies, crises and disasters, different organizations and their command and control and sensing systems must continuously cooperate, exchange and share data information [5]. As the technologies including network physical system, pervasive computing, embedded systems, mobile ad-hoc network, wireless sensor networks, cellular networks, wearable computing, cloud computing, big data analysis and intelligent agent have created an environment with various heterogeneous functions and protocols, adaptive control is required to realize useful links and modification of users. In military applications, completely different IoT devices need to be integrated into a common platform that must interoperate with dedicated military protocols, data structures and systems [6]. The effectiveness of military organizations mainly depends on the command and control structure. Sun, Y [7] et al. studied the C2 structure design of distributed military organizations, established a multi-objective optimization model to balance and minimize the relative load of decision-makers, and proposed a multi-objective genetic algorithm. Evans J et al. [8] mainly studied how to meet the requirements of the rapid decision-making cycle of tactical command and control. We discussed the security requirements of security packages and how named data networks meet these requirements.
The Stackelberg game is a classic example of the double-level optimization problem, which is often encountered in game theory and economics. These are complex problems with a hierarchical structure, in which one optimization task is nested in another. In recent years, because of the advantages of multi-master multi-slave Stackelberg game, it has been studied deeply in many fields. Emerging mobile cloud computing technologies provide great potentials for mobile terminals to support highly complex applications. Wang Y et al. [9], based on the two-stage Stackelberg game model of multi-master and multi-slave, adopted an acceptable QoE to solve the resource management problem of MCC networks which maximized the utility function of the network. Modern communication networks are becoming highly virtualized, with a two-layer hierarchical structure. Zheng Z et al. [10] combined the optimization method with game theory and proposed an alternate direction multiplier based on multi-master and multi-slave games. This method excites multiple agents to execute the tasks of the controller, so as to meet the corresponding goals of the controller and the agent. Sinha A et al. [11] studied a special case of a multi-cycle, multi-master and multi-slave Stackelberg competition model with nonlinear cost and demand functions and discrete production variables. In literature [12], an asymmetric explicit model was proposed. The framework of Stackelberg game was used to capture the self-interested and hierarchical competitive properties of nodes, which proved the existence and uniqueness of Stackelberg equilibrium. Akbarid et al. proposed a new coordinated structure for a power generation system whereby power generation capacity was deregulated by transmission departments and decentralized capacity expansion planning of the transmission network. In this paper, a two-level Stackelberg game problem with multiple leaders and followers was solved by the diagonalization method [13]. Chen F et al. [14] studied the multi-master and multi-slave coordination problem based on neighborhood network topology. Scheme [15] mainly studied the problem of anti-interference transmission in a UAV (unmanned aerial vehicle) communication network, and a Bayes Stackelberg game method was proposed to describe the competitive relationship between a UAV (user) and jammer, namely, the jammer acts as a leader, and the user acts as a follower of the proposed game, and the jammer and user choose the optimal power control strategy according to their own utility function. Yuan Y et al. [16] studied elastic control under denial of service attack initiated by intelligent attackers. The elastic control system was modeled as a multi-stage hierarchical game with decision levels at the network and physical levels, respectively. Specifically, the interaction between different security agents in the network layer was modeled as a static infinite Stackelberg game.
Aiming at the above two game roles, the interaction model was constructed based on the game relationship, and the optimization model based on the Stackelberg game and the multi-attribute decisions discuss how the accuser and the participant interacted in the distributed collaborative environment. The current multi-agent model of leader-follower architecture still has a closed loop solution that cannot be explained. In fact, many studies have performed a preliminary investigation of it, which provided some insight for this paper. Thai C N et al. [17] studied the performance of closed-loop identification technology and chose an optimal closed-loop identification solution. The flexible transmission system used two recursive closed-loop identification methods and proposed an optimal closed-loop control scheme for the flexible transmission system. In reference [18], the dynamic mean-variance combinatorial optimization problem with deterministic coefficients was studied, and an inherent property of the closed-loop equilibrium solution was obtained for the first time, which proved that this optimization problem really had a unique equilibrium solution. Due to the limitations of the above studies, this paper will demonstrate the existence of the closed-loop solution from the perspective of mathematical derivation and solve the closed-loop solution, prove the reliability and rationality of the closed-loop solution with numerical simulation, and verify the model algorithm through the results of data training.
Based on the command and control model of multi-agent system under the distributed cooperative engagement, the game theories within the multi-agent system were discussed and studied, a multi-master and multi-slave Stackelberg game model was proposed [19]. It took two types of major decisions in the process of operation command as the research objects, one is accuser, namely leader, who is responsible for the command and control of the combat process; The other is the participant, namely, the follower, who is responsible for executing the decision scheme generated by the accuser. Aiming at the above two kinds of game roles, the interaction model was constructed based on the game relationship, and the optimization model based on Stackelberg game and the multi-attribute decisions discuss how the accuser and the participant interacted in the distributed collaborative environment.
Considering the problem presented by the limited individual capability of a single agent, the ability to deal with complex tasks can be enhanced through cooperation among individuals [20], so the multi-agent system has more advantages in dealing with complex tasks. First of all, the multi-agent system has a better battlefield situational awareness than the single-agent system. It can obtain global situational information and has the ability of parallel perception and parallel charge decision-making [21]. Secondly, the ability of a cooperative agent system is stronger than that of a single agent, so a multi-agent system has excellent expansibility and robustness. Finally, based on reasonable battlefield resource deployment and cooperative control between combat units, the multi-agent system with low resource loss was used to replace a single complex system with high resource loss, so as to achieve higher benefits.

Multi-Agent Interaction Model for Multi-Master and Multi-Slave Stackelberg Game
Due to the low global coupling degree of distributed environments, the strategies are diversified, behavior planning is more complex, and solution space index level is high [22]. As the number of external access agents increases, the behavior of each agent will have a huge impact on the maximization of global benefits [23]. From the perspective of economics, both the accuser and the participant of the task tend to pursue the maximization of their own benefits in the process. Based on this principle, a balanced interaction model of multiple agents must be studied to make each agent meet its own goals on the premise of satisfying the maximum global benefits.
In this section, an agent interaction model based on multi-master and multi-slave Stackelberg game was proposed. In the process that the agent system dealt with complex tasks, the agent was divided into two roles: leader and follower. Based on these two roles, the game interaction model in the command and control system was studied in depth, and the validity and robustness of the game interaction model were demonstrated through numerical calculation and simulation [24]. In addition, the existence and uniqueness of the Stackelberg equilibrium solution in an interactive game were also deduced, and the closed-loop expression of the equilibrium solution was provided.

Optimization Model of Stackelberg Game
Game theory, as a commonly used model method, is used to deal with the issues of competition with multiple participants [25]. Game theory mainly studies the theory that various the decision-makers influence the behavior of each other under competition as well as decision-making equalization, and it is the mathematical theories and methods of the decision-making process to maximize their own profits.
Generally, according to whether there is a binding agreement between game participants, the game mode can be divided into a cooperative game and non-cooperative game. The two parties of a cooperative game [26] have a consistent income direction and a constraint agreement is reached. Otherwise, it is called a non-cooperative game. According to whether the sum of the cost functions of the two sides of the game has a loss, it can be divided into a zero-sum game and a non-zero-sum game. The game can be classified as either a complete information game or incomplete information game according to the players' mastery of global situation information. According to the different status of each participating member in the Game process, the Game is divided into Nash Game, master slave Game (Stackelberg Game), etc. [27].
The Stackelberg game is a dynamic game, which is used where there is a level of decision behavior between two types of game players [28], one being the leader, and the other a follower. After leading the action, the followers make their own decision plans according to the action plan of the leader, so as to ensure the maximization of the decision benefits of the leader. However, in this paper, the original model was improved, and based on the relevant concept of distributed collaboration, a multi-master and multi-slave game model was established, wherein leader was redefined as the accuser, and the followers were defined as the participants so as to make the accuser adjust the follow-up action plan according to the participants' decision-making scheme and maintain the benefits of participants to the greatest degree, thus obtaining the maximization of global income. Therefore, the optimal decision-making and response decision-making corresponding to participants constituted the equilibrium of the game [29].
Two types of game participants were defined. The accuser was L k ∈ U, and its performance index function was J L (µ, ω). The participant was F k ∈ U, and its performance index function was J F (µ, ω). The purpose of both parties was to minimize J(µ, ω), so as to maximize the benefits of both parties and maximize the global benefits. For any decision µ ∈ Ud made by the accuser, the participants had a unique decision ω ∈ W to minimize its performance index function J F (µ , ω). The mapping relationship between the two parties' decision choices can be expressed as follows: At this time, considering the performance index of the accuser J L µ, T µ , it was assumed that there was a unique solution µ * ×U, i.e., J L µ * , T µ * ≤ J L µ * , T µ ∀µ ∈ U. At this time, (µ * , ω * ) = µ * , T µ * ∈ U × W was the unique Stackelberg solution.

Multi-Attribute Decision-Making
The multi-attribute decision-making problem is a multi-criterion decision problem. Another multi-criterion decision problem is the multi-objective optimization or multi-objective decision-making problem [30]. The criterion of MADM is often invisible and indirect. Even in many cases, it cannot be accurately described in a quantitative way, while the constraint of MODM must be direct and accurate. MADM's constraint on the target is often hidden in the attribute, while MODM does not have this feature due to the accuracy of the constraint. In addition, MADM alternatives are often limited, while MODMs can have an infinite number of alternatives [31].
where S represents the set of alternatives s ∈ S represents one of the alternatives in S, and u 1 , u 2 , . . . , u m represents the attributes of the multi-attribute decision problem. Let S = {s 1 , s 2 , . . . , s n }(n ∈ N + ) be the alternative scheme set, S = {s 1 , s 2 , . . . , s n }(n ∈ N + ) the attribute set, and set c ij = u j (s i )(i ∈ [1, n]; j ∈ [1, m]; i, j ∈ N + ), where c ij is the attribute value of s i for u j . Then, a multi-attribute decision matrix can be constructed, i.e.,C = c ij n×m . S, U, C are shown in the following Table 1: At this time, the property value c ij is an estimate. The solving process of the multi-attribute decision-making problem is as follows. First of all, the decision matrix was listed. Ssecondly, according to whether the weight was known, the weight determination method was selected. Thirdly, based on the attribute values of decision matrix, the aggregation operator of the property matrix was determined, and according to the solving target and the form of decision matrix, the appropriate multi-attribute decision-making method was chosen to calculate. The calculation results were carried out with weight distribution and gathering, and eventually the score of each scheme was obtained, and the decisions were made according to the score situation [32].
In this paper, every member of the multi-master and multi-slave Stackelberg game model used multi-attribute decision making as the basic method to generate the decision-making scheme, optimized the weights of the method according to the roles of the decision-makers, and generated the optimal decision scheme according to the final score.

Multi-Master and Multi-Slave Stackelberg Game Model in a Distributed Environment
Game theory is widely used in the field of interactive multi-agent systems. The Stackelberg game [33] is a classic game model, and the strategy of various participants in the game is usually made on the basis of the maximization of self-interest, and there is cooperation or non-cooperation between each other. When all participants achieve maximum interests and global interests tend to be steady, this state is called equilibrium state.
The Stackelberg game is a game problem between upper and lower levels, in which each participant belongs to the relationship between upper and lower levels, namely leader and followers [34]. In the game model, the leader generates a decision scheme according to the state prediction of the task objective and the followers, and the followers respond according to the decision scheme of the leader. In the game process, information between leader and followers is not fully shared, that is, the leader's information is partially shared with followers, but the followers' information is fully shared with the leader. Considering such a design, the leader's decision-making scheme constraint is based on the maximization of followers' revenue, so in Stackelberg game [35], the leader doesn't need to design a response function.
The main problem of the current multi-master and multi-slave Stackelberg game model is the difficulty in solving the closed loop solution, so the open-loop solution of Stackelberg game is usually the mainstream research object [36]. This section aims to obtain Stackelberg game equilibrium through reasoning, namely, the closed-loop solution of the Stackelberg game. Therefore, in this paper, the retrieval process of the open-loop solution won't be demonstrated in detail. The closed-loop solution of multi-master and multi-slave Stackelberg game model under the accuser participants has the nonlinear characteristics, there is no explicit solution, so under certain conditions, this paper will use the regularity of a positive semidefinite matrix to give a closed-loop solution to the Stackelberg game, obtain the optimal decision through the accuser, optimize the participants' decision-making, obtain the optimal decision scheme of participants, and eventually maximize the overall revenue.

A Closed-Loop Solution of Multi-Master and Multi-Slave Stackelberg Game Model
Under the distributed environment with a certain complexity, the optimal regularity of positive semidefinite quadratic performance index was used to give a closed-loop solution of multi-master and multi-slave Stackelberg game. In this section, a regular Riccati equation was introduced to solve positive semi-definite multi-master and multi-slave Stackelberg game problems, and the main contribution is that it gave a closed loop solution to the Stackelberg game. The general process is as follows: First of all, the optimal decision (µ, ω) was obtained in the decision optimization of the accuser. When the decision resulted in the maximum value ω of the accuser, the value of µ was the minimum. At this time, the weighted matrix of the control performance index was positive semi-definite, so there were arbitrary terms in the performance index function. Secondly, the participants' decisions were optimized by continuously using arbitrary terms in µ, and finally, combined with the solution of the regular Riccati equation, and according to the decision-making optimization process of the accuser and participants, the closed-loop solution of multi-master and multi-slave Stackelberg game problem.

Problem Description
The linear discrete time system is defined as follows: where x k ∈ R n is the state variable of game participants and there is an initial value x 0 and it can be determined; ω k ∈ R l is the interference that can be quantified; z k is the output item; µ k is the input item; and the interference attenuation factor is defined as γ, whose value range is γ ≥ 0, then the following performance index function can be obtained: where γ is the interference attenuation factor defined above and γ ≥ 0; −1 0 =γ 2 π 0 −1 [111], where π 0 is the variance matrix of the initial state x(0). In this section, the accuser's optimization and the participants' decision-making optimization were carried out simultaneously. During the decision-making optimization of the accuser, under any initial environmental conditions (k 0 , x 0 ), (ω * , µ * ) meets J * L (k 0 , x 0 ; ω * , µ * ) = max ω min µ J L (k 0 , x 0 ; ω, µ) ≤ 0. During the decision-making optimization of the participants, for any initial environmental condition (k 0 , x 0 ), the optimal value µ * was sought to minimize J, that is, J * F (k 0 , x 0 ; µ * ) = min µ J F (k 0 , x 0 ; µ).

Closed-Loop Solution of Semidefinite Control in Stackelberg Game
This section will describe the closed-loop solution process in the optimization of the multi-master multi-slave Stackelberg game, and the optimization process of this problem is mainly divided into two parts. The first part is the accuser's decision-making optimization. By adjusting the value of µ, the performance index function J L is minimized, and at the same time, by adjusting ω, the performance index function is J L maximized. According to the positive definiteness of the weighted matrix, µ value contains arbitrary terms. In order to ensure the only decision-making of the accuser, arbitrary terms are used as new decision-making to be solved, and it is obtained from the participants' decision-making optimization, so as to obtain the accuser's optimal decision. •

Decision optimization of accuser
As can be seen from the above, the accuser's decision-making optimization is carried out, that is maxminJ L ≤ 0 s.t, where: where K K = R ≥ 0, I I = Q ≥ 0. Firstly, the existence of the accuser's optimal decision-making scheme is proved: If the above equation is satisfied, it can be proven that the accuser has an optimal decision, that is, maxminJ L has a definite value, and the optimal solution satisfies Equations (10) and (11). The proof process can be seen in the literature [37].
According to Equations (3) and (9), it can be known that there is a non-secondary linear relationship between variable x k and p k−1 , so it is defined as follows: where η k−1 and P k will be defined below. Substitute Equation (12) into Equation (11), then Equation (11) can be written as follows: Substitute Equation (12) into Equation (10), and Equation (10) can be written as follows: where Z k+1 = −γ 2 I + C P k+1 C. Since there is an optimal solution for the accuser's decision making, it can be seen that the weighting matrix Z k+1 is determined, that is Z k+1 ≤ 0, then Equation (14) can be written as follows: Substitute Equation (15) into Equation (13), the following formula can be obtained: Assuming that when the weighted matrix is semi-positive definite, maxminJ L may have more than one optimal solution. At this time, it is necessary to introduce the regular Riccati equation to make P k satisfy the Riccati equation, which can be expressed as follows: At this time, η k = 0 and P N+1 = P N x N+1 = H, ψ * k+1 is the pseudo-inverse matrix of ψ k+1 . Then it can be obtained η N = 0, so Equation (12) is true at time N. Assuming Equation (12) is true at time k, then η k = p k − P k+1 x k+1 , when the range of the two accords with Range[ς k+1 ] ⊆ Range[ψ k+1 ], the following formula can be obtained from Equation (16): where ϕ is an arbitrary term. By substituting Equation (18) into Equation (15), the following formula can be obtained: By substituting Equations (18), (19), and (12) into (9), the following equation can be obtained: As P k meets Equation (17), and ς k+1 [I − ψ k+1 * ψ k+1 ] = 0, the following formula can be obtained: It can be concluded from the above that η k = 0, k = 0, 1, . . . , N. As can be seen from the above, (3) ψ k+1 ≥ 0, the following formulas can be obtained: where φ k is arbitrary vector with appropriate dimension, φ k . Then the optimal decision indicator function is as follows: The specific proof process can be seen in Appendix A.
• Accuser's strategy optimization It was found that there is an arbitrary term in the optimal decision scheme. In order to further optimize the arbitrary item in the optimal decision scheme, we first carry out matrix transformation to transform the arbitrary item into the participants to be solved. An elementary row transformation matrix T 0 (k) is introduced to make: and the following formula can be obtained: Based on the above analysis, µ(k) and ω(k) are rewritten as follows: By substituting the above formula into the definition of x k+1 , the following equation can be obtained: . By substituting µ k and ω k into J F , the following equation can be obtained: At this time, the decision-making scheme optimization problem of the participants is solved, namely: min The participants' decision scheme optimization is a standard linear quadratic optimization problem. First of all, the necessary conditions for a solution are given: If the above equation is true, it indicates that there is an optimal solution to the participants' decision-making scheme optimization problem. The specific proof process is as follows: Firstly, the inhomogeneous relation between x k and θ k−1 is defined as follows: where ς k−1 and P k are defined in the following lemma. By combining the definition of J F and θ k−1 , the following equilibrium equation can be obtained: where In order to obtain the explicit solution of the optimal controller µ 1 k , we consider the following situation, that is, the following hypothesis is obtained: When the control matrix in the performance index of the decision-making scheme of the participants is a semi-positive definite matrix, min u 1 J 2 has no unique solution. Here, the regular Riccati equation is introduced. Under the above regular hypothesis, this paper solves the solution of the regular Riccati equation and verifies the homogeneous relationship between the state variable x(k) and the adjoint state variable θ(k − 1). Under the above hypothesis, θ(k − 1) = P(k)x(k), where: At this time, ς(k) = 0, and θ N = P N+1 x N+1 and P N+1 = H. Namely, θ N = P N+1 x N+1 . If ς(k) is true at time k, so ς k = θ k − P k+1 x k+1 , and the following formula is obtained: where ϕ(k) is the arbitrary terms with the appropriate dimension. By substituting this formula and ς k = θ k − P k+1 x k+1 into θ k−1 , the following formula can be obtained: where Γ k+1 I − ψ 0 k+1 * ψ 0 k+1 = 0 is applied to the derivation of the above formula, and the following formula can be obtained: As ς N = 0, it can be concluded that ς k = 0, k = 0, 1, · · · , N. Next, the state-space model and the indicator function of the decision scheme are considered. When H F meets the above assumptions, there is an optimal solution µ 1 k to the optimization problem, when the following conditions are satisfied: (1) P 0 ≥ 0; (2) ψ k ≥ 0. The problem is solvable when the optimal solution is met, and the optimal solution is as follows.
where ϕ(k) is the arbitrary vector with the appropriate dimension. Then the performance index of the corresponding optimal decision scheme is as follows: The specific proof process is shown in Appendix B. At this time, P 0 ≥ 0 and ψ 0 k+1 ≥ 0. Therefore, the optimal decision-making scheme of participants exists, and the optimal accuser-participant decision-making scheme, i.e., the global optimal decision-making scheme, is as follows: Thus, a closed-loop solution to the Stackelberg game is obtained, and the optimal decision scheme of the accuser, the participant is obtained, and global profit maximization is finally achieved.

Model Performance Comparison
In this paper, a multi-master, multi-slave Stackelberg game model is designed. Under the same resource deployment situation, the optimization effect and model complexity of the model designed in this paper, and the unimproved model under the same time limit are compared, as shown in the Tables 2 and 3. According to the experimental data, after the learning rate is improved adaptively, the function value and the time-space complexity of the model are analyzed under the condition of the same number of iterations, and it is found that the model in this paper can achieve lower objective function values with better convergence and higher efficiency [38].
By comparing the performance of the model, this paper sets the same initial function value and iteration times between the literature model and the model. Finally, under the same iteration times, it can be seen that the model in this paper has better convergence from Table 4.

Numerical Simulation
To verify the validity of the above results, a numerical example is given in this section. Some parameters of the multi-agent system are assumed as follows: First of all, the optimization process of H L is analyzed. Through algebraic iterative operation, the solution of the regular Riccati Equation (17)  where P(0) − π −1 0 < 0. By combining with the results obtained from the above operation, (20) and (21) are calculated, and the optimal controller described in theorem 1 can be obtained as follows: In the H F optimization, the solution of the regular Riccati Equation (27) The arbitrary term ϕ k in the decision-making scheme of the participants obtained in the previous optimization process can be transformed into µ 1 k through matrix transformation. The value of µ 1 k obtained by calculating Equation (32) is as follows: The arbitrary term ϕ(k)(k = 2, 1, 0) in Equation (32) is as follows: Substituting ϕ k into µ 1 k , and then according to Equation (3), we can get the optimal controller (Equations (52) and (53), respectively): After calculation, the optimal performance index of H L norm form is J L = −6.35. The optimal performance index of H F norm form is JF = 79.65.

The Experimental Simulation
In this section, the air defense of important places in the field of air defense and anti-missile defense is taken as the experimental scene, constraint conditions and objective functions are designed as the simulation environment, and a multi-agent system based on edge Laplace matrix is constructed to analyze the system complexity and verify the reliability and robustness of the system.

The Description of Simulation Scenario
This chapter will be based on the air defense of important place as the background. Considering the influence of mountains, the algorithm in this chapter will be used to model the multi-agent system, and the system will be used to deploy the battlefield fire unit to verify whether the key points can be defended and the target function value can be achieved under the specified incoming fire.

Parameters Definition
(1) Battlefield environment: Terrain: mainly flat terrain, some areas for undulating mountains Area: 300 km × 300 km Object of protection: important place (2) Strength formation: Security object: important place Security forces: 8 launching vehicles with a running speed of 40 km/h, which can stop shooting for a short time. Each vehicle is equipped with a number of medium range bombs with a range of 50 km and an average speed of 800 m/s; 20 short range bombs with a range of 20 km and an average speed of 600 m/s. There are five radar vehicles with a detection distance of 50 km for cruise missile, 120 km for fighter, 100 km for helicopter, and 80 km for UAV.
(3) Tactical purpose: The coverage of detection area of one layer shall be more than 85%, that of two layers shall be more than 55%, and that of key areas shall be at least two layers; that of fire area of one layer shall be more than 80%, that of two layers shall be more than 70%, and that of key areas shall be at least three layers.
Resources are the operational objects that command and control the deployment behavior of resources. They are physical resource entities. Resources include fire units and detection units. The set of resources is P = P 1 , P 2 , · · · , P j , and j is the number of resources. In this chapter, the platform resources are mainly radar vehicle, launch vehicle, medium range missile and short-range missile.
Command entity: the command entity is the entity that receives situation information, makes decisions, and completes the deployment of fire units, task allocation, situation information interaction, command and decision-making, etc. The collection of charged entities is D = {D 1 , D 2 , · · · , D n }, and n is the number of charged entities. In this model, the command entity is radar vehicle, which belongs to scheme release agent. Each radar vehicle is the same level of command entity, but there is a logic center at some time, which is distributed deployment.
There are four capability parameters of the accusing entity: (1) How many launching vehicles can each radar vehicle D n control at the same time is called the control and management ability C

Element Definition and Load Design
The following elements need to be defined to describe the deployment of command and control resources: (1) task to resource assignment capability n (i,j) TP (i = 1, 2, · · · , I; j = 1, 2, · · · , J) if task t T j is assigned to resource P j , then n (i,j) TP = 1, otherwise n (i,j) TP = 0. In this model, the ability variable of task to resource allocation belongs to known quantity.
(2) the control ability of decision-making entity over resources n (m,j) TP (m = 1, 2, · · · , M; j = 1, 2, · · · , J), if resource P j belongs to decision-making entity D n , then n (m,j) (3) assignment ability n (i,m) TP (i = 1, 2, · · · , I; m = 1, 2, · · · , M) from task to decision entity. If task T j is assigned to decision entity D n , then n (i,m) Internal workload of decision entity. The workload that decision entity D n controls its platform to perform tasks is called D n 's internal workload, which is recorded as W (n) Task , and its calculation formula is D n 's internal workload, which is recorded as W (n) Task , and its calculation formula is as follows: Task is the execution time of task T i ; t Complete is the final completion time of all tasks; d Task is the difficulty of the task; Z t is the weighting coefficient of execution time factor; Z d is the weighting coefficient of task difficulty degree.
Collaborative load between decision entities. The collaboration load of any two decision entities D n h and D n is the workload that D n and D n must cooperate to complete the task, which is recorded as W (n,n ) Cooperation , and its calculation formula is as follows:

Constraint Analysis
The constraints of the command and control resource deployment model are analyzed as follows: The decision entity performs tasks through the platform it controls. Therefore, when at least one platform P i in the platform D m controls the variable n There is an upper capacity limit for the control management ability of each decision-making entity, that is: In the formula, C comtrol is the upper limit of control management capacity, and the constraint value of all decision entities is set to be the same in this paper. There is an upper capacity limit for the task processing capacity of each decision entity, namely: In the formula, C Task is the upper limit of cooperative ability capacity, and the constraint value of all decision entities is set to be the same in this paper. There is an upper capacity limit for the task processing capacity of each decision entity, namely: In the formula, C Cooperation is the upper limit of cooperation ability. In this paper, all decision entities are set with the same constraint value.

Objective Function Design
As can be seen from the above, the sum of workload of all agents remains the same during the task assignment, while the workload varies with the assignment. According to this characteristic, we want to achieve the optimal, then the total load should be minimized. Let the amount of tasks assigned to each agent be w 1 , w 2 . . . w n , and the amount of tasks received by each agent be f 1 (w 1 ), f 2 (w 2 ) . . . f n (w n ). The workload for all agents is g(w 1 , w 2 . . . w n ) = f 1 (w 1 ) + f 2 (w 2 ) + . . . + f n (w n ). So if g(w 1 , w 2 . . . w n ) is the smallest, the smaller the total cost, the more reasonable the assignment. Then the objective function is: After using in the boundary, the group w 1 , w 2 . . . w n which makes g(w 1 , w 2 . . . w n ) the smallest is selected.

System Training Data Demonstration
This section will demonstrate and analyze the system from the perspective of a reinforcement learning network, global gain function and global loss function. When the number of iterations is between 2500 k and 3000 k, the maximum global gain is basically reached and becomes stable.
Experimental background is set to be the deployment of battlefield resources. Figure 1 is the iterative curve map of the stability of multi-agent system, where the horizontal axis is the number of iterations, and the vertical axis is the revenue function of multi-agent system. With the increase of the number of iterations, when the number of iterations reaches 2500 times, the revenue function tends to be stable, which proves that the multi-agent system tends to be stable and has excellent robustness. The details are shown in the following figure: in the boundary, the group 1 2 , ... n w w w which makes 1 2 ( , ... ) n g w w w the smallest is selected.

System Training Data Demonstration
This section will demonstrate and analyze the system from the perspective of a reinforcement learning network, global gain function and global loss function. When the number of iterations is between 2500 k and 3000 k, the maximum global gain is basically reached and becomes stable.
Experimental background is set to be the deployment of battlefield resources. Figure 1 is the iterative curve map of the stability of multi-agent system, where the horizontal axis is the number of iterations, and the vertical axis is the revenue function of multi-agent system. With the increase of the number of iterations, when the number of iterations reaches 2500 times, the revenue function tends to be stable, which proves that the multi-agent system tends to be stable and has excellent robustness. The details are shown in the following figure: It can be concluded from Figure 1 that, as the training iteration proceeds, the global return gradually increases after a period of time from a low level, during which there is a small range of ups and downs, finally reaches the optimal global return, and tends to converge. Figure 2 is the graphical display of reinforcement learning network. The 3D coordinate system of reinforcement learning network is established, and every point in the figure represents a performance index of the intelligent agent, and the location represents the index's specific parameter value and the state function of the intelligent agent at that time. With the increase of the number of iterations, the state transfer of the intelligent agent is carried out, and the coordinate information also changes. There are more than 2000 parameters for the multi-agent system in the command and control system of this paper, and the specific details are shown in Figure 2: It can be concluded from Figure 1 that, as the training iteration proceeds, the global return gradually increases after a period of time from a low level, during which there is a small range of ups and downs, finally reaches the optimal global return, and tends to converge. Figure 2 is the graphical display of reinforcement learning network. The 3D coordinate system of reinforcement learning network is established, and every point in the figure represents a performance index of the intelligent agent, and the location represents the index's specific parameter value and the state function of the intelligent agent at that time. With the increase of the number of iterations, the state transfer of the intelligent agent is carried out, and the coordinate information also changes. There are more than 2000 parameters for the multi-agent system in the command and control system of this paper, and the specific details are shown in Figure 2: As the number of iterations increases, the global gain increases, and as the corresponding global loss gradually decreases, when the algorithm approaches the global optimization, the loss function tends to be stable, with a small range of fluctuations. The change trajectory of the function corresponds to Figure 1, so as to ensure that the total battlefield situation remains unchanged. Figure 3 is the consistency analysis diagram of the multi-agent system. It can be seen from the figure that, with the increase of the number of iterations, when it reaches about 1000 k, the edge state of the intelligent agent tends to balance. Meanwhile, it can be seen that after a certain number of iterative training, the multi-agent system is locally Lipschitz continuous and has excellent consistency. The change trend of the specific loss value function is shown in the figure below: As the number of iterations increases, the global gain increases, and as the corresponding global loss gradually decreases, when the algorithm approaches the global optimization, the loss function tends to be stable, with a small range of fluctuations. The change trajectory of the function corresponds to Figure 1, so as to ensure that the total battlefield situation remains unchanged. Figure  3 is the consistency analysis diagram of the multi-agent system. It can be seen from the figure that, with the increase of the number of iterations, when it reaches about 1000 k, the edge state of the intelligent agent tends to balance. Meanwhile, it can be seen that after a certain number of iterative training, the multi-agent system is locally Lipschitz continuous and has excellent consistency. The change trend of the specific loss value function is shown in the figure below: It can be concluded from Figure 3 that, at the beginning of the iteration, that the loss value shows a rapid decline. When the number of iterations reaches 500-1000 k, the loss value starts to drop slightly, but the average value declines slowly. When the number of iterations reaches about 1500 k, the loss value starts to become stable and reaches the minimum value of the global loss function.

Conclusions
This paper analyzes and solves the game interaction model of the multi-agent system in distributed combat environment, and provides a solution to the accuser participant game problem in the multi-agent system with a semi-positive definite weighted matrix based on the performance As the number of iterations increases, the global gain increases, and as the corresponding global loss gradually decreases, when the algorithm approaches the global optimization, the loss function tends to be stable, with a small range of fluctuations. The change trajectory of the function corresponds to Figure 1, so as to ensure that the total battlefield situation remains unchanged. Figure  3 is the consistency analysis diagram of the multi-agent system. It can be seen from the figure that, with the increase of the number of iterations, when it reaches about 1000 k, the edge state of the intelligent agent tends to balance. Meanwhile, it can be seen that after a certain number of iterative training, the multi-agent system is locally Lipschitz continuous and has excellent consistency. The change trend of the specific loss value function is shown in the figure below: It can be concluded from Figure 3 that, at the beginning of the iteration, that the loss value shows a rapid decline. When the number of iterations reaches 500-1000 k, the loss value starts to drop slightly, but the average value declines slowly. When the number of iterations reaches about 1500 k, the loss value starts to become stable and reaches the minimum value of the global loss function.

Conclusions
This paper analyzes and solves the game interaction model of the multi-agent system in distributed combat environment, and provides a solution to the accuser participant game problem in the multi-agent system with a semi-positive definite weighted matrix based on the performance It can be concluded from Figure 3 that, at the beginning of the iteration, that the loss value shows a rapid decline. When the number of iterations reaches 500-1000 k, the loss value starts to drop slightly, but the average value declines slowly. When the number of iterations reaches about 1500 k, the loss value starts to become stable and reaches the minimum value of the global loss function.

Conclusions
This paper analyzes and solves the game interaction model of the multi-agent system in distributed combat environment, and provides a solution to the accuser participant game problem in the multi-agent system with a semi-positive definite weighted matrix based on the performance index. In order to solve this problem, the optimal decision (µ, ω) is first obtained in the accuser's optimization, in which the size of µ, ω is controlled to obtain the optimal value of J L . Due to its semi-positive definite nature and the existence of arbitrary term item in µ, the arbitrary term obtained before is transformed into the item to be solved in J F to further optimize J F after matrix transformation. Based on the above optimization process, the Riccati equation is solved and the closed-loop solution of the multi-master and multi-slave Stackelberg game model is given, which solves the problem of data interaction and local conflict resolution between various intelligent agents in the command and control system. The research of this paper not only broadens the depth of the academic theory of distributed command and control, but also satisfies the interactive game model mechanism in distributed multi-agent systems, which lays a theoretical foundation for intelligent warfare.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Based on equation p k−1 = A p k + Qx k and maxminJ L ≤ 0 s.t, the following equation can be obtained: By summing the above equation from 0 to N in sequence, the following equation can be obtained: The performance index function of H J control is updated as follows: Combined with the previous content, this equation can be rewritten as follows: At this time, P 0 − π −1 0 < 0, ψ k+1 ≥ 0, and Z k+1 < 0. Combining with the previous content, the optimal decision scheme of the accuser (u, w) can be obtained, and the optimal performance index can be calculated as shown in the above equation.

Appendix B
Proof of necessity: Suppose the problem min µ 1 J F s.t. has a solution. The optimal decision scheme is verified and obtained by induction. Equation (30) can be written as follows: When k = N, J F can be written as the expression of x(N) and u 1 N , as follows: N has a quadratic term, which is a semi-positive definite matrix for any non-zero µ 1 N . At this time, x(N) = 0, and the following formula can be obtained: That is ψ 0 N+1 ≥ 0. By combining with the previous formula and θ N = P N+1 x N+1 , the following formula can be obtained: At this time, the optimal decision-making scheme parameters of the participants can be defined as follows: So far, we've verified P N at time k = N. By induction, any k in 0 ≤ k ≤ N is selected, and assuming ψ 0 k ≥ 0 and the optimal decision scheme parameter µ 1 N is consistent with the derivation above when k > 1. Then, when k = 0, the conditions in this paper are still valid, so µ k has an optimal decision for all k ≥ 1. It needs to verify that ψ 0 is a semi-positive definite matrix. Supposing x 0 = 0, the quadratic form of µ 0 in J 0 F is solved. At this time, the following formula can be obtained: The two sides of the equation are carried out with N 1 k summation: To sum up, J F can be written as follows: Assuming that the optimal decision scheme of the participants exists, the minimum value of J F can be obtained for any µ 0 0. Therefore, ψ 0 1 ≥ 0. By combining with θ 0 = P 1 x 1 , the following formula can be obtained: So u 1 0 can be written as µ 1 0 = −ψ 0 1 * Γ 1 x 0 + I − ψ 0 1 * ψ 0 1 ϕ 0 . Combined with the previous text, the following formula can be obtained: At this point, it is verified that P 0 is true at time k = 0: Proof of sufficiency: The total sum of squares is used to obtain the following formula: x k+1 P k+1 x k+1 − x k P k x k = A k+1 x k + B k+1 µ 1 k P k+1 A k+1 x k + B k+1 µ 1 k − x k P k x k = x k µ 1 k       − P k + A k+1 P k+1 A k+1 A k+1 P k+1 B k+1 B k+1 P k+1 A k+1 B k+1 P k+1 B k+1       × x k µ 1 k Sum over the above equation from k = 0, 1, · · · , N, then: N k=0 x k+1 P k+1 x k+1 − x k P k x k By substituting x N+1 P N+1 x N+1 into the above equation and combining with Riccati equation, J F can be described as follows: Ends.