Multiagent Cooperative Reinforcement Learning by Expert Agents (MCRLEA)

: The paper gives novel approach Multiagent Cooperative Reinforcement Learning by Expert Agents (MCRLEA) for dynamic decision making in the retail application. Furthermore, it put up different cooperation schemes for multiagent cooperative reinforcement learning i.e. EQ learning, EGroup, EDynamic, EGoal driven and Expert agents scheme. Implementation outcome includes a demonstration of recommended cooperation schemes that are competent enough to speedup the collection of agents that achieve excellent action policies. Accordingly this approach presents three retailer stores in the retail market place. Retailers can help to each other and can obtain profit from cooperation knowledge through learning their own strategies that just stand for their aims and benefit. The vendors are the knowledgeable agents in the hypothesis to employ cooperative learning to train in the circumstances. Assuming significant hypothesis on the vendor’s stock policy, restock period, arrival process of the consumers, the approach is formed as Markov decision process model that makes it possible to design learning algorithms. The proposed algorithms noticeably learn dynamic consumer performance. Moreover, the paper illustrates results of Cooperative Reinforcement Learning Algorithms of three shop agents for the period of one year sale duration and then demonstrated the results using proposed approach for three shop agents for the period of one year sale duration. The results obtained by the proposed expert agent based cooperation approach show that such methods can put into a quick convergence of agents in the dynamic environment.


Introduction
The retail store sells the household items and gains profit by that. Retailers are interested in their selling, their profit. By accepting certain steps, the portion that can reason break or decrease the revenue can be prohibited. The aim of predicting the sales business is to collect data from various shops and analyze it by machine learning algorithms. The proficient significance of the practical information by ordinary ways is not practically achievable because the information is extremely vast [1]. Retail shops example is considered here. Walmart is an example for huge shops, big bazaars etc. Most of the time retailers will not be doing well in getting the consumer's requests because they will be unable in the estimation of market place perspective. In some particular occurrences, the speed of sale or shopping is more. Sometimes it might reason insufficiency of the items. The relationship between the consumers and the shops is evaluated and the modifications that require gaining extra yield are prepared. The history of buy of each item in each shop and department is maintained. By examining these, the sales are predicted that facilitate the understanding of yield and loss happened throughout the year [1] [2]. Let us consider example Christmas in some branch for the period of the specific session. In Christmas celebration, the sales are more in shops like clothing, footwear, jewelry etc. Throughout summer time the purchase of cotton clothing is more; in winter the purchase for sweaters is more. The purchase of items alters as indicated by the season. By examining this past record of purchases, the sales can be forecasted for the future [2]. That discovers the result to predict the highest revenue in the industry of retail shop market. The retailers monitor the behavior of consumers and attract them by offering several beautiful schemes. In order, they will be back to the shop and pay for more time and money. The major target of retail shop market preparation is to acquire the highest revenue by significant the knowledge and where to provide gainfully and in which shops [2] [3].
There are many challenges in the retail shop forecasting. Some of the mare retailers be unsuccessful in the estimating the possibility of the market. Retailers disregard the seasonal changes. The human resources are insufficient and the workers do not exist as and when required. The retailers experience the complexity in storage management system. The retailers sometimes pay no attention to the competition or cooperation in the market. Retailers build the strategies that encourage the success and the extremely target plan. The strategies should be such that they facilitate to achieve the highest revenue [3].
Generally, the income of the sale of a specific product is kept which is the result of forecasting the maximum potential of the quantity of sales in given period of time and under uncertain environment. Market sale determined by the customer's behavior, the cooperation, facilities support etc. These take effect on the sales of future of a particular shop. Shop and inventory scheduling is significant and is organized policy method in individual shop level [4]. Goods to buy and sale, store management, and space management are the major work in the planning of a shop. By monitoring the past history of the shop it helps to put up a scheme of sales of the shop and build any changes in the idea so that it can be highest cost-effective. The fundamental information presented by the existing shop is extremely useful in the forecasting of sales [5].
The paper is arranged as: Section 2 provides the proposed approach toward dynamic decision-making in retail shop application by Multi-agent Cooperative Reinforcement Learning by Expert Agents (MCRLEA). Section 3 illustrates expert agents based multi-agent cooperative learning schemes. Section 4 presented Mathematical Model of Cooperative Learning for the system of retail shops. Section 5 demonstrates Implementation Results of Multi-agent Cooperative Reinforcement Learning by Expert Agents. Result Analysis of proposed approach Expert agent based Multiagent Cooperative Reinforcement Learning is given in Section 6 and conclusion in Section 7.

Multiagent Cooperative Reinforcement Learning by Expert Agents (MCRLEA)
Three shop agents cannot obtain the maximum profit without cooperation. Cooperative Reinforcement Learning Algorithms for these shop agents certainly increases the sale of items due to cooperation between them that gives a significant rise in the profit. Convergence of reinforcement learning becomes important as a number of states increases.
Adding expert agents into cooperative reinforcement learning would surely enhance their performance in terms of profit and also can put into a quick convergence of agents in the dynamic environment. Hence the proposed approach Multiagent Cooperative Reinforcement Learning by Expert Agents (MCRLEA) is given.
The communication in multiagent reinforcement can build a sophisticated collection of accomplishments achieved from the agents' proceedings. The part of accomplishments set is allocated to the agents via an Incomplete Action Plan (Q i ) [5] [6].
Normally similar incomplete policies maintain incomplete information about the state. These strategies can be incorporated to improve the sum of the partial rewards received using satisfactory association method. The action plans are generated via the way of multiagent cooperative reinforcement training through gathering such rewards with constructing these agents to go nearer to the excellent policy Q*. Once the plans Q 1 ,….…, Q x is incorporated, it is possible to build up a new strategy that is Complete Action Plan(CAP={CAP 1 ,…, CAP x }), in which CAP i denotes the best reinforcement received by agent I all over the training algorithm [7].
The Splan algorithm 1 gives out the agents' training particulars. Strategies are considered by the Q learning method for every algorithm. The best reinforcements are distributed to CAP with the aim to create a gathering of such best collected rewards by every agent. Such rewards are one more time given by the way of the extra agents [8]. Coordination is implemented through the changing of incomplete rewards because CAP is forecasted by the best reinforcements. A val task is applied in the direction of the discovery of excellent strategy among the previous states and last state for a specified plan that calculates CAP with the best reinforcements. The val task is found out as adding of phases the agent demand to reach at final state and the sum of the acquired amount in the plan amongst every initial state and final state [8]

Expertise Rewards
More skilled agents discover additional reinforcements and penalty of the set. As an effect, if the set achieves reinforcement then expertise agents will obtain additional rewards as compared to another agent. On the opposite, further agents receive more punishments as measured to expert agents when the group gets punishment. Expert agents normally execute better than other agents [9] [10]. They find extra chance to conduct correct action as measured apart from less expert agents. Agents acquire rewards (rewards and penalty) as follows: where r is a reward, R is accumulative reward, N is a number of agents, e i is the expertise of agent I and e j are the expertise of agent j.

Expertise Criteria
Expertise criteria consider both reinforcements and penalty as a symbol of being knowledgeable. It indicates that negative and positive results, calculated based upon the cost of reinforcement and penalty indication, are together important for the agent. It is an addition to the complete cost of the reinforcement signal [11] [12].

Expert Agent Based Multiagent Cooperative Reinforcement Learning Schemes
Various expertise based multiagent cooperative schemes for cooperative reinforcement learning are given below [12]. i) EGroup scheme-reinforcements are issued in a sequence of steps. ii) EDynamic scheme-reinforcements are issued in each action. iii) EGoal-driven scheme-issuing the addition of reinforcements when the agent reaches the goal state (S goal ) [13] [14] iv) ExptAgent scheme-The learning only between expertise agents is shared in this scheme [14]. Algorithm 2 Cooperation schemes Consider states, action a, agent i, reward r, number of agents N, α learning rate, γ discount factor, expertise of agent I is e i and expertise of agent j are e j , and Q The EGroup scheme appears to be extremely strong meeting extremely quick to the best action plan Q*. Reinforcements obtained by the agents are produced in series of pre-identified stages. They gather reasonable reward values that cause a good convergence. In the EGroup scheme, the global policy converges to the best action strategy as there is an intermission of series necessary to gather good reinforcements [14] [15] [16]. The global action policy of the EDynamic scheme is able to gather excellent reward values in small earning series. It is observed that after some series, the performance of global strategy reduces. This takes place because the states neighboring to the goal state begin to gather much advanced reward values giving to a local maximum. It punishes the agent as it will no longer stay in the other states. In the EDynamic scheme as the reinforcement learning algorithm renews learning values, actions with higher gathered reinforcements are chosen by the top possibility than actions with small gathered reinforcements. Such a policy is recognized as greedy search [15] [16]. In the EGoal-driven scheme, the agent distributes its learning in a changeable number of sequences and the cooperation acquired when the agent arrives at the goal state. The global action strategy of the EGoal-driven scheme is capable together excellent reward values, agreed that there is a sum of iteration series together values of acceptable rewards. The execution of the cooperative learning algorithms is generally small in the early series of the learning process with the EGoal-driven scheme [16] [17]. In the ExptAgent scheme the learning only among expert agents is shared [17].

Model of Cooperative Learning
Wedding period situation is considered for the development. Beginning from choosing a site, invitation cards, decoration, booking the caterers, purchase of clothing, gifts, jewelry and additional items for bride and groom, so many actions are concerned. Such periodical conditions are able to be practically executed like follow: Consumer purchasing in clothing shop surely go for the purchase of jewelry, footwear, and further related items. Retailers of various items can come jointly and in cooperation fulfill consumer demands and can acquire the profit by an enhancing in the item sale [17]. Figure 1 provides a diagrammatic representation of these dynamics. Below are mathematical notations for the model.

Results of Multiagent Cooperative Reinforcement Learning by Expert Agents
The experiments were carried out into environment with dimensions between 120 to 350 states. Learning by Expert Agents (MCRLEA)

Shop Agent 1
The result of shop agent 1 for the period of one year sale duration using proposed cooperative expertness methods is given below.
The graph in Figure 1 for Shop agent 1 describes the comparison between two proposed expertness based methods i.e. expertness based Q learning and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based Q learning method.

Figure 1. Graph for Shop agent 1 using EQ-Learning and Expert Agent Learning methods.
The graph in Figure 2 for Shop agent 1 describes the comparison between two proposed expertness based methods i.e. expertness based group learning (EGroup) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based group learning method. The graph in Figure 3 for Shop agent 1 describes the comparison between two proposed expertness based methods i.e. expertness based dynamic method (EDynamic) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based dynamic learning method. The graph in Figure 4 for Shop agent 1 describes the comparison between two proposed expertness based methods i.e. expertness based goal driven method (EGoal) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based goal-driven learning method.

Shop Agent 2
The result of shop agent 2 for the period of one year sale duration using proposed cooperative expertness methods is given below. The graph in Figure 5 for Shop agent 2 describes the comparison between two proposed expertness based methods i.e. expertness based Q learning (EQ learning) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based Q learning method.

Figure 5. Graph for Shop agent 2 using EQ-Learning and Expert Agent learning methods
The graph in Figure 6 for Shop agent 2 describes the comparison between two proposed expertness based methods i.e. expertness based group learning (EGroup) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based group learning method. The graph in Figure 7 for Shop agent 2 describes the comparison between two proposed expertness based methods i.e. expertness based dynamic method (EDynamic) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based dynamic learning method. The graph in Figure 8 for Shop agent 2 describes the comparison between two proposed expertness based methods i.e. expertness based goal driven method (EGoal) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based goal-driven learning method.

Shop Agent 3
The graph in Figure 9 for Shop agent 3 describes the comparison between two proposed expertness based methods i.e. expertness based Q learning (EQ learning) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based Q learning method. The graph in Figure 10 for Shop agent 3 describes the comparison between two proposed expertness based methods i.e. expertness based group learning (EGroup) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based group learning method. The graph in Figure 11 for Shop agent 3 describes the comparison between two proposed expertness based methods i.e. expertness based dynamic method (EDynamic) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based dynamic learning method. The graph in Figure 12 for Shop agent 3 describes the comparison between two proposed expertness based methods i.e. expertness based goal driven method (EGoal) and expert agent method. It shows that expert agent method gives good results in terms of profit vs states as compared to expertness based goal driven learning method.

Result Analysis of Multiagent Cooperative Reinforcement Learning by Expert Agent (MCRLEA)
During one year period, for agent 1, expertness based dynamic method, expertness based group method, and Q learning method gives good profit as per the decreasing order. New method proposed expert agent gives satisfactory results as listed in Table 1, Table 2 and Table 3 for Shop Agent 1, Shop Agent 2 and Shop Agent 3 respectively. For Shop Agent 2, it is understood from Table 2 that for one year duration profit obtained without cooperation (EQL) method is reasonable as compared to profit with cooperation by expert methods i.e. EGroup, EDynamic, EGoal Driven and Expert Agent. In quarter 1, quarter 2, quarter 3 and quarter 4 the EDynamic method give more profit compared to other three methods. The profit range (lowest & highest) for four expertness based cooperative methods are given as: for EGroup is 6.32 to 9.96, for EDynamic is 5.38 to 11.38, for EGoal Driven is 5.01 to 9.63 and for Expert Agent is 6.36 to 9.37. The profit range obtained by without cooperation method EQL is 8.45 to 11.38. The tables show a comparison between expertise based cooperative methods and expert agent based cooperative method during the period of one year. In more than 70% months, the expert agent method gives better results than other expertness based methods. Shop agent 1 gets more profit as compared to agent 2 and agent 3 using expert Q learning method (EQ) and expert agent method in the 2 nd quarter. Shop agent 3 gets more profit as compared to agent 1 and agent 2 using the expert dynamic method in the 1 st quarter. Shop agent 2 gets more profit as compared to agent 1 and agent 3 using the expert dynamic method in 1 st quarter.

Conclusion
It claims that expert agent cooperative reinforcement learning methods outperform and surely enhance the performance of cooperative learning. The paper illustrates results of Cooperative reinforcement learning algorithms of three shop agents for the period of one year sale duration. Profit obtained expertness based cooperation methods (EQlearning, EDynamic, EGoal-oriented) and expert agent cooperative schemes are calculated. By following only expertness based cooperation methods shop agents cannot obtain the maximum profit. Amount of profit received in expertness based cooperation methods is less as compared to the amount of profit received with expert agent cooperation method. Graphical results show profit margin vs a number of states for four methods.
The paper demonstrated the results using proposed approach i.e. Expert agent based Multiagent Cooperative Reinforcement Learning (MCRLEA) for three shop agents for the period of one year sale duration. Expert agent method presents improved results in comparison with expertness based EQ-learning, expertness based EGroup, expertness based EDynamic and expertness based EGoal-driven methods in profit vs states. Comparison between expert agent based cooperative methods and expertness based cooperative method for the period of one year is calculated. In more than 70% months, the proposed methods i.e. cooperation with an expert agent gives better results than cooperation with expertness methods. The results obtained by the proposed expert agent cooperation methods show that such methods can put into a quick convergence of agents in the dynamic environment. It also shows that cooperative methods give a good presentation in dense, incompletely and composite circumstances.