Reinforcement Learning

 In this learning, the answer to the question of how an autonomous agent perceiving and acting in his / her environment can learn to do the most appropriate actions to achieve his goal [1]. It is widely used in systems such as robotics, game programming, disease diagnosis and diagnosis, automation.


In response to an act of the agent in the reinforced learning environment, the trainer or software reinforces the agent with a reward or punishment to indicate the status of the new situation. Thus, in this system, the best action that can be taken to achieve the goal tries to be selected [2].

There are two main methods for solving problems with reinforcement learning: The first is to search the training space to find the one that improves the environment, and the second is to use statistics and dynamic programming methods to predict useful motion [3].

The purpose of reinforcement learning is to find the optimal policy. Optimal policy enables the agent to optimally solve the problem and reach the result. Thus, the agent reaches its target highest reward value. The optimal policy can be expressed as the maximum value of the reduced total reward amount starting from a random st state. The goals and awards determined by the instructor or the software are of great importance in order to achieve the goal of reinforcement learning. For this reason, it is seen that purpose and rewarding affect the success and how important it is in systems designed with reinforced learning method [4].

The most obvious difference of reinforced learning from supervised learning is that only some feedback is given to the agent for the predictions of the agent. In addition, the predictions here can have a long-term effect on the future state of the controlled system. Thus, time becomes an important factor [5].

In a reinforced learning system, there are four elements, one optional, besides the factor and the environment [6]:

  • policy
  • reward signal
  • value function
  • model

Policy; It determines the action the agent can take in the situation he is in. Prize; It is the score received from the environment for an action performed by the agent. Status value; It is the sum of the rewards that the agent can expect from the situation and other situations that follow that situation. Model; It is an element that is optionally included in the system [6].

Temporal difference learning (TD learning): Before explaining temporal difference learning, it is necessary to start with a basic understanding of value functions. Value functions are state-action pair functions that predict how well a particular action will do in a given situation, or predict what the return of that action will be. Value functions representation [7]:

Vπ (s) — value of state s in policy π.
Qπ (s, a) — the value (Q value) of state s in policy π to perform move a.

The problem here is to estimate these value functions for a particular policy. The reason for predicting these value functions is that they can be used to accurately select an operation that will provide the best possible total reward once it occurs in a given situation [7].

The TD learning method is a reinforced learning algorithm that can learn directly from raw experiences without the dynamics of the environment model. In this method, predictions based on other learned predictions are partially updated (booted) without waiting for a final result [3]. In the TD learning method, predictions are used as a goal in the learning process. As a model-free learning algorithm, TD learning has two important features:

  • It does not require prior knowledge of model dynamics.
  • It can also be applied for non-episodic tasks.

TD can be used to predict learning value functions. If the value functions were to be calculated without predicting, it would be necessary to wait for the final reward before any state-action pair values were updated. Once the final reward has been received, the path taken to reach the final state would have to be taken back and each value updated accordingly. Instead, with TD learning methods, an estimate of the final reward in each case is calculated and the state-action value is updated for each step of the situation.

The TD learning method is called the “bootstrap” method because the value is partially updated using an existing estimate, not a final reward. TD learning can be organized in two ways as “Political” and “Non-Policy” Learning. Methods in policy learning learn the value of the policy used to make decisions. Value functions are updated using the results of performing actions specified by some policies. These policies are usually soft (always assuring politics an exploratory element) and not deterministic. The policy is not strict and always chooses the action that gives the most reward. There are three types of frequently used policies: soft, greedy and softmax. Non-policy methods can learn different policies for behavior and prediction. The policy of conduct is usually lenient. Non-policy algorithms can update predictive value functions using hypothetical actions that were actually untested. Non-policy algorithms can separate research from control and intra-policy algorithms cannot. An agent trained using a non-policy method may result in learning tactics that should not necessarily be demonstrated during the learning phase.

Q learning: It is one of the most commonly used reinforcement learning algorithms. Q learning is an algorithm that can learn online in environments where there is no information about the environment. In a random environment, the agent is shown how to learn the optimal policy. It is difficult for the agent to learn the optimal policy directly. Because there is no training data that can be used for your agent. The training data available to the agent are only instant rewards. It is easier to learn a numerical evaluation function using these training data and then determine the optimal policy with the help of this function [4].

TD is a non-policy algorithm for learning. The main purpose of the Q learning algorithm is to examine the next moves and see the reward it will gain according to the moves it will make, and to act in a way that will maximize this award. In this algorithm, the agent is expected to set up a plan for the future. This algorithm is often applied to problems such as maze and search.

The algorithm is basically based on two matrices. The first matrix is the Reward matrix, the other is the State matrix. Moving during a given repetition, the robot completes the S matrix through the values in the R matrix during these movements. The last values of the S matrix formed as a result of all repetitions show the optimal result.

Q learning is a model-independent Reinforcement Learning algorithm that can be easily applied to areas that can be modeled as the finite state Markov Decision Process. Reinforcement learning problems can be modeled mathematically like Markov decision processes. Markov decision process is defined based on the following parameters [8].

  • The finite set of states, S
  • Set of movements, A
  • A reward function, R: S x A → R
  • State transition function, T: S x A → π (S)

Here π (S) can be defined as a probability distribution over the set S. The state transition function determines possibly the next state of the environment as a function of the current state and motion of the agent. The reward function gives instant rewards. This feature is known as the Markov Feature. Accordingly, there is no dependence between the past state, action, and reinforcement. Therefore, Markov decision process defines the dynamics of the environment for only one step [8].

The Q learning procedure can be listed as follows:

  • Start the table of Q-values (Q (s, a)).
  • Observe the current situation, p.
  • Choose an action based on one of the action choice policies. (soft, greedy or softmax).
  • Take action and watch the reward (r) and new status (s’).
  • Update the Q value for the status using the reward observed and the maximum possible reward for the next situation.
  • Set the state to its new state and repeat the process until you reach the terminal state.

SARSA: The Sarsa algorithm is a policy algorithm for TD learning. The biggest difference between Sarsa and Q learning is that the maximum reward for the next state is not necessarily used to update the Q values. Instead, a new action, and hence the reward, is selected using the policy that determines the original action. In the current state, S is taken an action, “A” and the agent gets a reward, “R” and ends in the next case, “S1” and takes action, A1 on S1. Based on this, it was named (S, A, R, S1, A1).

The SARSA procedure can be listed as follows:

  • First, reset the Q values to some arbitrary value,
  • Choose an action according to Epsilon-greedy policy and move from one state to another,
  • Update Q value in previous state, following update rule.




REFERENCES

[1] Kaelbling, L. P., Littman, M. L., and Moore, A. P. (1996). Reinforcement learning: A Survey. Journal of Artificial Intelligence Research, 237–285.

[2] Hoshino, Y., and Kamei, K. (2003). SICE 2003 Annual Conference. A proposal of reinforcement learning system to use knowledge effectively, 1582–1585. IEEE Xplore.

[3] Sutton, R. S., and Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge: MIT Press.

[4] Hacıbeyoğlu, M. (2006). Çoklu Etmen Mimarisi ve Takviyeli Öğrenme. Bilgisayar Mühendisliği Anabilim Dalı, Yüksek Lisans Tezi, Selçuk Üniversitesi Fen Bilimleri Enstitüsü, Konya.

[5] Szepesvari, C. (2010). Algorithms for Reinforcement Learning. Morgan & Claypool. doi: 10.2200/S00268ED1V01Y201005AIM009

[6] Kayaoğlu, T., Bilgiç, T., Özkaynak, S., Çalışır, S., ve Güçkiran, K. (2018). Pekiştirmeli Öğrenmeye Giriş Serisi-1. Web: https://medium.com/deep-learningturkiye/peki%CC%87%C5%9Fti%CC%87rmeli%CC%87-%C3%B6%C4%9Frenmeye-gi%CC%87ri%CC%87%C5%9Fseri%CC%87si%CC%87-1-8f5c35b6044

[7] Eden, T., Knittel, A., and Uffelen, R. v. (2019). Reinforcement Learning. Web: http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html

[8] Kaya, M. (2003). Çoklu Etmen Takviyeli Öğrenmeye Veri Madenciliği Tabanlı Yeni Yaklaşımlar, Doktora Tezi, Fırat Üniversitesi Fen Bilimleri Enstitüsü. Elazığ.

[9] Savaş, S. (2019), Karotis Arter Intima Media Kalınlığının Derin Öğrenme ile Sınıflandırılması, Gazi Üniversitesi Fen Bilimleri Enstitüsü Bilgisayar Mühendisliği Ana Bilim Dalı, Doktora Tezi, Ankara.

Hiç yorum yok:

Yorum Gönder