207.38.151.173 at 06:15, 12 February 2014

2014-02-12T06:15:11Z

en>Mark viking: Added wl

2013-07-12T09:25:06Z

Added wl

← Older revision		Revision as of 11:25, 12 July 2013
Line 1:		Line 1:
	~~The title~~ of the ~~writer is Nestor~~. ~~She functions~~ as a ~~financial officer~~ and ~~she~~ will ~~not change~~ it ~~whenever soon~~. ~~Years ago we moved~~ to ~~Kansas~~. ~~To maintain birds is one~~ of the ~~issues he loves~~ most.<br><br>~~my website :: [~~http://~~Pchelpnow~~.~~biz~~/~~ActivityFeed~~/~~MyProfile~~/~~tabid~~/60/~~userId~~/~~48445~~/~~Default~~.~~aspx Pchelpnow~~.~~biz~~]		{{About\|the machine learning algorithm\|the village in the Anand District of Gujarat in India\|Sarsa}}
			{{Wikiversity\|SARSA}}

			'''SARSA''' ('''State-Action-Reward-State-Action''') is an [[algorithm]] for learning a [[Markov decision process]] policy, used in the [[reinforcement learning]] area of [[machine learning]]. It was introduced in a technical note <ref>[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.2539&rep=rep1&type=pdf Online Q-Learning using Connectionist Systems" by Rummery & Niranjan (1994)]</ref> where the alternative name SARSA was only mentioned as a footnote.

			This name simply reflects the fact that the main function for updating the Q-value depends on the current state of the agent "'''S'''<sub>1</sub>", the action the agent chooses "'''A'''<sub>1</sub>", the reward "'''R'''" the agent gets for choosing this action, the state "'''S'''<sub>2</sub>" that the agent will now be in after taking that action, and finally the next action "'''A'''<sub>2</sub>" the agent will choose in its new state. Taking every letter in the quintuple (s<sub>t</sub>, a<sub>t</sub>, r<sub>t</sub>, s<sub>t+1</sub>, a<sub>t+1</sub>) yields the word ''SARSA''.<ref>[http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node64.html Reinforcement Learning: An Introduction Richard S. Sutton and Andrew G. Barto (chapter 6.4)]</ref>

			== Algorithm ==
			:<math>Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [r_{t} + \gamma Q(s_{t+1}, a_{t+1})-Q(s_t,a_t)]</math>

			A SARSA agent will interact with the environment and update the policy based on actions taken, known as an on-policy learning algorithm. As expressed above, the Q value for a state-action is updated by an error, adjusted by the learning rate alpha. Q values represent the possible reward received in the next time step for taking action ''a'' in state ''s'', plus the discounted future reward received from the next state-action observation. Watkin's [[Q-learning]] was created as an alternative to the existing [[temporal difference learning\|temporal difference technique]] and which updates the policy based on the maximum reward of available actions. The difference may be explained as SARSA learns the Q values associated with taking the policy it follows itself, while Watkin's Q-learning learns the Q values associated with taking the exploitation policy while following an exploration/exploitation policy. For further information on the exploration/exploitation trade off, see [[reinforcement learning]].

			Some optimizations of Watkin's Q-learning may also be applied to SARSA, for example in the paper "Fast Online Q(λ)" (Wiering and Schmidhuber, 1998) the small differences needed for SARSA(λ) implementations are described as they arise.

			== Influence of variables on the algorithm ==

			=== Learning rate (alpha)===

			The learning rate determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information.

			=== Discount factor (gamma) ===

			The discount factor determines the importance of future rewards. A factor of 0 will make the agent "opportunistic" by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the <math>Q</math> values may diverge.

			=== Initial conditions (<math>Q(s_0,a_0)</math>) ===

			Since SARSA is an iterative algorithm, it implicitly assumes an initial condition before the first update occur. A high (infinite) initial value, also known as "optimistic initial conditions",<ref>http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node21.html</ref> can encourage exploration: no matter what action will take place, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. Recently, it was suggested that the first reward <math>r</math> could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of <math>Q</math>. This will allow immediate learning in case of fix deterministic rewards. Surprisingly, this resetting-of-initial-conditions (RIC) approach seems to be consistent with human behaviour in repeated binary choice experiments.<ref>[http://www.ncbi.nlm.nih.gov/pubmed/22924882 The Role of First Impression in Operant Learning. Shteingart H, Neiman T, Loewenstein Y. J Exp Psychol Gen. 2013 May; 142(2):476-88. doi: 10.1037/a0029550. Epub 2012 Aug 27.]</ref>

			== See also ==
			* [[Reinforcement learning]]
			* [[Temporal difference learning]]
			* [[Q-learning]]

			==References==
			{{reflist}}

			[[Category:Machine learning algorithms]]

en>Michael Hardy at 19:17, 13 August 2012

2012-08-13T19:17:57Z

New page

The title of the writer is Nestor. She functions as a financial officer and she will not change it whenever soon. Years ago we moved to Kansas. To maintain birds is one of the issues he loves most.<br><br>my website :: [http://Pchelpnow.biz/ActivityFeed/MyProfile/tabid/60/userId/48445/Default.aspx Pchelpnow.biz]

Layer cake representation - Revision history

207.38.151.173 at 06:15, 12 February 2014

en>Mark viking: Added wl

en>Michael Hardy at 19:17, 13 August 2012