Contents

## What is expected SARSA?

Expected SARSA, as the name suggest takes the expectation (mean) of Q values for every possible action in the current state. The target update rule shall make things more clear: Source: Introduction to Reinforcement learning by Sutton and Barto —6.9.

## What is SARSA in reinforcement learning?

State–action–reward–state–action (SARSA) is an algorithm for learning a Markov decision process policy, used in the reinforcement learning area of machine learning. It was proposed by Rummery and Niranjan in a technical note with the name “Modified Connectionist Q-Learning” (MCQ-L).

**What is the difference between TD 0 and SARSA?**

Policy control in TD(0) has two implementations: SARSA and Q-Learning. SARSA is an On-Policy method, which means it computes the Q-value according to a certain policy and then the agent follows that policy. Q-Learning is an Off-Policy method.

### Is expected SARSA on-policy or off-policy?

We know that SARSA is an on-policy technique, Q-learning is an off-policy technique, but Expected SARSA can be use either as an on-policy or off-policy. This is where Expected SARSA is much more flexible compared to both these algorithms.

### Is SARSA model free?

Algorithms that purely sample from experience such as Monte Carlo Control, SARSA, Q-learning, Actor-Critic are “model free” RL algorithms.

**Why TD is better than Monte Carlo?**

The next most obvious advantage of TD methods over Monte Carlo methods is that they are naturally implemented in an on-line, fully incremental fashion. With Monte Carlo methods one must wait until the end of an episode, because only then is the return known, whereas with TD methods one need wait only one time step.

## Is Q-learning on policy?

For example, Q-learning is an off-policy learner. On-policy methods attempt to evaluate or improve the policy that is used to make decisions. In contrast, off-policy methods evaluate or improve a policy different from that used to generate the data.

## How is a new action chosen in Sarsa?

In SARSA, this is done by choosing another action $a’$ following the same current policyabove and using $r + \\gamma Q(s’,a’)$ as target. SARSA is called on-policylearning because new action $a’$ is chosen using the same $epsilon$-greedy policy as the action $a$, the one that generated $s’$.

**How are Sarsa and Q-learning different from MC?**

SARSA and Q-learning are two reinforcement learning methods that do not require model knowledge, only observed rewards from many experiment runs. Unlike MC which we need to wait until the end of an episode to update the state-action value function $Q(s,a)$, SARSA and Q-learning make the update after each step.

### How does Sarsa learn the optimal Q-value function?

SARSA will learn the optimal $\\epsilon$-greedy policy, i.e, the Q-value function will converge to a optimal Q-value function but in the space of $\\epsilon$-greedy policy only (as long as each state action pair will be visited infinitely). We expect that in the limit of $\\epsilon$ decaying to $0$, SARSA will converge to the overall optimal policy.

### Why is SARSA called on policylearning?

SARSA is called on-policylearning because new action $a’$ is chosen using the same $epsilon$-greedy policy as the action $a$, the one that generated $s’$. In Q-learning, this is done by choosing the greedy action $a^g$, i.e the action that maximize the Q-value function at the new state Q(s’, a): or equivalently