Contents

## Does approximate Q-learning converge?

Value-based methods such as TD-learning [3], Q-learning [4] or SARSA [5] have been exhaustively covered in the literature and, under mild assumptions, have been proven to converge to the desired solution [6]–[8]. In this paper, we describe Q-learning with linear function approximation.

## What is meant by premature convergence?

In genetic algorithms, the term of premature convergence means that a population for an optimization problem converged too early, resulting in being suboptimal. An allele is considered lost if, in a population, a gene is present, where all individuals are sharing the same value for that particular gene.

**Is Approximate Q-learning optimal?**

If Q-value estimates are correct a greedy policy is optimal. Instead of updating based on the best action from the next state, update based on the action your current policy actually takes from the next state.

**Is there proof that Q-learning converges when using function?**

A complete proof that shows that Q -learning finds the optimal Q function can be found in the paper Convergence of Q-learning: A Simple Proof (by Francisco S. Melo).

### Why is convergence of reinforcement learning algorithms important?

The convergence of these methods yields a measure proportional to how reinforcement learning algorithms will converge because reinforcement learning algorithms are sampling-based versions of Value and Policy Iteration, with a few more moving parts.

### How is Q-learning similar to Q-value iteration?

Recall: Q-learning is the same update rule as Q-value Iteration, but the transition function is replaced by the action of sampling and the reward function is replaced with the actual sample, r, received from the environment.

**What to look for in a convergence proof?**

Any convergence proof will be looking for a relationship between the error bound, ε, and the number of steps, N , (iterations). This relationship will give us the chance to bound the performance with an analytical equation. We want the bound of our Utility error at step N — b (N) — to be less than epsilon.