Discounted cumulative reward

In the previous section, we said that the goal of reinforcement learning is to learn a policy that, for each state s in which the system is located, indicates to the agent an action to maximize the total reward received during the entire action sequence. How can we maximize the total reinforcement received during the entire sequence of actions?

The total reinforcement derived from the policy is calculated as follows:

Here, rT represents the reward of the action that drives the environment in the terminal state sT.

A possible solution to the problem is to associate the action that provides the highest reward to each individual state; that is, we must determine an optimal policy such that the previous quantity is maximized.

For problems that do not reach the goal or terminal state in a finite number of steps (continuing tasks), Rt tends to infinity.

In these cases, the sum of the rewards that one wants to maximize diverges at the infinite, so this approach is not applicable. Then, it is necessary to develop an alternative reinforcement technique.

The technique that best suits the reinforcement learning paradigm turns out to be the discounted cumulative reward, which tries to maximize the following quantity:

Here, γ is called a discount factor and represents the importance for future rewards. This parameter can take the values 0 ≤ γ ≤ 1, with the following value:

  • If γ <1, the sequence rt will converge to a finite value
  • If γ = 0, the agent will have no interest in future rewards, but will try to maximize the reward only for the current state
  • If γ = 1, the agent will try to increase future rewards even at the expense of the immediate ones

The discount factor can be modified during the learning process to highlight particular actions or states. An optimal policy can lead to the reinforcement obtained in performing a single action to be low (or even negative), provided that this leads to greater reinforcement overall.