Monte Carlo methods

The Monte Carlo (MC) methods for estimating the value function and discovering excellent policies do not require the presence of a model of the environment. They are able to learn through the use of the agent's experience alone or from samples of state sequences, actions, and rewards obtained from the interactions between agent and environment. The experience can be acquired by the agent in line with the learning process or emulated by a previously populated dataset. The possibility of gaining experience during learning (online learning) is interesting because it allows obtaining excellent behavior even in the absence of a priori knowledge of the dynamics of the environment. Even learning through an already populated experience dataset can be interesting, because, if combined with online learning, it makes automatic policy improvement induced by others' experiences possible.

In general, MC methods rely on repeated random sampling to obtain numerical results. To do this, they use randomness to solve deterministic problems. In our case, we will use random sampling of states and action-state pairs and we will look at the rewards and then we will review the policy in an iterative way. The iteration of the process will converge on optimal policy as we explore every possible action-state pair.

For example, we could take the following procedure:

We will assign a reward of +1 to a right action, -1 to a wrong action, and 0 to a draw.
We will establish a table in which each key corresponds to a particular state-action pair and each value is the value of that pair. This represents the average reward received for that action in that state.

To solve the reinforcement learning problems, MC methods estimate the value function on the basis of the total sum of rewards, obtained on average in the past episodes. This assumes that the experience is divided into episodes, and that all episodes are composed of a finite number of transitions. This is because, in MC methods, the estimation of new values, and the modification of policies, takes place at the end of each episode. MC methods iteratively estimate policy and value function. In this case, however, each iteration cycle is equivalent to completing an episode—the new estimates of policy and value function occur episode by episode.

The following is a pseudocode for MC policy evaluation:

Initialize
 arbitrary policy π
 arbitrary state-value function
Repeat
 generate episode using π
 for each state s in episode
 the received reinforcement R is added to the set of rewards obtained so far
 estimate the value function on the basis on the average of the total sum of rewards obtained

Usually, the term MC is used for estimation methods, the operations of which involve random components. In this case, the term MC refers to reinforcement learning methods based on total reward averages. Unlike the DP methods that calculate the values ??for each state, the MC methods calculate the values ??for each state-action pair, because, in the absence of a model, the only state values ??are not sufficient to decide which action is best performed in a certain state.