Reinforcement Learning
The RL problem presented in MLDemos is a Food Gathering problem, in
which the goal is to provide a policy for navigating a continuous
2-Dimensional space and pick up food. The states, actions and
rewards are defined next.
States
States are defined as 2-dimensional (x,y) ∈ R2 positions in the
canvas space. The space is
continuous, and bound between [0., 1.] for practical purposes.
Actions
Actions are defined as movement from one state to another, following
a set of possible direction (defined by the user). The sets of
possible actions from each states are:
- 4-way: movement along
either the horizontal or the vertical axis
- 8-way: movement along
the horizontal or vertical axes or diagonally at a 45° angle
- Full: movement along
an arbitrary direction θ (θ ∈ [02π])
In all cases, an additional action ”wait” allows to not move.
Rewards
The State-Value function is computed in a cumulative way by
considering how much food is collected throughout a trajectory from
any given initial state, for a number of Evaluation Steps (defined
by the user).
- Sum of Rewards: Sum
of all the food present in the current state at each step of the
trajectory. food can be collected multiple times from the same
state.
- Sum (Non Repeatable):
Same as Sum of Rewards, but when food is collected at a specific
step, the food at that location is erased and cannot be
collected again.
- Sum - Harsh Turns:
Same as above, but a penalty is given if the latest action taken
was at more than 90◦ from the previous one (harsh turn), in
which case, no food is provided for that step.
- Average of Rewards:
The amount of food collected is averaged by the number of steps
taken over the whole trajectory.
The state-value function is evaluated at each policy-optimization
iteration for a number of states corresponding to the number of
basis functions, initialized at the state corresponding to the
center of the basis function (grid).
Policies
Three policies have been implemented in MLDemos. In all cases, the
policy determines what action will be taken from each state using a
grid-like distribution of basis functions. The action taken from a
specific state will be ”influenced” by the policy using 3 different
paradigms:
- Nearest Neighbors: the
action taken from each state depends entirely on the direction
suggested by the nearest basis function.
- Linear combination:
the action taken from each state is computed as a linear
combination of the closest basis functions, each weighed as a
function of their proximity (inverse euclidian distance).
- Gaussians: the action
taken from each state is computed as a linear combination of the
closest basis functions, each weighed as a function of their
proximity (gaussian function, with sigma equal to the distance
between basis functions).
The first case is a peculiar case in that, while the states space is
continuous, the policy provides the exact same action for a set of
states, which makes it a somewhat discretized problem. The other two
policies provide a continuous set of actions for a continuous states
space and therefore pose no problems of a somewhat ontological
nature.
In Practice
The easiest way to test the reinforcement learning process is to:
- Use the Reward Painter button in the drawing tools to paint
food (red) onto the canvas
- Click the Initialize button to start the learning process
This will start the RL process, display the policy basis functions
and update them every Display Steps iterations.
Options and Commands
The interface for Reinforcement Learning (the right-hand side of the
Algorithm Options dialog) provides the following commands:
- Initialize: Initialize the RL problem and start the learning
- Pause / Continue: pause the learning process (stops animation
as well)
- Clear: clear the current classifier model (does NOT clear the
data)
- Drag Me: (for display purposes only) display the evaluation
steps for an agent at a specific position (drag and drop onto
the canvas)
- X: erase all displayed agents
The options regarding the policy type, reward and evaluation have
been described above.
Generate Rewards
It is possible to generate a set of pre-constructed rewards by
dragging and dropping a gaussian of fixed size (Var option) or a
gradient from the center of the canvas to the dropped position.
Alternatively a number of standard benchmark functions is proposed.
Use the Set button to draw the benchmark function onto the canvas.