Stochastic Method for Reward Maximization


Here are presented some stochastic methods for exploring a policy space and maximizing an arbitrary reward function.

Random Search
Randomly sample new policies (points) at each iteration and keep the maximum value found. The display shows in black all visited policies, and in green the best policies found at each iteartion.

Random Walk
Generate samples around the current maximum drawing from a normal distribution.
The display shows in green the history of the best policies found at each iteration, in black all policies visited so far. Ellipses display the search distribution (bold: 1x variance, thin: 2x variance).

Policy learning by Weighting Exploration with the Returns (PoWER)
PoWER uses a set of the k best policies to estimate the maximum, and explores new samples drawing from a normal distribution whose variance can be constant or adapted to the current best policies. Detailed information on PoWER can be found in this paper. The display shows in green the set of k policies, in white the history of best policies throughout the iterations, and in black all visited policies. Ellipses display the search distribution (bold: 1x variance, thin: 2x variance).

Particle Filters
Particle Filters use a set of particles (samples) to model the reward function. New particles are generated from the best older particles at each new iteration, quickly discarding particles in region that have no reward. More information on Wikipedia.
In its implementation here, we follow 4 steps for each iteration:
  1. Displace previous particles ( N(0, Search Variance) )
  2. Recompute the weights for each particle
  3. Find best particle
  4. Sample particles for the new iteration
  5. Check for degenerated weights and replace the corresponding particles
The display shows in green the current particles, with large radii for larger weights. The best particles at previous iterations are shown in white.

Gradient Descent
Gradient descent methods compute the gradient of the reward in the neighborhood of the current minimum (or maximum in the case of gradient ascent), and move the search in the direction of maximum gradient. More information on Wikipedia.
The display shows in white the history of best policies found at each iteration, and in green the current best policy.