class PolicyGradientInitializer(*, discrete_action_space_size, reward_discount_factor=0.95, learning_rate=0.01, learning_rate_decay_factor=0.999, min_learning_rate=None, rng_seed=None)

Configuration for the policy gradients based agent. The agent trains a policy that is represented by a neural network to maximize the long term discounted rewards it receives when interacting with the user-specified environment. The policy is updated at the conclusion of a complete episode.

  • discrete_action_space_size (int) – The number of possible actions the agent can choose from. No further information about the action space is needed for this agent. Must be positive. See Also: Action.discreteIndex

  • reward_discount_factor (float, optional) – Indicates the level of discounting that far-off rewards receive. 0 <= rewardDiscountFactor <= 1. A rewardDiscountFactor of zero means the agent will learn to be maximally near-sighted and learn to maximize one-step returns. A rewardDiscountFactor of one means the agent will treat all rewards equally and learn to maximize the average return over an entire trajectory.

  • learning_rate (float, optional) – Indicates the learning rate used to optimize the agent’s policy neural network. 0 < learningRate, typically <= 1.

  • learning_rate_decay_factor (float, optional) – Indicates the geometric decay rate for the policy network’s gradient update step. 0 < learningRateDecayFactor <= 1. This decay factor is multiplicatively applied to the current learning rate after each episode. Learning rate decay occurs until the minLearningRate has been hit, where it plateaus.

  • min_learning_rate (float, optional) – Indicates the minimum policy neural network learning rate value after which the decaying stops. 0 < minLearningRate <= learningRate. Defaults to 75% of the learning rate.

  • rng_seed (int, optional) – Optional. Seed for the random number generator. Use this option to generate deterministic results from the agent.