![]() This shows that our discrete policy is a good baseline to start with before we dive into continuous actions. Let’s look at the leaderboard of this problem and MountainCar-v0, which is a discrete version of the problem. The overall max reward reaches the goal of 90 at around ~1000 episodes. In conclusion, the number of episodes until convergence is quite good. However, I didn’t use continuous actions because I wanted to see how well a discrete-action agent could perform on this simple task. This problem (MountainCarContinuous-v0) was intended to be solved using a continuous action policy. This translates to a faster convergence rate. We can easily see from the rewards graph, the agent with baseline has a smaller variance in the rewards. The above comparison experiment is to show the effects of having a baseline on the agent’s performance. Epsilon-greedy (\(\epsilon\)): 1 decreasing to 0.1 linearly over all training episodesĤ.1 Introducing baseline to reduce variance Without Baseline.Discretization for velocity state: 120 buckets.Discretization for position state: 150 buckets.Temperature (\(\tau\)): 1 decreasing to 0.5 linearly over all training episodes.*Reinforcement Learning: An Introduction*. The criteria for choosing the method depends heavily on the problem itself. Training it too late might prolong the training duration for convergence. Training it too early could render very messy results and thus might be harder to converge. There is a balance between when you want to train the agent. So Monte Carlo is essentially a \(\infty\)-step Temporal Difference learning. This is at one end of the spectrum, the other end of the spectrum is called 1-Step Temporal Difference learning. This means the training only takes place after an entire episode is completed, and replays the accumulated state/action/reward/next state for training. The training process follows a Monte Carlo method. Technically, in the code, we will be using a temperature term to smooth the probability of actions, and epsilon to decide between whether to take a random action or the predicted action output from the policy. This will ensure that our agent will have a wide variety of state-action training samples and in the later part of the training, it will allow the agent to follow it’s own “trained strategy” as opposed to random actions. To overcome the exploration-exploitation dilemma, we will be using the epsilon-greedy approach to slowly decrease the randomization factor overtime. We will discretize them separately into 150 buckets and 120 buckets for position and velocity respectively. Since there are two dimensions in the state space, namely position and velocity. In other words, the continuous state space will be discretized into buckets of states that will be fed to the agent that will output a discrete action either, which is. Even though the state space is continuous, in this attempt, we will be using a discrete softmax policy. The approach uses the policy gradient algorithm with a baseline to reduce variance. The episode will terminate either when the car has reached the goal OR when the total number of time steps reached 1000 regardless of reaching the goal or not. A constraint on velocity might be added in a more challenging version. Position between -0.6 and -0.4, null velocity. Note that this reward is unusual with respect to most published work, where the goal was to reach the target as fast as possible, hence favouring a bang-bang strategy. This reward function raises an exploration challenge, because if the agent does not reach the target soon enough, it will figure out that it is better not to move, and won’t find the target anymore. Reward is 100 for reaching the target of the hill on the right hand side, minus the squared sum of actions from start to goal. Push car to the left (negative value) or to the right (positive value) Note that velocity has been constrained to facilitate exploration, but this constraint might be relaxed in a more challenging version. The goal is to get the car to accelerate up the hill and get to the flag. The states are the position of the car in the horizontal axis on the range and its velocity on the range. The acceleration of the car is controlled via the application of a force which takes values in the range. The mountain car follows a continuous state space as follows(copied from wiki): The problem setting is to solve the Continuous MountainCar problem in OpenAI gym.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |