目录
  • ppo 算法
  • actor-critic 算法
  • gym
  • lunarlander-v2
  • ppo 算法实现月球登录器
    • ppo

概述

从今天开始我们会开启一个新的篇章, 带领大家来一起学习 (卷进) 强化学习 (reinforcement learning). 强化学习基于环境, 分析数据采取行动, 从而最大化未来收益.

强化学习算法种类

on-policy vs off-policy:

  • on-policy: 训练数据由当前 agent 不断与环境交互得到
  • off-policy: 训练的 agent 和与环境交互的 agent 不是同一个 agent, 即别人与环境交互为我提供训练数据

ppo 算法

ppo (proximal policy optimization) 即近端策略优化. ppo 是一种 on-policy 算法, 通过实现小批量更新, 解决了训练过程中新旧策略的变化差异过大导致不易学习的问题.

actor-critic 算法

actor-critic 算法共分为两部分. 第一部分为策略函数 actor, 负责生成动作并与环境交互; 第二部分为价值函数, 负责评估 actor 的表现.

gym

gym 是一个强化学习会经常用到的包. gym 里收集了很多游戏的环境. 下面我们就会用 lunarlander-v2 来实现一个自动版的 “阿波罗登月”.

安装:

pip install gym

如果遇到报错:

attributeerror: module 'gym.envs.box2d' has no attribute 'lunarlander'

解决办法:

pip install gym[box2d]

lunarlander-v2

lunarlander-v2 是一个月球登陆器. 着陆平台位于坐标 (0, 0). 坐标是状态向量的前两个数字, 从屏幕顶部移动到着陆台和零速度的奖励大约是 100 到 140分. 如果着陆器坠毁或停止, 则回合结束, 获得额外的 -100 或 +100点. 每脚接地为 +10, 点火主机每帧 -0.3分, 正解为200分.

启动登陆器

代码:

import gym

# 创建环境
env = gym.make("lunarlander-v2")

# 重置环境
env.reset()

# 启动
for i in range(180):

    # 渲染环境
    env.render()

    # 随机移动
    observation, reward, done, info = env.step(env.action_space.sample())

    if i % 10 == 0:
        # 调试输出
        print("观察:", observation)
        print("得分:", reward)

输出结果:

观察: [ 0.00861025 1.4061487 0.42930993 -0.11858992 -0.00789343 -0.05729095
0. 0. ]
得分: 0.4097546298543773
观察: [ 0.04917412 1.3876126 0.41002613 -0.13066985 -0.06578191 -0.12604967
0. 0. ]
得分: -1.0858669952763478
观察: [ 0.08917055 1.3429415 0.43598312 -0.2890789 -0.17471936 -0.23913136
0. 0. ]
得分: -2.9339827504803666
观察: [ 0.1326253 1.2450166 0.44708318 -0.5567949 -0.32039645 -0.28250334
0. 0. ]
得分: -2.2779730990326357
观察: [ 0.18323365 1.1110108 0.615291 -0.61922276 -0.43743232 -0.2921057
0. 0. ]
得分: -3.107298313736037
观察: [ 0.24544087 0.94960684 0.66677517 -0.7835077 -0.5929364 -0.2968613
0. 0. ]
得分: -0.5472611013563438
观察: [ 0.3148238 0.75122666 0.7238519 -0.98458177 -0.72915816 -0.26130882
0. 0. ]
得分: -2.5665300894414416
观察: [ 0.38628978 0.49828076 0.74157137 -1.2624744 -0.85754734 -0.37227553
0. 0. ]
得分: -3.2562193227533087
观察: [ 0.46820658 0.18855602 0.92624503 -1.4677961 -1.08614 -0.4508995
0. 0. ]
得分: -4.017106927961208
观察: [ 0.57930076 -0.09440845 1.4345247 -0.693939 -2.0783656 -5.4039164
1. 0. ]
得分: -100
观察: [ 0.7383894 -0.08930686 1.4662493 -0.13461255 -3.653495 -3.109081
0. 0. ]
得分: -100
观察: [ 0.859124 -0.08471288 0.9377837 0.21408719 -3.8998525 0.10151418
0. 0. ]
得分: -100
观察: [ 9.3801367e-01 -4.6761338e-02 6.5999150e-01 1.4583524e-01
-3.9281998e+00 -4.7179851e-06 0.0000000e+00 1.0000000e+00]
得分: -100
观察: [ 0.9879366 -0.04012476 0.33624884 0.08859511 -4.253908 -1.0233303
0. 0. ]
得分: -100
观察: [ 1.0056045 -0.03840658 0.0733737 0.01812508 -4.6796274 -0.6103991
0. 0. ]
得分: -100
观察: [ 1.0112988 -0.03921754 0.07890484 -0.00624387 -4.845023 -0.17111658
0. 0. ]
得分: -100
观察: [ 1.0234139 -0.04488504 0.15701209 -0.0331554 -4.829875 0.07602684
0. 0. ]
得分: -100
观察: [ 1.0306002e+00 -4.8987642e-02 -1.1189224e-02 8.7506004e-04
-4.8712435e+00 -1.5446089e-01 0.0000000e+00 0.0000000e+00]
得分: -100

ppo 算法实现月球登录器

ppo

import numpy as np
import tensorflow as tf
from tensorflow_probability.python.distributions import categorical


class memory:
    def __init__(self):
        """初始化"""
        self.actions = []  # 行动(共4种)
        self.states = []  # 状态, 由8个数字组成
        self.logprobs = []  # 概率
        self.rewards = []  # 奖励
        self.is_terminals = []  # 游戏是否结束

    def clear_memory(self):
        """清除memory"""
        del self.actions[:]
        del self.states[:]
        del self.logprobs[:]
        del self.rewards[:]
        del self.is_terminals[:]


class actorcritic(tf.keras.model):
    def __init__(self, state_dim, action_dim, n_latent_var):
        super(actorcritic, self).__init__()

        # 行动
        self.action_layer = tf.keras.sequential([
            # [b, 8] => [b, 64]
            tf.keras.layers.dense(n_latent_var, activation="tanh"),

            # [b, 64] => [b, 64]
            tf.keras.layers.dense(n_latent_var, activation="tanh"),

            # [b, 64] => [b, 4]
            tf.keras.layers.dense(action_dim, activation="softmax")
        ])

        # 评判
        self.value_layer = tf.keras.sequential([
            # [b, 8] => [b, 64]
            tf.keras.layers.dense(n_latent_var, activation="tanh"),

            # [b, 64] => [b, 64]
            tf.keras.layers.dense(n_latent_var, activation="tanh"),

            # [b, 64] => [b, 1]
            tf.keras.layers.dense(1)
        ])

    def forward(self):
        """前向传播, 由act替代"""

        raise notimplementederror

    def build(self, input_shape):

        # no weight to train.
        super(actorcritic, self).build(input_shape)  # be sure to call this at the end

    def act(self, state, memory):
        """计算行动"""

        # 计算4个方向概率
        action_probs = self.action_layer(state)

        # 通过最大概率计算最终行动方向
        dist = categorical(action_probs)
        action = dist.sample()

        # 存入memory
        memory.states.append(state)
        memory.actions.append(action)
        memory.logprobs.append(dist.log_prob(action))

        # 返回行动
        return action.numpy()[0]

    def evaluate(self, state, action):
        """
        评估
        :param state: 状态, 2000个一组, 形状为 [2000, 8]
        :param action: 行动, 2000个一组, 形状为 [2000]
        :return:
        """

        # 计算行动概率
        action_probs = self.action_layer(state)
        dist = categorical(action_probs)  # 转换成类别分布

        # 计算概率密度, log(概率)
        action_logprobs = dist.log_prob(action)

        # 计算熵
        dist_entropy = dist.entropy()
        dist_entropy = tf.squeeze(dist_entropy)

        # 评判
        state_value = self.value_layer(state)
        state_value = tf.squeeze(state_value)  # [2000, 1] => [2000]


        # 返回行动概率密度, 评判值, 行动概率熵
        return action_logprobs, state_value, dist_entropy


class ppo:
    def __init__(self, state_dim, action_dim, n_latent_var, lr, betas, gamma, k_epochs, eps_clip):
        self.lr = lr  # 学习率
        self.betas = betas  # betas
        self.gamma = gamma  # gamma
        self.eps_clip = eps_clip  # 裁剪, 限制值范围
        self.k_epochs = k_epochs  # 迭代次数

        # 初始化policy
        self.policy = actorcritic(state_dim, action_dim, n_latent_var)
        self.policy_old = actorcritic(state_dim, action_dim, n_latent_var)

        self.optimizer = tf.keras.optimizers.adam(lr=lr)  # 优化器
        self.mseloss = tf.keras.losses.meansquarederror()  # 损失函数

    def update(self, memory):
        """更新梯度"""

        # 蒙特卡罗预测状态回报
        rewards = []
        discounted_reward = 0
        for reward, is_terminal in zip(reversed(memory.rewards), reversed(memory.is_terminals)):
            # 回合结束
            if is_terminal:
                discounted_reward = 0

            # 更新削减奖励(当前状态奖励 + 0.99*上一状态奖励
            discounted_reward = reward + (self.gamma * discounted_reward)

            # 首插入
            rewards.insert(0, discounted_reward)

        # 标准化奖励
        rewards = tf.convert_to_tensor(rewards, dtype=tf.float32)
        rewards = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-5)

        # 张量转换
        old_states = tf.stack(memory.states)
        old_actions = tf.stack(memory.actions)
        old_logprobs = tf.stack(memory.logprobs)

        # 迭代优化 k 次:
        for _ in range(self.k_epochs):
            with tf.gradienttape() as tape:

                # 评估
                logprobs, state_values, dist_entropy = self.policy.evaluate(old_states, old_actions)

                # 计算ratios
                ratios = tf.exp(logprobs - old_logprobs)
                ratios = tf.squeeze(ratios)

                # 计算损失
                advantages = rewards - state_values
                surr1 = ratios * advantages
                surr2 = tf.clip_by_value(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
                loss = -tf.minimum(surr1, surr2) + 0.5 * self.mseloss(state_values, rewards) - 0.01 * dist_entropy

            # 更新梯度
            grads = tape.gradient(loss, self.policy.action_layer.trainable_variables + self.policy.value_layer.trainable_variables)
            self.optimizer.apply_gradients(zip(grads, self.policy.action_layer.trainable_variables + self.policy.value_layer.trainable_variables))

        # 将新的权重赋值给旧policy
        self.policy_old.action_layer = self.policy.action_layer
        self.policy_old.value_layer = self.policy.value_layer

main

import gym
import tensorflow as tf
from ppo import memory, ppo

############## 超参数 ##############
env_name = "lunarlander-v2"  # 游戏名字
env = gym.make(env_name)
state_dim = 8  # 状态维度
action_dim = 4  # 行动维度
render = false  # 可视化
solved_reward = 230  # 停止循环条件 (奖励 > 230)
log_interval = 20  # print avg reward in the interval
max_episodes = 50000  # 最大迭代次数
max_timesteps = 300  # 最大单次游戏步数
n_latent_var = 64  # 全连接隐层维度
update_timestep = 2000  # 每2000步policy更新一次
lr = 0.002  # 学习率
betas = (0.9, 0.999)  # betas
gamma = 0.99  # gamma
k_epochs = 4  # policy迭代更新次数
eps_clip = 0.2  # ppo 限幅


#############################################

def main():
    # 实例化
    memory = memory()
    ppo = ppo(state_dim, action_dim, n_latent_var, lr, betas, gamma, k_epochs, eps_clip)

    # 存放
    total_reward = 0
    total_length = 0
    timestep = 0

    # 训练
    for i_episode in range(1, max_episodes + 1):

        # 环境初始化
        state = env.reset()  # 初始化(重新玩)

        # 转换成tensor
        state = tf.convert_to_tensor(state)
        state = tf.reshape(state, [1, 8])

        # 迭代
        for t in range(max_timesteps):
            timestep += 1

            # 用旧policy得到行动
            action = ppo.policy_old.act(state, memory)

            # 行动
            state, reward, done, _ = env.step(action)  # 得到(新的状态,奖励,是否终止,额外的调试信息)

            # 转换成tensor
            state = tf.convert_to_tensor(state)
            state = tf.reshape(state, [1, 8])

            # 更新memory(奖励/游戏是否结束)
            memory.rewards.append(reward)
            memory.is_terminals.append(done)

            # 更新梯度
            if timestep % update_timestep == 0:
                ppo.update(memory)

                # memory清零
                memory.clear_memory()

                # 累计步数清零
                timestep = 0

            # 累加
            total_reward += reward

            # 可视化
            if render:
                env.render()

            # 如果游戏结束, 退出
            if done:
                break

        # 游戏步长
        total_length += t

        # 如果达到要求(230分), 退出循环
        if total_reward >= (log_interval * solved_reward):
            print("########## solved! ##########")

            # 保存模型
            tf.keras.models.save_model(ppo.policy.action_layer, r"\model\action")
            tf.keras.models.save_model(ppo.policy.value_layer, r"\model\value")

            # 退出循环
            break

        # 输出log, 每20次迭代
        if i_episode % log_interval == 0:

            # 求20次迭代平均时长/收益
            avg_length = int(total_length / log_interval)
            running_reward = int(total_reward / log_interval)

            # 调试输出
            print('episode {} \t avg length: {} \t average_reward: {}'.format(i_episode, avg_length, running_reward))

            # 清零
            total_reward = 0
            total_length = 0

if __name__ == '__main__':
    main()

输出结果

episode 20 avg length: 93 reward: -243
episode 40 avg length: 92 reward: -172
episode 60 avg length: 79 reward: -192
episode 80 avg length: 85 reward: -164
episode 100 avg length: 90 reward: -179
episode 120 avg length: 100 reward: -201
episode 140 avg length: 91 reward: -175
episode 160 avg length: 101 reward: -141
episode 180 avg length: 86 reward: -153
episode 200 avg length: 93 reward: -189
episode 220 avg length: 96 reward: -221
episode 240 avg length: 105 reward: -140
episode 260 avg length: 94 reward: -121
episode 280 avg length: 91 reward: -131
episode 300 avg length: 91 reward: -122
episode 320 avg length: 90 reward: -113
episode 340 avg length: 100 reward: -110
episode 360 avg length: 110 reward: -92
episode 380 avg length: 110 reward: -75
episode 400 avg length: 119 reward: -76
episode 420 avg length: 162 reward: -77
episode 440 avg length: 194 reward: -91
episode 460 avg length: 144 reward: -28
episode 480 avg length: 192 reward: -8
episode 500 avg length: 244 reward: -25
episode 520 avg length: 239 reward: -1
episode 540 avg length: 269 reward: 21
episode 560 avg length: 289 reward: 27
episode 580 avg length: 270 reward: 65
episode 600 avg length: 264 reward: 86
episode 620 avg length: 256 reward: 66
episode 640 avg length: 278 reward: 75
episode 660 avg length: 235 reward: 11
episode 680 avg length: 244 reward: 84
episode 700 avg length: 253 reward: 73
episode 720 avg length: 292 reward: 63
episode 740 avg length: 293 reward: 104
episode 760 avg length: 279 reward: 109
episode 780 avg length: 246 reward: 86
episode 800 avg length: 260 reward: 124
episode 820 avg length: 276 reward: 131
episode 840 avg length: 269 reward: 121
episode 860 avg length: 194 reward: 67
episode 880 avg length: 241 reward: 94
episode 900 avg length: 259 reward: 98
episode 920 avg length: 211 reward: 83
episode 940 avg length: 260 reward: 105
episode 960 avg length: 194 reward: 65
episode 980 avg length: 202 reward: 68
episode 1000 avg length: 243 reward: 79
episode 1020 avg length: 260 reward: 66
episode 1040 avg length: 289 reward: 117
episode 1060 avg length: 252 reward: 94
episode 1080 avg length: 262 reward: 114
episode 1100 avg length: 272 reward: 112
episode 1120 avg length: 263 reward: 97
episode 1140 avg length: 256 reward: 93
episode 1160 avg length: 274 reward: 120
episode 1180 avg length: 256 reward: 117
episode 1200 avg length: 241 reward: 105
episode 1220 avg length: 238 reward: 103
episode 1240 avg length: 267 reward: 121
episode 1260 avg length: 283 reward: 124
episode 1280 avg length: 299 reward: 149
episode 1300 avg length: 281 reward: 126
episode 1320 avg length: 266 reward: 102
episode 1340 avg length: 282 reward: 128
episode 1360 avg length: 275 reward: 114
episode 1380 avg length: 285 reward: 105
episode 1400 avg length: 294 reward: 123
episode 1420 avg length: 293 reward: 132
episode 1440 avg length: 248 reward: 85
episode 1460 avg length: 281 reward: 115
episode 1480 avg length: 291 reward: 152
episode 1500 avg length: 279 reward: 130
episode 1520 avg length: 267 reward: 103
episode 1540 avg length: 270 reward: 137
episode 1560 avg length: 269 reward: 120
episode 1580 avg length: 260 reward: 113
episode 1600 avg length: 282 reward: 147
episode 1620 avg length: 259 reward: 125
episode 1640 avg length: 240 reward: 90
episode 1660 avg length: 284 reward: 125
episode 1680 avg length: 282 reward: 123
episode 1700 avg length: 274 reward: 123
episode 1720 avg length: 273 reward: 130
episode 1740 avg length: 260 reward: 117
episode 1760 avg length: 243 reward: 106
episode 1780 avg length: 241 reward: 90
episode 1800 avg length: 290 reward: 144
episode 1820 avg length: 258 reward: 131
episode 1840 avg length: 283 reward: 142
episode 1860 avg length: 262 reward: 100
episode 1880 avg length: 273 reward: 132
episode 1900 avg length: 255 reward: 92
episode 1920 avg length: 251 reward: 117
episode 1940 avg length: 220 reward: 103
episode 1960 avg length: 221 reward: 111
episode 1980 avg length: 205 reward: 83
episode 2000 avg length: 227 reward: 102
episode 2020 avg length: 251 reward: 123
episode 2040 avg length: 227 reward: 100
episode 2060 avg length: 255 reward: 135
episode 2080 avg length: 273 reward: 136
episode 2100 avg length: 256 reward: 126
episode 2120 avg length: 273 reward: 141
episode 2140 avg length: 280 reward: 109
episode 2160 avg length: 266 reward: 112
episode 2180 avg length: 249 reward: 88
episode 2200 avg length: 247 reward: 119
episode 2220 avg length: 270 reward: 143
episode 2240 avg length: 257 reward: 65
episode 2260 avg length: 250 reward: 30
episode 2280 avg length: 261 reward: 112
episode 2300 avg length: 270 reward: 139
episode 2320 avg length: 275 reward: 128
episode 2340 avg length: 290 reward: 149
episode 2360 avg length: 269 reward: 139
episode 2380 avg length: 272 reward: 137
episode 2400 avg length: 232 reward: 105
episode 2420 avg length: 242 reward: 127
episode 2440 avg length: 241 reward: 134
episode 2460 avg length: 249 reward: 113
episode 2480 avg length: 287 reward: 154
episode 2500 avg length: 289 reward: 149
episode 2520 avg length: 258 reward: 129
episode 2540 avg length: 250 reward: 101
episode 2560 avg length: 287 reward: 158
episode 2580 avg length: 271 reward: 145
episode 2600 avg length: 253 reward: 120
episode 2620 avg length: 255 reward: 127
episode 2640 avg length: 254 reward: 122
episode 2660 avg length: 238 reward: 123
episode 2680 avg length: 243 reward: 115
episode 2700 avg length: 241 reward: 93
episode 2720 avg length: 232 reward: 90
episode 2740 avg length: 215 reward: 83
episode 2760 avg length: 241 reward: 112
episode 2780 avg length: 273 reward: 129
episode 2800 avg length: 269 reward: 133
episode 2820 avg length: 246 reward: 91
episode 2840 avg length: 261 reward: 130
episode 2860 avg length: 261 reward: 136
episode 2880 avg length: 289 reward: 128
episode 2900 avg length: 271 reward: 131
episode 2920 avg length: 277 reward: 145
episode 2940 avg length: 251 reward: 117
episode 2960 avg length: 253 reward: 120
episode 2980 avg length: 270 reward: 133
episode 3000 avg length: 240 reward: 85
episode 3020 avg length: 284 reward: 141
episode 3040 avg length: 255 reward: 117
episode 3060 avg length: 299 reward: 134
episode 3080 avg length: 263 reward: 122
episode 3100 avg length: 259 reward: 126
episode 3120 avg length: 270 reward: 125
episode 3140 avg length: 299 reward: 150
episode 3160 avg length: 256 reward: 116
episode 3180 avg length: 264 reward: 124
episode 3200 avg length: 271 reward: 128
episode 3220 avg length: 259 reward: 122
episode 3240 avg length: 261 reward: 125
episode 3260 avg length: 271 reward: 129
episode 3280 avg length: 242 reward: 126
episode 3300 avg length: 218 reward: 93
episode 3320 avg length: 230 reward: 116
episode 3340 avg length: 223 reward: 109
episode 3360 avg length: 249 reward: 122
episode 3380 avg length: 224 reward: 104
episode 3400 avg length: 261 reward: 131
episode 3420 avg length: 280 reward: 140
episode 3440 avg length: 264 reward: 125
episode 3460 avg length: 247 reward: 105
episode 3480 avg length: 276 reward: 141
episode 3500 avg length: 282 reward: 149
episode 3520 avg length: 282 reward: 141
episode 3540 avg length: 290 reward: 152
episode 3560 avg length: 282 reward: 141
episode 3580 avg length: 291 reward: 151
episode 3600 avg length: 289 reward: 166
episode 3620 avg length: 266 reward: 142
episode 3640 avg length: 277 reward: 91
episode 3660 avg length: 272 reward: 114
episode 3680 avg length: 281 reward: 159
episode 3700 avg length: 287 reward: 160
episode 3720 avg length: 254 reward: 78
episode 3740 avg length: 296 reward: 174
episode 3760 avg length: 267 reward: 124
episode 3780 avg length: 273 reward: 148
episode 3800 avg length: 275 reward: 147
episode 3820 avg length: 276 reward: 145
episode 3840 avg length: 283 reward: 151
episode 3860 avg length: 275 reward: 142
episode 3880 avg length: 290 reward: 142
episode 3900 avg length: 290 reward: 154
episode 3920 avg length: 283 reward: 141
episode 3940 avg length: 273 reward: 145
episode 3960 avg length: 290 reward: 161
episode 3980 avg length: 268 reward: 145
episode 4000 avg length: 270 reward: 142
episode 4020 avg length: 283 reward: 156
episode 4040 avg length: 283 reward: 149
episode 4060 avg length: 299 reward: 172
episode 4080 avg length: 292 reward: 158
episode 4100 avg length: 274 reward: 143
episode 4120 avg length: 299 reward: 163
episode 4140 avg length: 290 reward: 153
episode 4160 avg length: 299 reward: 165
episode 4180 avg length: 290 reward: 160
episode 4200 avg length: 299 reward: 157
episode 4220 avg length: 299 reward: 171
episode 4240 avg length: 271 reward: 148
episode 4260 avg length: 265 reward: 139
episode 4280 avg length: 258 reward: 137
episode 4300 avg length: 280 reward: 137
episode 4320 avg length: 262 reward: 133
episode 4340 avg length: 255 reward: 110
episode 4360 avg length: 275 reward: 134
episode 4380 avg length: 282 reward: 154
episode 4400 avg length: 264 reward: 128
episode 4420 avg length: 299 reward: 150
episode 4440 avg length: 275 reward: 151
episode 4460 avg length: 257 reward: 116
episode 4480 avg length: 256 reward: 104
episode 4500 avg length: 263 reward: 134
episode 4520 avg length: 299 reward: 164
episode 4540 avg length: 265 reward: 137
episode 4560 avg length: 265 reward: 147
episode 4580 avg length: 283 reward: 138
episode 4600 avg length: 299 reward: 152
episode 4620 avg length: 281 reward: 154
episode 4640 avg length: 289 reward: 161
episode 4660 avg length: 264 reward: 143
episode 4680 avg length: 285 reward: 138
episode 4700 avg length: 291 reward: 143
episode 4720 avg length: 280 reward: 154
episode 4740 avg length: 284 reward: 125
episode 4760 avg length: 296 reward: 136
episode 4780 avg length: 254 reward: 127
episode 4800 avg length: 281 reward: 147
episode 4820 avg length: 282 reward: 143
episode 4840 avg length: 243 reward: 119
episode 4860 avg length: 280 reward: 139
episode 4880 avg length: 270 reward: 137
episode 4900 avg length: 278 reward: 150
episode 4920 avg length: 203 reward: 83
episode 4940 avg length: 272 reward: 153
episode 4960 avg length: 289 reward: 151
episode 4980 avg length: 289 reward: 157
episode 5000 avg length: 299 reward: 168
episode 5020 avg length: 292 reward: 136
episode 5040 avg length: 290 reward: 158
episode 5060 avg length: 286 reward: 157
episode 5080 avg length: 282 reward: 154
episode 5100 avg length: 278 reward: 121
episode 5120 avg length: 291 reward: 138
episode 5140 avg length: 297 reward: 143
episode 5160 avg length: 290 reward: 165
episode 5180 avg length: 290 reward: 157
episode 5200 avg length: 276 reward: 150
episode 5220 avg length: 278 reward: 149
episode 5240 avg length: 287 reward: 153
episode 5260 avg length: 274 reward: 145
episode 5280 avg length: 299 reward: 176
episode 5300 avg length: 299 reward: 173
episode 5320 avg length: 299 reward: 164
episode 5340 avg length: 271 reward: 157
episode 5360 avg length: 299 reward: 180
episode 5380 avg length: 279 reward: 156
episode 5400 avg length: 268 reward: 133
episode 5420 avg length: 279 reward: 136
episode 5440 avg length: 278 reward: 130
episode 5460 avg length: 268 reward: 137
episode 5480 avg length: 273 reward: 152
episode 5500 avg length: 299 reward: 168
episode 5520 avg length: 266 reward: 95
episode 5540 avg length: 294 reward: 146
episode 5560 avg length: 289 reward: 165
episode 5580 avg length: 288 reward: 139
episode 5600 avg length: 299 reward: 174
episode 5620 avg length: 291 reward: 168
episode 5640 avg length: 281 reward: 147
episode 5660 avg length: 270 reward: 126
episode 5680 avg length: 263 reward: 153
episode 5700 avg length: 283 reward: 161
episode 5720 avg length: 271 reward: 154
episode 5740 avg length: 281 reward: 154
episode 5760 avg length: 281 reward: 144
episode 5780 avg length: 272 reward: 145
episode 5800 avg length: 275 reward: 128
episode 5820 avg length: 290 reward: 159
episode 5840 avg length: 274 reward: 142
episode 5860 avg length: 243 reward: 122
episode 5880 avg length: 236 reward: 124
episode 5900 avg length: 255 reward: 139
episode 5920 avg length: 288 reward: 140
episode 5940 avg length: 271 reward: 140
episode 5960 avg length: 254 reward: 108
episode 5980 avg length: 299 reward: 149
episode 6000 avg length: 289 reward: 149
episode 6020 avg length: 258 reward: 109
episode 6040 avg length: 289 reward: 129
episode 6060 avg length: 238 reward: 94
episode 6080 avg length: 270 reward: 87
episode 6100 avg length: 268 reward: 96
episode 6120 avg length: 279 reward: 142
episode 6140 avg length: 233 reward: 112
episode 6160 avg length: 268 reward: 142
episode 6180 avg length: 260 reward: 133
episode 6200 avg length: 210 reward: 109
episode 6220 avg length: 248 reward: 111
episode 6240 avg length: 229 reward: 92
episode 6260 avg length: 210 reward: 98
episode 6280 avg length: 218 reward: 102
episode 6300 avg length: 225 reward: 117
episode 6320 avg length: 235 reward: 112
episode 6340 avg length: 259 reward: 124
episode 6360 avg length: 252 reward: 113
episode 6380 avg length: 239 reward: 119
episode 6400 avg length: 242 reward: 95
episode 6420 avg length: 249 reward: 111
episode 6440 avg length: 257 reward: 136
episode 6460 avg length: 259 reward: 123
episode 6480 avg length: 259 reward: 112
episode 6500 avg length: 259 reward: 129
episode 6520 avg length: 215 reward: 101
episode 6540 avg length: 249 reward: 137
episode 6560 avg length: 245 reward: 121
episode 6580 avg length: 259 reward: 127
episode 6600 avg length: 267 reward: 142
episode 6620 avg length: 257 reward: 86
episode 6640 avg length: 278 reward: 141
episode 6660 avg length: 255 reward: 92
episode 6680 avg length: 289 reward: 145
episode 6700 avg length: 259 reward: 133
episode 6720 avg length: 247 reward: 116
episode 6740 avg length: 243 reward: 56
episode 6760 avg length: 274 reward: 114
episode 6780 avg length: 279 reward: 133
episode 6800 avg length: 269 reward: 152
episode 6820 avg length: 252 reward: 105
episode 6840 avg length: 254 reward: 123
episode 6860 avg length: 253 reward: 98
episode 6880 avg length: 273 reward: 132
episode 6900 avg length: 249 reward: 108
episode 6920 avg length: 248 reward: 84
episode 6940 avg length: 250 reward: 107
episode 6960 avg length: 279 reward: 99
episode 6980 avg length: 279 reward: 140
episode 7000 avg length: 270 reward: 105
episode 7020 avg length: 250 reward: 109
episode 7040 avg length: 202 reward: 87
episode 7060 avg length: 188 reward: 56
episode 7080 avg length: 229 reward: 93
episode 7100 avg length: 248 reward: 105
episode 7120 avg length: 218 reward: 105
episode 7140 avg length: 213 reward: 77
episode 7160 avg length: 279 reward: 128
episode 7180 avg length: 247 reward: 110
episode 7200 avg length: 269 reward: 124
episode 7220 avg length: 217 reward: 64
episode 7240 avg length: 258 reward: 140
episode 7260 avg length: 279 reward: 116
episode 7280 avg length: 244 reward: 97
episode 7300 avg length: 245 reward: 104
episode 7320 avg length: 213 reward: 81
episode 7340 avg length: 268 reward: 126
episode 7360 avg length: 277 reward: 124
episode 7380 avg length: 251 reward: 122
episode 7400 avg length: 234 reward: 108
episode 7420 avg length: 267 reward: 127
episode 7440 avg length: 218 reward: 89
episode 7460 avg length: 199 reward: 80
episode 7480 avg length: 154 reward: 55
episode 7500 avg length: 228 reward: 114
episode 7520 avg length: 197 reward: 49
episode 7540 avg length: 147 reward: 59
episode 7560 avg length: 139 reward: 49
episode 7580 avg length: 181 reward: 74
episode 7600 avg length: 191 reward: 61
episode 7620 avg length: 176 reward: 78
episode 7640 avg length: 160 reward: 35
episode 7660 avg length: 159 reward: 50
episode 7680 avg length: 143 reward: 68
episode 7700 avg length: 227 reward: 103
episode 7720 avg length: 192 reward: 59
episode 7740 avg length: 248 reward: 118
episode 7760 avg length: 250 reward: 128
episode 7780 avg length: 261 reward: 110
episode 7800 avg length: 279 reward: 157
episode 7820 avg length: 249 reward: 153
episode 7840 avg length: 212 reward: 78
episode 7860 avg length: 249 reward: 144
episode 7880 avg length: 257 reward: 107
episode 7900 avg length: 271 reward: 136
episode 7920 avg length: 244 reward: 129
episode 7940 avg length: 262 reward: 145
episode 7960 avg length: 224 reward: 94
episode 7980 avg length: 247 reward: 110
episode 8000 avg length: 190 reward: 81
episode 8020 avg length: 157 reward: 67
episode 8040 avg length: 171 reward: 67
episode 8060 avg length: 203 reward: 96
episode 8080 avg length: 225 reward: 87
episode 8100 avg length: 166 reward: 84
episode 8120 avg length: 196 reward: 82
episode 8140 avg length: 249 reward: 120
episode 8160 avg length: 216 reward: 112
episode 8180 avg length: 178 reward: 97
episode 8200 avg length: 221 reward: 120
episode 8220 avg length: 265 reward: 122
episode 8240 avg length: 240 reward: 125
episode 8260 avg length: 266 reward: 146
episode 8280 avg length: 253 reward: 116
episode 8300 avg length: 233 reward: 129
episode 8320 avg length: 260 reward: 126
episode 8340 avg length: 264 reward: 138
episode 8360 avg length: 196 reward: 88
episode 8380 avg length: 189 reward: 60
episode 8400 avg length: 227 reward: 66
episode 8420 avg length: 257 reward: 114
episode 8440 avg length: 254 reward: 99
episode 8460 avg length: 268 reward: 127
episode 8480 avg length: 263 reward: 131
episode 8500 avg length: 246 reward: 107
episode 8520 avg length: 281 reward: 127
episode 8540 avg length: 273 reward: 146
episode 8560 avg length: 290 reward: 124
episode 8580 avg length: 261 reward: 103
episode 8600 avg length: 294 reward: 140
episode 8620 avg length: 236 reward: 110
episode 8640 avg length: 261 reward: 125
episode 8660 avg length: 284 reward: 108
episode 8680 avg length: 278 reward: 141
episode 8700 avg length: 256 reward: 124
episode 8720 avg length: 245 reward: 95
episode 8740 avg length: 258 reward: 136
episode 8760 avg length: 289 reward: 147
episode 8780 avg length: 229 reward: 98
episode 8800 avg length: 277 reward: 138
episode 8820 avg length: 237 reward: 129
episode 8840 avg length: 276 reward: 141
episode 8860 avg length: 224 reward: 102
episode 8880 avg length: 220 reward: 108
episode 8900 avg length: 277 reward: 137
episode 8920 avg length: 259 reward: 120
episode 8940 avg length: 242 reward: 124
episode 8960 avg length: 275 reward: 119
episode 8980 avg length: 256 reward: 140
episode 9000 avg length: 263 reward: 110
episode 9020 avg length: 247 reward: 101
episode 9040 avg length: 251 reward: 99
episode 9060 avg length: 266 reward: 128
episode 9080 avg length: 247 reward: 119
episode 9100 avg length: 227 reward: 95
episode 9120 avg length: 242 reward: 95
episode 9140 avg length: 234 reward: 120
episode 9160 avg length: 271 reward: 145
episode 9180 avg length: 234 reward: 106
episode 9200 avg length: 230 reward: 102
episode 9220 avg length: 217 reward: 111
episode 9240 avg length: 182 reward: 68
episode 9260 avg length: 225 reward: 111
episode 9280 avg length: 224 reward: 110
episode 9300 avg length: 195 reward: 97
episode 9320 avg length: 245 reward: 110
episode 9340 avg length: 249 reward: 87
episode 9360 avg length: 238 reward: 105
episode 9380 avg length: 231 reward: 83
episode 9400 avg length: 245 reward: 60
episode 9420 avg length: 251 reward: 81
episode 9440 avg length: 218 reward: 86
episode 9460 avg length: 177 reward: 62
episode 9480 avg length: 212 reward: 64
episode 9500 avg length: 213 reward: 96
episode 9520 avg length: 267 reward: 121
episode 9540 avg length: 195 reward: 89
episode 9560 avg length: 259 reward: 140
episode 9580 avg length: 246 reward: 116
episode 9600 avg length: 266 reward: 122
episode 9620 avg length: 255 reward: 104
episode 9640 avg length: 203 reward: 116
episode 9660 avg length: 239 reward: 117
episode 9680 avg length: 239 reward: 118
episode 9700 avg length: 254 reward: 137
episode 9720 avg length: 269 reward: 144
episode 9740 avg length: 274 reward: 136
episode 9760 avg length: 259 reward: 123
episode 9780 avg length: 230 reward: 102
episode 9800 avg length: 268 reward: 139
episode 9820 avg length: 258 reward: 120
episode 9840 avg length: 271 reward: 111
episode 9860 avg length: 260 reward: 130
episode 9880 avg length: 280 reward: 135
episode 9900 avg length: 269 reward: 126
episode 9920 avg length: 290 reward: 159
episode 9940 avg length: 286 reward: 129
episode 9960 avg length: 259 reward: 117
episode 9980 avg length: 299 reward: 139
episode 10000 avg length: 298 reward: 141
episode 10020 avg length: 294 reward: 115
episode 10040 avg length: 284 reward: 117
episode 10060 avg length: 299 reward: 156
episode 10080 avg length: 290 reward: 145
episode 10100 avg length: 280 reward: 151
episode 10120 avg length: 299 reward: 163
episode 10140 avg length: 290 reward: 151
episode 10160 avg length: 269 reward: 133
episode 10180 avg length: 259 reward: 134
episode 10200 avg length: 272 reward: 137
episode 10220 avg length: 260 reward: 121
episode 10240 avg length: 259 reward: 103
episode 10260 avg length: 260 reward: 126
episode 10280 avg length: 279 reward: 150
episode 10300 avg length: 268 reward: 128
episode 10320 avg length: 261 reward: 140
episode 10340 avg length: 243 reward: 111
episode 10360 avg length: 236 reward: 113
episode 10380 avg length: 219 reward: 112
episode 10400 avg length: 267 reward: 140
episode 10420 avg length: 279 reward: 146
episode 10440 avg length: 285 reward: 137
episode 10460 avg length: 255 reward: 107
episode 10480 avg length: 249 reward: 115
episode 10500 avg length: 241 reward: 106
episode 10520 avg length: 219 reward: 102
episode 10540 avg length: 200 reward: 52
episode 10560 avg length: 267 reward: 124
episode 10580 avg length: 235 reward: 111
episode 10600 avg length: 223 reward: 86
episode 10620 avg length: 220 reward: 90
episode 10640 avg length: 269 reward: 145
episode 10660 avg length: 255 reward: 133
episode 10680 avg length: 277 reward: 130
episode 10700 avg length: 280 reward: 142
episode 10720 avg length: 278 reward: 128
episode 10740 avg length: 260 reward: 90
episode 10760 avg length: 288 reward: 145
episode 10780 avg length: 238 reward: 94
episode 10800 avg length: 278 reward: 136
episode 10820 avg length: 288 reward: 150
episode 10840 avg length: 280 reward: 148
episode 10860 avg length: 240 reward: 117
episode 10880 avg length: 257 reward: 124
episode 10900 avg length: 261 reward: 130
episode 10920 avg length: 229 reward: 115
episode 10940 avg length: 259 reward: 144
episode 10960 avg length: 238 reward: 138
episode 10980 avg length: 230 reward: 112
episode 11000 avg length: 254 reward: 126
episode 11020 avg length: 281 reward: 141
episode 11040 avg length: 270 reward: 120
episode 11060 avg length: 297 reward: 174
episode 11080 avg length: 261 reward: 138
episode 11100 avg length: 259 reward: 125
episode 11120 avg length: 292 reward: 173
episode 11140 avg length: 275 reward: 146
episode 11160 avg length: 299 reward: 165
episode 11180 avg length: 299 reward: 175
episode 11200 avg length: 289 reward: 161
episode 11220 avg length: 299 reward: 166
episode 11240 avg length: 278 reward: 160
episode 11260 avg length: 290 reward: 142
episode 11280 avg length: 299 reward: 164
episode 11300 avg length: 279 reward: 155
episode 11320 avg length: 299 reward: 178
episode 11340 avg length: 299 reward: 150
episode 11360 avg length: 265 reward: 110
episode 11380 avg length: 288 reward: 156
episode 11400 avg length: 278 reward: 146
episode 11420 avg length: 268 reward: 141
episode 11440 avg length: 291 reward: 130
episode 11460 avg length: 299 reward: 161
episode 11480 avg length: 284 reward: 142
episode 11500 avg length: 262 reward: 132
episode 11520 avg length: 287 reward: 149
episode 11540 avg length: 288 reward: 150
episode 11560 avg length: 288 reward: 157
episode 11580 avg length: 288 reward: 156
episode 11600 avg length: 284 reward: 133
episode 11620 avg length: 287 reward: 152
episode 11640 avg length: 249 reward: 130
episode 11660 avg length: 240 reward: 106
episode 11680 avg length: 271 reward: 131
episode 11700 avg length: 271 reward: 117
episode 11720 avg length: 286 reward: 143
episode 11740 avg length: 293 reward: 150
episode 11760 avg length: 289 reward: 155
episode 11780 avg length: 290 reward: 137
episode 11800 avg length: 289 reward: 133
episode 11820 avg length: 273 reward: 121
episode 11840 avg length: 274 reward: 109
episode 11860 avg length: 261 reward: 147
episode 11880 avg length: 210 reward: 114
episode 11900 avg length: 245 reward: 143
episode 11920 avg length: 210 reward: 115
episode 11940 avg length: 218 reward: 102
episode 11960 avg length: 214 reward: 102
episode 11980 avg length: 269 reward: 133
episode 12000 avg length: 262 reward: 144
episode 12020 avg length: 235 reward: 131
episode 12040 avg length: 253 reward: 149
episode 12060 avg length: 227 reward: 120
episode 12080 avg length: 202 reward: 98
episode 12100 avg length: 240 reward: 117
episode 12120 avg length: 231 reward: 108
episode 12140 avg length: 230 reward: 122
episode 12160 avg length: 228 reward: 108
episode 12180 avg length: 233 reward: 96
episode 12200 avg length: 252 reward: 123
episode 12220 avg length: 272 reward: 154
episode 12240 avg length: 251 reward: 122
episode 12260 avg length: 273 reward: 147
episode 12280 avg length: 239 reward: 111
episode 12300 avg length: 287 reward: 126
episode 12320 avg length: 278 reward: 121
episode 12340 avg length: 258 reward: 120
episode 12360 avg length: 265 reward: 104
episode 12380 avg length: 279 reward: 118
episode 12400 avg length: 254 reward: 72
episode 12420 avg length: 187 reward: 74
episode 12440 avg length: 244 reward: 90
episode 12460 avg length: 228 reward: 116
episode 12480 avg length: 258 reward: 125
episode 12500 avg length: 247 reward: 118
episode 12520 avg length: 244 reward: 101
episode 12540 avg length: 267 reward: 135
episode 12560 avg length: 253 reward: 99
episode 12580 avg length: 285 reward: 135
episode 12600 avg length: 259 reward: 113
episode 12620 avg length: 256 reward: 108
episode 12640 avg length: 238 reward: 114
episode 12660 avg length: 265 reward: 128
episode 12680 avg length: 289 reward: 145
episode 12700 avg length: 287 reward: 147
episode 12720 avg length: 283 reward: 139
episode 12740 avg length: 255 reward: 108
episode 12760 avg length: 299 reward: 150
episode 12780 avg length: 277 reward: 138
episode 12800 avg length: 290 reward: 151
episode 12820 avg length: 284 reward: 159
episode 12840 avg length: 299 reward: 150
episode 12860 avg length: 289 reward: 146
episode 12880 avg length: 299 reward: 158
episode 12900 avg length: 299 reward: 144
episode 12920 avg length: 279 reward: 129
episode 12940 avg length: 282 reward: 132
episode 12960 avg length: 280 reward: 132
episode 12980 avg length: 278 reward: 108
episode 13000 avg length: 284 reward: 136
episode 13020 avg length: 289 reward: 128
episode 13040 avg length: 291 reward: 149
episode 13060 avg length: 299 reward: 140
episode 13080 avg length: 292 reward: 141
episode 13100 avg length: 290 reward: 139
episode 13120 avg length: 299 reward: 139
episode 13140 avg length: 291 reward: 151
episode 13160 avg length: 291 reward: 141
episode 13180 avg length: 299 reward: 169
episode 13200 avg length: 299 reward: 162
episode 13220 avg length: 299 reward: 170
episode 13240 avg length: 299 reward: 170
episode 13260 avg length: 299 reward: 155
episode 13280 avg length: 299 reward: 153
episode 13300 avg length: 299 reward: 163
episode 13320 avg length: 281 reward: 131
episode 13340 avg length: 289 reward: 153
episode 13360 avg length: 285 reward: 133
episode 13380 avg length: 280 reward: 134
episode 13400 avg length: 282 reward: 134
episode 13420 avg length: 268 reward: 114
episode 13440 avg length: 290 reward: 142
episode 13460 avg length: 270 reward: 145
episode 13480 avg length: 257 reward: 127
episode 13500 avg length: 272 reward: 139
episode 13520 avg length: 270 reward: 129
episode 13540 avg length: 279 reward: 149
episode 13560 avg length: 269 reward: 95
episode 13580 avg length: 270 reward: 113
episode 13600 avg length: 258 reward: 125
episode 13620 avg length: 217 reward: 88
episode 13640 avg length: 157 reward: 59
episode 13660 avg length: 132 reward: 41
episode 13680 avg length: 220 reward: 92
episode 13700 avg length: 241 reward: 109
episode 13720 avg length: 252 reward: 127
episode 13740 avg length: 253 reward: 104
episode 13760 avg length: 269 reward: 128
episode 13780 avg length: 230 reward: 96
episode 13800 avg length: 258 reward: 127
episode 13820 avg length: 290 reward: 151
episode 13840 avg length: 299 reward: 135
episode 13860 avg length: 280 reward: 111
episode 13880 avg length: 268 reward: 124
episode 13900 avg length: 255 reward: 93
episode 13920 avg length: 258 reward: 128
episode 13940 avg length: 244 reward: 127
episode 13960 avg length: 238 reward: 117
episode 13980 avg length: 237 reward: 104
episode 14000 avg length: 251 reward: 123
episode 14020 avg length: 267 reward: 114
episode 14040 avg length: 271 reward: 109
episode 14060 avg length: 247 reward: 117
episode 14080 avg length: 282 reward: 129
episode 14100 avg length: 266 reward: 144
episode 14120 avg length: 256 reward: 132
episode 14140 avg length: 267 reward: 140
episode 14160 avg length: 289 reward: 149
episode 14180 avg length: 262 reward: 95
episode 14200 avg length: 278 reward: 128
episode 14220 avg length: 279 reward: 136
episode 14240 avg length: 249 reward: 105
episode 14260 avg length: 235 reward: 112
episode 14280 avg length: 273 reward: 131
episode 14300 avg length: 278 reward: 130
episode 14320 avg length: 259 reward: 123
episode 14340 avg length: 234 reward: 78
episode 14360 avg length: 268 reward: 125
episode 14380 avg length: 294 reward: 153
episode 14400 avg length: 299 reward: 150
episode 14420 avg length: 278 reward: 129
episode 14440 avg length: 297 reward: 155
episode 14460 avg length: 247 reward: 106
episode 14480 avg length: 289 reward: 154
episode 14500 avg length: 270 reward: 133
episode 14520 avg length: 259 reward: 133
episode 14540 avg length: 280 reward: 151
episode 14560 avg length: 268 reward: 129
episode 14580 avg length: 299 reward: 159
episode 14600 avg length: 279 reward: 131
episode 14620 avg length: 242 reward: 100
episode 14640 avg length: 236 reward: 114
episode 14660 avg length: 253 reward: 132
episode 14680 avg length: 272 reward: 134
episode 14700 avg length: 297 reward: 175
episode 14720 avg length: 278 reward: 148
episode 14740 avg length: 289 reward: 154
episode 14760 avg length: 288 reward: 148
episode 14780 avg length: 278 reward: 140
episode 14800 avg length: 266 reward: 128
episode 14820 avg length: 288 reward: 161
episode 14840 avg length: 278 reward: 145
episode 14860 avg length: 290 reward: 161
episode 14880 avg length: 279 reward: 139
episode 14900 avg length: 284 reward: 155
episode 14920 avg length: 245 reward: 136
episode 14940 avg length: 269 reward: 137
episode 14960 avg length: 262 reward: 146
episode 14980 avg length: 299 reward: 154
episode 15000 avg length: 273 reward: 172
episode 15020 avg length: 278 reward: 142
episode 15040 avg length: 277 reward: 150
episode 15060 avg length: 232 reward: 119
episode 15080 avg length: 280 reward: 141
episode 15100 avg length: 260 reward: 137
episode 15120 avg length: 285 reward: 167
episode 15140 avg length: 280 reward: 149
episode 15160 avg length: 237 reward: 118
episode 15180 avg length: 223 reward: 111
episode 15200 avg length: 243 reward: 134
episode 15220 avg length: 269 reward: 138
episode 15240 avg length: 251 reward: 127
episode 15260 avg length: 289 reward: 157
episode 15280 avg length: 229 reward: 107
episode 15300 avg length: 277 reward: 143
episode 15320 avg length: 288 reward: 154
episode 15340 avg length: 289 reward: 149
episode 15360 avg length: 288 reward: 145
episode 15380 avg length: 260 reward: 134
episode 15400 avg length: 246 reward: 126
episode 15420 avg length: 244 reward: 132
episode 15440 avg length: 272 reward: 129
episode 15460 avg length: 267 reward: 134
episode 15480 avg length: 263 reward: 135
episode 15500 avg length: 280 reward: 141
episode 15520 avg length: 254 reward: 126
episode 15540 avg length: 275 reward: 133
episode 15560 avg length: 271 reward: 120
episode 15580 avg length: 270 reward: 130
episode 15600 avg length: 299 reward: 144
episode 15620 avg length: 254 reward: 88
episode 15640 avg length: 271 reward: 126
episode 15660 avg length: 289 reward: 153
episode 15680 avg length: 231 reward: 104
episode 15700 avg length: 227 reward: 127
episode 15720 avg length: 174 reward: 82
episode 15740 avg length: 214 reward: 92
episode 15760 avg length: 190 reward: 89
episode 15780 avg length: 159 reward: 49
episode 15800 avg length: 222 reward: 100
episode 15820 avg length: 269 reward: 133
episode 15840 avg length: 243 reward: 100
episode 15860 avg length: 191 reward: 68
episode 15880 avg length: 221 reward: 86
episode 15900 avg length: 206 reward: 109
episode 15920 avg length: 228 reward: 89
episode 15940 avg length: 250 reward: 108
episode 15960 avg length: 229 reward: 110
episode 15980 avg length: 263 reward: 139
episode 16000 avg length: 250 reward: 125
episode 16020 avg length: 270 reward: 140
episode 16040 avg length: 251 reward: 131
episode 16060 avg length: 258 reward: 124
episode 16080 avg length: 268 reward: 130
episode 16100 avg length: 263 reward: 125
episode 16120 avg length: 280 reward: 150
episode 16140 avg length: 267 reward: 132
episode 16160 avg length: 284 reward: 137
episode 16180 avg length: 275 reward: 128
episode 16200 avg length: 269 reward: 132
episode 16220 avg length: 280 reward: 132
episode 16240 avg length: 279 reward: 145
episode 16260 avg length: 299 reward: 152
episode 16280 avg length: 238 reward: 112
episode 16300 avg length: 284 reward: 159
episode 16320 avg length: 280 reward: 136
episode 16340 avg length: 271 reward: 120
episode 16360 avg length: 281 reward: 139
episode 16380 avg length: 267 reward: 141
episode 16400 avg length: 299 reward: 164
episode 16420 avg length: 239 reward: 113
episode 16440 avg length: 276 reward: 143
episode 16460 avg length: 268 reward: 144
episode 16480 avg length: 269 reward: 134
episode 16500 avg length: 273 reward: 148
episode 16520 avg length: 247 reward: 97
episode 16540 avg length: 266 reward: 129
episode 16560 avg length: 267 reward: 119
episode 16580 avg length: 270 reward: 124
episode 16600 avg length: 262 reward: 101
episode 16620 avg length: 257 reward: 121
episode 16640 avg length: 233 reward: 99
episode 16660 avg length: 268 reward: 114
episode 16680 avg length: 261 reward: 126
episode 16700 avg length: 278 reward: 143
episode 16720 avg length: 278 reward: 117
episode 16740 avg length: 266 reward: 135
episode 16760 avg length: 282 reward: 140
episode 16780 avg length: 299 reward: 154
episode 16800 avg length: 279 reward: 144
episode 16820 avg length: 281 reward: 124
episode 16840 avg length: 280 reward: 132
episode 16860 avg length: 278 reward: 148
episode 16880 avg length: 280 reward: 113
episode 16900 avg length: 268 reward: 133
episode 16920 avg length: 291 reward: 147
episode 16940 avg length: 274 reward: 150
episode 16960 avg length: 281 reward: 137
episode 16980 avg length: 251 reward: 126
episode 17000 avg length: 261 reward: 135
episode 17020 avg length: 267 reward: 105
episode 17040 avg length: 274 reward: 176
episode 17060 avg length: 262 reward: 131
episode 17080 avg length: 186 reward: 184
episode 17100 avg length: 225 reward: 150
episode 17120 avg length: 201 reward: 218
episode 17140 avg length: 211 reward: 220
episode 17160 avg length: 221 reward: 218
episode 17180 avg length: 232 reward: 210
episode 17200 avg length: 216 reward: 220
episode 17220 avg length: 226 reward: 203
episode 17240 avg length: 198 reward: 170
episode 17260 avg length: 196 reward: 222
episode 17280 avg length: 214 reward: 196
episode 17300 avg length: 229 reward: 205
episode 17320 avg length: 183 reward: 192
episode 17340 avg length: 212 reward: 186
episode 17360 avg length: 192 reward: 164
########## solved! ##########

到此这篇关于python强化练习之tensorflow2 opp算法实现月球登陆器的文章就介绍到这了,更多相关python tensorflow2 opp内容请搜索www.887551.com以前的文章或继续浏览下面的相关文章希望大家以后多多支持www.887551.com!