Deep Reinforcement learning – 2. 基于tensorflow的DDPG实现

Deep Reinforcement learning – 2. 基于tensorflow的DDPG实现TODDeepReinforcemenlearning-2.基于tensorflow的DDPG实现基于我上一篇博客的算法介绍,使用tensorflow的代码实现,仿真环境使用gymtorcs为了快速训练出结果,我没有使用driverview图像作为输入,而是使用lowdimension传感器数据作为输入,总共29个数据,包括:-赛车速度:speedX,

大家好,又见面了,我是你们的朋友全栈君。

Deep Reinforcemen learning – 2. 基于tensorflow的DDPG实现


基于我上一篇博客的算法介绍, 使用tensorflow的代码实现,仿真环境使用gym torcs
为了快速训练出结果,我没有使用driver view图像作为输入,而是使用low dimension传感器数据作为输入,
总共29个数据,包括:
– 赛车速度: speedX, speedY, speedZ.
– 赛车在跑道中的位置
– 19个range finder的探测数据:车身与跑道边缘的距离
– 发动机转速
– 车轮速度

输出action有三个维度:
– steer: 方向, 取值范围 [-1,1]
– accel: 油门,取值范围 [0,1]
– brake: 刹车,取值范围 [0,1]

训练1M steps的视频链接:ddpg视频

完整代码的github链接:https://github.com/kennethyu2017/ddpg


下面分模块讲解:

代码框架

再回顾一下ddpg算法的流程图:
这里写图片描述


actor

actor包含online policy和 target policy 两张神经网络, 其结构是一样的,由于使用low dimension 的数据输入,我没有使用卷积层,
policy网络架构如下 ,取自tensorboard生成的 computation graph:
这里写图片描述

包括:
– BatchNorm: input batch norm layer
– fully_connected,fully_connected_1: 2个hidden layers
– fully_connected_2: output layer

为了限定policy网络的输出action范围,使用tanh对steer,sigmoid对accelerate和brake,作为bound函数,进行范围限定:
这里写图片描述

Actor class的主要代码如下:
为了在调试时可以灵活的修改模型架构,policy 网络的模型架构、配置、参数,全部通过在实例化Actor时指定,包括: 对输入数据做归一化的batch norm层、全连接fully-connected层、输出层、
输出bound 函数等。

import tensorflow as tf
from tensorflow.contrib.layers import fully_connected,batch_norm
from common.common import soft_update_online_to_target, copy_online_to_target
DDPG_CFG = tf.app.flags.FLAGS  # alias

class Actor(object):
  def __init__(self, action_dim, online_state_inputs, target_state_inputs,input_normalizer, input_norm_params, n_fc_units, fc_activations, fc_initializers, fc_normalizers, fc_norm_params, fc_regularizers, output_layer_initializer, output_layer_regularizer, output_normalizers, output_norm_params,output_bound_fns, learning_rate, is_training):
    self.a_dim = action_dim
    self.learning_rate = learning_rate
    ...

使用Adam作为gradient descent的算法:

    self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate)  # use beta1, beta2 default.

创建online policy网络:

    self._online_action_outputs = self.create_policy_net(scope=DDPG_CFG.online_policy_net_var_scope,
                                                        state_inputs=self.online_state_inputs,
                                                        trainable=True)

    self.online_policy_net_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
                                                    scope=DDPG_CFG.online_policy_net_var_scope)
    self.online_policy_net_vars_by_name = {var.name.strip(DDPG_CFG.online_policy_net_var_scope):var
                                           for var in self.online_policy_net_vars}

创建target policy 网络,由于我们采用soft update的方法更新target 网络的参数,所以在创建target网络时指定其参数trainable为False。

    self._target_action_outputs = self.create_policy_net(scope=DDPG_CFG.target_policy_net_var_scope,
                                                        state_inputs=self.target_state_inputs,
                                                        trainable=False)
    self.target_policy_net_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                                                    scope=DDPG_CFG.target_policy_net_var_scope)

    self.target_policy_net_vars_by_name = {var.name.strip(DDPG_CFG.target_policy_net_var_scope):var
                                           for var in self.target_policy_net_vars}

策略网络的创建函数 create_policy_net ,根据网络架构参数创建各个layer:

  def create_policy_net(self, state_inputs, scope, trainable):
    with tf.variable_scope(scope):
      #input norm layer
      prev_layer = self.input_normalizer(state_inputs, **self.input_norm_params)

      ##fc layers
      for n_unit, activation, initializer, normalizer, norm_param, regularizer in zip(
          self.n_fc_units, self.fc_activations, self.fc_initializers,
        self.fc_normalizers,self.fc_norm_params, self.fc_regularizers):
        prev_layer = fully_connected(prev_layer, num_outputs=n_unit, activation_fn=activation,
                                     weights_initializer=initializer,
                                     weights_regularizer=regularizer,
                                     normalizer_fn=normalizer,
                                     normalizer_params=norm_param,
                                     biases_initializer=None, #skip bias when use norm.
                                     trainable=trainable)

      ##output layer
      output_layer = fully_connected(prev_layer, num_outputs=self.a_dim, activation_fn=None,
                                     weights_initializer=self.output_layer_initializer,
                                     weights_regularizer=self.output_layer_regularizer,
                                     normalizer_fn=self.output_normalizers,
                                     normalizer_params=self.output_norm_params,
                                     biases_initializer=None, # to skip bias
                                     trainable=trainable)
      ## bound and scale each action dim
      action_unpacked = tf.unstack(output_layer, axis=1)
      action_bounded = []
      for i in range(self.a_dim):
        action_bounded.append(self.output_bound_fns[i](action_unpacked[i]))
      action_outputs = tf.stack(action_bounded, axis=1)

    return action_outputs

policy网络的输出tensor,即action:

  # of online net
  @property
  def online_action_outputs_tensor(self):
    return self._online_action_outputs

  # of target net
  @property
  def target_action_outputs_tensor(self):
    return self._target_action_outputs

定义计算online policy gradient的函数,由于我们只训练online 网络, 所以不用计算target网络的gradient :

  def compute_online_policy_net_gradients(self, policy_loss):
    grads_and_vars = self.optimizer.compute_gradients(
      policy_loss,var_list=self.online_policy_net_vars)
    grads = [g for (g, _) in grads_and_vars if g is not None]
    compute_op = tf.group(*grads)

    return (grads_and_vars, compute_op)

定义函数,通过Adam optimize将计算好的gradient用于更新online policy网络的参数:

  def apply_online_policy_net_gradients(self, grads_and_vars):
    vars_with_grad = [v for g, v in grads_and_vars if g is not None]
    if not vars_with_grad:
      raise ValueError(
        "$$ ddpg $$ policy net $$ No gradients provided for any variable, check your graph for ops"
        " that do not support gradients,variables %s." %
        ([str(v) for _, v in grads_and_vars]))
    return self.optimizer.apply_gradients(grads_and_vars)

从功能上来讲,上面这两步可以直接合并为使用 optimizer.minimize()函数,我是为了在调试过程中观察gradient的变化情况,所以拆开来实现,
这样可以在调试时取得gradient的tensor,观察其是否收敛。

定义函数,将online policy网络的参数在初始化时copy到target policy 网络,以及在训练时soft update到 target policy网络,
函数返回的是具体操作的op:

  def soft_update_online_to_target(self):
    return soft_update_online_to_target(self.online_q_net_vars_by_name,
                                               self.target_q_net_vars_by_name)

  def copy_online_to_target(self):
    return copy_online_to_target(self.online_q_net_vars_by_name,
                                        self.target_q_net_vars_by_name)

critic

critic包含online q和 target q 两张神经网络, 其结构也是一样的,q网络结构如下,取自tensorboard生成的computation graph:
这里写图片描述

包括:
– BatchNorm: input batch norm layer
– fully_connected,fully_connected_1: 2个hidden layers
– fully_connected_2: output layer

有两点需要注意:
1、由于q网络需要以action作为输入,且在第二层fully_connected_1 中(代码中通过DDPG_CFG.include_action_fc_layer配置参数指定)
,通过concat包含action输入;
2、对照上面的ddpg流程图,在训练q 网络时,q网络的action输入是采样自replay buffer的action batch, 而在训练 policy网络时,
q网络的action输入是 μ(si) , 即policy网络的action输出。我们为q 网络增加一个switch-merge的
控制逻辑–对应上图右下角部分–来同时接入上述两种action输入,并通过cond_training_q tensor控制接通哪一路action输入,
这样我们在训练q网络时,就feed cond_training_q tensor的值为True, 在训练policy网络时,feed cond_training_q tensor的值为False。

Critic class 主要代码如下:同Actor一样,为了在调试时可以灵活的修改模型架构,policy 网络的模型架构、配置、参数,全部通过在实例化Critic
时指定:

class Critic(object):

  def __init__(self, online_state_inputs, target_state_inputs,input_normalizer,input_norm_params, online_action_inputs_training_q, online_action_inputs_training_policy, cond_training_q, target_action_inputs, n_fc_units, fc_activations, fc_initializers, fc_normalizers,fc_norm_params,fc_regularizers, output_layer_initializer,output_layer_regularizer,learning_rate):

    self.online_state_inputs = online_state_inputs
    self.target_state_inputs = target_state_inputs
    ...

同Actor一样 ,也使用Adam作为gradient descent的算法;接着创建online q 和 target q 网络,
其创建函数如下:

  def create_q_net(self, state_inputs, # NHWC format. action_inputs_training_q, scope, trainable, action_inputs_training_policy=None, # None for target net. cond_training_q=None # bool to control switch. can be None for target net. ):
    with tf.variable_scope(scope):
      #input norm layer
      prev_layer=self.input_normalizer(state_inputs,**self.input_norm_params)

      ##fc layers
      l = 1  # start from fc-1 as 1
      for n_unit, activation, initializer, normalizer, norm_param,regularizer in zip(
              self.n_fc_units, self.fc_activations, self.fc_initializers,
              self.fc_normalizers, self.fc_norm_params, self.fc_regularizers):
        # include action_inputs
        if l == DDPG_CFG.include_action_fc_layer:
          if action_inputs_training_policy is None:  # target net
            actions = action_inputs_training_q
          else:  # add logic for selecting online net action inputs
            # switch return :(output_false, output_true)
            (_, sw_action_training_q) = switch(data=action_inputs_training_q,
                                                                   pred=cond_training_q,
                                                                   name='switch_actions_training_q')
            (sw_action_training_policy, _) = switch(data=action_inputs_training_policy,
                                                                        pred=cond_training_q,
                                                                        name='switch_actions_training_policy')
            (actions, _) = merge([sw_action_training_q, sw_action_training_policy])

          prev_layer = tf.concat([prev_layer, actions], axis=1)
        l += 1
        prev_layer = fully_connected(prev_layer, num_outputs=n_unit, activation_fn=activation,
                                     weights_initializer=initializer,
                                     weights_regularizer=regularizer,
                                     normalizer_fn=normalizer, #when specify norm , bias will be ignored.
                                     normalizer_params=norm_param,
                                     trainable=trainable)

      # end fc layers

      ##output layer. fully_connected will create bias which is not wanted in output layer.
      output_layer = fully_connected(inputs=prev_layer,num_outputs=1,
                                     activation_fn=None,
                                     weights_initializer=self.output_layer_initializer,
                                     weights_regularizer=self.output_layer_regularizer,
                                     biases_initializer=None, # to skip bias in output layer
                                    trainable=trainable)

    # == == end with variable_scope() ==
    return output_layer

接着定义gradient的计算、应用函数,以及target q 网络soft update的函数,同Actor类似,不再赘述。


构建总体computation graph

我们知道,tensorflow是基于computation graph进行计算,当我们有了policy 网络、q 网络之后,就可以构架一张大的graph, 把
他们连接起来,构成一个完整的ddpg graph, 进行训练,如下图所示,取自tensorboard的graph显示:
这里写图片描述

可以看到online policy、online q网络,policy loss\q_loss, 为back propagation添加的gradients\gradients_1 模块,
target policy\ target q网络。

定义构建函数,入参包括actor实例 、 critic实例;两个placeholder:reward_inputs 和 terminated_inputs:在session run时,
通过feed将raplay buffer中采样的reward和terminated batch输入给训练网络;返回训练ddpg时需要运行的op:

def build_ddpg_graph(actor, critic, reward_inputs, terminated_inputs,global_step_tensor):
  #定义label。 对于episode 结束的Transition数据,我们不计算q值, 所以对q值乘上(1.0 - terminated_inputs)
  y_i = reward_inputs + (1.0 - terminated_inputs) * DDPG_CFG.gamma * critic.target_q_outputs_tensor

  #定义q 网络的loss。 对q网络,我们一般采用l2 reg.
  q_reg_loss = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES,scope=DDPG_CFG.online_q_net_var_scope)
  q_loss = tf.add_n([tf.losses.mean_squared_error(labels=y_i, predictions=critic.online_q_outputs_tensor)] + q_reg_loss,
                    name='q_loss')

  #定义policy 网络的loss。 policy网络可以不用reg。
  policy_reg_loss = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES, scope=DDPG_CFG.online_policy_net_var_scope)
  policy_loss =tf.add_n([ -1.0 * tf.reduce_mean(critic.online_q_outputs_tensor)] + policy_reg_loss,
                        name='policy_loss')

  # 定义计算和应用online policy 网络gradient的op, 我们在训练时,运行该op即实现了对policy网络的训练。
  actor_g_and_v, actor_compute_grads_op = actor.compute_online_policy_net_gradients(policy_loss=policy_loss)
  actor_apply_grads_op = actor.apply_online_policy_net_gradients(grads_and_vars=actor_g_and_v)
  train_online_policy_op = actor_apply_grads_op

  # 定义计算和应用online q网络gradient的op, 运行该op即实现了对q网络的训练。
  critic_g_and_v, critic_compute_grads_op = critic.compute_online_q_net_gradients(q_loss=q_loss)
  critic_apply_grads_op = critic.apply_online_q_net_gradients(grads_and_vars=critic_g_and_v)
  train_online_q_op = critic_apply_grads_op

  #定义soft update 的op
  actor_update_target_op = actor.soft_update_online_to_target()
  critic_update_target_op = critic.soft_update_online_to_target()

  #创建control flow的依赖关系
  with tf.control_dependencies([actor_update_target_op, critic_update_target_op]):
    update_target_op = tf.assign_add(global_step_tensor, 1).op  # increment global step

  #copy online -> target,在初始化后进行。
  actor_init_target_op = actor.copy_online_to_target()
  critic_init_target_op = critic.copy_online_to_target()
  copy_online_to_target_op = tf.group(actor_init_target_op, critic_init_target_op)

  # 创建model saver
  saver = tf.train.Saver(keep_checkpoint_every_n_hours=0.5, max_to_keep=5)

  return (copy_online_to_target_op, train_online_policy_op, train_online_q_op, update_target_op, saver)

replay buffer

replay buffer存储我们收集到的transition数据样本,可以通过collections 包中的deque实现。为了反复训练,我增加了将buffer存盘的功能,可以在
下次训练时单独加载buffer文件,不用再一步步执行torcs环境,提高训练效率。

class ReplayBuffer(object):
  def __init__(self, buffer_size, seed, save_segment_size=None, save_path=None):
    #The right side of the deque contains the most recent experiences
    self.buffer_size = buffer_size
    self.buffer = deque([], maxlen=buffer_size)
    if seed is not None:
      np.random.seed(seed)
    self.save=False
    if save_segment_size is not None:
      assert save_path is not None
      self.save = True
      self.save_segment_size = save_segment_size
      self.save_path = save_path
      self.save_data_cnt=0
      self.save_segment_cnt=0

存储transition数据,如果最近新增加的数据达到save_segment_size,就存盘:

  def store(self, transition):
    #deque can take care of max len.
    T = copy.deepcopy(transition)
    self.buffer.append(T)
    if self.save:
      self.save_data_cnt+=1
      if self.save_data_cnt >= self.save_segment_size:
        self.save_segment()
        self.save_data_cnt=0
    del transition

采样mini batch,使用uniform随机采样策略,注意一些边界条件:


  def sample_batch(self, batch_size):
    indices = np.random.permutation(self.length - 1)[:batch_size]
    state_batch, action_batch, reward_batch, next_state_batch, terminated_batch = [], [], [], [], []
    for idx in indices:
      trans_1 = self.buffer[idx]
      if trans_1.terminated is not True:
        # the trans_2 : (a_2, r_2, term_2, s_3)
        trans_2 = self.buffer[idx + 1]
        # we use the data (s_2, a_2, r_2, term_2, s_3)
        state_batch.append(trans_1.next_state)
        action_batch.append(trans_2.action)
        reward_batch.append(trans_2.reward)
        next_state_batch.append(trans_2.next_state)
        terminated_batch.append(trans_2.terminated)
      else:
        ##term_1 is true, so buffer[idx+1] is beginning of new episode,
        # we traverse back.
        if idx != 0:
          trans_0 = self.buffer[idx - 1]  # a_0, r_0, s_1, term_0 = self.buffer[idx - 1]
          if trans_0.terminated is True:  # give up
            continue
          # we use the data (s_1, a_1, r_1, term_1, s_2)
          # s_2 is not used to calc Q cause its terminated. but we still use
          # it to FF through mu_prime/Q_prime then Q*0. guarantee s_2 is accurate formatted.
          state_batch.append(trans_0.next_state)
          action_batch.append(trans_1.action)
          reward_batch.append(trans_1.reward)
          next_state_batch.append(trans_1.next_state)
          terminated_batch.append(trans_1.terminated)
        else:
          # head of buffer, we dont know the previous state , so give up.
          continue

    return (np.array(state_batch), np.array(action_batch), np.array(reward_batch), np.array(next_state_batch),np.array(terminated_batch))

接着定义buffer文件的存储和加载函数,使用pickle包来做deque的序列化存储:

  def save_segment(self):
    self.save_segment_cnt+=1

    data = []
    start = self.length - self.save_segment_size  #always save latest data of segment_size
    end = self.length

    for idx in range(start, end):
      data.append(self.buffer[idx])

    if not os.path.exists(self.save_path):
      os.makedirs(self.save_path)

    abs_file_name = os.path.abspath(os.path.join(self.save_path,
                            '_'.join([DDPG_CFG.replay_buffer_file_name,str(self.save_segment_cnt),time.ctime()])))

    with open(abs_file_name,'wb') as f:
      pkl.dump(data, f)


  def load(self, path):
    #load from file to buffer
    abs_file_pattern = os.path.abspath(os.path.join(path,
                            '_'.join([DDPG_CFG.replay_buffer_file_name,'*'])))
    buffer_files = glob.glob(abs_file_pattern)
    for f_name in buffer_files:
      with open(f_name,'rb') as f:
        data = pkl.load(f)
        self.buffer.extend(data)

train 和 evaluate

接下来,我们可以定义actor、critic的网络架构参数,训练用的hyper-parameter,包括learning rate、l2 regularization ratio、
buffer size等:

policy 网络架构定义,online 和 target 是同样的架构:

import tensorflow as tf
from tensorflow.contrib.layers import variance_scaling_initializer,batch_norm,l2_regularizer

DDPG_CFG = tf.app.flags.FLAGS  # alias
is_training=tf.placeholder(tf.bool, shape=(), name='is_training')

DDPG_CFG.online_policy_net_var_scope = 'online_policy'
DDPG_CFG.target_policy_net_var_scope = 'target_policy'

#因为输入状态参数的scale不统一,我们使用一个batch norm 层,将输入数据的每个维度都
#归一化为均值为0、方差为1:
# -- 1 input norm layers -- 
DDPG_CFG.actor_input_normalizer = batch_norm
DDPG_CFG.actor_input_norm_params =  {
  
  'is_training':is_training,
                                       'data_format':'NHWC',
                                       'updates_collections':None,
                                        'scale':False,  # not gamma. let next fc layer to scale.
                                        'center':True  # beta.
                                        }
# -- 2层fully_connected layers --
DDPG_CFG.actor_n_fc_units = [400, 300]
DDPG_CFG.actor_fc_activations = [tf.nn.elu] * 2
DDPG_CFG.actor_fc_initializers = [variance_scaling_initializer()] * 2
DDPG_CFG.actor_fc_regularizers = [None] * 2
#加一个batch norm层,训练比较稳定
DDPG_CFG.actor_fc_normalizers = [batch_norm] * 2
DDPG_CFG.actor_fc_norm_params =  [{
  
  'is_training':is_training,
                                       'data_format':'NHWC',
                                       'updates_collections':None,
                                       'scale':False,
                                       'center':True
                                         }] *2

# -- 1层 output layer ,由于我们采用tanh和sigmoid函数对输出的action进行了bound,
# 为了避免tanh和sigmoid函数的饱和,在他们之前加一个batch norm层
DDPG_CFG.actor_output_layer_normalizers = batch_norm
DDPG_CFG.actor_output_layer_norm_params = {
  
  'is_training':is_training,
                                       'data_format':'NHWC',
                                       'updates_collections':None,
                                       'scale':False,
                                       'center':False}
DDPG_CFG.actor_output_layer_initializer=tf.random_uniform_initializer(-3e-3,3e-3)

q网络架构定义, online和target是相同的架构:

DDPG_CFG.online_q_net_var_scope = 'online_q'
DDPG_CFG.target_q_net_var_scope = 'target_q'

#-- 1层 input norm layer --
DDPG_CFG.critic_input_normalizer = batch_norm
DDPG_CFG.critic_input_norm_params =  {
  
  'is_training':is_training,
                                       'data_format':'NHWC',
                                       'updates_collections':None,
                                       'scale': False,
                                       'center': True
                                      }

# -- 2层 fully-connected layer,使用l2 regularization --
DDPG_CFG.include_action_fc_layer = 2  # in this layer we include action inputs.
DDPG_CFG.critic_n_fc_units = [400, 300]
DDPG_CFG.critic_fc_activations = [tf.nn.elu] * 2
DDPG_CFG.critic_fc_initializers = [variance_scaling_initializer()] * 2
DDPG_CFG.critic_fc_regularizers = [l2_regularizer(scale=DDPG_CFG.critic_reg_ratio,scope=DDPG_CFG.online_q_net_var_scope)] * 2
DDPG_CFG.critic_fc_normalizers = [batch_norm, None]  # 2nd fc including action input and no BN but has bias.
DDPG_CFG.critic_fc_norm_params =  [{
  
  'is_training':is_training,
                                       'data_format':'NHWC',
                                       'updates_collections':None,
                                       'scale': False,
                                       'center': True
                                    }, None]

# -- 1 output layer --
DDPG_CFG.critic_output_layer_initializer = tf.random_uniform_initializer(-3e-3, 3e-3)

hyper parameter定义:

## hyper-P
DDPG_CFG.actor_learning_rate = 1e-3
DDPG_CFG.critic_learning_rate = 1e-4
DDPG_CFG.critic_reg_ratio = 1e-2
DDPG_CFG.tau = 0.001
DDPG_CFG.gamma = 0.99
DDPG_CFG.num_training_steps = 25*(10**5)  # 2.5M steps total
DDPG_CFG.eval_freq = 3*10000
DDPG_CFG.num_eval_steps = 1000  # eval steps during training
DDPG_CFG.eval_steps_after_training=2000
DDPG_CFG.batch_size = 64
DDPG_CFG.replay_buff_size = 10**6  # 1M
DDPG_CFG.replay_buff_save_segment_size = 30*3000  # every 180,000 Transition data.
DDPG_CFG.greedy_accel_noise_steps = 3*(10**5)

定义train函数, 这个函数也提供在训练后,对得到的具体模型进行evaluation的模式,由入参eval_mode指定:

def train(train_env, agent_action_fn, eval_mode=False):
action_space = train_env.action_space
obs_space = train_env.observation_space
#########实例化 actor,critic, replay buffer#########
# 训练online网络时的state 输入
online_state_inputs = tf.placeholder(tf.float32,
shape=(None, obs_space.shape[0]),
name="online_state_inputs")
# target网络需要的state 输入
target_state_inputs = tf.placeholder(tf.float32,
shape=online_state_inputs.shape,
name="target_state_inputs")
# 训练q 网络时的action输入
online_action_inputs_training_q = tf.placeholder(tf.float32,
shape=(None, action_space.shape[0]),
name='online_action_batch_inputs'
)
# 用于控制q 网络action输入的条件变量:
# True: training q .
# False: training policy.
cond_training_q = tf.placeholder(tf.bool, shape=[], name='cond_training_q')
terminated_inputs = tf.placeholder(tf.float32, shape=(None), name='terminated_inputs')
reward_inputs = tf.placeholder(tf.float32, shape=(None), name='rewards_inputs')
# for summary text
summary_text_tensor = tf.convert_to_tensor(str('summary_text'), preferred_dtype=string)
tf.summary.text(name='summary_text', tensor=summary_text_tensor, collections=[DDPG_CFG.log_summary_keys])
##实例化 actor, critic.
actor = Actor(action_dim=action_space.shape[0],
online_state_inputs=online_state_inputs,
target_state_inputs=target_state_inputs,
input_normalizer=DDPG_CFG.actor_input_normalizer,
input_norm_params=DDPG_CFG.actor_input_norm_params,
n_fc_units=DDPG_CFG.actor_n_fc_units,
fc_activations=DDPG_CFG.actor_fc_activations,
fc_initializers=DDPG_CFG.actor_fc_initializers,
fc_normalizers=DDPG_CFG.actor_fc_normalizers,summary_text_tensor
fc_norm_params=DDPG_CFG.actor_fc_norm_params,
fc_regularizers=DDPG_CFG.actor_fc_regularizers,
output_layer_initializer=DDPG_CFG.actor_output_layer_initializer,
output_layer_regularizer=None,
output_normalizers=DDPG_CFG.actor_output_layer_normalizers,
output_norm_params=DDPG_CFG.actor_output_layer_norm_params,
output_bound_fns=DDPG_CFG.actor_output_bound_fns,
learning_rate=DDPG_CFG.actor_learning_rate,
is_training=is_training)
critic = Critic(online_state_inputs=online_state_inputs,
target_state_inputs=target_state_inputs,
input_normalizer=DDPG_CFG.critic_input_normalizer,
input_norm_params=DDPG_CFG.critic_input_norm_params,
online_action_inputs_training_q=online_action_inputs_training_q,
online_action_inputs_training_policy=actor.online_action_outputs_tensor,
cond_training_q=cond_training_q,
target_action_inputs=actor.target_action_outputs_tensor,
n_fc_units=DDPG_CFG.critic_n_fc_units,
fc_activations=DDPG_CFG.critic_fc_activations,
fc_initializers=DDPG_CFG.critic_fc_initializers,
fc_normalizers=DDPG_CFG.critic_fc_normalizers,
fc_norm_params=DDPG_CFG.critic_fc_norm_params,
fc_regularizers=DDPG_CFG.critic_fc_regularizers,
output_layer_initializer=DDPG_CFG.critic_output_layer_initializer,
output_layer_regularizer = None,
learning_rate=DDPG_CFG.critic_learning_rate)
## track updates.
global_step_tensor = tf.train.create_global_step()
## 构建整个ddpg computation graph 
copy_online_to_target_op, train_online_policy_op, train_online_q_op, update_target_op, saver \
= build_ddpg_graph(actor, critic, reward_inputs, terminated_inputs, global_step_tensor)
#实例化 replay buffer,指定是否将buffer数据保存到文件
replay_buffer = ReplayBuffer(buffer_size=DDPG_CFG.replay_buff_size,
save_segment_size= DDPG_CFG.replay_buff_save_segment_size,
save_path=DDPG_CFG.replay_buffer_file_path,
seed=DDPG_CFG.random_seed
)
#从文件加载buffer数据
if DDPG_CFG.load_replay_buffer_set:
replay_buffer.load(DDPG_CFG.replay_buffer_file_path)
#使用summary监控训练中各项数据、参数的变化,并生成图表,在tensorboard中进行观察,非常方便
sess = tf.Session(graph=tf.get_default_graph())
summary_writer = tf.summary.FileWriter(logdir=os.path.join(DDPG_CFG.log_dir, "train"),
graph=sess.graph)
log_summary_op = tf.summary.merge_all(key=DDPG_CFG.log_summary_keys)
sess.run(fetches=[tf.global_variables_initializer()])
#copy init params from online to target
sess.run(fetches=[copy_online_to_target_op])
#加载之前保存的模型参数checkpoint:
latest_checkpoint = tf.train.latest_checkpoint(DDPG_CFG.checkpoint_dir)
if latest_checkpoint:
tf.logging.info("==== Loading model checkpoint: {}".format(latest_checkpoint))
saver.restore(sess, latest_checkpoint)
####### start training #########
obs = train_env.reset()  
transition = preprocess_low_dim(obs)
n_episodes = 1
if not eval_mode:
#training 模式
for step in range(1, DDPG_CFG.num_training_steps):
#根据state参数,从online policy网络得到action
policy_out=sess.run(fetches=[actor.online_action_outputs_tensor],
feed_dict={online_state_inputs: transition.next_state[np.newaxis, :], is_training: False})[0]
#通过仿真环境执行action,并保存Transition数据到replay buffer.
# 这一步里会引入action的noise,参见下面的 <noise设计> 段落。
transition=agent_action_fn(policy_out, replay_buffer,train_env)
##从replay buffer采样一个mini-batch
state_batch, action_batch, reward_batch, next_state_batch, terminated_batch = \
replay_buffer.sample_batch(DDPG_CFG.batch_size)
# ---- 1. 训练 policy网络,注意feed的参数 -----------
sess.run(fetches=[train_online_policy_op],
feed_dict = {online_state_inputs: state_batch,
cond_training_q: False,
online_action_inputs_training_q: action_batch,  # feed but not used.
is_training: True
})
# ---- 2. 训练 q 网络 --------------
sess.run(fetches=[train_online_q_op],
feed_dict={ online_state_inputs: state_batch,
cond_training_q: True,
online_action_inputs_training_q: action_batch,
target_state_inputs: next_state_batch,
reward_inputs: reward_batch,
terminated_inputs: terminated_batch,
is_training: True})
# ----- 3. soft update target网络 ---------
sess.run(fetches=[update_target_op], feed_dict=None)
# 每隔 eval_freq steps,我们进行一次evaluation,以便在训练结束后选择好的模型:
if step % DDPG_CFG.eval_freq == 0:
evaluate(env=train_env,
num_eval_steps=DDPG_CFG.num_eval_steps,
preprocess_fn=preprocess_low_dim,
estimate_fn=lambda state: sess.run(fetches=[actor.online_action_outputs_tensor],
feed_dict={online_state_inputs:state,
is_training:False} ),
summary_writer=summary_writer,
saver=saver, sess=sess, global_step=step,
log_summary_op=log_summary_op,summary_text_tensor=summary_text_tensor)
if transition.terminated:
transition = preprocess_low_dim(train_env.reset())
n_episodes +=1
continue  # begin new episode
#evaluate 模式
else: 
evaluate(env=train_env,
num_eval_steps=DDPG_CFG.eval_steps_after_training,
preprocess_fn=preprocess_low_dim,
estimate_fn=lambda state: sess.run(fetches=[actor.online_action_outputs_tensor],
feed_dict={online_state_inputs: state,
is_training: False}),
summary_writer=summary_writer,
saver=None, sess=sess, global_step=0,
log_summary_op=log_summary_op, summary_text_tensor=summary_text_tensor)
sess.close()
train_env.close()

定义evaluate函数如下:


def evaluate(env, num_eval_steps, preprocess_fn, estimate_fn, summary_writer, saver, sess,global_step,log_summary_op,summary_text_tensor):
total_reward = 0
episode_reward = 0
max_episode_reward = 0
n_episodes = 0
n_rewards = 0
terminated = False
transition = preprocess_fn(state=env.reset())
tf.logging.info(' ####### start evaluate @ global step:{}## '.format(global_step))
for estep in range(1,num_eval_steps):      
policy_out = estimate_fn(transition.next_state[np.newaxis,:])
action = policy_output_to_deterministic_action(policy_out,env.action_space)
(state, reward, terminated) = env_step(env, action)
transition = preprocess_fn(state)
# record every reward
total_reward += reward
episode_reward += reward
if reward != 0:
n_rewards += 1  # can represent effective steps in episode
if terminated:
n_episodes += 1
if episode_reward > max_episode_reward:
max_episode_reward = episode_reward
episode_reward = 0            
transition = preprocess_fn(env.reset())
# -- end for estep ---
avg_episode_reward = total_reward / max(1, n_episodes)
avg_episode_steps = n_rewards / max(1, n_episodes)
#训练时我们才保存model参数
saved_name='eval_only_not_save_model'
if saver is not None: 
saved_name = save_model(saver, sess, global_step)
write_summary(summary_writer, global_step, avg_episode_reward, max_episode_reward,
avg_episode_steps,saved_name,sess,log_summary_op,
summary_text_tensor)

我们将evaluation的结果,通过summary保存, 可以通过tensorboard进行查看:
这里写图片描述

def write_summary(writer, global_step, avg_episode_reward, max_episode_reward, avg_episode_steps,saved_name,sess,log_summary_op, summary_text_tensor):
eval_summary = tf.Summary()  # protocol buffer
eval_summary.value.add(node_name='avg_episode_reward',simple_value=avg_episode_reward, tag="train_eval/avg_episode_reward")
eval_summary.value.add(node_name='max_episode_reward', simple_value=max_episode_reward, tag="train_eval/max_episode_reward")
eval_summary.value.add(node_name='avg_episode_steps', simple_value=avg_episode_steps, tag="train_eval/avg_episode_steps")
writer.add_summary(summary=eval_summary, global_step=global_step)
log_info = 'eval result : global_step:{} avg_episode_reward:{} \ max_episode_reward:{} avg_episode_steps:{} \n saved_file: {} '.format(global_step,
avg_episode_reward,
max_episode_reward,
avg_episode_steps,
saved_name)
log_summary=sess.run(fetches=[log_summary_op],
feed_dict={summary_text_tensor:log_info})
writer.add_summary(summary=log_summary[0], global_step=global_step)
writer.flush()

通过saver对象保存模型参数:


def save_model(saver, sess,global_step):
# save model. will save both online and target networks.
return saver.save(sess=sess, save_path=DDPG_CFG.checkpoint_dir, global_step=global_step)

接下来,我们就可以创建torcs环境,开始训练:
注意:为了加速训练,torcs使用text模式(-T选项) ,即没有画面显示。你可以更改下面的vision选项打开画面显示。
当你想选择赛道时,在shell里通过运行 sudo torcs , 进入 practice->configure race 界面进行配置,再退出,然后开始训练。


if __name__ == "__main__":
tf.logging.info("@@@ start ddpg training gym_torcs @@@ start time:{}".format(time.ctime()))
# Generate a Torcs environment
env_train = torcs_env_wrapper(vision=False, throttle=True, gear_change=False,port=3101)
train(env_train,agent_action,eval_mode=False)

gym torcs的reward设计和episode终结条件:

我们定义reward和episode终结条件如下:

- 每一步action执行时,汽车的当前速度投影到赛道正前方向的值,参照  [yanpanlau的描述](https://github.com/yanpanlau/DDPG-Keras-Torcs):

这里写图片描述

- 当汽车与周围发生碰撞时, reward=-1 作为惩罚,但并不结束当前episode
- 当汽车在持续200步进展缓慢(前后两步的距离差小于0.05米)时,给出reward=-1作为惩罚,并结束当前episode
- 汽车如果掉头行驶,则结束当前episode

这样设计的目的也是了实现exploration,即让汽车尽量多的往前跑,去覆盖到赛道的各个部分,并且避免产生大量低速行驶的数据。

具体实现参见本项目中的gym_torcs.py文件

noise的设计

noise的引入是为了exploration,ddpg paper中使用的是OU noise,但经过我实践,标准的OU noise应用在torcs这种环境效果并不算好,
我参考了yanpanlau 的做法, 即在训练一开始,引入大量直线加速的noise, 然后随着训练步骤衰减该noise,
这样做的效果很明显,就是让模型快速学习到高速行驶状态(车速也是状态输入的一部分)下的q网络和policy网络参数,
而且在速度的保证下,可以往前开的尽量的远,探索到更多赛段。你可以探索更多有效的noise,去帮助ddpg寻找最优解。
我管这种noise 叫做greedy accelerate:

def greedy_function(x, mu, theta, sigma):
return theta * (mu - x) + sigma * np.random.randn()
epsilon=1
def policy_output_to_stochastic_action(output, action_space):
global epsilon
output = np.squeeze(output, axis=0)
epsilon -= 1.0 / DDPG_CFG.greedy_accel_noise_steps
greedy_noise=np.array( [max(epsilon, 0) * greedy_function(output[0], 0.0, 0.60, 0.30),  # steer
max(epsilon, 0) * greedy_function(output[1], 0.5, 1.00, 0.10),  # accel
max(epsilon, 0) * greedy_function(output[2], -0.1, 1.00, 0.05)]) # brake
stochastic_action = greedy_noise + output
bounded = np.clip(stochastic_action, action_space.low, action_space.high)
return bounded

训练结果

使用机器配置:cpu i5-3450, gpu Nvidia 1050Ti, 训练1M steps 耗时 13小时,
基本可以得到一个跑分(evo-3-l赛道,Alpine赛道)在10k/episode 以上的模型。
ddpg paper中的模型训练2.5M steps后,得到的 torcs最好跑分为1840/episode, 当然paper中的模型是为了
通用(也用于atari、mujuco等仿真环境), 对于torcs没有特别的处理。
我得到的模型应该是有overfit torcs具体某一条赛道的成分存在。
观察 q_loss的曲线,是往下收敛的:
这里写图片描述

观察policy loss 曲线,虽然没有往下收敛,基本比较平稳,需要持续的训练才能明显的收敛。我们可以取一个较好的中间结果,也可以看到赛车可以30km左右的速度顺利过弯了。
这里写图片描述

再看一下training过程中evaluate的结果,可以看到中间有些步骤的跑分可以达到10k,由于我们每次evaluation都保存了模型参数,我们 可以取出对应的模型:
这里写图片描述

上述1M steps训练后得到的不算是最优模型:
– 不能稳定跑完整条赛道,只能跑完前面一段
– 速度40km上下,还不能尽力加速,
– 多赛道的表现不好,因为我只在一条赛道上进行训练, 有overfit存在。

我们验证了ddpg算法的正确性。
ddpg的训练过程存在很大的不稳定性,后期的训练过程很可能覆盖前期训练的结果,所以我们需要大量的反复训练,然后在过程中
寻找一个相对最优解。
如果你想自己动手训练一下,我建议两点:

    -- 加上一些基本的Deep learning 技巧去调试:比如cross validation寻找最优hyper parameters,
early stopping等。
-- 算法上,更新Deepmind 2017年发表的Rainbow
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/149034.html原文链接:https://javaforall.cn

【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛

【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...

(0)
blank

相关推荐

  • Mac 安装Yarn「建议收藏」

    Mac 安装Yarn「建议收藏」前言近期学习ant-design-vue-pro,需要使用Yarn安装相关依赖包,以下说说在Mac上安装Yarn的尝试几种安装方式使用homebrew安装brewinstallyarn等待许久后,得到如下错误,尝试无数次后放弃,建议在良好的网络环境下尝试脚本安装curl-o–Lhttps://yarnpkg.com/install.sh|bash安装截图:简单快速,强烈建议尝试,尤其适用在国内网络环境下适用。总结homebrew适宜在网络环境良好的情况下使

  • linux中samba服务器搭建(手机samba服务器)

    一、Samba概述Samba是在Linux和UNIX系统上实现SMB协议的一个免费软件,由服务器及客户端程序构成。SMB(ServerMessagesBlock,信息服务块)是一种在局域网上共享文件和打印机的一种通信协议,它为局域网内的不同计算机之间提供文件及打印机等资源的共享服务。SMB协议是客户机/服务器型协议,客户机通过该协议可以访问服务器上的共享文件系统、打印机及其他资源。win…

  • vue中上传文件_vue实现文件上传和下载

    vue中上传文件_vue实现文件上传和下载列子一:简单的上传文件,先把文件上传到input框只展示文件名,不走接口,之后点击确定上传按钮统一上传<template> <div> <Col> <FormItemlabel=”上传文件:”prop=”plugin_name”> <Inputv-model=”setValidate.plugin_name”placeholder=”请选择上传文件(.ZIP格式)”></Input> </Fo

  • linux通配符大全_linux中rmdir命令

    linux通配符大全_linux中rmdir命令linux基础(通配符的使用)你好!这里是面向新手的linux入门指南,这节课我会整理我所知道的linux中的通配符,希望和大家一起学习通配符的概念首先通配符绝对不是正则表达式,通配符基础只有4个:***,?,[],[^]**。这些一般只用于文件名匹配,它是由shell解析的,比如find,ls,cp,mv等符号*:该符号表示一个或多个字符例如:*finda为找a开头的文件符号?:表示代替单个字符符号[list]:表示匹配list中的任意单一字符【0,9】—

  • c++入门教程–1选择编译器,进行第一段代码

    c++入门教程–1选择编译器,进行第一段代码

  • Xmn 与 NewSize 设置说明

    Xmn与 NewSize都是设置新生代的内存大小。经过测试,写在最后的一个参数起作用。下面是我的测试信息。第一次参数设置:    内存信息:S0C(13056)+S1C(13056)+EC(104960)=131072K=128M第二次参数设置:内存信息:S0C(20480)+S1C(20480)+EC(163840)=204800K=200M…

发表回复

您的电子邮箱地址不会被公开。

关注全栈程序员社区公众号