关于动作分布的计算
Created by: Termset
ppo和sac都要计算动作的概率,但是处理方法不一样。ppo设置单独的可训练的参数作为方差,并且手写了计算概率和kl的方法。而sac用神经网络拟合出均值和方差,并通过layers.Normal建立动作分布,Normal.sample()采样,Normal.kl_divergence(other)、 Normal.log_prob(action)计算动作概率和kl。这两种有什么区别吗? PPO:
def _calc_kl(self, means, logvars, old_means, old_logvars):
log_det_cov_old = layers.reduce_sum(old_logvars)
log_det_cov_new = layers.reduce_sum(logvars)
tr_old_new = layers.reduce_sum(layers.exp(old_logvars - logvars))
kl = 0.5 * (layers.reduce_sum(
layers.square(means - old_means) / layers.exp(logvars), dim=1) + (
log_det_cov_new - log_det_cov_old) + tr_old_new - self.act_dim)
return kl
SAC:
def sample(self, obs):
mean, log_std = self.actor.policy(obs)
std = layers.exp(log_std)
normal = Normal(mean, std)
x_t = normal.sample([1])[0]
y_t = layers.tanh(x_t)
action = y_t * self.max_action
log_prob = normal.log_prob(x_t)
log_prob -= layers.log(self.max_action * (1 - layers.pow(y_t, 2)) +
epsilon)
log_prob = layers.reduce_sum(log_prob, dim=1, keep_dim=True)
log_prob = layers.squeeze(log_prob, axes=[1])
return action, log_prob