大家好,又见面了,我是你们的朋友全栈君。
目录
1. 前言
2. 交叉熵损失函数
3. 交叉熵损失函数的求导
前言
说明:本文只讨论Logistic回归的交叉熵,对Softmax回归的交叉熵类似(Logistic回归和Softmax回归两者本质是一样的,后面我会专门有一篇文章说明两者关系,先在这里挖个坑)。
首先,我们二话不说,先放出交叉熵的公式:
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) , J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)})), J(θ)=−m1i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))),
以及 J ( θ ) J(\theta) J(θ)对参数 θ \theta θ的偏导数(用于诸如梯度下降法等优化算法的参数更新),如下:
∂ ∂ θ j J ( θ ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \frac{\partial}{\partial\theta_{j}}J(\theta) =\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} ∂θj∂J(θ)=m1i=1∑m(hθ(x(i))−y(i))xj(i)
但是在大多论文或数教程中,也就是直接给出了上面两个公式,而未给出推导过程,而且这一过程并不是一两步就可以得到的,这就给初学者造成了一定的困惑,所以我特意在此详细介绍了它的推导过程,跟大家分享。因水平有限,如有错误,欢迎指正。
交叉熵损失函数(Logistic Regression代价函数)
我们一共有 m m m组已知样本( B a t c h s i z e = m Batch size = m Batchsize=m), ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i))表示第 i i i 组数据及其对应的类别标记。其中 x ( i ) = ( 1 , x 1 ( i ) , x 2 ( i ) , . . . , x p ( i ) ) T x^{(i)}=(1,x^{(i)}_1,x^{(i)}_2,…,x^{(i)}_p)^T x(i)=(1,x1(i),x2(i),...,xp(i))T为 p + 1 p+1 p+1维向量(考虑偏置项), y ( i ) y^{(i)} y(i)则为表示类别的一个数:
- logistic回归(是非问题)中, y ( i ) y^{(i)} y(i)取0或者1;
- softmax回归 (多分类问题)中, y ( i ) y^{(i)} y(i)取1,2…k中的一个表示类别标号的一个数(假设共有k类)。
这里,只讨论logistic回归,输入样本数据 x ( i ) = ( 1 , x 1 ( i ) , x 2 ( i ) , . . . , x p ( i ) ) T x^{(i)}=(1,x^{(i)}_1,x^{(i)}_2,…,x^{(i)}_p)^T x(i)=(1,x1(i),x2(i),...,xp(i))T,模型的参数为 θ = ( θ 0 , θ 1 , θ 2 , . . . , θ p ) T \theta=(\theta_0,\theta_1,\theta_2,…,\theta_p)^T θ=(θ0,θ1,θ2,...,θp)T,因此有
θ T x ( i ) : = θ 0 + θ 1 x 1 ( i ) + ⋯ + θ p x p ( i ) . \theta^T x^{(i)}:=\theta_0+\theta_1 x^{(i)}_1+\dots+\theta_p x^{(i)}_p. θTx(i):=θ0+θ1x1(i)+⋯+θpxp(i).
假设函数(hypothesis function)定义为:
h θ ( x ( i ) ) = 1 1 + e − θ T x ( i ) . h_\theta(x^{(i)})=\frac{1}{1+e^{-\theta^T x^{(i)}} }. hθ(x(i))=1+e−θTx(i)1.
因为Logistic回归问题就是0/1的二分类问题,可以有
P ( y ^ ( i ) = 1 ∣ x ( i ) ; θ ) = h θ ( x ( i ) ) P ( y ^ ( i ) = 0 ∣ x ( i ) ; θ ) = 1 − h θ ( x ( i ) ) P({\hat{y}}^{(i)}=1|x^{(i)};\theta)=h_\theta(x^{(i)}) \\ P({\hat{y}}^{(i)}=0|x^{(i)};\theta)=1-h_\theta(x^{(i)}) P(y^(i)=1∣x(i);θ)=hθ(x(i))P(y^(i)=0∣x(i);θ)=1−hθ(x(i))
现在,我们不考虑“熵”的概念,根据下面的说明,从简单直观角度理解,就可以得到我们想要的损失函数:我们将概率取对数,其单调性不变,有
log P ( y ^ ( i ) = 1 ∣ x ( i ) ; θ ) = log h θ ( x ( i ) ) = log 1 1 + e − θ T x ( i ) log P ( y ^ ( i ) = 0 ∣ x ( i ) ; θ ) = log ( 1 − h θ ( x ( i ) ) ) = log e − θ T x ( i ) 1 + e − θ T x ( i ) \log P({\hat{y}}^{(i)}=1|x^{(i)};\theta)=\log h_\theta(x^{(i)})=\log\frac{1}{1+e^{-\theta^T x^{(i)}} } \\ \log P({\hat{y}}^{(i)}=0|x^{(i)};\theta)=\log (1-h_\theta(x^{(i)}))=\log\frac{e^{-\theta^T x^{(i)}}}{1+e^{-\theta^T x^{(i)}} } logP(y^(i)=1∣x(i);θ)=loghθ(x(i))=log1+e−θTx(i)1logP(y^(i)=0∣x(i);θ)=log(1−hθ(x(i)))=log1+e−θTx(i)e−θTx(i)
那么对于第 i i i组样本,假设函数表征正确的组合对数概率为:
I { y ( i ) = 1 } log P ( y ^ ( i ) = 1 ∣ x ( i ) ; θ ) + I { y ( i ) = 0 } log P ( y ^ ( i ) = 0 ∣ x ( i ) ; θ ) = y ( i ) log P ( y ^ ( i ) = 1 ∣ x ( i ) ; θ ) + ( 1 − y ( i ) ) log P ( y ^ ( i ) = 0 ∣ x ( i ) ; θ ) = y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) I\{y^{(i)}=1\}\log P({\hat{y}}^{(i)}=1|x^{(i)};\theta)+I\{y^{(i)}=0\}\log P({\hat{y}}^{(i)}=0|x^{(i)};\theta)\\ =y^{(i)}\log P({\hat{y}}^{(i)}=1|x^{(i)};\theta)+(1-y^{(i)})\log P({\hat{y}}^{(i)}=0|x^{(i)};\theta)\\ =y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)})) I{
y(i)=1}logP(y^(i)=1∣x(i);θ)+I{
y(i)=0}logP(y^(i)=0∣x(i);θ)=y(i)logP(y^(i)=1∣x(i);θ)+(1−y(i))logP(y^(i)=0∣x(i);θ)=y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))
其中, I { y ( i ) = 1 } I\{y^{(i)}=1\} I{
y(i)=1}和 I { y ( i ) = 0 } I\{y^{(i)}=0\} I{
y(i)=0}为示性函数(indicative function),简单理解为{ }内条件成立时,取1,否则取0,这里不赘言。
那么对于一共 m m m组样本,我们就可以得到模型对于整体训练样本的表现能力:
∑ i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) \sum_{i=1}^{m}y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)})) i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))
由以上表征正确的概率含义可知,我们希望其值越大,模型对数据的表达能力越好。而我们在参数更新或衡量模型优劣时是需要一个能充分反映模型表现误差的损失函数(Loss function)或者代价函数(Cost function)的,而且我们希望损失函数越小越好。由这两个矛盾,那么我们不妨领代价函数为上述组合对数概率的相反数:
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)})) J(θ)=−m1i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))
上式即为大名鼎鼎的交叉熵损失函数。(说明:如果熟悉“信息熵“的概念 E [ − log p i ] = − ∑ i = 1 m p i log p i E[-\log p_i]=-\sum_{i=1}^mp_i\log p_i E[−logpi]=−∑i=1mpilogpi,那么可以有助理解叉熵损失函数,先挖个坑,后面我会专门写一篇讲信息熵的白话文)
交叉熵损失函数的求导
这步需要用到一些简单的对数运算公式,这里先以编号形式给出,下面推导过程中使用特意说明时都会在该步骤下脚标标出相应的公式编号,以保证推导的连贯性。
① log a b = log a − log b \log \frac{a}{b}=\log a-\log b logba=loga−logb
② log a + log b = log ( a b ) \log a+\log b=\log (ab) loga+logb=log(ab)
③ a = log e a a=\log e^a a=logea (为了方便这里 log \log log指 log e \log_e loge,即 ln \ln ln,其他底数如2,10等,只是前置常数系数不同,对结论毫无影响)
另外,值得一提的是在这里涉及的求导均为矩阵、向量的导数(矩阵微商),这里有一篇教程总结得精简又全面,非常棒,推荐给需要的同学。
下面开始推导:
交叉熵损失函数为:
J ( θ ) = − 1 m ∑ i = 1 m y ( i ) log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ) (1) J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_\theta(x^{(i)}))+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))\tag{1} J(θ)=−m1i=1∑my(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i)))(1)
其中,
log h θ ( x ( i ) ) = log 1 1 + e − θ T x ( i ) = − log ( 1 + e − θ T x ( i ) ) , log ( 1 − h θ ( x ( i ) ) ) = log ( 1 − 1 1 + e − θ T x ( i ) ) = log ( e − θ T x ( i ) 1 + e − θ T x ( i ) ) = log ( e − θ T x ( i ) ) − log ( 1 + e − θ T x ( i ) ) = − θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) ① ③ . \log h_\theta(x^{(i)})=\log\frac{1}{1+e^{-\theta^T x^{(i)}} }=-\log ( 1+e^{-\theta^T x^{(i)}} )\ ,\\ \log(1- h_\theta(x^{(i)}))=\log(1-\frac{1}{1+e^{-\theta^T x^{(i)}} })=\log(\frac{e^{-\theta^T x^{(i)}}}{1+e^{-\theta^T x^{(i)}} })\\=\log (e^{-\theta^T x^{(i)}} )-\log ( 1+e^{-\theta^T x^{(i)}} )=-\theta^T x^{(i)}-\log ( 1+e^{-\theta^T x^{(i)}} ) _{①③}\ . loghθ(x(i))=log1+e−θTx(i)1=−log(1+e−θTx(i)) ,log(1−hθ(x(i)))=log(1−1+e−θTx(i)1)=log(1+e−θTx(i)e−θTx(i))=log(e−θTx(i))−log(1+e−θTx(i))=−θTx(i)−log(1+e−θTx(i))①③ .
由此,得到
J ( θ ) = − 1 m ∑ i = 1 m [ − y ( i ) ( log ( 1 + e − θ T x ( i ) ) ) + ( 1 − y ( i ) ) ( − θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) ] = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − log e θ T x ( i ) − log ( 1 + e − θ T x ( i ) ) ] ③ = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − ( log e θ T x ( i ) + log ( 1 + e − θ T x ( i ) ) ) ] ② = − 1 m ∑ i = 1 m [ y ( i ) θ T x ( i ) − log ( 1 + e θ T x ( i ) ) ] J(\theta) =-\frac{1}{m}\sum_{i=1}^m \left[-y^{(i)}(\log ( 1+e^{-\theta^T x^{(i)}})) + (1-y^{(i)})(-\theta^T x^{(i)}-\log ( 1+e^{-\theta^T x^{(i)}} ))\right]\\ =-\frac{1}{m}\sum_{i=1}^m \left[y^{(i)}\theta^T x^{(i)}-\theta^T x^{(i)}-\log(1+e^{-\theta^T x^{(i)}})\right]\\ =-\frac{1}{m}\sum_{i=1}^m \left[y^{(i)}\theta^T x^{(i)}-\log e^{\theta^T x^{(i)}}-\log(1+e^{-\theta^T x^{(i)}})\right]_{③}\\ =-\frac{1}{m}\sum_{i=1}^m \left[y^{(i)}\theta^T x^{(i)}-\left(\log e^{\theta^T x^{(i)}}+\log(1+e^{-\theta^T x^{(i)}})\right)\right] _②\\ =-\frac{1}{m}\sum_{i=1}^m \left[y^{(i)}\theta^T x^{(i)}-\log(1+e^{\theta^T x^{(i)}})\right] J(θ)=−m1i=1∑m[−y(i)(log(1+e−θTx(i)))+(1−y(i))(−θTx(i)−log(1+e−θTx(i)))]=−m1i=1∑m[y(i)θTx(i)−θTx(i)−log(1+e−θTx(i))]=−m1i=1∑m[y(i)θTx(i)−logeθTx(i)−log(1+e−θTx(i))]③=−m1i=1∑m[y(i)θTx(i)−(logeθTx(i)+log(1+e−θTx(i)))]②=−m1i=1∑m[y(i)θTx(i)−log(1+eθTx(i))]
这次再计算 J ( θ ) J(\theta) J(θ)对第 j j j个参数分量 θ j \theta_j θj求偏导:
∂ ∂ θ j J ( θ ) = ∂ ∂ θ j ( 1 m ∑ i = 1 m [ log ( 1 + e θ T x ( i ) ) − y ( i ) θ T x ( i ) ] ) = 1 m ∑ i = 1 m [ ∂ ∂ θ j log ( 1 + e θ T x ( i ) ) − ∂ ∂ θ j ( y ( i ) θ T x ( i ) ) ] = 1 m ∑ i = 1 m ( x j ( i ) e θ T x ( i ) 1 + e θ T x ( i ) − y ( i ) x j ( i ) ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \frac{\partial}{\partial\theta_{j}}J(\theta) =\frac{\partial}{\partial\theta_{j}}\left(\frac{1}{m}\sum_{i=1}^m \left[\log(1+e^{\theta^T x^{(i)}})-y^{(i)}\theta^T x^{(i)}\right]\right)\\ =\frac{1}{m}\sum_{i=1}^m \left[\frac{\partial}{\partial\theta_{j}}\log(1+e^{\theta^T x^{(i)}})-\frac{\partial}{\partial\theta_{j}}\left(y^{(i)}\theta^T x^{(i)}\right)\right]\\ =\frac{1}{m}\sum_{i=1}^m \left(\frac{x^{(i)}_je^{\theta^T x^{(i)}}}{1+e^{\theta^T x^{(i)}}}-y^{(i)}x^{(i)}_j\right)\\ =\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} ∂θj∂J(θ)=∂θj∂(m1i=1∑m[log(1+eθTx(i))−y(i)θTx(i)])=m1i=1∑m[∂θj∂log(1+eθTx(i))−∂θj∂(y(i)θTx(i))]=m1i=1∑m(1+eθTx(i)xj(i)eθTx(i)−y(i)xj(i))=m1i=1∑m(hθ(x(i))−y(i))xj(i)
这就是交叉熵对参数的导数:
∂ ∂ θ j J ( θ ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \frac{\partial}{\partial\theta_{j}}J(\theta) =\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)} ∂θj∂J(θ)=m1i=1∑m(hθ(x(i))−y(i))xj(i)
向量形式
前面都是元素表示的形式,只是写法不同,过程基本都是一样的,不过写成向量形式会更清晰,这样就会把 i i i 和求和符号 ∑ \sum ∑省略掉了。我们不妨忽略前面的固定系数项 1 / m 1/m 1/m,交叉墒的损失函数(1)则可以写成下式:
J ( θ ) = − [ y T log h θ ( x ) + ( 1 − y T ) log ( 1 − h θ ( x ) ) ] (2) J(\theta) = -\left[ y^T \log h_\theta(x)+(1-y^T)\log(1-h_\theta(x))\right]\tag{2} J(θ)=−[yTloghθ(x)+(1−yT)log(1−hθ(x))](2)
将 h θ ( x ) = 1 1 + e − θ T x h_\theta(x)=\frac{1}{1+e^{-\theta^T x} } hθ(x)=1+e−θTx1带入,得到:
J ( θ ) = − [ y T log 1 1 + e − θ T x + ( 1 − y T ) log e − θ T x 1 + e − θ T x ] = − [ − y T log ( 1 + e − θ T x ) + ( 1 − y T ) log e − θ T x − ( 1 − y T ) log ( 1 + e − θ T x ) ] = − [ ( 1 − y T ) log e − θ T x − log ( 1 + e − θ T x ) ] = − [ ( 1 − y T ) ( − θ T x ) − log ( 1 + e − θ T x ) ] J(\theta) = -\left[ y^T \log \frac{1}{1+e^{-\theta^T x} }+(1-y^T)\log\frac{e^{-\theta^T x}}{1+e^{-\theta^T x} }\right] \\ = -\left[ -y^T \log (1+e^{-\theta^T x}) + (1-y^T) \log e^{-\theta^T x} – (1-y^T)\log (1+e^{-\theta^T x})\right] \\ = -\left[(1-y^T) \log e^{-\theta^T x} – \log (1+e^{-\theta^T x}) \right]\\ = -\left[(1-y^T ) (-\theta^Tx) – \log (1+e^{-\theta^T x}) \right] J(θ)=−[yTlog1+e−θTx1+(1−yT)log1+e−θTxe−θTx]=−[−yTlog(1+e−θTx)+(1−yT)loge−θTx−(1−yT)log(1+e−θTx)]=−[(1−yT)loge−θTx−log(1+e−θTx)]=−[(1−yT)(−θTx)−log(1+e−θTx)]
再对 θ \theta θ求导,前面的负号直接削掉了,
∂ ∂ θ j J ( θ ) = − ∂ ∂ θ j [ ( 1 − y T ) ( − θ T x ) − log ( 1 + e − θ T x ) ] = ( 1 − y T ) x − e − θ T 1 + e − θ T x x = ( 1 1 + e − θ T x − y T ) x = ( h θ ( x ) − y T ) x \frac{\partial}{\partial\theta_{j}}J(\theta) = -\frac{\partial}{\partial\theta_{j}}\left[(1-y^T ) (-\theta^Tx) – \log (1+e^{-\theta^T x}) \right] \\ = (1-y^T)x- \frac{e^{-\theta^T }}{1+e^{-\theta^T x} }x \\ = (\frac{1}{1+e^{-\theta^T x} } – y^T)x \\ = \left(h_\theta(x)-y^T \right)x ∂θj∂J(θ)=−∂θj∂[(1−yT)(−θTx)−log(1+e−θTx)]=(1−yT)x−1+e−θTxe−θTx=(1+e−θTx1−yT)x=(hθ(x)−yT)x
转载请注明:赵子健的博客 » 机器学习系列 » 交叉熵损失函数的求导 [zijian-zhao.com/2020/04/crossEntropyLossGrident/]
由于本人不常登陆CSDN,大家的留言很难及时回复,因此知乎专栏同步该文,如有疑问欢迎在知乎讨论:人工+智能
发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/130300.html原文链接:https://javaforall.cn
【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛
【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...