softmax损失函数

大家好，又见面了，我是你们的朋友全栈君。

一：神经网络中的损失函数

cnn进行前向传播阶段，依次调用每个Layer的Forward函数，得到逐层的输出，最后一层与目标函数比较得到损失函数，计算误差更新值，通过反向传播逐层到达第一层，所有权值在反向传播结束时一起更新。

loss layer 是CNN的终点，接受两个Blob作为输入，其中一个是CNN的预测值，另一个是真实标签。损失层则将这两个输入进行一系列运算，得到当前网络的损失函数（Loss Function），一般记做L(θ）其中θ是当前网络权值构成的向量空间。机器学习的目的是在权值空间中找到让损失函数L(θ）最小的权值θ（opt），可以采用一系列最优化方法（如SGD方法）逼近权值θ（opt）。

损失函数是在前向传播计算中得到的，同时也是反向传播的起点。

二：Softmax函数

假设有K个类别，Softmax计算过程为：

softmax损失函数其中，j=0,1,2,3,4,5,…,K-1

下面图更直观：

softmax损失函数

softMax的结果相当于输入图像被分到每个标签的概率分布，该函数是单调增函数，即输入值越大，输出也就越大，输入图像属于该标签的概率就越大。

神评论：SVM只选自己喜欢的男神，Softmax把所有备胎全部拉出来评分，最后还归一化一下

对softmax的结果计算交叉熵分类损失函数为：

softmax损失函数

取log里面的值就是这组数据正确分类的Softmax值，它占的比重越大，这个样本的Loss也就越小，这种定义符合我们的要求.

三：wiki百科对softmax函数的定义：

In mathematics, the softmax function, or normalized exponential function,^[1]^:198 is a generalization of the logistic function that “squashes” a $K$ -dimensional vector $\mathbf {z}$ of arbitrary real values to a $K$ -dimensional vector $\sigma (\mathbf {z} )$ of real values in the range [0, 1] that add up to 1. The function is given by

\sigma (\mathbf {z} )_{j}={\frac {e^{z_{j}}}{\sum _{k=1}^{K}e^{z_{k}}}}

for j = 1, …, K.

The softmax function is used in various multiclass classification methods, such as multinomial logistic regression,^[1]^:206–209 multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks.^[2] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of $K$ distinct linear functions, and the predicted probability for the $j$ ‘th class given a sample vector $x$ and a weighting vector $w$ ^{[further explanation needed]} is:

P(y=j\mid \mathbf {x} )={\frac {e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{j}}}{\sum _{k=1}^{K}e^{\mathbf {x} ^{\mathsf {T}}\mathbf {w} _{k}}}}

This can be seen as the composition of $K$ linear functions $\mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{1},\ldots ,\mathbf {x} \mapsto \mathbf {x} ^{\mathsf {T}}\mathbf {w} _{K}$ and the softmax function (where $\mathbf {x} ^{\mathsf {T}}\mathbf {w}$ denotes the inner product of $\mathbf {x}$ and $\mathbf {w}$ ). The operation is equivalent to applying a linear operator defined by $\mathbf {w}$ to vectors $\mathbf {x}$ , thus transforming the original, probably highly-dimensional, input to vectors in a $K$ -dimensional space $R^{K}$ .