计算两个矩阵之间的欧式距离「建议收藏」

大家好，又见面了，我是你们的朋友全栈君。

在我们使用k-NN模型时，需要计算测试集中每一点到训练集中每一点的欧氏距离，即需要求得两矩阵之间的欧氏距离。在实现k-NN算法时通常有三种方案，分别是使用两层循环，使用一层循环和不使用循环。

使用两层循环

分别对训练集和测试集中的数据进行循环遍历，计算每两个点之间的欧式距离，然后赋值给dist矩阵。此算法没有经过任何优化。

num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    for i in xrange(num_test):
      for j in xrange(num_train):
        ##################################################################### # TODO: # # Compute the l2 distance between the ith test point and the jth # # training point, and store the result in dists[i, j]. You should # # not use a loop over dimension. # #####################################################################
        # pass
        dists[i][j] = np.sqrt(np.sum(np.square(X[i] - self.X_train[j])))
        ##################################################################### # END OF YOUR CODE # #####################################################################
    return dists

使用一层循环

使用矩阵表示训练集的数据，计算测试集中每一点到训练集矩阵的距离，可以对算法优化为只使用一层循环。

def compute_distances_one_loop(self, X):
    """ Compute the distance between each test point in X and each training point in self.X_train using a single loop over the test data. Input / Output: Same as compute_distances_two_loops """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train))
    for i in xrange(num_test):
      ####################################################################### # TODO: # # Compute the l2 distance between the ith test point and all training # # points, and store the result in dists[i, :]. # #######################################################################
      # pass
      dists[i] = np.sqrt(np.sum(np.square(self.X_train - X[i]), axis = 1))
      ####################################################################### # END OF YOUR CODE # #######################################################################
    return dists

不使用循环

运算效率最高的算法是将训练集和测试集都使用矩阵表示，然后使用矩阵运算的方法替代之前的循环操作。但此操作需要我们对矩阵的运算规则非常熟悉。接下来着重记录如何计算两个矩阵之间的欧式距离。

记录测试集矩阵P的大小为M*D，训练集矩阵C的大小为N*D（测试集中共有M个点，每个点为D维特征向量。训练集中共有N个点，每个点为D维特征向量）
记 $P_i$ 是P的第i行，记 $C_j$ 是C的第j行
$P_i = [ P_{i1}\quad P_{i2} \cdots P_{iD}]$ $C_j = [ C_{j1} C_{j2} \cdots \quad C_{jD}]$

首先计算 $P_{i}$ 和 $C_{j}$ 之间的距离dist(i,j)
$d(P_i,C_j) = \sqrt{(P_{i1}-C_{j1})^2+(P_{i2}-C_{j2})^2+\cdots+(P_{iD}-C_{jD})^2}\\ =\sqrt{(P_{i1}^2+P_{i2}^2+\cdots+P_{iD}^2)+(C_{j1}^2+C_{j2}^2+\cdots+C_{jD}^2)-2\times(P_{i1}C_{j1}+P_{i2}C_{j2}+\cdots+P_{iD}C_{iD})}\\=\sqrt{\left \| P_i\right \|^2+\left\|C_j\right\|^2-2\times P_iC_j^T}$

我们可以推广到距离矩阵的第i行的计算公式
$dist[i]=\sqrt{(\left\|P_i\right\|^2\quad \left\|P_i\right\|^2 \cdots \left\|P_i\right\|^2)+(\left\|C_1\right\|^2 \quad \left\|C_2\right\|^2 \cdots \left\|C_N\right\|^2)-2\times P_i(C_1^T \quad C_2^T \cdots C_N^T)}\\=\sqrt{(\left\|P_i\right\|^2\quad \left\|P_i\right\|^2 \cdots \left\|P_i\right\|^2)+(\left\|C_1\right\|^2 \quad \left\|C_2\right\|^2 \cdots \left\|C_N\right\|^2)-2\times P_iC^T}$

继续将公式推广为整个距离矩阵
$dist=\sqrt{\begin{pmatrix}\left\|P_1\right\|^2 & \left\|P_1\right\|^2 & \cdots & \left\|P_1\right\|^2\\\left\|P_2\right\|^2 & \left\|P_2\right\|^2 & \cdots & \left\|P_2\right\|^2\\\vdots & \vdots & \ddots & \vdots \\\left\|P_M\right\|^2 & \left\|P_M\right\|^2 & \cdots & \left\|P_M\right\|^2 \end{pmatrix}+\begin{pmatrix}\left\|C_1\right\|^2 & \left\|C_2\right\|^2 & \cdots & \left\|C_N\right\|^2\\\left\|C_1\right\|^2 & \left\|C_2\right\|^2 & \cdots & \left\|C_N\right\|^2\\\vdots & \vdots & \ddots & \vdots \\\left\|C_1\right\|^2 & \left\|C_2\right\|^2 & \cdots & \left\|C_N\right\|^2 \end{pmatrix}-2\times PC^T}$

表示为python代码：

def compute_distances_no_loops(self, X):
    """ Compute the distance between each test point in X and each training point in self.X_train using no explicit loops. Input / Output: Same as compute_distances_two_loops """
    num_test = X.shape[0]
    num_train = self.X_train.shape[0]
    dists = np.zeros((num_test, num_train)) 
    #########################################################################
    # TODO: #
    # Compute the l2 distance between all test points and all training #
    # points without using any explicit loops, and store the result in #
    # dists. #
    # #
    # You should implement this function using only basic array operations; #
    # in particular you should not use functions from scipy. #
    # #
    # HINT: Try to formulate the l2 distance using matrix multiplication #
    # and two broadcast sums. #
    #########################################################################
    # pass
    dists = np.sqrt(-2*np.dot(X, self.X_train.T) + np.sum(np.square(self.X_train), axis = 1) + np.transpose([np.sum(np.square(X), axis = 1)]))
    #########################################################################
    # END OF YOUR CODE #
    #########################################################################
    return dists