大家好，又见面了，我是你们的朋友全栈君。

1. 说在前面的话

在目标检测领域Faster RCNN可以说是无人不知无人不晓，它里面有一个网络结构RPN（Region Proposal Network）用于在特征图上产生候选预测区域。但是呢，这个网络结构具体是怎么工作的呢？网上有很多种解释，但是都是云里雾里的，还是直接撸代码来得直接，这里就直接从代码入手直接撸吧-_-||。
首先，来看一下Faster RCNN中RPN的结构是什么样子的吧。可以看到RPN直接通过一个卷积层rpn_conv/3×3直接接在了分类网络的特征层输出上面，之后接上两个卷积层rpn_clc_score与rpn_bbox_pred分别用于产生前景背景分类与预测框。之后再由python层AnchorTargetLayer产生anchor机制的分类与预测框。然后，经过ROI Proposal产生ROI区域的候选，并通过ROI Pooling规范到相同的尺寸上进行后续处理。大体的结构如下图所示：
在这里插入图片描述
虽然在上面的图中能够对RPN网络有一个比较直观但是笼统的概念，其具体内部搞了啥子，并不清楚。所以还是撸一下它里面的代码看看吧，首先来看RPN模块中各个文件说明。
（1）generate_anchors.py
在[0,0,15,15]基础anchor的基础上生成不同宽高比例以及缩放大小的anchor。
Generates a regular grid of multi-scale, multi-aspect anchor boxes.
（2）proposal_layer.py
将RPN网络的每个anchor的分类得分以及检测框回归预估转换为目标候选
Converts RPN outputs (per-anchor scores and bbox regression estimates) into object proposals.
（3）anchor_target_layer.py
为每个anchor生成训练目标或标签，分类的标签只是0（非目标）1（是目标）-1（忽略）。当分类的标签大于0的时候预测框的回归才被指定。
Generates training targets/labels for each anchor. Classification labels are 1 (object), 0 (not object) or -1 (ignore).
Bbox regression targets are specified when the classification label is > 0.
（4）proposal_target_layer.py
为每个目标候选生成训练目标或标签，分类标签从 $0 - K$ （背景0或目标类别 $\dots, K$ ），自然lable值大于0的才被指定预测框回归。
Generates training targets/labels for each object proposal: classification labels 0 – K (bg or object class 1, … , K)
and bbox regression targets in that case that the label is > 0.
（5）generate.py
使用RPN从IMDB输入数据上产生目标候选。
Generate object detection proposals from an imdb using an RPN.
现在对RPN网络的结构和RPN模块中文件有了一个大体的认识，那么接下来就开始阅读里面的实现代码，看看它究竟干了些什么事情。

2. RPN网络部分

这个部分使用到的文件有anchor_target_layer.py、generate_anchors.py。这里的generate_anchors.py是用来产生模型需要的anchor的，其中也包含了一些其它的辅助函数，它不是讲解说明的重点，这里不作介绍。主要来看anchor_target_layer.py文件。
首先，来看看这个层的初始化函数：

def setup(self, bottom, top):
    layer_params = yaml.load(self.param_str_)
    anchor_scales = layer_params.get('scales', (8, 16, 32)) # 尺度变化参数
    self._anchors = generate_anchors(scales=np.array(anchor_scales)) # 生成默认的9个anchor
    self._num_anchors = self._anchors.shape[0]
    self._feat_stride = layer_params['feat_stride']

    # allow boxes to sit over the edge by a small amount
	# 设为0，则取出任何超过图像边界的proposals，只要超出一点点，都要去除
    self._allowed_border = layer_params.get('allowed_border', 0)

    height, width = bottom[0].data.shape[-2:]
    if DEBUG:
        print 'AnchorTargetLayer: height', height, 'width', width

        A = self._num_anchors
    # labels 是否为目标的分类
    top[0].reshape(1, 1, A * height, width)
    # bbox_targets
    top[1].reshape(1, A * 4, height, width)
    # bbox_inside_weights
    top[2].reshape(1, A * 4, height, width)
    # bbox_outside_weights
top[3].reshape(1, A * 4, height, width)

接下来就是重头的forward函数，首先，该函数在特征图生成需要运算的总的anchor

# 1. Generate proposals from bbox deltas and shifted anchors
# x方向的偏移个数，大小为特征图的width
shift_x = np.arange(0, width) * self._feat_stride
# y方向的偏移个数，大小为特征图的height
shift_y = np.arange(0, height) * self._feat_stride
# shift_x，shift_y均为width×height的二维数组（meshgrid生成），对应位置的元素组合即构成图像上需要偏移量大小
#（偏移量大小是相对与图像最左上角的那9个anchor的偏移量大小），也就是说总共会得到width×height×9个偏移值对。
# 这些偏移值对与初始的anchor相加即可得到
# 所有的anchors，所以总共会产生width×height×9个anchors，且存储在all_anchors变量中
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),
                    shift_x.ravel(), shift_y.ravel())).transpose() # 维度输出为(width*height)*4
# add A anchors (1, A, 4) to
# cell K shifts (K, 1, 4) to get
# shift anchors (K, A, 4)
# reshape to (K*A, 4) shifted anchors
A = self._num_anchors
K = shifts.shape[0] # K=width*height
# 在之前9个anchor的基础上产生K*A个anchor，既是总的anchor数量
all_anchors = (self._anchors.reshape((1, A, 4)) +
               shifts.reshape((1, K, 4)).transpose((1, 0, 2)))
all_anchors = all_anchors.reshape((K * A, 4))
total_anchors = int(K * A) # 总的anchor数量

产生这么多的anchor自然有一些超出了边界，那么就需要对其进行剔除

# only keep anchors inside the image 在图像内部的anchor，即是有效anchor，边界之外的删除掉
inds_inside = np.where(
    (all_anchors[:, 0] >= -self._allowed_border) &
    (all_anchors[:, 1] >= -self._allowed_border) &
    (all_anchors[:, 2] < im_info[1] + self._allowed_border) &  # width
    (all_anchors[:, 3] < im_info[0] + self._allowed_border)    # height
    )[0]

初始化可用anchor对应的lable，分类标签的含义下面写了

# label: 1 is positive, 0 is negative, -1 is dont care
# 图像内部anchor对应的分类，是否为目标的分类，大小为符合条件anchor的数量
labels = np.empty((len(inds_inside), ), dtype=np.float32)
labels.fill(-1)

在之前生成了计算需要的anchor了那么接下来就是需要计算anchor与gt之间的关系了，也就是使用overlap area的面积来度量，每个anchor的是否为目标分类也是根据这个度量来设置的。

# overlaps between the anchors and the gt boxes
# overlaps (ex, gt)返回维度为【anchors * gt_boxes】大小的二维数组
overlaps = bbox_overlaps(
    np.ascontiguousarray(anchors, dtype=np.float),
    np.ascontiguousarray(gt_boxes, dtype=np.float))
argmax_overlaps = overlaps.argmax(axis=1) # 求取于anchor重叠最大的gt
max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps] # 取出与每个anchor重叠最大gt的重叠面积
gt_argmax_overlaps = overlaps.argmax(axis=0) # 求出与每个gt重叠面积最大的anchor
gt_max_overlaps = overlaps[gt_argmax_overlaps,
                                   np.arange(overlaps.shape[1])] # 取出与每个gt重叠面积最大的
gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]

# 重叠面积小于阈值0.3的标注为0
if not cfg.TRAIN.RPN_CLOBBER_POSITIVES:
    # assign bg labels first so that positive labels can clobber them
    labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0

# fg label: for each gt, anchor with highest overlap 与gt图重叠最大的对应anchor分类被设置为1
labels[gt_argmax_overlaps] = 1

# fg label: above threshold IOU 将与gt重叠的面积大于阈值0.7的anchor也将其分类设置为1
labels[max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = 1

if cfg.TRAIN.RPN_CLOBBER_POSITIVES:
    # assign bg labels last so that negative labels can clobber positives
    labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0

论文中说从所有anchor中随机选取256个anchor，前景128个，背景128个。注意：那种label为-1的不会当前景也不会当背景。
下面这两段代码是前一部分是在所有前景的anchor中选128个，后一部分是在所有的背景anchor中选128个。如果前景的个数少于了128个，就把所有的anchor选出来，差的由背景部分补。这和Fast RCNN选取ROI一样。

# subsample positive labels if we have too many 要是运行到这里得到的分类为1的太多了那就进行采样
# 从所有label为1的anchor中选择128个，剩下的anchor的label全部置为-1
num_fg = int(cfg.TRAIN.RPN_FG_FRACTION * cfg.TRAIN.RPN_BATCHSIZE) # 采样的阈值
fg_inds = np.where(labels == 1)[0]
if len(fg_inds) > num_fg:
    disable_inds = npr.choice(
        fg_inds, size=(len(fg_inds) - num_fg), replace=False)
    labels[disable_inds] = -1

# subsample negative labels if we have too many 要是被分类为非1的太多了那么也要进行采样
# 这里num_bg不是直接设为128，而是256减去label为1的个数，这样如果label为1的不够，就用label为0的填充，这个代码实现很巧
num_bg = cfg.TRAIN.RPN_BATCHSIZE - np.sum(labels == 1)
bg_inds = np.where(labels == 0)[0]
if len(bg_inds) > num_bg:
    disable_inds = npr.choice(
        bg_inds, size=(len(bg_inds) - num_bg), replace=False)
    labels[disable_inds] = -1

论文中RPN的损失函数是这样定义的：
在这里插入图片描述
这个loss函数和Fast RCNN中的loss函数差不多，所以在计算的时候是每个坐标单独进行smoothL1计算，所以参数 $Pi^*$ 和 $N_{reg}$ 必须弄成4维的向量，并不是在论文中的就一个数值。
bbox_inside_weights实际上指的就是 $Pi^*$ ，bbox_outside_weights指的是 $N_{reg}$ 。论文中说如果anchor是前景， $Pi^*$ 就是1，为背景， $Pi^*$ 就是0。label为-1的，在这个代码来看也是设置为0，应该是在后面不会参与计算，这个设置为多少都无所谓。
$N_{reg}$ 是进行标准化操作，就是取平均。这个平均是把所有的label 0和label 1加起来。因为选的是256个anchor做训练，所以实际上这个值是 $\frac{1}{256}$ 。

bbox_targets = np.zeros((len(inds_inside), 4), dtype=np.float32) # 之前anchor过滤之后与之对应的bbox
bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :]) # 计算anchor框与gt框之间的残差用于回归

bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
bbox_inside_weights[labels == 1, :] = np.array(cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS)

bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
# 对样本权重进行归一化
if cfg.TRAIN.RPN_POSITIVE_WEIGHT < 0:
    # uniform weighting of examples (given non-uniform sampling)
    num_examples = np.sum(labels >= 0)
    positive_weights = np.ones((1, 4)) * 1.0 / num_examples
    negative_weights = np.ones((1, 4)) * 1.0 / num_examples
else:
    assert ((cfg.TRAIN.RPN_POSITIVE_WEIGHT > 0) &
            (cfg.TRAIN.RPN_POSITIVE_WEIGHT < 1))
    positive_weights = (cfg.TRAIN.RPN_POSITIVE_WEIGHT /
                        np.sum(labels == 1))
    negative_weights = ((1.0 - cfg.TRAIN.RPN_POSITIVE_WEIGHT) /
                                np.sum(labels == 0))
bbox_outside_weights[labels == 1, :] = positive_weights
bbox_outside_weights[labels == 0, :] = negative_weights

之后将计算的anchor映射回原来的全部的anchor中去：

# map up to original set of anchors
# 主要是将长度为len(inds_inside)的数据映射回长度total_anchors的数据，total_anchors=(width*height)×9
labels = _unmap(labels, total_anchors, inds_inside, fill=-1)
bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)
bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors, inds_inside, fill=0)
bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors, inds_inside, fill=0)

值得注意的是，rpn网络的训练是256个anchor，128个positive，128个negative。但anchor_target_layer层的输出并不是只有256个anchor的label和坐标变换，而是所有的anchor。其中_unmap函数就很好体现了这一点。那训练的时候怎么实现训练这256个呢？实际上，这一层的4个输出，rpn_labels是需要输出到rpn_loss_cls层，其他的3个输出到rpn_loss_bbox，label实际上就是loss function前半部分中的 $P_i^*$ （即计算分类的loss），这是一个log loss，为-1的label是无法进行log计算的，剩下的0、1就直接计算，这一部分实现了256。loss function后半部分是计算bbox坐标的loss， $P_i^*$ ，也就是bbox_inside_weights，论文中说了activated only for positive anchors，只有为正例的anchor才去计算坐标的损失，这是 $P_i^*$ 是1，其他情况都是0。所以呢，只有那256个才真正改变了loss值，其他的都是0。

bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
bbox_inside_weights[labels == 1, :] = np.array(cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS)

这段代码也体现了这个思想，所以这也实现了256。

最后就是维度转换并设置这个层的4个输出了

# labels
labels = labels.reshape((1, height, width, A)).transpose(0, 3, 1, 2)
labels = labels.reshape((1, 1, A * height, width))
top[0].reshape(*labels.shape)
top[0].data[...] = labels

# bbox_targets
bbox_targets = bbox_targets \
    .reshape((1, height, width, A * 4)).transpose(0, 3, 1, 2)
top[1].reshape(*bbox_targets.shape)
top[1].data[...] = bbox_targets

# bbox_inside_weights
bbox_inside_weights = bbox_inside_weights \
    .reshape((1, height, width, A * 4)).transpose(0, 3, 1, 2)
assert bbox_inside_weights.shape[2] == height
assert bbox_inside_weights.shape[3] == width
top[2].reshape(*bbox_inside_weights.shape)
top[2].data[...] = bbox_inside_weights

# bbox_outside_weights
bbox_outside_weights = bbox_outside_weights \
    .reshape((1, height, width, A * 4)).transpose(0, 3, 1, 2)
assert bbox_outside_weights.shape[2] == height
assert bbox_outside_weights.shape[3] == width
top[3].reshape(*bbox_outside_weights.shape)
 top[3].data[...] = bbox_outside_weights

到这里，由特征图与anchor生成anchor分类与预测框的流程梳理完了，接下来就是根据对该层输出计算RPN部分的loss了。

PS： 我们注意到，该层中没有并没有实现反向传播，这是为毛啊？没有给网络提供梯度。其实是因为这个层的输入信息rpn_cls_score就提供了一个长宽信息就回家洗洗睡了，所以就没必要传递梯度了。

3. ROI Proposal网络部分

3.1 ProposalLayer

该层有3个输入：fg/bg anchors分类器结果rpn_cls_prob_reshape，对应的bbox reg的 $[d x (A) ， d y (A) ， d w (A) ， d h (A)]$ 变换量rpn_bbox_pred，以及im_info；另外还有参数feat_stride=16。
缩进首先解释im_info。对于一副任意大小图像，传入Faster RCNN前首先reshape到固定 $M * N$ ， $im\_info=[M, N, scale\_factor]$ 则保存了此次缩放的所有信息。然后经过Conv Layers，经过4次pooling变为 $(M / 16) * (N / 16)$ 大小，其中 $feature\_stride=16$ 则保存了该信息。所有这些数值都是为了将proposal映射回原图而设置的。
首先来看，该层的初始函数

def setup(self, bottom, top):
    # parse the layer parameter string, which must be valid YAML
    layer_params = yaml.load(self.param_str_)

    self._feat_stride = layer_params['feat_stride']
    anchor_scales = layer_params.get('scales', (8, 16, 32))
    self._anchors = generate_anchors(scales=np.array(anchor_scales)) # 产生默认的9个anchor
    self._num_anchors = self._anchors.shape[0]

    if DEBUG:
        print 'feat_stride: {}'.format(self._feat_stride)
        print 'anchors:'
        print self._anchors

    # rois blob: holds R regions of interest, each is a 5-tuple
    # (n, x1, y1, x2, y2) specifying an image batch index n and a
    # rectangle (x1, y1, x2, y2)
    top[0].reshape(1, 5)

    # scores blob: holds scores for R regions of interest
    if len(top) > 1:
        top[1].reshape(1, 1, 1, 1)

在进行前向运算之前，需要载入一些配置项：

cfg_key = str(self.phase) # either 'TRAIN' or 'TEST' 阶段为train和test的时候nms的输入输出数目不一样
# Number of top scoring boxes to keep before apply NMS to RPN proposals
# 对RPN结果使用NMS之前需要保留的框
pre_nms_topN  = cfg[cfg_key].RPN_PRE_NMS_TOP_N # 12000
# Number of top scoring boxes to keep after applying NMS to RPN proposals
# 对RPN结果使用NMS之后需要保留的框
post_nms_topN = cfg[cfg_key].RPN_POST_NMS_TOP_N # 1200
## NMS threshold used on RPN proposals 使用nms时候的阈值
nms_thresh    = cfg[cfg_key].RPN_NMS_THRESH # 0.7
# Proposal height and width both need to be greater than RPN_MIN_SIZE (at orig image scale)
min_size      = cfg[cfg_key].RPN_MIN_SIZE # 16

# the first set of _num_anchors channels are bg probs
# the second set are the fg probs, which we want
# 前9个通道为背景类；后9个通道为非背景类
scores = bottom[0].data[:, self._num_anchors:, :, :] # 预测的分类（卷积输出：18）
bbox_deltas = bottom[1].data # 预测框的偏移量
im_info = bottom[2].data[0, :] # 图像的信息

接下来就开始proposal了
step1：再次生成anchor，并使用bbox_deltas得到预测框

# 1. Generate proposals from bbox deltas and shifted anchors
height, width = scores.shape[-2:]

if DEBUG:
    print 'score map size: {}'.format(scores.shape)

# Enumerate all shifts 这部分同anchor_target_layer
shift_x = np.arange(0, width) * self._feat_stride
shift_y = np.arange(0, height) * self._feat_stride
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),
                    shift_x.ravel(), shift_y.ravel())).transpose()

# Enumerate all shifted anchors:
#
# add A anchors (1, A, 4) to
# cell K shifts (K, 1, 4) to get
# shift anchors (K, A, 4)
# reshape to (K*A, 4) shifted anchors
A = self._num_anchors
K = shifts.shape[0]
anchors = self._anchors.reshape((1, A, 4)) + \
                  shifts.reshape((1, K, 4)).transpose((1, 0, 2))
anchors = anchors.reshape((K * A, 4))

# Transpose and reshape predicted bbox transformations to get them
# into the same order as the anchors:
#
# bbox deltas will be (1, 4 * A, H, W) format
# transpose to (1, H, W, 4 * A)
# reshape to (1 * H * W * A, 4) where rows are ordered by (h, w, a)
# in slowest to fastest order
bbox_deltas = bbox_deltas.transpose((0, 2, 3, 1)).reshape((-1, 4))

# Same story for the scores:
#
# scores are (1, A, H, W) format
# transpose to (1, H, W, A)
# reshape to (1 * H * W * A, 1) where rows are ordered by (h, w, a)
scores = scores.transpose((0, 2, 3, 1)).reshape((-1, 1))

# Convert anchors into proposals via bbox transformations
# 利用 bbox_deltas 对anchors进行修正，得到proposals的预测位置，可以参考论文中公式
# 对于x,y使用线性变换，对于w,h使用exp
proposals = bbox_transform_inv(anchors, bbox_deltas)

step2：剪裁预测框使之在图像范围之内

# 2. clip predicted boxes to image
# 剪裁预测框到图像的边界内
proposals = clip_boxes(proposals, im_info[:2])

step3：去除小的预测框，阈值为16

# 3. remove predicted boxes with either height or width < threshold
# (NOTE: convert min_size to input image scale stored in im_info[2])
# 去除长宽小于16的预测框，因为进行过4次Pooling呀
keep = _filter_boxes(proposals, min_size * im_info[2])
proposals = proposals[keep, :]
scores = scores[keep]

step4：对于预测框的分数进行排序，并且取前N个送去NMS

# 4. sort all (proposal, score) pairs by score from highest to lowest
# 5. take top pre_nms_topN (e.g. 6000) 选出Top_N，后面再进行 NMS，见前面的设置
order = scores.ravel().argsort()[::-1]
if pre_nms_topN > 0:
    order = order[:pre_nms_topN]
proposals = proposals[order, :] # 保留了前pre_nms_topN个框的坐标信息
scores = scores[order] # 保留了前pre_nms_topN个框的分数信息

step5：进行NMS并取前N个

# 6. apply nms (e.g. threshold = 0.7)
# 7. take after_nms_topN (e.g. 300)
# 8. return the top proposals (-> RoIs top) 对预测框进行nms
keep = nms(np.hstack((proposals, scores)), nms_thresh)
if post_nms_topN > 0:
    keep = keep[:post_nms_topN]
proposals = proposals[keep, :] # 对nms之后的预测框取前after_nms_topN个
scores = scores[keep]

step6：输出结果

# Output rois blob
# Our RPN implementation only supports a single input image, so all
# batch inds are 0
batch_inds = np.zeros((proposals.shape[0], 1), dtype=np.float32)
blob = np.hstack((batch_inds, proposals.astype(np.float32, copy=False)))
top[0].reshape(*(blob.shape))
top[0].data[...] = blob

# [Optional] output scores blob
if len(top) > 1:
    top[1].reshape(*(scores.shape))
    top[1].data[...] = scores

3.2 ProposalTargetLayer

这个层主要完成由RPN得到的预测框到对应分类的匹配，其中对每次训练的预测框进行了限制（每次只处理32个目标预测框，总数的1/4），详见_sample_rois函数。首先，得到分类的数目，并初始化输出blob的shape

def setup(self, bottom, top):
    layer_params = yaml.load(self.param_str_)
    self._num_classes = layer_params['num_classes']

    # sampled rois (0, x1, y1, x2, y2)
    top[0].reshape(1, 5)
    # labels
    top[1].reshape(1, 1)
    # bbox_targets
    top[2].reshape(1, self._num_classes * 4)
    # bbox_inside_weights
    top[3].reshape(1, self._num_classes * 4)
    # bbox_outside_weights
    top[4].reshape(1, self._num_classes * 4)

前向传播函数

def forward(self, bottom, top):
    # Proposal ROIs (0, x1, y1, x2, y2) coming from RPN
    # (i.e., rpn.proposal_layer.ProposalLayer), or any other source
    all_rois = bottom[0].data # RPN预测框，维度为[N,5]
    # GT boxes (x1, y1, x2, y2, label)
    # TODO(rbg): it's annoying that sometimes I have extra info before
    # and other times after box coordinates -- normalize to one format
    gt_boxes = bottom[1].data # GT信息，维度[M,5]

    # Include ground-truth boxes in the set of candidate rois
    # 将ground truth框加入到待分类的框里面(相当于增加正样本个数)
    # all_rois输出维度[N+M,5]，前一维表示是从RPN的输出选出的框和ground truth框合在一起了
    zeros = np.zeros((gt_boxes.shape[0], 1), dtype=gt_boxes.dtype)
    all_rois = np.vstack(
        (all_rois, np.hstack((zeros, gt_boxes[:, :-1])))
    ) # 先在每个ground truth框前面插入0(这样才能和N个从RPN的输出选出的框对齐)，然后把ground truth框插在最后

    # Sanity check: single batch only
    assert np.all(all_rois[:, 0] == 0), \
        'Only single item batches are supported'

    num_images = 1
    rois_per_image = cfg.TRAIN.BATCH_SIZE / num_images #cfg.TRAIN.BATCH_SIZE为128
    # cfg.TRAIN.FG_FRACTION为0.25，即在一次分类训练中前景框只能有32个
    fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image)

    # Sample rois with classification labels and bounding box regression
    # targets
    # _sample_rois选择进行分类训练的框，并求取他们类别和坐标的ground truth和计算边框损失loss时需要的bbox_inside_weights
    labels, rois, bbox_targets, bbox_inside_weights = _sample_rois(
        all_rois, gt_boxes, fg_rois_per_image,
        rois_per_image, self._num_classes)

    if DEBUG:
        print 'num fg: {}'.format((labels > 0).sum())
        print 'num bg: {}'.format((labels == 0).sum())
        self._count += 1
        self._fg_num += (labels > 0).sum()
        self._bg_num += (labels == 0).sum()
        print 'num fg avg: {}'.format(self._fg_num / self._count)
        print 'num bg avg: {}'.format(self._bg_num / self._count)
        print 'ratio: {:.3f}'.format(float(self._fg_num) / float(self._bg_num))

    # sampled rois 采样之后最终保留的全部预测框
    top[0].reshape(*rois.shape)
    top[0].data[...] = rois

    # classification labels 预测框的分类
    top[1].reshape(*labels.shape)
    top[1].data[...] = labels

    # bbox_targets 预测框与GT的残差
    top[2].reshape(*bbox_targets.shape)
    top[2].data[...] = bbox_targets

    # bbox_inside_weights
    top[3].reshape(*bbox_inside_weights.shape)
    top[3].data[...] = bbox_inside_weights

    # bbox_outside_weights
    top[4].reshape(*bbox_inside_weights.shape)
    top[4].data[...] = np.array(bbox_inside_weights > 0).astype(np.float32)

对预测框进行采样并计算残差，在GT上找到其对应的分类

def _sample_rois(all_rois, gt_boxes, fg_rois_per_image, rois_per_image, num_classes):
    """Generate a random sample of RoIs comprising foreground and background examples. """
    # overlaps: (rois x gt_boxes)
    # 计算所有roi和ground truth框之间的重合度
    # 只取坐标信息，roi中取第二到第五个数（因为补0了呀），ground truth框中取第一到第四个数
    overlaps = bbox_overlaps(
        np.ascontiguousarray(all_rois[:, 1:5], dtype=np.float),
        np.ascontiguousarray(gt_boxes[:, :4], dtype=np.float))
    gt_assignment = overlaps.argmax(axis=1) # 对于每个roi，找到对应的gt_box坐标 shape: [len(all_rois),]
    max_overlaps = overlaps.max(axis=1) # 对于每个roi，找到与gt_box重合的最大的overlap shape: [len(all_rois),]
    labels = gt_boxes[gt_assignment, 4] #对于每个roi，找到归属的类别: [len(all_rois),]

    # Select foreground RoIs as those with >= FG_THRESH overlap
    # 大于阈值的实际前景的数量
    fg_inds = np.where(max_overlaps >= cfg.TRAIN.FG_THRESH)[0]
    # Guard against the case when an image has fewer than fg_rois_per_image
    # foreground RoIs 求取用于回归的前景框数量
    fg_rois_per_this_image = min(fg_rois_per_image, fg_inds.size)
    # Sample foreground regions without replacement
    # 如果需要的话，就随机地排除一些前景框
    if fg_inds.size > 0:
        fg_inds = npr.choice(fg_inds, size=fg_rois_per_this_image, replace=False)

    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
    # 找到属于背景的rois(就是与gt_box覆盖介于0和0.5之间的)
    bg_inds = np.where((max_overlaps < cfg.TRAIN.BG_THRESH_HI) &
                       (max_overlaps >= cfg.TRAIN.BG_THRESH_LO))[0]
    # Compute number of background RoIs to take from this image (guarding
    # against there being fewer than desired)
    bg_rois_per_this_image = rois_per_image - fg_rois_per_this_image # 128-32个
    bg_rois_per_this_image = min(bg_rois_per_this_image, bg_inds.size) # 以下操作同fg
    # Sample background regions without replacement
    if bg_inds.size > 0:
        bg_inds = npr.choice(bg_inds, size=bg_rois_per_this_image, replace=False)

    # The indices that we're selecting (both fg and bg)
    keep_inds = np.append(fg_inds, bg_inds) # 记录一下运算之后最终保留的框
    # Select sampled values from various arrays:
    labels = labels[keep_inds]  # 记录一下最终保留的框对应的label
    # Clamp labels for the background RoIs to 0
    labels[fg_rois_per_this_image:] = 0 # 把背景框的分类置0
    rois = all_rois[keep_inds] # 取出最终保留的rois

    # 得到最终保留的框的类别ground truth值和坐标变换ground truth值，得到预测框的误差
    bbox_target_data = _compute_targets(
        rois[:, 1:5], gt_boxes[gt_assignment[keep_inds], :4], labels)

    # 得到最终计算loss时使用的ground truth边框回归值和bbox_inside_weights
    bbox_targets, bbox_inside_weights = \
        _get_bbox_regression_labels(bbox_target_data, num_classes)

    return labels, rois, bbox_targets, bbox_inside_weights

计算预测框残差：

def _compute_targets(ex_rois, gt_rois, labels):
    """Compute bounding-box regression targets for an image."""

    assert ex_rois.shape[0] == gt_rois.shape[0]
    assert ex_rois.shape[1] == 4
    assert gt_rois.shape[1] == 4

    targets = bbox_transform(ex_rois, gt_rois) # 获得预测框与gt的残差
    if cfg.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED: # 是否需要进行归一化
        # Optionally normalize targets by a precomputed mean and stdev
        targets = ((targets - np.array(cfg.TRAIN.BBOX_NORMALIZE_MEANS))
                / np.array(cfg.TRAIN.BBOX_NORMALIZE_STDS))
    # 将残差插到lable的后面（水平插入）
    return np.hstack(
            (labels[:, np.newaxis], targets)).astype(np.float32, copy=False)

整理数据到需要的格式：

def _get_bbox_regression_labels(bbox_target_data, num_classes):
    """Bounding-box regression targets (bbox_target_data) are stored in a compact form N x (class, tx, ty, tw, th) This function expands those targets into the 4-of-4*K representation used by the network (i.e. only one class has non-zero targets). Returns: bbox_target (ndarray): N x 4K blob of regression targets bbox_inside_weights (ndarray): N x 4K blob of loss weights """

    clss = bbox_target_data[:, 0]  # 每个预测框通过重叠面积与gt比较得到的分类
    # 对应分类上预测框的误差
    bbox_targets = np.zeros((clss.size, 4 * num_classes), dtype=np.float32)
    # 用全0初始化一下bbox_inside_weights
    bbox_inside_weights = np.zeros(bbox_targets.shape, dtype=np.float32)
    inds = np.where(clss > 0)[0] # 非背景类
    for ind in inds:
        cls = clss[ind]
        start = 4 * cls # 找到从属的类别对应的坐标回归值的起始位置
        end = start + 4  # 找到从属的类别对应的坐标回归值的结束位置
        bbox_targets[ind, start:end] = bbox_target_data[ind, 1:]  #在对应类的坐标回归上置相应的值（预测框误差）
        # 将bbox_inside_weights上的对应类的坐标回归值置1
        bbox_inside_weights[ind, start:end] = cfg.TRAIN.BBOX_INSIDE_WEIGHTS # (1.0, 1.0, 1.0, 1.0)
    return bbox_targets, bbox_inside_weights

4. ROI Pooling

这部分参考：
关于ROI Pooling Layer的解读

5. REF

发布者：全栈程序员-用户IM，转载请注明出处：https://javaforall.cn/152293.html原文链接：https://javaforall.cn

【正版授权，激活自己账号】： Jetbrains全家桶Ide使用，1年售后保障，每天仅需1毛

【官方授权正版激活】： 官方授权正版激活支持Jetbrains家族下所有IDE 使用个人JB账号...

RPN网络代码解读

1. 说在前面的话

2. RPN网络部分

3. ROI Proposal网络部分

3.1 ProposalLayer

3.2 ProposalTargetLayer

4. ROI Pooling

5. REF

发表回复

RPN网络代码解读

1. 说在前面的话

2. RPN网络部分

3. ROI Proposal网络部分

3.1 ProposalLayer

3.2 ProposalTargetLayer

4. ROI Pooling

5. REF

相关推荐

方格子服务器系统,方格子无盘服务器配置推荐方案[通俗易懂]

python矩阵转置与zip(*)的使用

解决touchStart滑动时触发点击事件「建议收藏」

实体list转JSONArray对Date数据处理「建议收藏」

intel酷睿游戏计算机,Intel酷睿九代i3-9100F配RX590游戏电脑配置单，预算3500元不到…「建议收藏」

C#设计模式之十三模板方法模式（Template Method Pattern）【行为型】

发表回复