Linux3.10.0块IO子系统流程（7）- 请求处理完成

大家好，又见面了，我是你们的朋友全栈君。

和提交请求相反，完成请求的过程是从低层驱动开始的。请求处理完成分为两个部分：上半部和下半部。开始时，请求处理完成总是处在中断上下文，在这里的主要任务是将已完成的请求放到某个队列中，然后引发软终端让中断“下半部”来处理，这是通常的做法。而“下半部”则依次处理队列中的每一个已完成的请求。

在讲派发SCSI命令的时候，提到了scsi_done，低层驱动在初始化硬件时，注册过一个中断回调函数。在硬件中断被引发时，中断回调函数将被调用，如果是对SCSI命令的相应，则将找到对应的scsi_cmnd描述符，低层设备驱动处理完这个请求后，调用保存在它里面的scsi_done函数，将它交给SCSI核心来处理。

 1 /**
 2 * scsi_done - Enqueue the finished SCSI command into the done queue.
 3 * @cmd: The SCSI Command for which a low-level device driver (LLDD) gives
 4 * ownership back to SCSI Core -- i.e. the LLDD has finished with it.
 5 *
 6 * Description: This function is the mid-level's (SCSI Core) interrupt routine,
 7 * which regains ownership of the SCSI command (de facto) from a LLDD, and
 8 * enqueues the command to the done queue for further processing.
 9 *
10 * This is the producer of the done queue who enqueues at the tail.
11 *
12 * This function is interrupt context safe.
13 */
14 static void scsi_done(struct scsi_cmnd *cmd)
15 {
16     trace_scsi_dispatch_cmd_done(cmd);
17     blk_complete_request(cmd->request);
18 }
19 
20 
21 /**
22 * blk_complete_request - end I/O on a request
23 * @req:      the request being processed
24 *
25 * Description:
26 *     Ends all I/O on a request. It does not handle partial completions,
27 *     unless the driver actually implements this in its completion callback
28 *     through requeueing. The actual completion happens out-of-order,
29 *     through a softirq handler. The user must have registered a completion
30 *     callback through blk_queue_softirq_done().
31 *     如果用户在编译内核时指定了FAIL_IO_TIMEOUT选项，则提供在请求处理完成时注入错误的能力。
32 *     Linux内核包含了大量的代码来“注入”错误，其思想是模拟故障，让我们检查程序对故障的处理是否完善。
33 *     请求完成逻辑调用blk_mark_rq_complete函数以原子的方式设置块设备驱动层请求的REQ_ATOM_COMPLETE标志位，这是为了防止错误恢复定时器同时来试图“抢夺”这个块设备驱动层请求
34 **/
35 void blk_complete_request(struct request *req)
36 {
37     if (unlikely(blk_should_fake_timeout(req->q)))
38         return;
39     if (!blk_mark_rq_complete(req))
40         __blk_complete_request(req);
41 }

一般来说，Linux软中断遵循谁引发谁执行的原则。但有一种情况我们需要考虑，在SMP（多对称处理器）系统中，假设一个进程运行在一个CPU上，它执行了一个读文件操作，该操作一步一步向低层推进，终于到了块IO层进而接触到了磁盘驱动，到了硬件层CPU就管不着了，这时执行读操作的进程不得不在一个等待队列上等待，进程开始睡眠，睡眠以后，磁盘操作交给了磁盘硬件，操作中硬件通过中断来通知操作的执行情况。很显然操作执行完毕后也是通过中断来通知的，可是被中断的CPU还是执行读文件的进程所在的那个CPU吗？这是无法保证的。

我们知道IO完成是通过软中断来执行的，完成操作也就是唤醒原始的进程，如果是被磁盘中断的CPU来引发IO完成软中断，那么由Linux软中断谁引发谁执行的原则，就应该由此被中断的CPU来执行IO完成软中断。实际上就是这个CPU唤醒了在不同CPU上睡眠的进程，但是唤醒不同CPU上的进程开销很大，涉及迁移、计数、负载均衡等细节。

我们只需记住原始的睡眠的进程所在的CPU，就可以在硬件中断完成后引发软中断的时刻将软中断路由到这个被记住的CPU上，这样的话，最终的操作就是一个软中断唤醒了在当前CPU上睡眠的进程，这个开销是很小的。

了解这些之后，再看以下的代码：

 1 void __blk_complete_request(struct request *req)
 2 {
 3     int ccpu, cpu;
 4     struct request_queue *q = req->q;
 5     unsigned long flags;
 6     bool shared = false;
 7 
 8     BUG_ON(!q->softirq_done_fn);
 9 
10     local_irq_save(flags);
11     cpu = smp_processor_id();
12 
13     /*
14      * Select completion CPU
15      */
16     if (req->cpu != -1) {
17         ccpu = req->cpu;
18         if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags))
19             shared = cpus_share_cache(cpu, ccpu);
20     } else
21         ccpu = cpu;
22 
23     /*
24      * If current CPU and requested CPU share a cache, run the softirq on
25      * the current CPU. One might concern this is just like
26      * QUEUE_FLAG_SAME_FORCE, but actually not. blk_complete_request() is
27      * running in interrupt handler, and currently I/O controller doesn't
28      * support multiple interrupts, so current CPU is unique actually. This
29      * avoids IPI sending from current CPU to the first CPU of a group.
30      */
31     if (ccpu == cpu || shared) {
32         struct list_head *list;
33 do_local:
34         list = &__get_cpu_var(blk_cpu_done);
35         list_add_tail(&req->csd.list, list);
36 
37         /*
38          * if the list only contains our just added request,
39          * signal a raise of the softirq. If there are already
40          * entries there, someone already raised the irq but it
41          * hasn't run yet.
42          */
43         if (list->next == &req->csd.list)
44             raise_softirq_irqoff(BLOCK_SOFTIRQ);    // 触发软中断，这个中断绑定blk_done_softirq
45     } else if (raise_blk_irq(ccpu, req))
46         goto do_local;
47 
48     local_irq_restore(flags);
49 }

软中断BLOCK_SOFTIRQ在blk_softirq_init中初始化，这个函数执行以下工作：

1.为每个CPU初始化一个链表，用来记录已完成的请求

2.注册软中断

3.注册一个通知结构，主要目的是为了在某个CPU离线时，将它已完成请求链表中的项转移到当前CPU的已完成链表，并引发软中断执行

 1 static __init int blk_softirq_init(void)
 2 {
 3     int i;
 4 
 5     for_each_possible_cpu(i)
 6         INIT_LIST_HEAD(&per_cpu(blk_cpu_done, i));
 7 
 8     open_softirq(BLOCK_SOFTIRQ, blk_done_softirq);
 9     register_hotcpu_notifier(&blk_cpu_notifier);
10     return 0;
11 }

blk_softirq_init

软中断处理函数如下，这个函数首先将CPU已完成请求链表中的所有项转移到一个局部链表，这样做的目的是为了在这进行处理的时候，尽可能少地打扰CPU的完成请求链表，也就是不妨碍新的完成请求加入到这个链表。然后循环处理局部链表的每个项，将它从链表中删除，然后调用请求队列的软中断完成回调函数来处理。

 1 /*
 2 * Softirq action handler - move entries to local list and loop over them
 3 * while passing them to the queue registered handler.
 4 */
 5 static void blk_done_softirq(struct softirq_action *h)
 6 {
 7     struct list_head *cpu_list, local_list;
 8 
 9     local_irq_disable();
10     cpu_list = &__get_cpu_var(blk_cpu_done);
11     list_replace_init(cpu_list, &local_list);
12     local_irq_enable();
13 
14     while (!list_empty(&local_list)) {
15         struct request *rq;
16 
17         rq = list_entry(local_list.next, struct request, csd.list);
18         list_del_init(&rq->csd.list);
19         rq->q->softirq_done_fn(rq);
20     }
21 }

软中断完成回调函数是依赖请求队列的，对于SCSI设备，这个回调函数被设定为scsi_softirq_done，具体设定的时机是在为SCSI设备分配请求队列时，参见scsi_alloc_queue

 1 static void scsi_softirq_done(struct request *rq)
 2 {
 3     struct scsi_cmnd *cmd = rq->special;
 4     unsigned long wait_for = (cmd->allowed + 1) * rq->timeout;
 5     int disposition;
 6 
 7     INIT_LIST_HEAD(&cmd->eh_entry);
 8 
 9     /* 首先修改所属SCSI设备的统计计数器，包括递增已完成命令计数器iodone_cnt和返回错误结果时递增已出错命令计数器ioerr_cnt */
10     atomic_inc(&cmd->device->iodone_cnt);
11     if (cmd->result)
12         atomic_inc(&cmd->device->ioerr_cnt);
13 
14     /*
15      * scsi_decide_disposition确定如何处理这条命令 
16      * SUCCESS：调用scsi_finish_command结束，后续继续分析
17      * NEEDS_RETRY：
18      * ADD_TO_MLQUEUE：后面两种情况都将命令重新排入请求队列，前者立即重试，后者经过一定延时后重试
19      * 其他返回值调用scsi_eh_scmd_add进入错误恢复。如果进入错误恢复流程，返回1，这种情况下无需再处理这条命令，如果返回0则只能调用scsi_finish_command结束
20      */
21     disposition = scsi_decide_disposition(cmd);
22     if (disposition != SUCCESS &&
23         time_before(cmd->jiffies_at_alloc + wait_for, jiffies)) {
24         sdev_printk(KERN_ERR, cmd->device,
25                 "timing out command, waited %lus\n",
26                 wait_for/HZ);
27         disposition = SUCCESS;
28     }
29             
30     scsi_log_completion(cmd, disposition);
31 
32     switch (disposition) {
33         case SUCCESS:
34             scsi_finish_command(cmd);
35             break;
36         case NEEDS_RETRY:
37             scsi_queue_insert(cmd, SCSI_MLQUEUE_EH_RETRY);
38             break;
39         case ADD_TO_MLQUEUE:
40             scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
41             break;
42         default:
43             if (!scsi_eh_scmd_add(cmd, 0))
44                 scsi_finish_command(cmd);
45     }
46 }

scsi_finish_command

 1 /**
 2 * scsi_finish_command - cleanup and pass command back to upper layer
 3 * @cmd: the command
 4 *
 5 * Description: Pass command off to upper layer for finishing of I/O
 6 *              request, waking processes that are waiting on results,
 7 *              etc.
 8 */
 9 void scsi_finish_command(struct scsi_cmnd *cmd)
10 {
11     struct scsi_device *sdev = cmd->device;
12     struct scsi_target *starget = scsi_target(sdev);
13     struct Scsi_Host *shost = sdev->host;
14     struct scsi_driver *drv;
15     unsigned int good_bytes;
16 
17     scsi_device_unbusy(sdev);
18 
19         /*
20          * Clear the flags which say that the device/host is no longer
21          * capable of accepting new commands.  These are set in scsi_queue.c
22          * for both the queue full condition on a device, and for a
23          * host full condition on the host.
24      *
25      * XXX(hch): What about locking?
26          */
27         shost->host_blocked = 0;
28     starget->target_blocked = 0;
29         sdev->device_blocked = 0;
30 
31     /*
32      * If we have valid sense information, then some kind of recovery
33      * must have taken place.  Make a note of this.
34      */
35     if (SCSI_SENSE_VALID(cmd))
36         cmd->result |= (DRIVER_SENSE << 24);
37 
38     SCSI_LOG_MLCOMPLETE(4, sdev_printk(KERN_INFO, sdev,
39                 "Notifying upper driver of completion "
40                 "(result %x)\n", cmd->result));
41 
42     /*
43      * 要进行完成处理，首先必须知道SCSI已经成功完成的字节数，scsi_bufflen函数从SCSI数据缓冲区得到这个数据
44      * 如果请求不是来自SCSI公共服务层，那么它一定来自上层，也就表明处理这个请求的设备必定被绑定到了高层驱动，
45      * 如果定义了done回调，则调用它，对于SCSI磁盘高层驱动，对应实现为sd_done函数，这个函数返回调整后的已完成字节数
46      * 有了已完成字节数，就可以调用scsi_io_completion
47      */
48     good_bytes = scsi_bufflen(cmd);
49         if (cmd->request->cmd_type != REQ_TYPE_BLOCK_PC) {
50         int old_good_bytes = good_bytes;
51         drv = scsi_cmd_to_driver(cmd);
52         if (drv->done)
53             good_bytes = drv->done(cmd);
54         /*
55          * USB may not give sense identifying bad sector and
56          * simply return a residue instead, so subtract off the
57          * residue if drv->done() error processing indicates no
58          * change to the completion length.
59          */
60         if (good_bytes == old_good_bytes)
61             good_bytes -= scsi_get_resid(cmd);
62     }
63     scsi_io_completion(cmd, good_bytes);
64 }