1. 07 4月, 2014 1 次提交
    • J
      blk-mq: fix potential stall during CPU unplug with IO pending · bccb5f7c
      Jens Axboe 提交于
      When a CPU is unplugged, we move the blk_mq_ctx request entries
      to the current queue. The current code forgets to remap the
      blk_mq_hw_ctx before marking the software context pending,
      which breaks if old-cpu and new-cpu don't map to the same
      hardware queue.
      
      Additionally, if we mark entries as pending in the new
      hardware queue, then make sure we schedule it for running.
      Otherwise request could be sitting there until someone else
      queues IO for that hardware queue.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      bccb5f7c
  2. 21 3月, 2014 7 次提交
    • S
      blk-mq: add REQ_SYNC early · 27fbf4e8
      Shaohua Li 提交于
      Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init
      is set correctly.
      
      Signed-off-by: Shaohua Li<shli@fusionio.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      27fbf4e8
    • M
      rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock · 55c816e3
      Mike Galbraith 提交于
      [  365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674
      [  365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1
      [  365.164043] no locks held by migration/1/26.
      [  365.164044] irq event stamp: 6648
      [  365.164056] hardirqs last  enabled at (6647): [<ffffffff8153d377>] restore_args+0x0/0x30
      [  365.164062] hardirqs last disabled at (6648): [<ffffffff810ed98d>] multi_cpu_stop+0x9d/0x120
      [  365.164070] softirqs last  enabled at (0): [<ffffffff810543bc>] copy_process.part.28+0x6fc/0x1920
      [  365.164072] softirqs last disabled at (0): [<          (null)>]           (null)
      [  365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF           N  3.12.12-rt19-0.gcb6c4a2-rt #3
      [  365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011
      [  365.164091]  0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0
      [  365.164099]  ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f
      [  365.164107]  ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0
      [  365.164108] Call Trace:
      [  365.164119]  [<ffffffff810060b1>] try_stack_unwind+0x191/0x1a0
      [  365.164127]  [<ffffffff81004872>] dump_trace+0x92/0x360
      [  365.164133]  [<ffffffff81006108>] show_trace_log_lvl+0x48/0x60
      [  365.164138]  [<ffffffff81004c18>] show_stack_log_lvl+0xd8/0x1d0
      [  365.164143]  [<ffffffff81006160>] show_stack+0x20/0x50
      [  365.164153]  [<ffffffff815367e6>] dump_stack+0x54/0x9a
      [  365.164163]  [<ffffffff8108919c>] __might_sleep+0xfc/0x140
      [  365.164173]  [<ffffffff8153c24f>] rt_spin_lock+0x1f/0x70
      [  365.164182]  [<ffffffff81297ec0>] blk_mq_main_cpu_notify+0x20/0x70
      [  365.164191]  [<ffffffff81540a1c>] notifier_call_chain+0x4c/0x70
      [  365.164201]  [<ffffffff81083499>] __raw_notifier_call_chain+0x9/0x10
      [  365.164207]  [<ffffffff810567be>] cpu_notify+0x1e/0x40
      [  365.164217]  [<ffffffff81525da2>] take_cpu_down+0x22/0x40
      [  365.164223]  [<ffffffff810ed9c6>] multi_cpu_stop+0xd6/0x120
      [  365.164229]  [<ffffffff810edd97>] cpu_stopper_thread+0xd7/0x1e0
      [  365.164235]  [<ffffffff810863a3>] smpboot_thread_fn+0x203/0x380
      [  365.164241]  [<ffffffff8107cbf8>] kthread+0xc8/0xd0
      [  365.164250]  [<ffffffff8154440c>] ret_from_fork+0x7c/0xb0
      [  365.164429] smpboot: CPU 1 is now offline
      Signed-off-by: NMike Galbraith <bitbucket@online.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      55c816e3
    • C
      blk-mq: support partial I/O completions · 7237c740
      Christoph Hellwig 提交于
      Add a new blk_mq_end_io_partial function to partially complete requests
      as needed by the SCSI layer.  We do this by reusing blk_update_request
      to advance the bio instead of having a simplified version of it in
      the blk-mq code.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7237c740
    • C
      blk-mq: merge blk_mq_insert_request and blk_mq_run_request · eeabc850
      Christoph Hellwig 提交于
      It's almost identical to blk_mq_insert_request, so fold the two into one
      slightly more generic function by making the flush special case a bit
      smarted.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      eeabc850
    • C
      blk-mq: remove blk_mq_alloc_rq · 081241e5
      Christoph Hellwig 提交于
      There's only one caller, which is a straight wrapper and fits the naming
      scheme of the related functions a lot better.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      081241e5
    • D
      block: free q->flush_rq in blk_init_allocated_queue error paths · 708f04d2
      Dave Jones 提交于
      Commit 7982e90c ("block: fix q->flush_rq NULL pointer crash on
      dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
      neglected to free that allocation on the error paths that follow.
      Signed-off-by: NDave Jones <davej@fedoraproject.org>
      Acked-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      708f04d2
    • J
      blk-mq: don't dump CPU -> hw queue map on driver load · 676141e4
      Jens Axboe 提交于
      Now that we are out of initial debug/bringup mode, remove
      the verbose dump of the mapping table.
      
      Provide the mapping table in sysfs, under the hardware queue
      directory, in the cpu_list file.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      676141e4
  3. 20 3月, 2014 1 次提交
    • J
      blk-mq: fix wrong usage of hctx->state vs hctx->flags · 5d12f905
      Jens Axboe 提交于
      BLK_MQ_F_* flags are for hctx->flags, and are non-atomic and
      set at registration time. BLK_MQ_S_* flags are dynamic and
      atomic, and are accessed through hctx->state.
      
      Some of the BLK_MQ_S_STOPPED uses were wrong. Additionally,
      the header file should not use a bit shift for the _S_ flags,
      as they are done through the set/test_bit functions.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5d12f905
  4. 19 3月, 2014 1 次提交
    • T
      cgroup: drop const from @buffer of cftype->write_string() · 4d3bb511
      Tejun Heo 提交于
      cftype->write_string() just passes on the writeable buffer from kernfs
      and there's no reason to add const restriction on the buffer.  The
      only thing const achieves is unnecessarily complicating parsing of the
      buffer.  Drop const from @buffer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>                                           
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      4d3bb511
  5. 15 3月, 2014 2 次提交
  6. 13 3月, 2014 1 次提交
    • J
      block: remove old blk_iopoll_enabled variable · 89f8b33c
      Jens Axboe 提交于
      This was a debugging measure to toggle enabled/disabled
      when testing. But for real production setups, it's not
      safe to toggle this setting without either reloading
      drivers of quiescing IO first. Neither of which the toggle
      enforces.
      
      Additionally, it makes drivers deal with the conditional
      state.
      
      Remove it completely. It's up to the driver whether iopoll
      is enabled or not.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      89f8b33c
  7. 09 3月, 2014 2 次提交
    • M
      block: change flush sequence list addition back to front add · 10beafc1
      Mike Snitzer 提交于
      Commit 18741986 inadvertently changed the rq flush insertion
      from a head to a tail insertion. Fix that back up.
      Signed-off-by: NMike Snitzer <msnitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      10beafc1
    • M
      block: fix q->flush_rq NULL pointer crash on dm-mpath flush · 7982e90c
      Mike Snitzer 提交于
      Commit 18741986 ("blk-mq: rework flush sequencing logic") switched
      ->flush_rq from being an embedded member of the request_queue structure
      to being dynamically allocated in blk_init_queue_node().
      
      Request-based DM multipath doesn't use blk_init_queue_node(), instead it
      uses blk_alloc_queue_node() + blk_init_allocated_queue().  Because
      commit 18741986 placed the dynamic allocation of ->flush_rq in
      blk_init_queue_node() any flush issued to a dm-mpath device would crash
      with a NULL pointer, e.g.:
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0
      PGD bb3c7067 PUD bb01d067 PMD 0
      Oops: 0002 [#1] SMP
      ...
      CPU: 5 PID: 5028 Comm: dt Tainted: G        W  O 3.14.0-rc3.snitm+ #10
      ...
      task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000
      RIP: 0010:[<ffffffff8125037e>]  [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0
      RSP: 0018:ffff880079565c98  EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030
      RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001
      R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000
      R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000
      FS:  00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0
      Stack:
       0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47
       ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000
       ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8
      Call Trace:
       [<ffffffff81257a47>] blk_flush_complete_seq+0x2d7/0x2e0
       [<ffffffff81257c2d>] blk_insert_flush+0x1dd/0x210
       [<ffffffff8124ec59>] __elv_add_request+0x1f9/0x320
       [<ffffffff81250681>] ? blk_account_io_start+0x111/0x190
       [<ffffffff81253a4b>] blk_queue_bio+0x25b/0x330
       [<ffffffffa0020bf5>] dm_request+0x35/0x40 [dm_mod]
       [<ffffffff812530c0>] generic_make_request+0xc0/0x100
       [<ffffffff81253173>] submit_bio+0x73/0x140
       [<ffffffff811becdd>] submit_bio_wait+0x5d/0x80
       [<ffffffff81257528>] blkdev_issue_flush+0x78/0xa0
       [<ffffffff811c1f6f>] blkdev_fsync+0x3f/0x60
       [<ffffffff811b7fde>] vfs_fsync_range+0x1e/0x20
       [<ffffffff811b7ffc>] vfs_fsync+0x1c/0x20
       [<ffffffff811b81f1>] do_fsync+0x41/0x80
       [<ffffffff8118874e>] ? SyS_lseek+0x7e/0x80
       [<ffffffff811b8260>] SyS_fsync+0x10/0x20
       [<ffffffff8154c2d2>] system_call_fastpath+0x16/0x1b
      
      Fix this by moving the ->flush_rq allocation from blk_init_queue_node()
      to blk_init_allocated_queue().  blk_init_queue_node() also calls
      blk_init_allocated_queue() so this change is functionality equivalent
      for all blk_init_queue_node() callers.
      Reported-by: NHannes Reinecke <hare@suse.de>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7982e90c
  8. 07 3月, 2014 1 次提交
  9. 06 3月, 2014 1 次提交
    • R
      blktrace: fix accounting of partially completed requests · af5040da
      Roman Pen 提交于
      trace_block_rq_complete does not take into account that request can
      be partially completed, so we can get the following incorrect output
      of blkparser:
      
        C   R 232 + 240 [0]
        C   R 240 + 232 [0]
        C   R 248 + 224 [0]
        C   R 256 + 216 [0]
      
      but should be:
      
        C   R 232 + 8 [0]
        C   R 240 + 8 [0]
        C   R 248 + 8 [0]
        C   R 256 + 8 [0]
      
      Also, the whole output summary statistics of completed requests and
      final throughput will be incorrect.
      
      This patch takes into account real completion size of the request and
      fixes wrong completion accounting.
      Signed-off-by: NRoman Pen <r.peniaev@gmail.com>
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      CC: Ingo Molnar <mingo@redhat.com>
      CC: linux-kernel@vger.kernel.org
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      af5040da
  10. 04 3月, 2014 1 次提交
    • M
      rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock · 2a26ebef
      Mike Galbraith 提交于
      [  365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674
      [  365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1
      [  365.164043] no locks held by migration/1/26.
      [  365.164044] irq event stamp: 6648
      [  365.164056] hardirqs last  enabled at (6647): [<ffffffff8153d377>] restore_args+0x0/0x30
      [  365.164062] hardirqs last disabled at (6648): [<ffffffff810ed98d>] multi_cpu_stop+0x9d/0x120
      [  365.164070] softirqs last  enabled at (0): [<ffffffff810543bc>] copy_process.part.28+0x6fc/0x1920
      [  365.164072] softirqs last disabled at (0): [<          (null)>]           (null)
      [  365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF           N  3.12.12-rt19-0.gcb6c4a2-rt #3
      [  365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011
      [  365.164091]  0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0
      [  365.164099]  ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f
      [  365.164107]  ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0
      [  365.164108] Call Trace:
      [  365.164119]  [<ffffffff810060b1>] try_stack_unwind+0x191/0x1a0
      [  365.164127]  [<ffffffff81004872>] dump_trace+0x92/0x360
      [  365.164133]  [<ffffffff81006108>] show_trace_log_lvl+0x48/0x60
      [  365.164138]  [<ffffffff81004c18>] show_stack_log_lvl+0xd8/0x1d0
      [  365.164143]  [<ffffffff81006160>] show_stack+0x20/0x50
      [  365.164153]  [<ffffffff815367e6>] dump_stack+0x54/0x9a
      [  365.164163]  [<ffffffff8108919c>] __might_sleep+0xfc/0x140
      [  365.164173]  [<ffffffff8153c24f>] rt_spin_lock+0x1f/0x70
      [  365.164182]  [<ffffffff81297ec0>] blk_mq_main_cpu_notify+0x20/0x70
      [  365.164191]  [<ffffffff81540a1c>] notifier_call_chain+0x4c/0x70
      [  365.164201]  [<ffffffff81083499>] __raw_notifier_call_chain+0x9/0x10
      [  365.164207]  [<ffffffff810567be>] cpu_notify+0x1e/0x40
      [  365.164217]  [<ffffffff81525da2>] take_cpu_down+0x22/0x40
      [  365.164223]  [<ffffffff810ed9c6>] multi_cpu_stop+0xd6/0x120
      [  365.164229]  [<ffffffff810edd97>] cpu_stopper_thread+0xd7/0x1e0
      [  365.164235]  [<ffffffff810863a3>] smpboot_thread_fn+0x203/0x380
      [  365.164241]  [<ffffffff8107cbf8>] kthread+0xc8/0xd0
      [  365.164250]  [<ffffffff8154440c>] ret_from_fork+0x7c/0xb0
      [  365.164429] smpboot: CPU 1 is now offline
      Signed-off-by: NMike Galbraith <bitbucket@online.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      2a26ebef
  11. 25 2月, 2014 4 次提交
    • F
      smp: Rename __smp_call_function_single() to smp_call_function_single_async() · c46fff2a
      Frederic Weisbecker 提交于
      The name __smp_call_function_single() doesn't tell much about the
      properties of this function, especially when compared to
      smp_call_function_single().
      
      The comments above the implementation are also misleading. The main
      point of this function is actually not to be able to embed the csd
      in an object. This is actually a requirement that result from the
      purpose of this function which is to raise an IPI asynchronously.
      
      As such it can be called with interrupts disabled. And this feature
      comes at the cost of the caller who then needs to serialize the
      IPIs on this csd.
      
      Lets rename the function and enhance the comments so that they reflect
      these properties.
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c46fff2a
    • F
      smp: Remove wait argument from __smp_call_function_single() · fce8ad15
      Frederic Weisbecker 提交于
      The main point of calling __smp_call_function_single() is to send
      an IPI in a pure asynchronous way. By embedding a csd in an object,
      a caller can send the IPI without waiting for a previous one to complete
      as is required by smp_call_function_single() for example. As such,
      sending this kind of IPI can be safe even when irqs are disabled.
      
      This flexibility comes at the expense of the caller who then needs to
      synchronize the csd lifecycle by himself and make sure that IPIs on a
      single csd are serialized.
      
      This is how __smp_call_function_single() works when wait = 0 and this
      usecase is relevant.
      
      Now there don't seem to be any usecase with wait = 1 that can't be
      covered by smp_call_function_single() instead, which is safer. Lets look
      at the two possible scenario:
      
      1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
         in an object. It looks like a nice and convenient pattern at the first
         sight because we can then retrieve the object from the IPI handler easily.
      
         But actually it is a waste of memory space in the object since the csd
         can be allocated from the stack by smp_call_function_single(wait = 1)
         and the object can be passed an the IPI argument.
      
         Besides that, embedding the csd in an object is more error prone
         because the caller must take care of the serialization of the IPIs
         for this csd.
      
      2) The user calls __smp_call_function_single(wait = 1) on a csd that
         is allocated on the stack. It's ok but smp_call_function_single()
         can do it as well and it already takes care of the allocation on the
         stack. Again it's more simple and less error prone.
      
      Therefore, using the underscore prepend API version with wait = 1
      is a bad pattern and a sign that the caller can do safer and more
      simple.
      
      There was a single user of that which has just been converted.
      So lets remove this option to discourage further users.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      fce8ad15
    • J
      block: Stop abusing rq->csd.list in blk-softirq · 6d113398
      Jan Kara 提交于
      Abusing rq->csd.list for a list of requests to complete is rather ugly.
      We use rq->queuelist instead which is much cleaner. It is safe because
      queuelist is used by the block layer only for requests waiting to be
      submitted to a device. Thus it is unused when irq reports the request IO
      is finished.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6d113398
    • J
      block: Stop abusing csd.list for fifo_time · 8b4922d3
      Jan Kara 提交于
      Block layer currently abuses rq->csd.list.next for storing fifo_time.
      That is a terrible hack and completely unnecessary as well. Union
      achieves the same space saving in a cleaner way.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8b4922d3
  12. 22 2月, 2014 3 次提交
  13. 19 2月, 2014 3 次提交
  14. 13 2月, 2014 2 次提交
    • T
      cgroup: drop @skip_css from cgroup_taskset_for_each() · 924f0d9a
      Tejun Heo 提交于
      If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
      css.  The intention of the interface is to make it easy to skip css's
      (cgroup_subsys_states) which already match the migration target;
      however, this is entirely unnecessary as migration taskset doesn't
      include tasks which are already in the target cgroup.  Drop @skip_css
      from cgroup_taskset_for_each().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      924f0d9a
    • J
      block: add cond_resched() to potentially long running ioctl discard loop · c8123f8c
      Jens Axboe 提交于
      When mkfs issues a full device discard and the device only
      supports discards of a smallish size, we can loop in
      blkdev_issue_discard() for a long time. If preempt isn't enabled,
      this can turn into a softlock situation and the kernel will
      start complaining.
      
      Add an explicit cond_resched() at the end of the loop to avoid
      that.
      
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c8123f8c
  15. 12 2月, 2014 4 次提交
    • T
      cgroup: remove cgroup->name · e61734c5
      Tejun Heo 提交于
      cgroup->name handling became quite complicated over time involving
      dedicated struct cgroup_name for RCU protection.  Now that cgroup is
      on kernfs, we can drop all of it and simply use kernfs_name/path() and
      friends.  Replace cgroup->name and all related code with kernfs
      name/path constructs.
      
      * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
        of kernfs counterparts, which involves semantic changes.
        pr_cont_cgroup_name() and pr_cont_cgroup_path() added.
      
      * cgroup->name handling dropped from cgroup_rename().
      
      * All users of cgroup_name/path() updated to the new semantics.  Users
        which were formatting the string just to printk them are converted
        to use pr_cont_cgroup_name/path() instead, which simplifies things
        quite a bit.  As cgroup_name() no longer requires RCU read lock
        around it, RCU lockings which were protecting only cgroup_name() are
        removed.
      
      v2: Comment above oom_info_lock updated as suggested by Michal.
      
      v3: dummy_top doesn't have a kn associated and
          pr_cont_cgroup_name/path() ended up calling the matching kernfs
          functions with NULL kn leading to oops.  Test for NULL kn and
          print "/" if so.  This issue was reported by Fengguang Wu.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      e61734c5
    • T
      cgroup: update the meaning of cftype->max_write_len · 5f469907
      Tejun Heo 提交于
      cftype->max_write_len is used to extend the maximum size of writes.
      It's interpreted in such a way that the actual maximum size is one
      less than the specified value.  The default size is defined by
      CGROUP_LOCAL_BUFFER_SIZE.  Its interpretation is quite confusing - its
      value is decremented by 1 and then compared for equality with max
      size, which means that the actual default size is
      CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars.
      
      There's no point in having a limit that low.  Update its definition so
      that it means the actual string length sans termination and anything
      below PAGE_SIZE-1 is treated as PAGE_SIZE-1.
      
      .max_write_len for "release_agent" is updated to PATH_MAX-1 and
      cgroup_release_agent_write() is updated so that the redundant strlen()
      check is removed and it uses strlcpy() instead of strcpy().
      .max_write_len initializations in blk-throttle.c and cfq-iosched.c are
      no longer necessary and removed.  The one in cpuset is kept unchanged
      as it's an approximated value to begin with.
      
      This will also make transition to kernfs smoother.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5f469907
    • C
      blk-mq: pair blk_mq_start_request / blk_mq_requeue_request · 49f5baa5
      Christoph Hellwig 提交于
      Make sure we have a proper pairing between starting and requeueing
      requests.  Move the dma drain and REQ_END setup into blk_mq_start_request,
      and make sure blk_mq_requeue_request properly undoes them, giving us
      a pair of function to prepare and unprepare a request without leaving
      side effects.
      
      Together this ensures we always clean up properly after
      BLK_MQ_RQ_QUEUE_BUSY returns from ->queue_rq.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      49f5baa5
    • C
      blk-mq: dont assume rq->errors is set when returning an error from ->queue_rq · 1e93b8c2
      Christoph Hellwig 提交于
      rq->errors never has been part of the communication protocol between drivers
      and the block stack and most drivers will not have initialized it.
      
      Return -EIO to upper layers when the driver returns BLK_MQ_RQ_QUEUE_ERROR
      unconditionally.  If a driver want to return a different error it can easily
      do so by returning success after calling blk_mq_end_io itself.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1e93b8c2
  16. 11 2月, 2014 3 次提交
    • M
      block: Fix type mismatch in ssize_t_blk_mq_tag_sysfs_show · 11c94444
      Masanari Iida 提交于
      cppcheck detected following format string mismatch.
      [blk-mq-tag.c:201]: (warning) %u in format string (no. 1) requires
      'unsigned int' but the argument type is 'int'.
      
      Change "cpu" from int to unsigned int, because the cpu
      never become minus value.
      Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      11c94444
    • C
      blk-mq: rework flush sequencing logic · 18741986
      Christoph Hellwig 提交于
      Witch to using a preallocated flush_rq for blk-mq similar to what's done
      with the old request path.  This allows us to set up the request properly
      with a tag from the actually allowed range and ->rq_disk as needed by
      some drivers.  To make life easier we also switch to dynamic allocation
      of ->flush_rq for the old path.
      
      This effectively reverts most of
      
          "blk-mq: fix for flush deadlock"
      
      and
      
          "blk-mq: Don't reserve a tag for flush request"
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      18741986
    • C
      blk-mq: rework I/O completions · 30a91cb4
      Christoph Hellwig 提交于
      Rework I/O completions to work more like the old code path.  blk_mq_end_io
      now stays out of the business of deferring completions to others CPUs
      and calling blk_mark_rq_complete.  The latter is very important to allow
      completing requests that have timed out and thus are already marked completed,
      the former allows using the IPI callout even for driver specific completions
      instead of having to reimplement them.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      30a91cb4
  17. 08 2月, 2014 3 次提交
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
    • T
      cgroup: drop module support · 3ed80a62
      Tejun Heo 提交于
      With module supported dropped from net_prio, no controller is using
      cgroup module support.  None of actual resource controllers can be
      built as a module and we aren't gonna add new controllers which don't
      control resources.  This patch drops module support from cgroup.
      
      * cgroup_[un]load_subsys() and cgroup_subsys->module removed.
      
      * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
        cgroup_subsys.h now uses IS_ENABLED() directly.
      
      * enum cgroup_subsys_id now exactly matches the list of enabled
        controllers as ordered in cgroup_subsys.h.
      
      * cgroup_subsys[] is now a contiguously occupied array.  Size
        specification is no longer necessary and dropped.
      
      * for_each_builtin_subsys() is removed and for_each_subsys() is
        updated to not require any locking.
      
      * module ref handling is removed from rebind_subsystems().
      
      * Module related comments dropped.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      
      v3: Added {} around the if (need_forkexit_callback) block in
          cgroup_post_fork() for readability as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3ed80a62
    • K
      block: Explicitly handle discard/write same segments · 5cb8850c
      Kent Overstreet 提交于
      Immutable biovecs changed the way biovecs are interpreted - drivers no
      longer use bi_vcnt, they have to go by bi_iter.bi_size (to allow for
      using part of an existing segment without modifying it).
      
      This breaks with discards and write_same bios, since for those bi_size
      has nothing to do with segments in the biovec. So for now, we need a
      fairly gross hack - we fortunately know that there will never be more
      than one segment for the entire request, so we can special case
      discard/write_same.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5cb8850c