1. 29 5月, 2014 1 次提交
  2. 28 5月, 2014 1 次提交
  3. 27 5月, 2014 1 次提交
  4. 21 5月, 2014 2 次提交
  5. 10 5月, 2014 1 次提交
  6. 17 4月, 2014 2 次提交
  7. 16 4月, 2014 1 次提交
    • J
      block: remove struct request buffer member · b4f42e28
      Jens Axboe 提交于
      This was used in the olden days, back when onions were proper
      yellow. Basically it mapped to the current buffer to be
      transferred. With highmem being added more than a decade ago,
      most drivers map pages out of a bio, and rq->buffer isn't
      pointing at anything valid.
      
      Convert old style drivers to just use bio_data().
      
      For the discard payload use case, just reference the page
      in the bio.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b4f42e28
  8. 11 4月, 2014 1 次提交
  9. 10 4月, 2014 3 次提交
    • J
      block: fix regression with block enabled tagging · 360f92c2
      Jens Axboe 提交于
      Martin reported that his test system would not boot with
      current git, it oopsed with this:
      
      BUG: unable to handle kernel paging request at ffff88046c6c9e80
      IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
      PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
      Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
      rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
      CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
      Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
      Workqueue: events_unbound async_run_entry_fn
      task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
      RIP: 0010:[<ffffffff812971e0>]  [<ffffffff812971e0>]
      blk_queue_start_tag+0x90/0x150
      RSP: 0018:ffff880273d03a58  EFLAGS: 00010092
      RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
      RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
      RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
      R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
      R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
      FS:  0000000000000000(0000) GS:ffff880277b00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
      Stack:
       ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
       ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
       ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
      Call Trace:
       [<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
       [<ffffffff81293167>] __blk_run_queue+0x37/0x50
       [<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
       [<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
       [<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
       [<ffffffff813bba35>] scsi_execute+0xe5/0x180
       [<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
       [<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
       [<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
       [<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
       [<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
       [<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
       [<ffffffff8106a1c5>] process_one_work+0x175/0x410
       [<ffffffff8106b373>] worker_thread+0x123/0x400
       [<ffffffff8106b250>] ? manage_workers+0x160/0x160
       [<ffffffff8107104e>] kthread+0xce/0xf0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
       [<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
       [<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
      Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
      df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
      58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
      RIP  [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
       RSP <ffff880273d03a58>
      CR2: ffff88046c6c9e80
      
      Martin bisected and found this to be the problem patch;
      
      	commit 6d113398
      	Author: Jan Kara <jack@suse.cz>
      	Date:   Mon Feb 24 16:39:54 2014 +0100
      
      	    block: Stop abusing rq->csd.list in blk-softirq
      
      and the problem was immediately apparent. The patch states that
      it is safe to reuse queuelist at completion time, since it is
      no longer used. However, that is not true if a device is using
      block enabled tagging. If that is the case, then the queuelist
      is reused to keep track of busy tags. If a device also ended
      up using softirq completions, we'd reuse ->queuelist for the
      IPI handling while block tagging was still using it. Boom.
      
      Fix this by adding a new ipi_list list head, and share the
      memory used with the request hash table. The hash table is
      never used after the request is moved to the dispatch list,
      which happens long before any potential completion of the
      request. Add a new request bit for this, so we don't have
      cases that check rq->hash while it could potentially have
      been reused for the IPI completion.
      Reported-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Tested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      360f92c2
    • J
      block: add kblockd_schedule_delayed_work_on() · 8ab14595
      Jens Axboe 提交于
      Same function as kblockd_schedule_delayed_work(), but allow the
      caller to pass in a CPU that the work should be executed on. This
      just directly extends and maps into the workqueue API, and will
      be used to make the blk-mq mappings more strict.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      8ab14595
    • J
      block: remove 'q' parameter from kblockd_schedule_*_work() · 59c3d45e
      Jens Axboe 提交于
      The queue parameter is never used, just get rid of it.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      59c3d45e
  10. 21 3月, 2014 1 次提交
  11. 09 3月, 2014 1 次提交
    • M
      block: fix q->flush_rq NULL pointer crash on dm-mpath flush · 7982e90c
      Mike Snitzer 提交于
      Commit 18741986 ("blk-mq: rework flush sequencing logic") switched
      ->flush_rq from being an embedded member of the request_queue structure
      to being dynamically allocated in blk_init_queue_node().
      
      Request-based DM multipath doesn't use blk_init_queue_node(), instead it
      uses blk_alloc_queue_node() + blk_init_allocated_queue().  Because
      commit 18741986 placed the dynamic allocation of ->flush_rq in
      blk_init_queue_node() any flush issued to a dm-mpath device would crash
      with a NULL pointer, e.g.:
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0
      PGD bb3c7067 PUD bb01d067 PMD 0
      Oops: 0002 [#1] SMP
      ...
      CPU: 5 PID: 5028 Comm: dt Tainted: G        W  O 3.14.0-rc3.snitm+ #10
      ...
      task: ffff88032fb270e0 ti: ffff880079564000 task.ti: ffff880079564000
      RIP: 0010:[<ffffffff8125037e>]  [<ffffffff8125037e>] blk_rq_init+0x1e/0xb0
      RSP: 0018:ffff880079565c98  EFLAGS: 00010046
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000030
      RDX: ffff880260c74048 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff880079565ca8 R08: ffff880260aa1e98 R09: 0000000000000001
      R10: ffff88032fa78500 R11: 0000000000000246 R12: 0000000000000000
      R13: ffff880260aa1de8 R14: 0000000000000650 R15: 0000000000000000
      FS:  00007f8d36a2a700(0000) GS:ffff88033fca0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000000 CR3: 0000000079b36000 CR4: 00000000000007e0
      Stack:
       0000000000000000 ffff880260c74048 ffff880079565cd8 ffffffff81257a47
       ffff880260aa1de8 ffff880260c74048 0000000000000001 0000000000000000
       ffff880079565d08 ffffffff81257c2d 0000000000000000 ffff880260aa1de8
      Call Trace:
       [<ffffffff81257a47>] blk_flush_complete_seq+0x2d7/0x2e0
       [<ffffffff81257c2d>] blk_insert_flush+0x1dd/0x210
       [<ffffffff8124ec59>] __elv_add_request+0x1f9/0x320
       [<ffffffff81250681>] ? blk_account_io_start+0x111/0x190
       [<ffffffff81253a4b>] blk_queue_bio+0x25b/0x330
       [<ffffffffa0020bf5>] dm_request+0x35/0x40 [dm_mod]
       [<ffffffff812530c0>] generic_make_request+0xc0/0x100
       [<ffffffff81253173>] submit_bio+0x73/0x140
       [<ffffffff811becdd>] submit_bio_wait+0x5d/0x80
       [<ffffffff81257528>] blkdev_issue_flush+0x78/0xa0
       [<ffffffff811c1f6f>] blkdev_fsync+0x3f/0x60
       [<ffffffff811b7fde>] vfs_fsync_range+0x1e/0x20
       [<ffffffff811b7ffc>] vfs_fsync+0x1c/0x20
       [<ffffffff811b81f1>] do_fsync+0x41/0x80
       [<ffffffff8118874e>] ? SyS_lseek+0x7e/0x80
       [<ffffffff811b8260>] SyS_fsync+0x10/0x20
       [<ffffffff8154c2d2>] system_call_fastpath+0x16/0x1b
      
      Fix this by moving the ->flush_rq allocation from blk_init_queue_node()
      to blk_init_allocated_queue().  blk_init_queue_node() also calls
      blk_init_allocated_queue() so this change is functionality equivalent
      for all blk_init_queue_node() callers.
      Reported-by: NHannes Reinecke <hare@suse.de>
      Reported-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7982e90c
  12. 06 3月, 2014 1 次提交
    • R
      blktrace: fix accounting of partially completed requests · af5040da
      Roman Pen 提交于
      trace_block_rq_complete does not take into account that request can
      be partially completed, so we can get the following incorrect output
      of blkparser:
      
        C   R 232 + 240 [0]
        C   R 240 + 232 [0]
        C   R 248 + 224 [0]
        C   R 256 + 216 [0]
      
      but should be:
      
        C   R 232 + 8 [0]
        C   R 240 + 8 [0]
        C   R 248 + 8 [0]
        C   R 256 + 8 [0]
      
      Also, the whole output summary statistics of completed requests and
      final throughput will be incorrect.
      
      This patch takes into account real completion size of the request and
      fixes wrong completion accounting.
      Signed-off-by: NRoman Pen <r.peniaev@gmail.com>
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      CC: Ingo Molnar <mingo@redhat.com>
      CC: linux-kernel@vger.kernel.org
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      af5040da
  13. 19 2月, 2014 1 次提交
  14. 11 2月, 2014 1 次提交
    • C
      blk-mq: rework flush sequencing logic · 18741986
      Christoph Hellwig 提交于
      Witch to using a preallocated flush_rq for blk-mq similar to what's done
      with the old request path.  This allows us to set up the request properly
      with a tag from the actually allowed range and ->rq_disk as needed by
      some drivers.  To make life easier we also switch to dynamic allocation
      of ->flush_rq for the old path.
      
      This effectively reverts most of
      
          "blk-mq: fix for flush deadlock"
      
      and
      
          "blk-mq: Don't reserve a tag for flush request"
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      18741986
  15. 08 2月, 2014 1 次提交
  16. 01 1月, 2014 2 次提交
  17. 24 11月, 2013 2 次提交
    • K
      block: Convert bio_for_each_segment() to bvec_iter · 7988613b
      Kent Overstreet 提交于
      More prep work for immutable biovecs - with immutable bvecs drivers
      won't be able to use the biovec directly, they'll need to use helpers
      that take into account bio->bi_iter.bi_bvec_done.
      
      This updates callers for the new usage without changing the
      implementation yet.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Paul Clements <Paul.Clements@steeleye.com>
      Cc: Jim Paris <jim@jtan.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com>
      Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com>
      Cc: support@lsi.com
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Quoc-Son Anh <quoc-sonx.anh@intel.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: linux-m68k@lists.linux-m68k.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: drbd-user@lists.linbit.com
      Cc: nbd-general@lists.sourceforge.net
      Cc: cbe-oss-dev@lists.ozlabs.org
      Cc: xen-devel@lists.xensource.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: linux-raid@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: DL-MPTFusionLinux@lsi.com
      Cc: linux-scsi@vger.kernel.org
      Cc: devel@driverdev.osuosl.org
      Cc: linux-fsdevel@vger.kernel.org
      Cc: cluster-devel@redhat.com
      Cc: linux-mm@kvack.org
      Acked-by: NGeoff Levand <geoff@infradead.org>
      7988613b
    • K
      block: Abstract out bvec iterator · 4f024f37
      Kent Overstreet 提交于
      Immutable biovecs are going to require an explicit iterator. To
      implement immutable bvecs, a later patch is going to add a bi_bvec_done
      member to this struct; for now, this patch effectively just renames
      things.
      Signed-off-by: NKent Overstreet <kmo@daterainc.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Ed L. Cashin" <ecashin@coraid.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Yehuda Sadeh <yehuda@inktank.com>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Alex Elder <elder@inktank.com>
      Cc: ceph-devel@vger.kernel.org
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: linux390@de.ibm.com
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Cc: Benny Halevy <bhalevy@tonian.com>
      Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@kernel.org>
      Cc: Joern Engel <joern@logfs.org>
      Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: xfs@oss.sgi.com
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Guo Chao <yan@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Asai Thambi S P <asamymuthupa@micron.com>
      Cc: Selvan Mani <smani@micron.com>
      Cc: Sam Bradshaw <sbradshaw@micron.com>
      Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
      Cc: "Roger Pau Monné" <roger.pau@citrix.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
      Cc: Ian Campbell <Ian.Campbell@citrix.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Jerome Marchand <jmarchand@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peng Tao <tao.peng@emc.com>
      Cc: Andy Adamson <andros@netapp.com>
      Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Namjae Jeon <namjae.jeon@samsung.com>
      Cc: Pankaj Kumar <pankaj.km@samsung.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>6
      4f024f37
  18. 09 11月, 2013 2 次提交
    • A
      block: Enable sysfs nomerge control for I/O requests in the plug list · 23779fbc
      Alireza Haghdoost 提交于
      This patch enables the sysfs to control I/O request merge
      functionality in the plug list. While this control has been
      implemented for the request queue, it was dismissed in the plug list.
      Therefore, block layer merges requests together (or attempt to merge)
      even if the merge capability was disable using sysfs nomerge parameter
      value 2.
      
      This limitation is directly affects functionality of io_submit()
      system call. The system call enables user to submit a bunch of IO
      requests from user space using struct iocb **ios input argument.
      However, the unconditioned merging functionality in the plug list
      potentially merges these requests together down the road. Therefore,
      there is no way to distinguish between an application sending bunch of
      sequential IOs and an application sending one big IO. Ultimately, all
      requests generated by the former app merge within the plug list
      together and looks similar to the second app.
      
      While the merging functionality is a desirable feature to improve the
      performance of IO subsystem for some applications, it is not useful
      for other application like ours at all.
      Signed-off-by: NAlireza Haghdoost <alireza@cs.umn.edu>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      
      Coding style modified.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      23779fbc
    • T
      elevator: Fix a race in elevator switching and md device initialization · eb1c160b
      Tomoki Sekiyama 提交于
      The soft lockup below happens at the boot time of the system using dm
      multipath and the udev rules to switch scheduler.
      
      [  356.127001] BUG: soft lockup - CPU#3 stuck for 22s! [sh:483]
      [  356.127001] RIP: 0010:[<ffffffff81072a7d>]  [<ffffffff81072a7d>] lock_timer_base.isra.35+0x1d/0x50
      ...
      [  356.127001] Call Trace:
      [  356.127001]  [<ffffffff81073810>] try_to_del_timer_sync+0x20/0x70
      [  356.127001]  [<ffffffff8118b08a>] ? kmem_cache_alloc_node_trace+0x20a/0x230
      [  356.127001]  [<ffffffff810738b2>] del_timer_sync+0x52/0x60
      [  356.127001]  [<ffffffff812ece22>] cfq_exit_queue+0x32/0xf0
      [  356.127001]  [<ffffffff812c98df>] elevator_exit+0x2f/0x50
      [  356.127001]  [<ffffffff812c9f21>] elevator_change+0xf1/0x1c0
      [  356.127001]  [<ffffffff812caa50>] elv_iosched_store+0x20/0x50
      [  356.127001]  [<ffffffff812d1d09>] queue_attr_store+0x59/0xb0
      [  356.127001]  [<ffffffff812143f6>] sysfs_write_file+0xc6/0x140
      [  356.127001]  [<ffffffff811a326d>] vfs_write+0xbd/0x1e0
      [  356.127001]  [<ffffffff811a3ca9>] SyS_write+0x49/0xa0
      [  356.127001]  [<ffffffff8164e899>] system_call_fastpath+0x16/0x1b
      
      This is caused by a race between md device initialization by multipathd and
      shell script to switch the scheduler using sysfs.
      
       - multipathd:
         SyS_ioctl -> do_vfs_ioctl -> dm_ctl_ioctl -> ctl_ioctl -> table_load
         -> dm_setup_md_queue -> blk_init_allocated_queue -> elevator_init
          q->elevator = elevator_alloc(q, e); // not yet initialized
      
       - sh -c 'echo deadline > /sys/$DEVPATH/queue/scheduler':
         elevator_switch (in the call trace above)
          struct elevator_queue *old = q->elevator;
          q->elevator = elevator_alloc(q, new_e);
          elevator_exit(old);                 // lockup! (*)
      
       - multipathd: (cont.)
          err = e->ops.elevator_init_fn(q);   // init fails; q->elevator is modified
      
      (*) When del_timer_sync() is called, lock_timer_base() will loop infinitely
      while timer->base == NULL. In this case, as timer will never initialized,
      it results in lockup.
      
      This patch introduces acquisition of q->sysfs_lock around elevator_init()
      into blk_init_allocated_queue(), to provide mutual exclusion between
      initialization of the q->scheduler and switching of the scheduler.
      
      This should fix this bugzilla:
      https://bugzilla.redhat.com/show_bug.cgi?id=902012Signed-off-by: NTomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb1c160b
  19. 08 11月, 2013 2 次提交
    • M
      blk-core: Fix memory corruption if blkcg_init_queue fails · fff4996b
      Mikulas Patocka 提交于
      If blkcg_init_queue fails, blk_alloc_queue_node doesn't call bdi_destroy
      to clean up structures allocated by the backing dev.
      
      ------------[ cut here ]------------
      WARNING: at lib/debugobjects.c:260 debug_print_object+0x85/0xa0()
      ODEBUG: free active (active state 0) object type: percpu_counter hint:           (null)
      Modules linked in: dm_loop dm_mod ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev ipt_MASQUERADE iptable_nat nf_nat_ipv4 msr nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative spadfs fuse hid_generic usbhid hid raid0 md_mod dmi_sysfs nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack lm85 hwmon_vid snd_usb_audio snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib snd_rawmidi snd soundcore acpi_cpufreq freq_table mperf sata_svw serverworks kvm_amd ide_core ehci_pci ohci_hcd libata ehci_hcd kvm usbcore tg3 usb_common libphy k10temp pcspkr ptp i2c_piix4 i2c_core evdev microcode hwmon rtc_cmos pps_core e100 skge floppy mii processor button unix
      CPU: 0 PID: 2739 Comm: lvchange Tainted: G        W
      3.10.15-devel #14
      Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
       0000000000000009 ffff88023c3c1ae8 ffffffff813c8fd4 ffff88023c3c1b20
       ffffffff810399eb ffff88043d35cd58 ffffffff81651940 ffff88023c3c1bf8
       ffffffff82479d90 0000000000000005 ffff88023c3c1b80 ffffffff81039a67
      Call Trace:
       [<ffffffff813c8fd4>] dump_stack+0x19/0x1b
       [<ffffffff810399eb>] warn_slowpath_common+0x6b/0xa0
       [<ffffffff81039a67>] warn_slowpath_fmt+0x47/0x50
       [<ffffffff8122aaaf>] ? debug_check_no_obj_freed+0xcf/0x250
       [<ffffffff81229a15>] debug_print_object+0x85/0xa0
       [<ffffffff8122abe3>] debug_check_no_obj_freed+0x203/0x250
       [<ffffffff8113c4ac>] kmem_cache_free+0x20c/0x3a0
       [<ffffffff811f6709>] blk_alloc_queue_node+0x2a9/0x2c0
       [<ffffffff811f672e>] blk_alloc_queue+0xe/0x10
       [<ffffffffa04c0093>] dm_create+0x1a3/0x530 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6c07>] dev_create+0x57/0x2b0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6bb0>] ? list_version_get_info+0xe0/0xe0 [dm_mod]
       [<ffffffffa04c6528>] ctl_ioctl+0x268/0x500 [dm_mod]
       [<ffffffff81097662>] ? get_lock_stats+0x22/0x70
       [<ffffffffa04c67ce>] dm_ctl_ioctl+0xe/0x20 [dm_mod]
       [<ffffffff81161aad>] do_vfs_ioctl+0x2ed/0x520
       [<ffffffff8116cfc7>] ? fget_light+0x377/0x4e0
       [<ffffffff81161d2b>] SyS_ioctl+0x4b/0x90
       [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
      ---[ end trace 4b5ff0d55673d986 ]---
      ------------[ cut here ]------------
      
      This fix should be backported to stable kernels starting with 2.6.37. Note
      that in the kernels prior to 3.5 the affected code is different, but the
      bug is still there - bdi_init is called and bdi_destroy isn't.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org	# 2.6.37+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fff4996b
    • J
      block: fix race between request completion and timeout handling · 4912aa6c
      Jeff Moyer 提交于
      crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      
      Pid: 491, comm: scsi_eh_0 Tainted: G        W  ----------------   2.6.32-220.13.1.el6.x86_64 #1 IBM  -[8722PAX]-/00D1461
      RIP: 0010:[<ffffffff8124e424>]  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
      RSP: 0018:ffff881057eefd60  EFLAGS: 00010012
      RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
      RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
      RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
      R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
      FS:  0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
      Stack:
       0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
      <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
      <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
      Call Trace:
       [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
       [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
       [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20
       [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
       [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
       [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
       [<ffffffff810908c6>] kthread+0x96/0xa0
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffff81090830>] ? kthread+0x0/0xa0
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
      RIP  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
       RSP <ffff881057eefd60>
      
      The RIP is this line:
              BUG_ON(blk_queued_rq(rq));
      
      After digging through the code, I think there may be a race between the
      request completion and the timer handler running.
      
      A timer is started for each request put on the device's queue (see
      blk_start_request->blk_add_timer).  If the request does not complete
      before the timer expires, the timer handler (blk_rq_timed_out_timer)
      will mark the request complete atomically:
      
      static inline int blk_mark_rq_complete(struct request *rq)
      {
              return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
      }
      
      and then call blk_rq_timed_out.  The latter function will call
      scsi_times_out, which will return one of BLK_EH_HANDLED,
      BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED.  If BLK_EH_RESET_TIMER is
      returned, blk_clear_rq_complete is called, and blk_add_timer is again
      called to simply wait longer for the request to complete.
      
      Now, if the request happens to complete while this is going on, what
      happens?  Given that we know the completion handler will bail if it
      finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
      handler running after that bit is cleared.  So, from the above
      paragraph, after the call to blk_clear_rq_complete.  If the completion
      sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
      there (I haven't seen this in the cores).  Next, if we get the
      completion before the call to list_add_tail, then the timer will
      eventually fire for an old req, which may either be freed or reallocated
      (there is evidence that this might be the case).  Finally, if the
      completion comes in *after* the addition to the timeout list, I think
      it's harmless.  The request will be removed from the timeout list,
      req_atom_complete will be set, and all will be well.
      
      This will only actually explain the coredumps *IF* the request
      structure was freed, reallocated *and* queued before the error handler
      thread had a chance to process it.  That is possible, but it may make
      sense to keep digging for another race.  I think that if this is what
      was happening, we would see other instances of this problem showing up
      as null pointer or garbage pointer dereferences, for example when the
      request structure was not re-used.  It looks like we actually do run
      into that situation in other reports.
      
      This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
      &req->atomic_flags)); from blk_add_timer to the only caller that could
      trip over it (blk_start_request).  It then inverts the calls to
      blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
      the race.  I've boot tested this patch, but nothing more.
      Signed-off-by: NJeff Moyer <jmoyer@redhat.com>
      Acked-by: NHannes Reinecke <hare@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4912aa6c
  20. 30 10月, 2013 1 次提交
  21. 29 10月, 2013 1 次提交
    • C
      blk-mq: fix for flush deadlock · 3228f48b
      Christoph Hellwig 提交于
      The flush state machine takes in a struct request, which then is
      submitted multiple times to the underling driver.  The old block code
      requeses the same request for each of those, so it does not have an
      issue with tapping into the request pool.  The new one on the other hand
      allocates a new request for each of the actualy steps of the flush
      sequence. If have already allocated all of the tags for IO, we will
      fail allocating the flush request.
      
      Set aside a reserved request just for flushes.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3228f48b
  22. 25 10月, 2013 3 次提交
    • J
      blk-mq: new multi-queue block IO queueing mechanism · 320ae51f
      Jens Axboe 提交于
      Linux currently has two models for block devices:
      
      - The classic request_fn based approach, where drivers use struct
        request units for IO. The block layer provides various helper
        functionalities to let drivers share code, things like tag
        management, timeout handling, queueing, etc.
      
      - The "stacked" approach, where a driver squeezes in between the
        block layer and IO submitter. Since this bypasses the IO stack,
        driver generally have to manage everything themselves.
      
      With drivers being written for new high IOPS devices, the classic
      request_fn based driver doesn't work well enough. The design dates
      back to when both SMP and high IOPS was rare. It has problems with
      scaling to bigger machines, and runs into scaling issues even on
      smaller machines when you have IOPS in the hundreds of thousands
      per device.
      
      The stacked approach is then most often selected as the model
      for the driver. But this means that everybody has to re-invent
      everything, and along with that we get all the problems again
      that the shared approach solved.
      
      This commit introduces blk-mq, block multi queue support. The
      design is centered around per-cpu queues for queueing IO, which
      then funnel down into x number of hardware submission queues.
      We might have a 1:1 mapping between the two, or it might be
      an N:M mapping. That all depends on what the hardware supports.
      
      blk-mq provides various helper functions, which include:
      
      - Scalable support for request tagging. Most devices need to
        be able to uniquely identify a request both in the driver and
        to the hardware. The tagging uses per-cpu caches for freed
        tags, to enable cache hot reuse.
      
      - Timeout handling without tracking request on a per-device
        basis. Basically the driver should be able to get a notification,
        if a request happens to fail.
      
      - Optional support for non 1:1 mappings between issue and
        submission queues. blk-mq can redirect IO completions to the
        desired location.
      
      - Support for per-request payloads. Drivers almost always need
        to associate a request structure with some driver private
        command structure. Drivers can tell blk-mq this at init time,
        and then any request handed to the driver will have the
        required size of memory associated with it.
      
      - Support for merging of IO, and plugging. The stacked model
        gets neither of these. Even for high IOPS devices, merging
        sequential IO reduces per-command overhead and thus
        increases bandwidth.
      
      For now, this is provided as a potential 3rd queueing model, with
      the hope being that, as it matures, it can replace both the classic
      and stacked model. That would get us back to having just 1 real
      model for block devices, leaving the stacked approach to dm/md
      devices (as it was originally intended).
      
      Contributions in this patch from the following people:
      
      Shaohua Li <shli@fusionio.com>
      Alexander Gordeev <agordeev@redhat.com>
      Christoph Hellwig <hch@infradead.org>
      Mike Christie <michaelc@cs.wisc.edu>
      Matias Bjorling <m@bjorling.me>
      Jeff Moyer <jmoyer@redhat.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      320ae51f
    • C
      block: remove request ref_count · 71fe07d0
      Christoph Hellwig 提交于
      This reference count has been around since before git history, but the only
      place where it's used is in blk_execute_rq, and ther it is entirely useless
      as it is incremented before submitting the request and decremented in the
      end_io handler before waking up the submitter thread.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71fe07d0
    • J
      block: make rq->cmd_flags be 64-bit · 5953316d
      Jens Axboe 提交于
      We have officially run out of flags in a 32-bit space. Extend it
      to 64-bit even on 32-bit archs.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5953316d
  23. 12 9月, 2013 1 次提交
  24. 24 8月, 2013 2 次提交
  25. 01 7月, 2013 1 次提交
  26. 17 5月, 2013 1 次提交
  27. 15 5月, 2013 1 次提交
    • V
      block: queue work on power efficient wq · 695588f9
      Viresh Kumar 提交于
      Block layer uses workqueues for multiple purposes. There is no real dependency
      of scheduling these on the cpu which scheduled them.
      
      On a idle system, it is observed that and idle cpu wakes up many times just to
      service this work. It would be better if we can schedule it on a cpu which the
      scheduler believes to be the most appropriate one.
      
      This patch replaces normal workqueues with power efficient versions.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      695588f9
  28. 19 4月, 2013 1 次提交
  29. 24 3月, 2013 1 次提交
    • K
      block: Add bio_end_sector() · f73a1c7d
      Kent Overstreet 提交于
      Just a little convenience macro - main reason to add it now is preparing
      for immutable bio vecs, it'll reduce the size of the patch that puts
      bi_sector/bi_size/bi_idx into a struct bvec_iter.
      Signed-off-by: NKent Overstreet <koverstreet@google.com>
      CC: Jens Axboe <axboe@kernel.dk>
      CC: Lars Ellenberg <drbd-dev@lists.linbit.com>
      CC: Jiri Kosina <jkosina@suse.cz>
      CC: Alasdair Kergon <agk@redhat.com>
      CC: dm-devel@redhat.com
      CC: Neil Brown <neilb@suse.de>
      CC: Martin Schwidefsky <schwidefsky@de.ibm.com>
      CC: Heiko Carstens <heiko.carstens@de.ibm.com>
      CC: linux-s390@vger.kernel.org
      CC: Chris Mason <chris.mason@fusionio.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      f73a1c7d