1. 24 5月, 2007 1 次提交
  2. 16 5月, 2007 1 次提交
  3. 11 5月, 2007 1 次提交
    • N
      When stacked block devices are in-use (e.g. md or dm), the recursive calls · d89d8796
      Neil Brown 提交于
      to generic_make_request can use up a lot of space, and we would rather they
      didn't.
      
      As generic_make_request is a void function, and as it is generally not
      expected that it will have any effect immediately, it is safe to delay any
      call to generic_make_request until there is sufficient stack space
      available.
      
      As ->bi_next is reserved for the driver to use, it can have no valid value
      when generic_make_request is called, and as __make_request implicitly
      assumes it will be NULL (ELEVATOR_BACK_MERGE fork of switch) we can be
      certain that all callers set it to NULL.  We can therefore safely use
      bi_next to link pending requests together, providing we clear it before
      making the real call.
      
      So, we choose to allow each thread to only be active in one
      generic_make_request at a time.  If a subsequent (recursive) call is made,
      the bio is linked into a per-thread list, and is handled when the active
      call completes.
      
      As the list of pending bios is per-thread, there are no locking issues to
      worry about.
      
      I say above that it is "safe to delay any call...".  There are, however,
      some behaviours of a make_request_fn which would make it unsafe.  These
      include any behaviour that assumes anything will have changed after a
      recursive call to generic_make_request.
      
      These could include:
       - waiting for that call to finish and call it's bi_end_io function.
         md use to sometimes do this (marking the superblock dirty before
         completing a write) but doesn't any more
       - inspecting the bio for fields that generic_make_request might
         change, such as bi_sector or bi_bdev.  It is hard to see a good
         reason for this, and I don't think anyone actually does it.
       - inspecing the queue to see if, e.g. it is 'full' yet.  Again, I
         think this is very unlikely to be useful, or to be done.
      Signed-off-by: NNeil Brown <neilb@suse.de>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <dm-devel@redhat.com>
      
      Alasdair G Kergon <agk@redhat.com> said:
      
       I can see nothing wrong with this in principle.
      
       For device-mapper at the moment though it's essential that, while the bio
       mappings may now get delayed, they still get processed in exactly
       the same order as they were passed to generic_make_request().
      
       My main concern is whether the timing changes implicit in this patch
       will make the rare data-corrupting races in the existing snapshot code
       more likely. (I'm working on a fix for these races, but the unfinished
       patch is already several hundred lines long.)
      
       It would be helpful if some people on this mailing list would test
       this patch in various scenarios and report back.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d89d8796
  4. 10 5月, 2007 4 次提交
  5. 09 5月, 2007 3 次提交
  6. 08 5月, 2007 2 次提交
  7. 03 5月, 2007 1 次提交
  8. 30 4月, 2007 19 次提交
  9. 25 4月, 2007 1 次提交
    • J
      cfq-iosched: fix alias + front merge bug · 5044eed4
      Jens Axboe 提交于
      There's a really rare and obscure bug in CFQ, that causes a crash in
      cfq_dispatch_insert() due to rq == NULL.  One example of the resulting
      oops is seen here:
      
      	http://lkml.org/lkml/2007/4/15/41
      
      Neil correctly diagnosed the situation for how this can happen: if two
      concurrent requests with the exact same sector number (due to direct IO
      or aliasing between MD and the raw device access), the alias handling
      will add the request to the sortlist, but next_rq remains NULL.
      
      Read the more complete analysis at:
      
      	http://lkml.org/lkml/2007/4/25/57
      
      This looks like it requires md to trigger, even though it should
      potentially be possible to due with O_DIRECT (at least if you edit the
      kernel and doctor some of the unplug calls).
      
      The fix is to move the ->next_rq update to when we add a request to the
      rbtree. Then we remove the possibility for a request to exist in the
      rbtree code, but not have ->next_rq correctly updated.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5044eed4
  10. 21 4月, 2007 1 次提交
    • J
      cfq-iosched: fix sequential write regression · a9938006
      Jens Axboe 提交于
      We have a 10-15% performance regression for sequential writes on TCQ/NCQ
      enabled drives in 2.6.21-rcX after the CFQ update went in.  It has been
      reported by Valerie Clement <valerie.clement@bull.net> and the Intel
      testing folks.  The regression is because of CFQ's now more aggressive
      queue control, limiting the depth available to the device.
      
      This patches fixes that regression by allowing a greater depth when only
      one queue is busy.  It has been tested to not impact sync-vs-async
      workloads too much - we still do a lot better than 2.6.20.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9938006
  11. 18 4月, 2007 1 次提交
    • A
      [SCSI] sg: cap reserved_size values at max_sectors · 44ec9542
      Alan Stern 提交于
      This patch (as857) modifies the SG_GET_RESERVED_SIZE and
      SG_SET_RESERVED_SIZE ioctls in the sg driver, capping the values at
      the device's request_queue's max_sectors value.  This will permit
      cdrecord to obtain a legal value for the maximum transfer length,
      fixing Bugzilla #7026.
      
      The patch also caps the initial reserved_size value.  There's no
      reason to have a reserved buffer larger than max_sectors, since it
      would be impossible to use the extra space.
      
      The corresponding ioctls in the block layer are modified similarly,
      and the initial value for the reserved_size is set as large as
      possible.  This will effectively make it default to max_sectors.
      Note that the actual value is meaningless anyway, since block devices
      don't have a reserved buffer.
      
      Finally, the BLKSECTGET ioctl is added to sg, so that there will be a
      uniform way for users to determine the actual max_sectors value for
      any raw SCSI transport.
      Signed-off-by: NAlan Stern <stern@rowland.harvard.edu>
      Acked-by: NJens Axboe <jens.axboe@oracle.com>
      Acked-by: NDouglas Gilbert <dougg@torque.net>
      Signed-off-by: NJames Bottomley <James.Bottomley@SteelEye.com>
      44ec9542
  12. 05 4月, 2007 1 次提交
  13. 27 3月, 2007 2 次提交
    • T
      make elv_register() output atomic · 1ffb96c5
      Thibaut VARENE 提交于
      Booting 2.6.21-rc3-g45592145 I noticed the following on one of my
      machines in the bootlog:
      
      io scheduler noop registered<6>Time: jiffies clocksource has been installed.
      
      io scheduler deadline registered (default)
      
      Looking at block/elevator.c, it appears that elv_register() uses two
      consecutive printks in a non-atomic way, leading to the above glitch. The
      attached trivial patch fixes this issue, by using a single printk.
      Signed-off-by: NThibaut VARENE <varenet@parisc-linux.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      1ffb96c5
    • V
      block: blk_max_pfn is somtimes wrong · f772b3d9
      Vasily Tarasov 提交于
      There is a small problem in handling page bounce.
      
      At the moment blk_max_pfn equals max_pfn, which is in fact not maximum
      possible _number_ of a page frame, but the _amount_ of page frames.  For
      example for the 32bit x86 node with 4Gb RAM, max_pfn = 0x100000, but not
      0xFFFF.
      
      request_queue structure has a member q->bounce_pfn and queue needs bounce
      pages for the pages _above_ this limit.  This routine is handled by
      blk_queue_bounce(), where the following check is produced:
      
      	if (q->bounce_pfn >= blk_max_pfn)
      		return;
      
      Assume, that a driver has set q->bounce_pfn to 0xFFFF, but blk_max_pfn
      equals 0x10000.  In such situation the check above fails and for each bio
      we always fall down for iterating over pages tied to the bio.
      
      I want to notice, that for quite a big range of device drivers (ide, md,
      ...) such problem doesn't happen because they use BLK_BOUNCE_ANY for
      bounce_pfn.  BLK_BOUNCE_ANY is defined as blk_max_pfn << PAGE_SHIFT, and
      then the check above doesn't fail.  But for other drivers, which obtain
      reuired value from drivers, it fails.  For example sata_nv uses
      ATA_DMA_MASK or dev->dma_mask.
      
      I propose to use (max_pfn - 1) for blk_max_pfn.  And the same for
      blk_max_low_pfn.  The patch also cleanses some checks related with
      bounce_pfn.
      Signed-off-by: NVasily Tarasov <vtaras@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      f772b3d9
  14. 21 2月, 2007 2 次提交
    • P
      [PATCH] lockdep: annotate BLKPG_DEL_PARTITION · 6d740cd5
      Peter Zijlstra 提交于
      >=============================================
      >[ INFO: possible recursive locking detected ]
      >2.6.19-1.2909.fc7 #1
      >---------------------------------------------
      >anaconda/587 is trying to acquire lock:
      > (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24
      >
      >but task is already holding lock:
      > (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24
      >
      >other info that might help us debug this:
      >1 lock held by anaconda/587:
      > #0:  (&bdev->bd_mutex){--..}, at: [<c05fb380>] mutex_lock+0x21/0x24
      >
      >stack backtrace:
      > [<c0405812>] show_trace_log_lvl+0x1a/0x2f
      > [<c0405db2>] show_trace+0x12/0x14
      > [<c0405e36>] dump_stack+0x16/0x18
      > [<c043bd84>] __lock_acquire+0x116/0xa09
      > [<c043c960>] lock_acquire+0x56/0x6f
      > [<c05fb1fa>] __mutex_lock_slowpath+0xe5/0x24a
      > [<c05fb380>] mutex_lock+0x21/0x24
      > [<c04d82fb>] blkdev_ioctl+0x600/0x76d
      > [<c04946b1>] block_ioctl+0x1b/0x1f
      > [<c047ed5a>] do_ioctl+0x22/0x68
      > [<c047eff2>] vfs_ioctl+0x252/0x265
      > [<c047f04e>] sys_ioctl+0x49/0x63
      > [<c0404070>] syscall_call+0x7/0xb
      
      Annotate BLKPG_DEL_PARTITION's bd_mutex locking and add a little comment
      clarifying the bd_mutex locking, because I confused myself and initially
      thought the lock order was wrong too.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d740cd5
    • A
      [PATCH] rework reserved major handling · b446b60e
      Andrew Morton 提交于
      Several people have reported failures in dynamic major device number handling
      due to the recent changes in there to avoid handing out the local/experimental
      majors.
      
      Rolf reports that this is due to a gcc-4.1.0 bug.
      
      The patch refactors that code a lot in an attempt to provoke the compiler into
      behaving.
      
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b446b60e