1. 01 7月, 2008 1 次提交
    • D
      block: Fix the starving writes bug in the anticipatory IO scheduler · d585d0b9
      Divyesh Shah 提交于
      AS scheduler alternates between issuing read and write batches. It does
      the batch switch only after all requests from the previous batch are
      completed.
      
      When switching to a write batch, if there is an on-going read request,
      it waits for its completion and indicates its intention of switching by
      setting ad->changed_batch and the new direction but does not update the
      batch_expire_time for the new write batch which it does in the case of
      no previous pending requests.
      On completion of the read request, it sees that we were waiting for the
      switch and schedules work for kblockd right away and resets the
      ad->changed_data flag.
      Now when kblockd enters dispatch_request where it is expected to pick
      up a write request, it in turn ends the write batch because the
      batch_expire_timer was not updated and shows the expire timestamp for
      the previous batch.
      
      This results in the write starvation for all the cases where there is
      the intention for switching to a write batch, but there is a previous
      in-flight read request and the batch gets reverted to a read_batch
      right away.
      
      This also holds true in the reverse case (switching from a write batch
      to a read batch with an in-flight write request).
      
      I've checked that this bug exists on 2.6.11, 2.6.18, 2.6.24 and
      linux-2.6-block git HEAD. I've tested the fix on x86 platforms with
      SCSI drives where the driver asks for the next request while a current
      request is in-flight.
      
      This patch is based off linux-2.6-block git HEAD.
      
      Bug reproduction:
      A simple scenario which reproduces this bug is:
      - dd if=/dev/hda3 of=/dev/null &
      - lilo
         The lilo takes forever to complete.
      
      This can also be reproduced fairly easily with the earlier dd and
      another test
      program doing msync().
      
      The example test program below should print out a message after every
      iteration
      but it simply hangs forever. With this bugfix it makes forward progress.
      
      ====
      Example test program using msync() (thanks to suleiman AT google DOT
      com)
      
      inline uint64_t
      rdtsc(void)
      {
               int64_t tsc;
      
               __asm __volatile("rdtsc" : "=A" (tsc));
               return (tsc);
      }
      
      int
      main(int argc, char **argv)
      {
               struct stat st;
               uint64_t e, s, t;
               char *p, q;
               long i;
               int fd;
      
               if (argc < 2) {
                       printf("Usage: %s <file>\n", argv[0]);
                       return (1);
               }
      
               if ((fd = open(argv[1], O_RDWR | O_NOATIME)) < 0)
                       err(1, "open");
      
               if (fstat(fd, &st) < 0)
                       err(1, "fstat");
      
               p = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
      MAP_SHARED, fd, 0);
      
               t = 0;
               for (i = 0; i < 1000; i++) {
                       *p = 0;
                       msync(p, 4096, MS_SYNC);
                       s = rdtsc();
                      *p = 0;
                       __asm __volatile(""::: "memory");
                       e = rdtsc();
                       if (argc > 2)
                               printf("%d: %lld cycles %jd %jd\n",
                                      i, e - s, (intmax_t)s, (intmax_t)e);
                       t += e - s;
               }
               printf("average time: %lld cycles\n", t / 1000);
               return (0);
      }
      
      Cc: <stable@kernel.org>
      Acked-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      d585d0b9
  2. 13 6月, 2008 1 次提交
  3. 10 6月, 2008 1 次提交
    • L
      Fix invalid access errors in blk_lookup_devt · d5791d13
      Linus Torvalds 提交于
      Commit 30f2f0eb ("block: do_mounts -
      accept root=<non-existant partition>") extended blk_lookup_devt() to be
      able to look up partitions that had not yet been registered, but in the
      process made the assumption that the '&block_class.devices' list only
      contains disk devices and that you can do 'dev_to_disk(dev)' on them.
      
      That isn't actually true.  The block_class device list also contains the
      partitions we've discovered so far, and you can't just do a
      'dev_to_disk()' on those.
      
      So make sure to only work on devices that block/genhd.c has registered
      itself, something we can test by checking the 'dev->type' member.  This
      makes the loop in blk_lookup_devt() match the other such loops in this
      file.
      
      [ We may want to do an alternate version that knows to handle _either_
        whole-disk devices or partitions, but for now this is the minimal fix
        for a series of crashes reported by Mariusz Kozlowski in
      
      	http://lkml.org/lkml/2008/5/25/25
      
        and Ingo in
      
      	http://lkml.org/lkml/2008/6/9/39 ]
      Reported-by: NMariusz Kozlowski <m.kozlowski@tuxland.pl>
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Joao Luis Meloni Assirati <assirati@nonada.if.usp.br>
      Acked-by: NKay Sievers <kay.sievers@vrfy.org>
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5791d13
  4. 28 5月, 2008 6 次提交
  5. 15 5月, 2008 2 次提交
    • N
      Remove blkdev warning triggered by using md · e7e72bf6
      Neil Brown 提交于
      As setting and clearing queue flags now requires that we hold a spinlock
      on the queue, and as blk_queue_stack_limits is called without that lock,
      get the lock inside blk_queue_stack_limits.
      
      For blk_queue_stack_limits to be able to find the right lock, each md
      personality needs to set q->queue_lock to point to the appropriate lock.
      Those personalities which didn't previously use a spin_lock, us
      q->__queue_lock.  So always initialise that lock when allocated.
      
      With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
      longer cause warnings as it will be clear that the proper lock is held.
      
      Thanks to Dan Williams for review and fixing the silly bugs.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Alistair John Strachan <alistair@devzero.co.uk>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Jacek Luczak <difrost.kernel@gmail.com>
      Cc: Prakash Punnoor <prakash@punnoor.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7e72bf6
    • K
      block: do_mounts - accept root=<non-existant partition> · 30f2f0eb
      Kay Sievers 提交于
      Some devices, like md, may create partitions only at first access,
      so allow root= to be set to a valid non-existant partition of an
      existing disk. This applies only to non-initramfs root mounting.
      
      This fixes a regression from 2.6.24 which did allow this to happen and
      broke some users machines :(
      Acked-by: NNeil Brown <neilb@suse.de>
      Tested-by: NJoao Luis Meloni Assirati <assirati@nonada.if.usp.br>
      Cc: stable <stable@kernel.org>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      30f2f0eb
  6. 13 5月, 2008 1 次提交
  7. 07 5月, 2008 7 次提交
  8. 03 5月, 2008 1 次提交
  9. 02 5月, 2008 1 次提交
  10. 01 5月, 2008 1 次提交
  11. 30 4月, 2008 1 次提交
  12. 29 4月, 2008 11 次提交
  13. 23 4月, 2008 1 次提交
    • F
      [SCSI] bsg: add release callback support · 97f46ae4
      FUJITA Tomonori 提交于
      This patch adds release callback support, which is called when a bsg
      device goes away. bsg_register_queue() takes a pointer to a callback
      function. This feature is useful for stuff like sas_host that can't
      use the release callback in struct device.
      
      If a caller doesn't need bsg's release callback, it can call
      bsg_register_queue() with NULL pointer (e.g. scsi devices can use
      release callback in struct device so they don't need bsg's callback).
      
      With this patch, bsg uses kref for refcounts on bsg devices instead of
      get/put_device in fops->open/release. bsg calls put_device and the
      caller's release callback (if it was registered) in kref_put's
      release.
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>
      97f46ae4
  14. 21 4月, 2008 4 次提交
    • A
      block: fix blk_register_queue() return value · fb199746
      Akinobu Mita 提交于
      blk_register_queue() returns -ENXIO when queue->request_fn is NULL.  But there
      are some block drivers that call blk_register_queue() via add_disk() with
      queue->request_fn == NULL.  (For example, brd, loop)
      
      Although no one checks return value of blk_register_queue(), this patch makes
      it return 0 instead of -ENXIO when queue->request_fn is NULL,
      
      Also this patch adds warning when blk_register_queue() and
      blk_unregister_queue() are called with queue == NULL rather than ignore
      invalid usage silently.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      fb199746
    • N
      Kconfig: clean up block/Kconfig help descriptions · ee86418d
      Nick Andrew 提交于
      Modify the help descriptions of block/Kconfig for clarity, accuracy and
      consistency.
      
      Refactor the BLOCK description a bit.  The wording "This permits ...  to be
      removed" isn't quite right; the block layer is removed when the option is
      disabled, whereas most descriptions talk about what happens when the option is
      enabled.  Reformat the list of what is affected by disabling the block layer.
      
      Add more examples of large block devices to LBD and strive for technical
      accuracy; block devices of size _exactly_ 2TB require CONFIG_LBD, not only
      "bigger than 2TB".  Also try to say (perhaps not very clearly) that the config
      option is only needed when you want to have individual block devices of size
      >= 2TB, for example if you had 3 x 1TB disks in your computer you'd have a
      total storage size of 3TB but you wouldn't need the option unless you want to
      aggregate those disks into a RAID or LVM.
      
      Improve terminology and grammar on BLK_DEV_IO_TRACE.
      
      I also added the boilerplate "If unsure, say N" to most options.
      
      Precisely say "2TB and larger" for LSF.
      
      Indent the help text for BLK_DEV_BSG by 2 spaces in accordance with the
      standard.
      Signed-off-by: NNick Andrew <nick@nick-andrew.net>
      Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      ee86418d
    • F
      block: move the padding adjustment to blk_rq_map_sg · f18573ab
      FUJITA Tomonori 提交于
      blk_rq_map_user adjusts bi_size of the last bio. It breaks the rule
      that req->data_len (the true data length) is equal to sum(bio). It
      broke the scsi command completion code.
      
      commit e97a294e was introduced to fix
      the above issue. However, the partial completion code doesn't work
      with it. The commit is also a layer violation (scsi mid-layer should
      not know about the block layer's padding).
      
      This patch moves the padding adjustment to blk_rq_map_sg (suggested by
      James). The padding works like the drain buffer. This patch breaks the
      rule that req->data_len is equal to sum(sg), however, the drain buffer
      already broke it. So this patch just restores the rule that
      req->data_len is equal to sub(bio) without breaking anything new.
      
      Now when a low level driver needs padding, blk_rq_map_user and
      blk_rq_map_user_iov guarantee there's enough room for padding.
      blk_rq_map_sg can safely extend the last entry of a scatter list.
      
      blk_rq_map_sg must extend the last entry of a scatter list only for a
      request that got through bio_copy_user_iov. This patches introduces
      new REQ_COPY_USER flag.
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      f18573ab
    • F
      block: add bio_copy_user_iov support to blk_rq_map_user_iov · afdc1a78
      FUJITA Tomonori 提交于
      With this patch, blk_rq_map_user_iov uses bio_copy_user_iov when a low
      level driver needs padding or a buffer in sg_iovec isn't aligned. That
      is, it uses temporary kernel buffers instead of mapping user pages
      directly.
      
      When a LLD needs padding, later blk_rq_map_sg needs to extend the last
      entry of a scatter list. bio_copy_user_iov guarantees that there is
      enough space for padding by using temporary kernel buffers instead of
      user pages.
      
      blk_rq_map_user_iov needs buffers in sg_iovec to be aligned. The
      comment in blk_rq_map_user_iov indicates that drivers/scsi/sg.c also
      needs buffers in sg_iovec to be aligned. Actually, drivers/scsi/sg.c
      works with unaligned buffers in sg_iovec (it always uses temporary
      kernel buffers).
      Signed-off-by: NFUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      afdc1a78
  15. 20 4月, 2008 1 次提交