1. 25 10月, 2010 1 次提交
  2. 19 10月, 2010 1 次提交
    • Y
      block: fix accounting bug on cross partition merges · 7681bfee
      Yasuaki Ishimatsu 提交于
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      When reloading partition tables, quiesce IO to ensure that no
      request references to the partition struct exists. When it is safe
      to free the partition table, the IO for that device is restarted
      again.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      7681bfee
  3. 07 10月, 2010 1 次提交
    • J
      elevator: fix oops on early call to elevator_change() · 430c62fb
      Jens Axboe 提交于
      2.6.36 introduces an API for drivers to switch the IO scheduler
      instead of manually calling the elevator exit and init functions.
      This API was added since q->elevator must be cleared in between
      those two calls. And since we already have this functionality
      directly from use by the sysfs interface to switch schedulers
      online, it was prudent to reuse it internally too.
      
      But this API needs the queue to be in a fully initialized state
      before it is called, or it will attempt to unregister elevator
      kobjects before they have been added. This results in an oops
      like this:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000051
      IP: [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
      PGD 47ddfc067 PUD 47c6a1067 PMD 0
      Oops: 0000 [#1] PREEMPT SMP
      last sysfs file: /sys/devices/pci0000:00/0000:00:02.0/0000:04:00.1/irq
      CPU 2
      Modules linked in: t(+) loop hid_apple usbhid ahci ehci_hcd uhci_hcd libahci usbcore nls_base igb
      
      Pid: 7319, comm: modprobe Not tainted 2.6.36-rc6+ #132 QSSC-S4R/QSSC-S4R
      RIP: 0010:[<ffffffff8116f15e>]  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
      RSP: 0018:ffff88027da25d08  EFLAGS: 00010246
      RAX: ffff88047c68c528 RBX: 00000000fffffffe RCX: 0000000000000000
      RDX: 000000000000002f RSI: 000000000000002f RDI: ffff88047e196c88
      RBP: ffff88027da25d38 R08: 0000000000000000 R09: d84156c5635688c0
      R10: d84156c5635688c0 R11: 0000000000000000 R12: ffff88047e196c88
      R13: 0000000000000000 R14: 0000000000000000 R15: ffff88047c68c528
      FS:  00007fcb0b26f6e0(0000) GS:ffff880287400000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000051 CR3: 000000047e76e000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process modprobe (pid: 7319, threadinfo ffff88027da24000, task ffff88027d377090)
      Stack:
       ffff88027da25d58 ffff88047c68c528 00000000fffffffe ffff88047e196c88
      <0> ffff88047c68c528 ffff88047e05bd90 ffff88027da25d78 ffffffff8123fb77
      <0> ffff88047e05bd90 0000000000000000 ffff88047e196c88 ffff88047c68c528
      Call Trace:
       [<ffffffff8123fb77>] kobject_add_internal+0xe7/0x1f0
       [<ffffffff8123fd98>] kobject_add_varg+0x38/0x60
       [<ffffffff8123feb9>] kobject_add+0x69/0x90
       [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
       [<ffffffff8103d48d>] ? sub_preempt_count+0x9d/0xe0
       [<ffffffff8143de20>] ? _raw_spin_unlock+0x30/0x50
       [<ffffffff8116efe0>] ? sysfs_remove_dir+0x20/0xa0
       [<ffffffff8116eff4>] ? sysfs_remove_dir+0x34/0xa0
       [<ffffffff81224204>] elv_register_queue+0x34/0xa0
       [<ffffffff81224aad>] elevator_change+0xfd/0x250
       [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
       [<ffffffffa007e000>] ? t_init+0x0/0x361 [t]
       [<ffffffffa007e0a8>] t_init+0xa8/0x361 [t]
       [<ffffffff810001de>] do_one_initcall+0x3e/0x170
       [<ffffffff8108c3fd>] sys_init_module+0xbd/0x220
       [<ffffffff81002f2b>] system_call_fastpath+0x16/0x1b
      Code: e5 41 56 41 55 41 54 49 89 fc 53 48 83 ec 10 48 85 ff 74 52 48 8b 47 18 49 c7 c5 00 46 61 81 48 85 c0 74 04 4c 8b 68 30 45 31 f6 <41> 80 7d 51 00 74 0e 49 8b 44 24 28 4c 89 e7 ff 50 20 49 89 c6
      RIP  [<ffffffff8116f15e>] sysfs_create_dir+0x2e/0xc0
       RSP <ffff88027da25d08>
      CR2: 0000000000000051
      ---[ end trace a6541d3bf07945df ]---
      
      Fix this by adding a registered bit to the elevator queue, which is
      set when the sysfs kobjects have been registered.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      430c62fb
  4. 23 8月, 2010 1 次提交
  5. 09 4月, 2010 1 次提交
    • D
      blkio: Add io_merged stat · 812d4026
      Divyesh Shah 提交于
      This includes both the number of bios merged into requests belonging to this
      cgroup as well as the number of requests merged together.
      In the past, we've observed different merging behavior across upstream kernels,
      some by design some actual bugs. This stat helps a lot in debugging such
      problems when applications report decreased throughput with a new kernel
      version.
      
      This needed adding an extra elevator function to capture bios being merged as I
      did not want to pollute elevator code with blkiocg knowledge and hence needed
      the accounting invocation to come from CFQ.
      
      Signed-off-by: Divyesh Shah<dpshah@google.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      812d4026
  6. 11 5月, 2009 2 次提交
    • T
      block: implement and enforce request peek/start/fetch · 9934c8c0
      Tejun Heo 提交于
      Till now block layer allowed two separate modes of request execution.
      A request is always acquired from the request queue via
      elv_next_request().  After that, drivers are free to either dequeue it
      or process it without dequeueing.  Dequeue allows elv_next_request()
      to return the next request so that multiple requests can be in flight.
      
      Executing requests without dequeueing has its merits mostly in
      allowing drivers for simpler devices which can't do sg to deal with
      segments only without considering request boundary.  However, the
      benefit this brings is dubious and declining while the cost of the API
      ambiguity is increasing.  Segment based drivers are usually for very
      old or limited devices and as converting to dequeueing model isn't
      difficult, it doesn't justify the API overhead it puts on block layer
      and its more modern users.
      
      Previous patches converted all block low level drivers to dequeueing
      model.  This patch completes the API transition by...
      
      * renaming elv_next_request() to blk_peek_request()
      
      * renaming blkdev_dequeue_request() to blk_start_request()
      
      * adding blk_fetch_request() which is combination of peek and start
      
      * disallowing completion of queued (not started) requests
      
      * applying new API to all LLDs
      
      Renamings are for consistency and to break out of tree code so that
      it's apparent that out of tree drivers need updating.
      
      [ Impact: block request issue API cleanup, no functional change ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Mike Miller <mike.miller@hp.com>
      Cc: unsik Kim <donari75@gmail.com>
      Cc: Paul Clements <paul.clements@steeleye.com>
      Cc: Tim Waugh <tim@cyberelk.net>
      Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Laurent Vivier <Laurent@lvivier.info>
      Cc: Jeff Garzik <jgarzik@pobox.com>
      Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
      Cc: Grant Likely <grant.likely@secretlab.ca>
      Cc: Adrian McMenamin <adrian@mcmen.demon.co.uk>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
      Cc: Borislav Petkov <petkovbb@googlemail.com>
      Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
      Cc: Alex Dubov <oakad@yahoo.com>
      Cc: Pierre Ossman <drzeus@drzeus.cx>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Markus Lidel <Markus.Lidel@shadowconnect.com>
      Cc: Stefan Weinhuber <wein@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pete Zaitcev <zaitcev@redhat.com>
      Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      9934c8c0
    • T
      block: drop request->hard_* and *nr_sectors · 2e46e8b2
      Tejun Heo 提交于
      struct request has had a few different ways to represent some
      properties of a request.  ->hard_* represent block layer's view of the
      request progress (completion cursor) and the ones without the prefix
      are supposed to represent the issue cursor and allowed to be updated
      as necessary by the low level drivers.  The thing is that as block
      layer supports partial completion, the two cursors really aren't
      necessary and only cause confusion.  In addition, manual management of
      request detail from low level drivers is cumbersome and error-prone at
      the very least.
      
      Another interesting duplicate fields are rq->[hard_]nr_sectors and
      rq->{hard_cur|current}_nr_sectors against rq->data_len and
      rq->bio->bi_size.  This is more convoluted than the hard_ case.
      
      rq->[hard_]nr_sectors are initialized for requests with bio but
      blk_rq_bytes() uses it only for !pc requests.  rq->data_len is
      initialized for all request but blk_rq_bytes() uses it only for pc
      requests.  This causes good amount of confusion throughout block layer
      and its drivers and determining the request length has been a bit of
      black magic which may or may not work depending on circumstances and
      what the specific LLD is actually doing.
      
      rq->{hard_cur|current}_nr_sectors represent the number of sectors in
      the contiguous data area at the front.  This is mainly used by drivers
      which transfers data by walking request segment-by-segment.  This
      value always equals rq->bio->bi_size >> 9.  However, data length for
      pc requests may not be multiple of 512 bytes and using this field
      becomes a bit confusing.
      
      In general, having multiple fields to represent the same property
      leads only to confusion and subtle bugs.  With recent block low level
      driver cleanups, no driver is accessing or manipulating these
      duplicate fields directly.  Drop all the duplicates.  Now rq->sector
      means the current sector, rq->data_len the current total length and
      rq->bio->bi_size the current segment length.  Everything else is
      defined in terms of these three and available only through accessors.
      
      * blk_recalc_rq_sectors() is collapsed into blk_update_request() and
        now handles pc and fs requests equally other than rq->sector update.
        This means that now pc requests can use partial completion too (no
        in-kernel user yet tho).
      
      * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer
        now uses byte count as the primary data length.
      
      * blk_rq_pos() is now guranteed to be always correct.  In-block users
        converted.
      
      * blk_rq_bytes() is now guaranteed to be always valid as is
        blk_rq_sectors().  In-block users converted.
      
      * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9.
        More convenient one is used.
      
      * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const
        pointer to request.
      
      [ Impact: API cleanup, single way to represent one property of a request ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Boaz Harrosh <bharrosh@panasas.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      2e46e8b2
  7. 07 4月, 2009 1 次提交
  8. 29 12月, 2008 1 次提交
  9. 09 10月, 2008 2 次提交
  10. 18 12月, 2007 1 次提交
  11. 24 7月, 2007 1 次提交
  12. 20 12月, 2006 1 次提交
    • J
      [PATCH] cfq-iosched: don't allow sync merges across queues · da775265
      Jens Axboe 提交于
      Currently we allow any merge, even if the io originates from different
      processes. This can cause really bad starvation and unfairness, if those
      ios happen to be synchronous (reads or direct writes).
      
      So add a allow_merge hook to the io scheduler ops, so an io scheduler can
      help decide whether a bio/process combination may be merged with an
      existing request.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      da775265
  13. 01 12月, 2006 1 次提交
  14. 12 10月, 2006 1 次提交
  15. 01 10月, 2006 6 次提交
    • D
      [PATCH] BLOCK: Make it possible to disable the block layer [try #6] · 9361401e
      David Howells 提交于
      Make it possible to disable the block layer.  Not all embedded devices require
      it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
      the block layer to be present.
      
      This patch does the following:
      
       (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
           support.
      
       (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
           an item that uses the block layer.  This includes:
      
           (*) Block I/O tracing.
      
           (*) Disk partition code.
      
           (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
      
           (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
           	 block layer to do scheduling.  Some drivers that use SCSI facilities -
           	 such as USB storage - end up disabled indirectly from this.
      
           (*) Various block-based device drivers, such as IDE and the old CDROM
           	 drivers.
      
           (*) MTD blockdev handling and FTL.
      
           (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
           	 taking a leaf out of JFFS2's book.
      
       (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
           linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
           however, still used in places, and so is still available.
      
       (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
           parts of linux/fs.h.
      
       (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
      
       (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
      
       (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
           is not enabled.
      
       (*) fs/no-block.c is created to hold out-of-line stubs and things that are
           required when CONFIG_BLOCK is not set:
      
           (*) Default blockdev file operations (to give error ENODEV on opening).
      
       (*) Makes some /proc changes:
      
           (*) /proc/devices does not list any blockdevs.
      
           (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
      
       (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
      
       (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
           given command other than Q_SYNC or if a special device is specified.
      
       (*) In init/do_mounts.c, no reference is made to the blockdev routines if
           CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.
      
       (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
           error ENOSYS by way of cond_syscall if so).
      
       (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
           CONFIG_BLOCK is not set, since they can't then happen.
      Signed-Off-By: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9361401e
    • J
      [PATCH] elevator: define ioc counting mechanism · 4a893e83
      Jens Axboe 提交于
      None of the in-kernel primitives for handling "atomic" counting seem
      to be a good fit. We need something that is essentially free for
      incrementing/decrementing, while the read side may be more expensive
      as we only ever need to do that when a device is removed from the
      kernel.
      
      Use a per-cpu variable for maintaining a per-cpu ioc count and define
      a reading mechanism that just sums up the values.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      4a893e83
    • J
      [PATCH] Drop useless bio passing in may_queue/set_request API · cb78b285
      Jens Axboe 提交于
      It's not needed for anything, so kill the bio passing.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      cb78b285
    • J
      [PATCH] elevator: introduce a way to reuse rq for internal FIFO handling · 1fbfdfcd
      Jens Axboe 提交于
      The io schedulers can use this instead of having to allocate space for
      it themselves.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      1fbfdfcd
    • J
      [PATCH] elevator: abstract out the rbtree sort handling · 2e662b65
      Jens Axboe 提交于
      The rbtree sort/lookup/reposition logic is mostly duplicated in
      cfq/deadline/as, so move it to the elevator core. The io schedulers
      still provide the actual rb root, as we don't want to impose any sort
      of specific handling on the schedulers.
      
      Introduce the helpers and rb_node in struct request to help migrate the
      IO schedulers.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      2e662b65
    • J
      [PATCH] elevator: move the backmerging logic into the elevator core · 9817064b
      Jens Axboe 提交于
      Right now, every IO scheduler implements its own backmerging (except for
      noop, which does no merging). That results in duplicated code for
      essentially the same operation, which is never a good thing. This patch
      moves the backmerging out of the io schedulers and into the elevator
      core. We save 1.6kb of text and as a bonus get backmerging for noop as
      well. Win-win!
      Signed-off-by: NJens Axboe <axboe@suse.de>
      9817064b
  16. 09 6月, 2006 1 次提交
    • J
      [PATCH] elevator switching race · bc1c1169
      Jens Axboe 提交于
      There's a race between shutting down one io scheduler and firing up the
      next, in which a new io could enter and cause the io scheduler to be
      invoked with bad or NULL data.
      
      To fix this, we need to maintain the queue lock for a bit longer.
      Unfortunately we cannot do that, since the elevator init requires to be
      run without the lock held.  This isn't easily fixable, without also
      changing the mempool API.  So split the initialization into two parts,
      and alloc-init operation and an attach operation.  Then we can
      preallocate the io scheduler and related structures, and run the attach
      inside the lock after we detach the old one.
      
      This patch has survived 30 minutes of 1 second io scheduler switching
      with a very busy io load.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      bc1c1169
  17. 19 3月, 2006 3 次提交
  18. 08 2月, 2006 1 次提交
  19. 10 1月, 2006 1 次提交
  20. 09 1月, 2006 1 次提交
  21. 06 1月, 2006 1 次提交
  22. 28 10月, 2005 3 次提交
  23. 28 6月, 2005 1 次提交
    • J
      [PATCH] Update cfq io scheduler to time sliced design · 22e2c507
      Jens Axboe 提交于
      This updates the CFQ io scheduler to the new time sliced design (cfq
      v3).  It provides full process fairness, while giving excellent
      aggregate system throughput even for many competing processes.  It
      supports io priorities, either inherited from the cpu nice value or set
      directly with the ioprio_get/set syscalls.  The latter closely mimic
      set/getpriority.
      
      This import is based on my latest from -mm.
      Signed-off-by: NJens Axboe <axboe@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      22e2c507
  24. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4