1. 23 3月, 2013 2 次提交
  2. 28 2月, 2013 8 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
    • M
      block/partitions: optimize memory allocation in check_partition() · ac2e5327
      Ming Lei 提交于
      Currently, sizeof(struct parsed_partitions) may be 64KB in 32bit arch, so
      it is easy to trigger page allocation failure by check_partition,
      especially in hotplug block device situation(such as, USB mass storage,
      MMC card, ...), and Felipe Balbi has observed the failure.
      
      This patch does below optimizations on the allocation of struct
      parsed_partitions to try to address the issue:
      
      - make parsed_partitions.parts as pointer so that the pointed memory can
        fit in 32KB buffer, then approximate 32KB memory can be saved
      
      - vmalloc the buffer pointed by parsed_partitions.parts because 32KB is
        still a bit big for kmalloc
      
      - given that many devices have the partition count limit, so only
        allocate disk_max_parts() partitions instead of 256 partitions always
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Reported-by: NFelipe Balbi <balbi@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac2e5327
    • M
      block/partitions/mac.c: obey the state->limit constraint · 06004e6e
      Ming Lei 提交于
      It isn't necessary to read the information of partitions whose number is
      equal and more than state->limit since only maximum state->limit
      partitions will be added inside rescan_partitions().
      
      That is also what other kind of partitions are doing.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06004e6e
    • P
      block/partitions/efi.c: ensure that the GPT header is at least the size of the structure. · 8b8a6e18
      Peter Jones 提交于
      UEFI 2.3.1D will include a change to the spec language mandating that a
      GPT header must be greater than *or equal to* the size of the defined
      structure.  While verifying that this would work on Linux, I discovered
      that we're not actually checking the minimum bound at all.
      
      The result of this is that when we verify the checksum, it's possible that
      on a malformed header (with header_size of 0), we won't actually verify
      any data.
      
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NPeter Jones <pjones@redhat.com>
      Acked-by: NMatt Fleming <matt.fleming@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Stephen Warren <swarren@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b8a6e18
    • P
      block/partition/msdos: detect AIX formatted disks even without 55aa · 86ee8ba6
      Philippe De Muyter 提交于
      AIX formatted disks do not always have the MSDOS 55aa signature.
      This happens e.g. for unbootable AIX disks.
      
      Up to now, such disks were not recognized as AIX disks, because of the
      missing 55aa.  Fix that by inverting the two tests.  Let's first
      check for the AIX magic strings, and only if that fails check for
      the MSDOS magic word.
      Signed-off-by: NPhilippe De Muyter <phdm@macqel.be>
      Cc: Andreas Mohr <andi@lisas.de>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Olaf Hering <olh@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86ee8ba6
    • T
      block: convert to idr_alloc() · bab998d6
      Tejun Heo 提交于
      Convert to the much saner new idr interface.  Both bsg and genhd
      protect idr w/ mutex making preloading unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bab998d6
    • T
      block: fix synchronization and limit check in blk_alloc_devt() · ce23bba8
      Tejun Heo 提交于
      idr allocation in blk_alloc_devt() wasn't synchronized against lookup
      and removal, and its limit check was off by one - 1 << MINORBITS is
      the number of minors allowed, not the maximum allowed minor.
      
      Add locking and rename MAX_EXT_DEVT to NR_EXT_DEVT and fix limit
      checking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce23bba8
    • T
      block: fix ext_devt_idr handling · 7b74e912
      Tomas Henzl 提交于
      While adding and removing a lot of disks disks and partitions this
      sometimes shows up:
      
        WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted)
        Hardware name:
        sysfs: cannot create duplicate filename '/dev/block/259:751'
        Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan]
        Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1
        Call Trace:
          warn_slowpath_common+0x87/0xc0
          warn_slowpath_fmt+0x46/0x50
          sysfs_add_one+0xc9/0x130
          sysfs_do_create_link+0x12b/0x170
          sysfs_create_link+0x13/0x20
          device_add+0x317/0x650
          idr_get_new+0x13/0x50
          add_partition+0x21c/0x390
          rescan_partitions+0x32b/0x470
          sd_open+0x81/0x1f0 [sd_mod]
          __blkdev_get+0x1b6/0x3c0
          blkdev_get+0x10/0x20
          register_disk+0x155/0x170
          add_disk+0xa6/0x160
          sd_probe_async+0x13b/0x210 [sd_mod]
          add_wait_queue+0x46/0x60
          async_thread+0x102/0x250
          default_wake_function+0x0/0x20
          async_thread+0x0/0x250
          kthread+0x96/0xa0
          child_rip+0xa/0x20
          kthread+0x0/0xa0
          child_rip+0x0/0x20
      
      This most likely happens because dev_t is freed while the number is
      still used and idr_get_new() is not protected on every use.  The fix
      adds a mutex where it wasn't before and moves the dev_t free function so
      it is called after device del.
      Signed-off-by: NTomas Henzl <thenzl@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b74e912
  3. 24 2月, 2013 1 次提交
    • M
      block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices · 25e823c8
      Ming Lei 提交于
      Apply the introduced pm_runtime_set_memalloc_noio on block device so
      that PM core will teach mm to not allocate memory with GFP_IOFS when
      calling the runtime_resume and runtime_suspend callback for block
      devices and its ancestors.
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Oliver Neukum <oneukum@suse.de>
      Cc: Jiri Kosina <jiri.kosina@suse.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Greg KH <greg@kroah.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Decotigny <david.decotigny@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25e823c8
  4. 22 2月, 2013 4 次提交
    • G
      cfq: fix lock imbalance with failed allocations · a3cc86c2
      Glauber Costa 提交于
      While stress-running very-small container scenarios with the Kernel Memory
      Controller, I've run into a lockdep-detected lock imbalance in
      cfq-iosched.c.
      
      I'll apologize beforehand for not posting a backlog: I didn't anticipate
      it would be so hard to reproduce, so I didn't save my serial output and
      went directly on debugging.  Turns out that it did not happen again in
      more than 20 runs, making it a quite rare pattern.
      
      But here is my analysis:
      
      When we are in very low-memory situations, we will arrive at
      cfq_find_alloc_queue and may not find a queue, having to resort to the oom
      queue, in an rcu-locked condition:
      
        if (!cfqq || cfqq == &cfqd->oom_cfqq)
            [ ... ]
      
      Next, we will release the rcu lock, and try to allocate a queue, retrying
      if we succeed:
      
        rcu_read_unlock();
        spin_unlock_irq(cfqd->queue->queue_lock);
        new_cfqq = kmem_cache_alloc_node(cfq_pool,
                        gfp_mask | __GFP_ZERO,
                        cfqd->queue->node);
         spin_lock_irq(cfqd->queue->queue_lock);
         if (new_cfqq)
             goto retry;
      
      We are unlocked at this point, but it should be fine, since we will
      reacquire the rcu_read_lock when we retry.
      
      Except of course, that we may not retry: the allocation may very well fail
      and we'll keep on going through the flow:
      
      The next branch is:
      
          if (cfqq) {
      	[ ... ]
          } else
              cfqq = &cfqd->oom_cfqq;
      
      And right before exiting, we'll issue rcu_read_unlock().
      
      Being already unlocked, this is the likely source of our imbalance.  Since
      cfqq is either already NULL or made NULL in the first statement of the
      outter branch, the only viable alternative here seems to be to return the
      oom queue right away in case of allocation failure.
      
      Please review the following patch and apply if you agree with my analysis.
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3cc86c2
    • M
      block: don't select PERCPU_RWSEM · 79d0b7f0
      Mikulas Patocka 提交于
      The block device doesn't use percpu rw-semaphore anymore, so don't select
      it for compilation.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      79d0b7f0
    • D
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong 提交于
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffecfd1a
    • D
      bdi: allow block devices to say that they require stable page writes · 7d311cda
      Darrick J. Wong 提交于
      This patchset ("stable page writes, part 2") makes some key
      modifications to the original 'stable page writes' patchset.  First, it
      provides creators (devices and filesystems) of a backing_dev_info a flag
      that declares whether or not it is necessary to ensure that page
      contents cannot change during writeout.  It is no longer assumed that
      this is true of all devices (which was never true anyway).  Second, the
      flag is used to relaxed the wait_on_page_writeback calls so that wait
      only occurs if the device needs it.  Third, it fixes up the remaining
      disk-backed filesystems to use this improved conditional-wait logic to
      provide stable page writes on those filesystems.
      
      It is hoped that (for people not using checksumming devices, anyway)
      this patchset will give back unnecessary performance decreases since the
      original stable page write patchset went into 3.0.  Sorry about not
      fixing it sooner.
      
      Complaints were registered by several people about the long write
      latencies introduced by the original stable page write patchset.
      Generally speaking, the kernel ought to allocate as little extra memory
      as possible to facilitate writeout, but for people who simply cannot
      wait, a second page stability strategy is (re)introduced: snapshotting
      page contents.  The waiting behavior is still the default strategy; to
      enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
      set.  This flag is used to bandaid^Henable stable page writeback on
      ext3[1], and is not used anywhere else.
      
      Given that there are already a few storage devices and network FSes that
      have rolled their own page stability wait/page snapshot code, it would
      be nice to move towards consolidating all of these.  It seems possible
      that iscsi and raid5 may wish to use the new stable page write support
      to enable zero-copy writeout.
      
      Thank you to Jan Kara for helping fix a couple more filesystems.
      
      Per Andrew Morton's request, here are the result of using dbench to measure
      latencies on ext2:
      
      3.8.0-rc3:
         Operation      Count    AvgLat    MaxLat
         ----------------------------------------
         WriteX        109347     0.028    59.817
         ReadX         347180     0.004     3.391
         Flush          15514    29.828   287.283
      
        Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms
      
      3.8.0-rc3 + patches:
         WriteX        105556     0.029     4.273
         ReadX         335004     0.005     4.112
         Flush          14982    30.540   298.634
      
        Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 ms
      
      As you can see, for ext2 the maximum write latency decreases from ~60ms
      on a laptop hard disk to ~4ms.  I'm not sure why the flush latencies
      increase, though I suspect that being able to dirty pages faster gives
      the flusher more work to do.
      
      On ext4, the average write latency decreases as well as all the maximum
      latencies:
      
      3.8.0-rc3:
         WriteX         85624     0.152    33.078
         ReadX         272090     0.010    61.210
         Flush          12129    36.219   168.260
      
        Throughput 44.8618 MB/sec  4 clients  4 procs  max_latency=168.276 ms
      
      3.8.0-rc3 + patches:
         WriteX         86082     0.141    30.928
         ReadX         273358     0.010    36.124
         Flush          12214    34.800   165.689
      
        Throughput 44.9941 MB/sec  4 clients  4 procs  max_latency=165.722 ms
      
      XFS seems to exhibit similar latency improvements as ext2:
      
      3.8.0-rc3:
         WriteX        125739     0.028   104.343
         ReadX         399070     0.005     4.115
         Flush          17851    25.004   131.390
      
        Throughput 66.0024 MB/sec  4 clients  4 procs  max_latency=131.406 ms
      
      3.8.0-rc3 + patches:
         WriteX        123529     0.028     6.299
         ReadX         392434     0.005     4.287
         Flush          17549    25.120   188.687
      
        Throughput 64.9113 MB/sec  4 clients  4 procs  max_latency=188.704 ms
      
      ...and btrfs, just to round things out, also shows some latency
      decreases:
      
      3.8.0-rc3:
         WriteX         67122     0.083    82.355
         ReadX         212719     0.005     2.828
         Flush           9547    47.561   147.418
      
        Throughput 35.3391 MB/sec  4 clients  4 procs  max_latency=147.433 ms
      
      3.8.0-rc3 + patches:
         WriteX         64898     0.101    71.631
         ReadX         206673     0.005     7.123
         Flush           9190    47.963   219.034
      
        Throughput 34.0795 MB/sec  4 clients  4 procs  max_latency=219.044 ms
      
      Before this patchset, all filesystems would block, regardless of whether
      or not it was necessary.  ext3 would wait, but still generate occasional
      checksum errors.  The network filesystems were left to do their own
      thing, so they'd wait too.
      
      After this patchset, all the disk filesystems except ext3 and btrfs will
      wait only if the hardware requires it.  ext3 (if necessary) snapshots
      pages instead of blocking, and btrfs provides its own bdi so the mm will
      never wait.  Network filesystems haven't been touched, so either they
      provide their own wait code, or they don't block at all.  The blocking
      behavior is back to what it was before 3.0 if you don't have a disk
      requiring stable page writes.
      
      This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
      xfs.  I've spot-checked 3.8.0-rc4 and seem to be getting the same
      results as -rc3.
      
      [1] The alternative fixes to ext3 include fixing the locking order and
      page bit handling like we did for ext4 (but then why not just use
      ext4?), or setting PG_writeback so early that ext3 becomes extremely
      slow.  I tried that, but the number of write()s I could initiate dropped
      by nearly an order of magnitude.  That was a bit much even for the
      author of the stable page series! :)
      
      This patch:
      
      Creates a per-backing-device flag that tracks whether or not pages must
      be held immutable during writeout.  Eventually it will be used to waive
      wait_for_page_writeback() if nothing requires stable pages.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d311cda
  5. 15 2月, 2013 1 次提交
    • V
      block: account iowait time when waiting for completion of IO request · 5577022f
      Vladimir Davydov 提交于
      Using wait_for_completion() for waiting for a IO request to be executed
      results in wrong iowait time accounting. For example, a system having
      the only task doing write() and fdatasync() on a block device can be
      reported being idle instead of iowaiting as it should because
      blkdev_issue_flush() calls wait_for_completion() which in turn calls
      schedule() that does not increment the iowait proc counter and thus does
      not turn on iowait time accounting.
      
      The patch makes block layer use wait_for_completion_io() instead of
      wait_for_completion() where appropriate to account iowait time
      correctly.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5577022f
  6. 08 2月, 2013 1 次提交
  7. 23 1月, 2013 1 次提交
    • T
      block: don't request module during elevator init · 21c3c5d2
      Tejun Heo 提交于
      Block layer allows selecting an elevator which is built as a module to
      be selected as system default via kernel param "elevator=".  This is
      achieved by automatically invoking request_module() whenever a new
      block device is initialized and the elevator is not available.
      
      This led to an interesting deadlock problem involving async and module
      init.  Block device probing running off an async job invokes
      request_module().  While the module is being loaded, it performs
      async_synchronize_full() which ends up waiting for the async job which
      is already waiting for request_module() to finish, leading to
      deadlock.
      
      Invoking request_module() from deep in block device init path is
      already nasty in itself.  It seems best to avoid these situations from
      the beginning by moving on-demand module loading out of block init
      path.
      
      The previous patch made sure that the default elevator module is
      loaded early during boot if available.  This patch removes on-demand
      loading of the default elevator from elevator init path.  As the
      module would have been loaded during boot, userland-visible behavior
      difference should be minimal.
      
      For more details, please refer to the following thread.
      
        http://thread.gmane.org/gmane.linux.kernel/1420814
      
      v2: The bool parameter was named @request_module which conflicted with
          request_module().  This built okay w/ CONFIG_MODULES because
          request_module() was defined as a macro.  W/o CONFIG_MODULES, it
          causes build breakage.  Rename the parameter to @try_loading.
          Reported by Fengguang.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alex Riesen <raa.lkml@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      21c3c5d2
  8. 19 1月, 2013 1 次提交
    • T
      init, block: try to load default elevator module early during boot · bb813f4c
      Tejun Heo 提交于
      This patch adds default module loading and uses it to load the default
      block elevator.  During boot, it's called right after initramfs or
      initrd is made available and right before control is passed to
      userland.  This ensures that as long as the modules are available in
      the usual places in initramfs, initrd or the root filesystem, the
      default modules are loaded as soon as possible.
      
      This will replace the on-demand elevator module loading from elevator
      init path.
      
      v2: Fixed build breakage when !CONFIG_BLOCK.  Reported by kbuild test
          robot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alex Riesen <raa.lkml@gmail.com>
      Cc: Fengguang We <fengguang.wu@intel.com>
      bb813f4c
  9. 14 1月, 2013 2 次提交
    • T
      block: add @req to bio_{front|back}_merge tracepoints · 8c1cf6bb
      Tejun Heo 提交于
      bio_{front|back}_merge tracepoints report a bio merging into an
      existing request but didn't specify which request the bio is being
      merged into.  Add @req to it.  This makes it impossible to share the
      event template with block_bio_queue - split it out.
      
      @req isn't used or exported to userland at this point and there is no
      userland visible behavior change.  Later changes will make use of the
      extra parameter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8c1cf6bb
    • T
      block: add missing block_bio_complete() tracepoint · 3a366e61
      Tejun Heo 提交于
      bio completion didn't kick block_bio_complete TP.  Only dm was
      explicitly triggering the TP on IO completion.  This makes
      block_bio_complete TP useless for tracers which want to know about
      bios, and all other bio based drivers skip generating blktrace
      completion events.
      
      This patch makes all bio completions via bio_endio() generate
      block_bio_complete TP.
      
      * Explicit trace_block_bio_complete() invocation removed from dm and
        the trace point is unexported.
      
      * @rq dropped from trace_block_bio_complete().  bios may fly around
        w/o queue associated.  Verifying and accessing the assocaited queue
        belongs to TP probes.
      
      * blktrace now gets both request and bio completions.  Make it ignore
        bio completions if request completion path is happening.
      
      This makes all bio based drivers generate blktrace completion events
      properly and makes the block_bio_complete TP actually useful.
      
      v2: With this change, block_bio_complete TP could be invoked on sg
          commands which have bio's with %NULL bi_bdev.  Update TP
          assignment code to check whether bio->bi_bdev is %NULL before
          dereferencing.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Original-patch-by: NNamhyung Kim <namhyung@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a366e61
  10. 11 1月, 2013 2 次提交
  11. 10 1月, 2013 17 次提交
    • T
      cfq-iosched: add hierarchical cfq_group statistics · 43114018
      Tejun Heo 提交于
      Unfortunately, at this point, there's no way to make the existing
      statistics hierarchical without creating nasty surprises for the
      existing users.  Just create recursive counterpart of the existing
      stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      43114018
    • T
      cfq-iosched: collect stats from dead cfqgs · 0b39920b
      Tejun Heo 提交于
      To support hierarchical stats, it's necessary to remember stats from
      dead children.  Add cfqg->dead_stats and make a dying cfqg transfer
      its stats to the parent's dead-stats.
      
      The transfer happens form ->pd_offline_fn() and it is possible that
      there are some residual IOs completing afterwards.  Currently, we lose
      these stats.  Given that cgroup removal isn't a very high frequency
      operation and the amount of residual IOs on offline are likely to be
      nil or small, this shouldn't be a big deal and the complexity needed
      to handle residual IOs - another callback and rather elaborate
      synchronization to reach and lock the matching q - doesn't seem
      justified.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      0b39920b
    • T
      cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats() · 689665af
      Tejun Heo 提交于
      Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
      cfq_pd_reset_stats() and move the latter to where other pd methods are
      defined.  cfqg_stats_reset() will be used to implement hierarchical
      stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      689665af
    • T
      blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock · 810ecfa7
      Tejun Heo 提交于
      Instead of holding blkcg->lock while walking ->blkg_list and executing
      prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
      executing prfill().  This makes prfill() implementations easier as
      stats are mostly protected by queue lock.
      
      This will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      810ecfa7
    • T
      block: RCU free request_queue · 548bc8e1
      Tejun Heo 提交于
      RCU free request_queue so that blkcg_gq->q can be dereferenced under
      RCU lock.  This will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      548bc8e1
    • T
      blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() · 16b3de66
      Tejun Heo 提交于
      Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
      The former two collect the [rw]stats designated by the target policy
      data and offset from the pd's subtree.  The latter two add one
      [rw]stat to another.
      
      Note that the recursive sum functions require the queue lock to be
      held on entry to make blkg online test reliable.  This is necessary to
      properly handle stats of a dying blkg.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      16b3de66
    • T
      blkcg: export __blkg_prfill_rwstat() · b50da39f
      Tejun Heo 提交于
      Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
      Export it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      b50da39f
    • T
      blkcg: s/blkg_rwstat_sum()/blkg_rwstat_total()/ · 4d5e80a7
      Tejun Heo 提交于
      Rename blkg_rwstat_sum() to blkg_rwstat_total().  sum will be used for
      summing up stats from multiple blkgs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      4d5e80a7
    • T
      blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online · f427d909
      Tejun Heo 提交于
      Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
      which are invoked as the policy_data gets activated and deactivated
      while holding both blkcg and q locks.
      
      Also, add blkcg_gq->online bool, which is set and cleared as the
      blkcg_gq gets activated and deactivated.  This flag also is toggled
      while holding both blkcg and q locks.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      f427d909
    • T
      blkcg: add blkg_policy_data->plid · b276a876
      Tejun Heo 提交于
      Add pd->plid so that the policy a pd belongs to can be identified
      easily.  This will be used to implement hierarchical blkg_[rw]stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      b276a876
    • T
      cfq-iosched: enable full blkcg hierarchy support · d02f7aa8
      Tejun Heo 提交于
      With the previous two patches, all cfqg scheduling decisions are based
      on vfraction and ready for hierarchy support.  The only thing which
      keeps the behavior flat is cfqg_flat_parent() which makes vfraction
      calculation consider all non-root cfqgs children of the root cfqg.
      
      Replace it with cfqg_parent() which returns the real parent.  This
      enables full blkcg hierarchy support for cfq-iosched.  For example,
      consider the following hierarchy.
      
              root
            /      \
         A:500      B:250
        /     \
       AA:500  AB:1000
      
      For simplicity, let's say all the leaf nodes have active tasks and are
      on service tree.  For each leaf node, vfraction would be
      
       AA: (500  / 1500) * (500 / 750) =~ 0.2222
       AB: (1000 / 1500) * (500 / 750) =~ 0.4444
        B:                 (250 / 750) =~ 0.3333
      
      and vdisktime will be distributed accordingly.  For more detail,
      please refer to Documentation/block/cfq-iosched.txt.
      
      v2: cfq-iosched.txt updated to describe group scheduling as suggested
          by Vivek.
      
      v3: blkio-controller.txt updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      d02f7aa8
    • T
      cfq-iosched: convert cfq_group_slice() to use cfqg->vfraction · 41cad6ab
      Tejun Heo 提交于
      cfq_group_slice() calculates slice by taking a fraction of
      cfq_target_latency according to the ratio of cfqg->weight against
      service_tree->total_weight.  This currently works only because all
      cfqgs are treated to be at the same level.
      
      To prepare for proper hierarchy support, convert cfq_group_slice() to
      base the calculation on cfqg->vfraction.  As cfqg->vfraction is always
      a fraction of 1 and represents the fraction allocated to the cfqg with
      hierarchy considered, the slice can be simply calculated by
      multiplying cfqg->vfraction to cfq_target_latency (with fixed point
      shift factored in).
      
      As vfraction calculation currently treats all non-root cfqgs as
      children of the root cfqg, this patch doesn't introduce noticeable
      behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      41cad6ab
    • T
      cfq-iosched: implement hierarchy-ready cfq_group charge scaling · 1d3650f7
      Tejun Heo 提交于
      Currently, cfqg charges are scaled directly according to cfqg->weight.
      Regardless of the number of active cfqgs or the amount of active
      weights, a given weight value always scales charge the same way.  This
      works fine as long as all cfqgs are treated equally regardless of
      their positions in the hierarchy, which is what cfq currently
      implements.  It can't work in hierarchical settings because the
      interpretation of a given weight value depends on where the weight is
      located in the hierarchy.
      
      This patch reimplements cfqg charge scaling so that it can be used to
      support hierarchy properly.  The scheme is fairly simple and
      light-weight.
      
      * When a cfqg is added to the service tree, v(disktime)weight is
        calculated.  It walks up the tree to root calculating the fraction
        it has in the hierarchy.  At each level, the fraction can be
        calculated as
      
          cfqg->weight / parent->level_weight
      
        By compounding these, the global fraction of vdisktime the cfqg has
        claim to - vfraction - can be determined.
      
      * When the cfqg needs to be charged, the charge is scaled inversely
        proportionally to the vfraction.
      
      The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
      representation as before; however, the smallest scaling factor is now
      1 (ie. 1 << CFQ_SERVICE_SHIFT).  This is different from before where 1
      was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
      scaling factor.
      
      While this shifts the global scale of vdisktime a bit, it doesn't
      change the relative relationships among cfqgs and the scheduling
      result isn't different.
      
      cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
      new cfqg to the service tree.  The specific value of CFQ_IDLE_DELAY
      didn't have any relevance to vdisktime before and is unlikely to cause
      any visible behavior difference now especially as the scale shift
      isn't that large.
      
      As the new scheme now makes proper distinction between cfqg->weight
      and ->leaf_weight, reverse the weight aliasing for root cfqgs.  For
      root, both weights are now mapped to ->leaf_weight instead of the
      other way around.
      
      Because we're still using cfqg_flat_parent(), this patch shouldn't
      change the scheduling behavior in any noticeable way.
      
      v2: Beefed up comments on vfraction as requested by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      1d3650f7
    • T
      cfq-iosched: implement cfq_group->nr_active and ->children_weight · 7918ffb5
      Tejun Heo 提交于
      To prepare for blkcg hierarchy support, add cfqg->nr_active and
      ->children_weight.  cfqg->nr_active counts the number of active cfqgs
      at the cfqg's level and ->children_weight is sum of weights of those
      cfqgs.  The level covers itself (cfqg->leaf_weight) and immediate
      children.
      
      The two values are updated when a cfqg enters and leaves the group
      service tree.  Unless the hierarchy is very deep, the added overhead
      should be negligible.
      
      Currently, the parent is determined using cfqg_flat_parent() which
      makes the root cfqg the parent of all other cfqgs.  This is to make
      the transition to hierarchy-aware scheduling gradual.  Scheduling
      logic will be converted to use cfqg->children_weight without actually
      changing the behavior.  When everything is ready,
      blkcg_weight_parent() will be replaced with proper parent function.
      
      This patch doesn't introduce any behavior chagne.
      
      v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      7918ffb5
    • T
      cfq-iosched: add leaf_weight · e71357e1
      Tejun Heo 提交于
      cfq blkcg is about to grow proper hierarchy handling, where a child
      blkg's weight would nest inside the parent's.  This makes tasks in a
      blkg to compete against both tasks in the sibling blkgs and the tasks
      of child blkgs.
      
      We're gonna use the existing weight as the group weight which decides
      the blkg's weight against its siblings.  This patch introduces a new
      weight - leaf_weight - which decides the weight of a blkg against the
      child blkgs.
      
      It's named leaf_weight because another way to look at it is that each
      internal blkg nodes have a hidden child leaf node which contains all
      its tasks and leaf_weight is the weight of the leaf node and handled
      the same as the weight of the child blkgs.
      
      This patch only adds leaf_weight fields and exposes it to userland.
      The new weight isn't actually used anywhere yet.  Note that
      cfq-iosched currently offcially supports only single level hierarchy
      and root blkgs compete with the first level blkgs - ie. root weight is
      basically being used as leaf_weight.  For root blkgs, the two weights
      are kept in sync for backward compatibility.
      
      v2: cfqd->root_group->leaf_weight initialization was missing from
          cfq_init_queue() causing divide by zero when
          !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      e71357e1
    • T
      blkcg: make blkcg_gq's hierarchical · 3c547865
      Tejun Heo 提交于
      Currently a child blkg (blkcg_gq) can be created even if its parent
      doesn't exist.  ie. Given a blkg, it's not guaranteed that its
      ancestors will exist.  This makes it difficult to implement proper
      hierarchy support for blkcg policies.
      
      Always create blkgs recursively and make a child blkg hold a reference
      to its parent.  blkg->parent is added so that finding the parent is
      easy.  blkcg_parent() is also added in the process.
      
      This change can be visible to userland.  e.g. while issuing IO in a
      nested cgroup didn't affect the ancestors at all, now it will
      initialize all ancestor blkgs and zero stats for the request_queue
      will always appear on them.  While this is userland visible, this
      shouldn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      3c547865
    • T
      blkcg: cosmetic updates to blkg_create() · 93e6d5d8
      Tejun Heo 提交于
      * Rename out_* labels to err_*.
      
      * Do ERR_PTR() conversion once in the error return path.
      
      This patch is cosmetic and to prepare for the hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      93e6d5d8