1. 26 9月, 2012 1 次提交
    • S
      s390/partitions: make partition detection independent from DASD ioctls · 46e88947
      Stefan Weinhuber 提交于
      In some usage scenarios it is desireable to work with disk images or
      virtualized DASD devices. One problem that prevents such applications
      is the partition detection in ibm.c. Currently it works only for
      devices that support the BIODASDINFO2 ioctl, in other words, it only
      works for devices that belong to the DASD device driver.
      
      The information gained from the BIODASDINFO2 ioctl is only for a small
      set of legacy cases abolutely necessary. All current VOL1, LNX1 and
      CMS1 type of disk labels can be interpreted correctly without this
      information, as long as the generic HDIO_GETGEO ioctl works and
      provides a correct disk geometry.
      
      This patch makes the ibm.c partition detection as independent as
      possible from the BIODASDINFO2 ioctl. Only the following two cases are
      still restricted to real DASDs:
      - An FBA DASD, or LDL formatted ECKD DASD without any disk label.
      - An old style LNX1 label (without large volume support) on a disk
        with inconsistent device geometry.
      Signed-off-by: NStefan Weinhuber <wein@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      46e88947
  2. 21 9月, 2012 2 次提交
    • T
      block: fix request_queue->flags initialization · 60ea8226
      Tejun Heo 提交于
      A queue newly allocated with blk_alloc_queue_node() has only
      QUEUE_FLAG_BYPASS set.  For request-based drivers,
      blk_init_allocated_queue() is called and q->queue_flags is overwritten
      with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
      initial bypass is still in effect.
      
      In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
      instead of overwriting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      60ea8226
    • T
      block: lift the initial queue bypass mode on blk_register_queue() instead of... · 749fefe6
      Tejun Heo 提交于
      block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
      
      b82d4b19 ("blkcg: make request_queue bypassing on allocation") made
      request_queues bypassed on allocation to avoid switching on and off
      bypass mode on a queue being initialized.  Some drivers allocate and
      then destroy a lot of queues without fully initializing them and
      incurring bypass latency overhead on each of them could add upto
      significant overhead.
      
      Unfortunately, blk_init_allocated_queue() is never used by queues of
      bio-based drivers, which means that all bio-based driver queues are in
      bypass mode even after initialization and registration complete
      successfully.
      
      Due to the limited way request_queues are used by bio drivers, this
      problem is hidden pretty well but it shows up when blk-throttle is
      used in combination with a bio-based driver.  Trying to configure
      (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
      indefinitely in blkg_conf_prep() waiting for bypass mode to end.
      
      This patch moves the initial blk_queue_bypass_end() call from
      blk_init_allocated_queue() to blk_register_queue() which is called for
      any userland-visible queues regardless of its type.
      
      I believe this is correct because I don't think there is any block
      driver which needs or wants working elevator and blk-cgroup on a queue
      which isn't visible to userland.  If there are such users, we need a
      different solution.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJoseph Glanville <joseph.glanville@orionvm.com.au>
      Cc: stable@vger.kernel.org
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      749fefe6
  3. 20 9月, 2012 5 次提交
  4. 18 9月, 2012 1 次提交
  5. 15 9月, 2012 1 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
  6. 13 9月, 2012 1 次提交
    • P
      block/blk-tag.c: Remove useless kfree · d41570b7
      Peter Senna Tschudin 提交于
      Remove useless kfree() and clean up code related to the removal.
      
      The semantic patch that finds this problem is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @r exists@
      position p1,p2;
      expression x;
      @@
      
      if (x@p1 == NULL) { ... kfree@p2(x); ... return ...; }
      
      @unchanged exists@
      position r.p1,r.p2;
      expression e <= r.x,x,e1;
      iterator I;
      statement S;
      @@
      
      if (x@p1 == NULL) { ... when != I(x,...) S
                              when != e = e1
                              when != e += e1
                              when != e -= e1
                              when != ++e
                              when != --e
                              when != e++
                              when != e--
                              when != &e
         kfree@p2(x); ... return ...; }
      
      @ok depends on unchanged exists@
      position any r.p1;
      position r.p2;
      expression x;
      @@
      
      ... when != true x@p1 == NULL
      kfree@p2(x);
      
      @depends on !ok && unchanged@
      position r.p2;
      expression x;
      @@
      
      *kfree@p2(x);
      // </smpl>
      Signed-off-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d41570b7
  7. 09 9月, 2012 5 次提交
  8. 31 8月, 2012 1 次提交
    • Y
      block: rate-limit the error message from failing commands · 37d7b34f
      Yi Zou 提交于
      When performing a cable pull test w/ active stress I/O using fio over
      a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
      on the other, it is observed that the system becomes not usable due to
      scsi-ml being busy printing the error messages for all the failing commands.
      I don't believe this problem is specific to FCoE and these commands are
      anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
      the messages here to solve this issue.
      
      v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
      rate-limit per function. However, in this case, the failed i/o gets to
      blk_end_request_err() and then blk_update_request(), which also has to
      be rate-limited, as added in the v2 of this patch.
      
      v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.
      Signed-off-by: NYi Zou <yi.zou@intel.com>
      Cc: www.Open-FCoE.org <devel@open-fcoe.org>
      Cc: Tomas Henzl <thenzl@redhat.com>
      Cc: <linux-scsi@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      37d7b34f
  9. 22 8月, 2012 2 次提交
    • T
      workqueue: deprecate __cancel_delayed_work() · 136b5721
      Tejun Heo 提交于
      Now that cancel_delayed_work() can be safely called from IRQ handlers,
      there's no reason to use __cancel_delayed_work().  Use
      cancel_delayed_work() instead of __cancel_delayed_work() and mark the
      latter deprecated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJens Axboe <axboe@kernel.dk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      136b5721
    • T
      workqueue: use mod_delayed_work() instead of __cancel + queue · e7c2f967
      Tejun Heo 提交于
      Now that mod_delayed_work() is safe to call from IRQ handlers,
      __cancel_delayed_work() followed by queue_delayed_work() can be
      replaced with mod_delayed_work().
      
      Most conversions are straight-forward except for the following.
      
      * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
        elaborate dancing around its delayed_work.  Collapse it such that
        linkwatch_work is queued for immediate execution if LW_URGENT and
        existing timer is kept otherwise.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> 
      e7c2f967
  10. 21 8月, 2012 1 次提交
    • T
      workqueue: deprecate system_nrt[_freezable]_wq · 3b07e9ca
      Tejun Heo 提交于
      system_nrt[_freezable]_wq are now spurious.  Mark them deprecated and
      convert all users to system[_freezable]_wq.
      
      If you're cc'd and wondering what's going on: Now all workqueues are
      non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
      Please use system[_freezable]_wq instead.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-By: NLai Jiangshan <laijs@cn.fujitsu.com>
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: David Howells <dhowells@redhat.com>
      3b07e9ca
  11. 14 8月, 2012 1 次提交
    • T
      workqueue: use mod_delayed_work() instead of cancel + queue · 41f63c53
      Tejun Heo 提交于
      Convert delayed_work users doing cancel_delayed_work() followed by
      queue_delayed_work() to mod_delayed_work().
      
      Most conversions are straight-forward.  Ones worth mentioning are,
      
      * drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
        use mod_delayed_work() and cancel loop in
        edac_mc_reset_delay_period() is dropped.
      
      * drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
        watchdog is active or not.  @fan_watchdog_active and related code
        dropped.
      
      * drivers/power/charger-manager.c: Seemingly a lot of
        delayed_work_pending() abuse going on here.
        [delayed_]work_pending() are unsynchronized and racy when used like
        this.  I converted one instance in fullbatt_handler().  Please
        conver the rest so that it invokes workqueue APIs for the intended
        target state rather than trying to game work item pending state
        transitions.  e.g. if timer should be modified - call
        mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
      
      * drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
        simplified.  Note that round_jiffies() calls in this function are
        meaningless.  round_jiffies() work on absolute jiffies not delta
        delay used by delayed_work.
      
      v2: Tomi pointed out that __cancel_delayed_work() users can't be
          safely converted to mod_delayed_work().  They could be calling it
          from irq context and if that happens while delayed_work_timer_fn()
          is running, it could deadlock.  __cancel_delayed_work() users are
          dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NHenrique de Moraes Holschuh <hmh@hmh.eng.br>
      Acked-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Acked-by: NAnton Vorontsov <cbouatmailru@gmail.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Doug Thompson <dougthompson@xmission.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: "John W. Linville" <linville@tuxdriver.com>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      41f63c53
  12. 03 8月, 2012 3 次提交
    • J
      block: Don't use static to define "void *p" in show_partition_start() · 06768067
      Jianpeng Ma 提交于
      I met a odd prblem:read /proc/partitions may return zero.
      
      I wrote a file test.c:
      int main()
      {
      	char buff[4096];
      	int ret;
      	int fd;
      	printf("pid=%d\n",getpid());
      	while (1) {
      		fd = open("/proc/partitions", O_RDONLY);
      		if (fd < 0) {
      			printf("open error %s\n", strerror(errno));
      			return 0;
      		}
      		ret = read(fd, buff, 4096);
      		if (ret <= 0)
      			printf("ret=%d, %s, %ld\n", ret,
      				strerror(errno), lseek(fd,0,SEEK_CUR));
      		close(fd);
      	}
      	exit(0);
      }
      
      You can reproduce by:
      1:while true;do cat /proc/partitions > /dev/null ;done
      2:./test
      
      I reviewed the code and found:
      
      >> static void *show_partition_start(struct seq_file *seqf, loff_t *pos)
      >> {
      >> 	static void *p;
      >>
      >> 	p = disk_seqf_start(seqf, pos);
      >> 	if (!IS_ERR_OR_NULL(p) && !*pos)
      >> 		seq_puts(seqf, "major minor  #blocks  name\n\n");
      >> 	return p;
      >> }
      		test								cat /proc/partitions
      	p = disk_seqf_start()(Not NULL)
      									p = disk_seqf_start()(NULL because pos)
      	if (!IS_ERR_OR_NULL(p) && !*pos)
      Signed-off-by: NJianpeng Ma <majianpeng@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      06768067
    • A
      block: Add blk_bio_map_sg() helper · 85b9f66a
      Asias He 提交于
      Add a helper to map a bio to a scatterlist, modelled after
      blk_rq_map_sg.
      
      This helper is useful for any driver that wants to create
      a scatterlist from its ->make_request_fn method.
      
      Changes in v2:
       - Use __blk_segment_map_sg to avoid duplicated code
       - Add cocbook style function comment
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: virtualization@lists.linux-foundation.org
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      85b9f66a
    • A
      block: Introduce __blk_segment_map_sg() helper · 963ab9e5
      Asias He 提交于
      Split the mapping code in blk_rq_map_sg() to a helper
      __blk_segment_map_sg(), so that other mapping function, e.g.
      blk_bio_map_sg(), can share the code.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: virtualization@lists.linux-foundation.org
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAsias He <asias@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      963ab9e5
  13. 02 8月, 2012 2 次提交
    • P
      block: split discard into aligned requests · c6e66634
      Paolo Bonzini 提交于
      When a disk has large discard_granularity and small max_discard_sectors,
      discards are not split with optimal alignment.  In the limit case of
      discard_granularity == max_discard_sectors, no request could be aligned
      correctly, so in fact you might end up with no discarded logical blocks
      at all.
      
      Another example that helps showing the condition in the patch is with
      discard_granularity == 64, max_discard_sectors == 128.  A request that is
      submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
      However, only 2 aligned blocks out of 3 are included in the request;
      128..191 may be left intact and not discarded.  With this patch, the
      first request will be truncated to ensure good alignment of what's left,
      and the split will be 2..127, 128..255, 256..257.  The patch will also
      take into account the discard_alignment.
      
      At most one extra request will be introduced, because the first request
      will be reduced by at most granularity-1 sectors, and granularity
      must be less than max_discard_sectors.  Subsequent requests will run
      on round_down(max_discard_sectors, granularity) sectors, as in the
      current code.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c6e66634
    • P
      block: reorganize rounding of max_discard_sectors · f6ff53d3
      Paolo Bonzini 提交于
      Mostly a preparation for the next patch.
      
      In principle this fixes an infinite loop if max_discard_sectors < granularity,
      but that really shouldn't happen.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Tested-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f6ff53d3
  14. 01 8月, 2012 4 次提交
    • Y
      block: remove dead func declaration · 80799fbb
      Yuanhan Liu 提交于
      __generic_unplug_device() function is removed with commit
      7eaceacc, which forgot to
      remove the declaration at meantime. Here remove it.
      Signed-off-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      80799fbb
    • V
      block: add partition resize function to blkpg ioctl · c83f6bf9
      Vivek Goyal 提交于
      Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
      allows altering the size of an existing partition, even if it is currently
      in use.
      
      This patch converts hd_struct->nr_sects into sequence counter because
      One might extend a partition while IO is happening to it and update of
      nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
      can lead to issues like reading inconsistent size of a partition. Sequence
      counter have been used so that readers don't have to take bdev mutex lock
      as we call sector_in_part() very frequently.
      
      Now all the access to hd_struct->nr_sects should happen using sequence
      counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
      There is one exception though, set_capacity()/get_capacity(). I think
      theoritically race should exist there too but this patch does not
      modify set_capacity()/get_capacity() due to sheer number of call sites
      and I am afraid that change might break something. I have left that as a
      TODO item. We can handle it later if need be. This patch does not introduce
      any new races as such w.r.t set_capacity()/get_capacity().
      
      v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NPhillip Susi <psusi@ubuntu.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c83f6bf9
    • O
      block: uninitialized ioc->nr_tasks triggers WARN_ON · 4638a83e
      Olof Johansson 提交于
      Hi,
      
      I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
      below warning as of 3.5-rc1 and later (3.4 is fine):
      
      [   10.886893] ------------[ cut here ]------------
      [   10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
      [   10.886905] Hardware name: Bochs
      [   10.886906] Modules linked in:
      [   10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
      [   10.886908] Call Trace:
      [   10.886911]  [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
      [   10.886912]  [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
      [   10.886913]  [<ffffffff8107c088>] copy_process+0x1488/0x1560
      [   10.886914]  [<ffffffff8107c244>] do_fork+0xb4/0x340
      [   10.886918]  [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
      [   10.886919]  [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
      [   10.886920]  [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
      [   10.886923]  [<ffffffff81051db3>] sys_clone+0x23/0x30
      [   10.886925]  [<ffffffff8179bd73>] stub_clone+0x13/0x20
      [   10.886927]  [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
      [   10.886928] ---[ end trace 32a14af7ee6a590b ]---
      
      Reproducing is easy, I can hit it on a KVM system with a very basic
      config (x86_64 make defconfig + enable the drivers needed). To hit it,
      just install dump (on debian/ubuntu, not sure what the package might be
      called on Fedora), and:
      
      dump -o -f /tmp/foo /
      
      You'll see the warning in dmesg once it forks off the I/O process and
      starts dumping filesystem contents.
      
      I bisected it down to the following commit:
      
      commit f6e8d01b
      Author: Tejun Heo <tj@kernel.org>
      Date:   Mon Mar 5 13:15:26 2012 -0800
      
          block: add io_context->active_ref
      
          Currently ioc->nr_tasks is used to decide two things - whether an ioc
          is done issuing IOs and whether it's shared by multiple tasks.  This
          patch separate out the first into ioc->active_ref, which is acquired
          and released using {get|put}_io_context_active() respectively.
      
          This will be used to associate bio's with a given task.  This patch
          doesn't introduce any visible behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
          Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      
      It seems like the init of ioc->nr_tasks was removed in that patch,
      so it starts out at 0 instead of 1.
      
      Tejun, is the right thing here to add back the init, or should something else
      be done?
      
      The below patch removes the warning, but I haven't done any more extensive
      testing on it.
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4638a83e
    • M
      block: do not artificially constrain max_sectors for stacking drivers · fe86cdce
      Mike Snitzer 提交于
      blk_set_stacking_limits is intended to allow stacking drivers to build
      up the limits of the stacked device based on the underlying devices'
      limits.  But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
      doesn't allow the stacking driver to inherit a max_sectors larger than
      1024 -- due to blk_stack_limits' use of min_not_zero.
      
      It is now clear that this artificial limit is getting in the way so
      change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
      stacking drivers like dm-multipath to inherit 'max_sectors' from the
      underlying paths).
      Reported-by: NVijay Chauhan <vijay.chauhan@netapp.com>
      Tested-by: NVijay Chauhan <vijay.chauhan@netapp.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fe86cdce
  15. 31 7月, 2012 3 次提交
  16. 20 7月, 2012 1 次提交
  17. 27 6月, 2012 1 次提交
    • T
      blkcg: implement per-blkg request allocation · a051661c
      Tejun Heo 提交于
      Currently, request_queue has one request_list to allocate requests
      from regardless of blkcg of the IO being issued.  When the unified
      request pool is used up, cfq proportional IO limits become meaningless
      - whoever grabs the next request being freed wins the race regardless
      of the configured weights.
      
      This can be easily demonstrated by creating a blkio cgroup w/ very low
      weight, put a program which can issue a lot of random direct IOs there
      and running a sequential IO from a different cgroup.  As soon as the
      request pool is used up, the sequential IO bandwidth crashes.
      
      This patch implements per-blkg request_list.  Each blkg has its own
      request_list and any IO allocates its request from the matching blkg
      making blkcgs completely isolated in terms of request allocation.
      
      * Root blkcg uses the request_list embedded in each request_queue,
        which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
        handling a bit harier, this enables avoiding most overhead for root
        blkcg.
      
      * Queue fullness is properly per request_list but bdi isn't blkcg
        aware yet, so congestion state currently just follows the root
        blkcg.  As writeback isn't aware of blkcg yet, this works okay for
        async congestion but readahead may get the wrong signals.  It's
        better than blkcg completely collapsing with shared request_list but
        needs to be improved with future changes.
      
      * After this change, each block cgroup gets a full request pool making
        resource consumption of each cgroup higher.  This makes allowing
        non-root users to create cgroups less desirable; however, note that
        allowing non-root users to directly manage cgroups is already
        severely broken regardless of this patch - each block cgroup
        consumes kernel memory and skews IO weight (IO weights are not
        hierarchical).
      
      v2: queue-sysfs.txt updated and patch description udpated as suggested
          by Vivek.
      
      v3: blk_get_rl() wasn't checking error return from
          blkg_lookup_create() and may cause oops on lookup failure.  Fix it
          by falling back to root_rl on blkg lookup failures.  This problem
          was spotted by Rakesh Iyer <rni@google.com>.
      
      v4: Updated to accomodate 458f27a9 "block: Avoid missed wakeup in
          request waitqueue".  blk_drain_queue() now wakes up waiters on all
          blkg->rl on the target queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a051661c
  18. 25 6月, 2012 5 次提交
    • T
      block: prepare for multiple request_lists · 5b788ce3
      Tejun Heo 提交于
      Request allocation is about to be made per-blkg meaning that there'll
      be multiple request lists.
      
      * Make queue full state per request_list.  blk_*queue_full() functions
        are renamed to blk_*rl_full() and takes @rl instead of @q.
      
      * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
        instead of @q.  Also add @gfp_mask parameter.
      
      * Add blk_exit_rl() instead of destroying rl directly from
        blk_release_queue().
      
      * Add request_list->q and make request alloc/free functions -
        blk_free_request(), [__]freed_request(), __get_request() - take @rl
        instead of @q.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5b788ce3
    • T
      block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv · 8a5ecdd4
      Tejun Heo 提交于
      Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
      move q->rq.elvpriv to q->nr_rqs_elvpriv.  blk_drain_queue() is updated
      to use q->nr_rqs[] instead of q->rq.count[].
      
      These counters separates queue-wide request statistics from the
      request list and allow implementation of per-queue request allocation.
      
      While at it, properly indent fields of struct request_list.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a5ecdd4
    • T
      blkcg: inline bio_blkcg() and friends · b1208b56
      Tejun Heo 提交于
      Make bio_blkcg() and friends inline.  They all are very simple and
      used only in few places.
      
      This patch is to prepare for further updates to request allocation
      path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1208b56
    • T
      block: allocate io_context upfront · 7f4b35d1
      Tejun Heo 提交于
      Block layer very lazy allocation of ioc.  It waits until the moment
      ioc is absolutely necessary; unfortunately, that time could be inside
      queue lock and __get_request() performs unlock - try alloc - retry
      dancing.
      
      Just allocate it up-front on entry to block layer.  We're not saving
      the rain forest by deferring it to the last possible moment and
      complicating things unnecessarily.
      
      This patch is to prepare for further updates to request allocation
      path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7f4b35d1
    • T
      block: refactor get_request[_wait]() · a06e05e6
      Tejun Heo 提交于
      Currently, there are two request allocation functions - get_request()
      and get_request_wait().  The former tries to allocate a request once
      and the latter keeps retrying until it succeeds.  The latter wraps the
      former and keeps retrying until allocation succeeds.
      
      The combination of two functions deliver fallible non-wait allocation,
      fallible wait allocation and unfailing wait allocation.  However,
      given that forward progress is guaranteed, fallible wait allocation
      isn't all that useful and in fact nobody uses it.
      
      This patch simplifies the interface as follows.
      
      * get_request() is renamed to __get_request() and is only used by the
        wrapper function.
      
      * get_request_wait() is renamed to get_request().  It now takes
        @gfp_mask and retries iff it contains %__GFP_WAIT.
      
      This patch doesn't introduce any functional change and is to prepare
      for further updates to request allocation path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a06e05e6
新手
引导
客服 返回
顶部