1. 02 4月, 2012 5 次提交
    • T
      blkcg: cfq doesn't need per-cpu dispatch stats · 41b38b6d
      Tejun Heo 提交于
      blkio_group_stats_cpu is used to count dispatch stats using per-cpu
      counters.  This is used by both blk-throtl and cfq-iosched but the
      sharing is rather silly.
      
      * cfq-iosched doesn't need per-cpu dispatch stats.  cfq always updates
        those stats while holding queue_lock.
      
      * blk-throtl needs per-cpu dispatch stats but only service_bytes and
        serviced.  It doesn't make use of sectors.
      
      This patch makes cfq add and use global stats for service_bytes,
      serviced and sectors, removes per-cpu sectors counter and moves
      per-cpu stat printing code to blk-throttle.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      41b38b6d
    • T
      blkcg: move statistics update code to policies · 629ed0b1
      Tejun Heo 提交于
      As with conf/stats file handling code, there's no reason for stat
      update code to live in blkcg core with policies calling into update
      them.  The current organization is both inflexible and complex.
      
      This patch moves stat update code to specific policies.  All
      blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP
      stats are collapsed into their cfq_blkiocg_update_*_stats()
      counterparts.  blkiocg_update_dispatch_stats() is used by both
      policies and duplicated as throtl_update_dispatch_stats() and
      cfq_blkiocg_update_dispatch_stats().  This will be cleaned up later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      629ed0b1
    • T
      cfq: collapse cfq.h into cfq-iosched.c · 2ce4d50f
      Tejun Heo 提交于
      block/cfq.h contains some functions which interact with blkcg;
      however, this is only part of it and cfq-iosched.c already has quite
      some #ifdef CONFIG_CFQ_GROUP_IOSCHED.  With conf/stat handling being
      moved to specific policies, having these relay functions isolated in
      cfq.h doesn't make much sense.  Collapse cfq.h into cfq-iosched.c for
      now.  Let's split blkcg support properly later if necessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2ce4d50f
    • T
      blkcg: move conf/stat file handling code to policies · 60c2bc2d
      Tejun Heo 提交于
      blkcg conf/stat handling is convoluted in that details which belong to
      specific policy implementations are all out in blkcg core and then
      policies hook into core layer to access and manipulate confs and
      stats.  This sadly achieves both inflexibility (confs/stats can't be
      modified without messing with blkcg core) and complexity (all the
      call-ins and call-backs).
      
      The previous patches restructured conf and stat handling code such
      that they can be separated out.  This patch relocates the file
      handling part.  All conf/stat file handling code which belongs to
      BLKIO_POLICY_PROP is moved to cfq-iosched.c and all
      BKLIO_POLICY_THROTL code to blk-throtl.c.
      
      The move is verbatim except for blkio_update_group_{weight|bps|iops}()
      callbacks which relays conf changes to policies.  The configuration
      settings are handled in policies themselves so the relaying isn't
      necessary.  Conf setting functions are modified to directly call
      per-policy update functions and the relaying mechanism is dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      60c2bc2d
    • T
      blkcg: remove unused @pol and @plid parameters · aaec55a0
      Tejun Heo 提交于
      @pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no
      longer necessary.  Drop them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      aaec55a0
  2. 23 3月, 2012 1 次提交
    • T
      cfq: fix cfqg ref handling when BLK_CGROUP && !CFQ_GROUP_IOSCHED · eb7d8c07
      Tejun Heo 提交于
      When BLK_CGROUP is enabled but CFQ_GROUP_IOSCHED is, cfq ends up
      calling blkg_get/put() on dummy cfqg leading to the following crash.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
        IP: [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        CPU 0
        Modules linked in:
      
        Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc6-work+ #125 Bochs Bochs
        RIP: 0010:[<ffffffff813d44d8>]  [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
        RSP: 0018:ffff88001f9dfd80  EFLAGS: 00010046
        RAX: ffff88001aefbbf0 RBX: ffff88001aeedbf0 RCX: 0000000000000100
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff820ffd40
        RBP: ffff88001f9dfdd0 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
        R13: 0000000000000009 R14: ffff88001aefbc30 R15: 0000000000000003
        FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00000000000000b0 CR3: 000000000206f000 CR4: 00000000000006f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process swapper/0 (pid: 1, threadinfo ffff88001f9de000, task ffff88001f9dc040)
        Stack:
         ffff88001aeedbf0 ffff88001aefbdb0 ffff88001aef1548 ffff88001aefbbf0
         ffff88001f9dfdd0 ffff88001aef1548 ffffffff820d6320 ffffffff8165ce30
         ffffffff82c555e0 ffff88001aeebbf0 ffff88001f9dfe00 ffffffff813b0507
        Call Trace:
         [<ffffffff813b0507>] elevator_init+0xd7/0x140
         [<ffffffff813b83d5>] blk_init_allocated_queue+0x125/0x150
         [<ffffffff813b94d3>] blk_init_queue_node+0x43/0x80
         [<ffffffff813b9523>] blk_init_queue+0x13/0x20
         [<ffffffff821aec00>] floppy_init+0x82/0xec7
         [<ffffffff810001d2>] do_one_initcall+0x42/0x170
         [<ffffffff821835fc>] kernel_init+0xcb/0x14f
         [<ffffffff81b40b24>] kernel_thread_helper+0x4/0x10
        Code: 00 e8 1d 9e 76 00 48 8b 43 48 48 85 c0 48 89 83 28 03 00 00 74 07 4c 8b a0 10 ff ff ff 8b 15 b0 2e d0 00 85 d2 0f 85 49 01 00 00 <41> 8b 84 24 b0 00 00 00 85 c0 0f 8e 8c 01 00 00 83 e8 01 85 c0
        RIP  [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
      
      Because cfq's blkcg support has a on/off switch, CFQ_GROUP_IOSCHED,
      separate from BLK_CGROUP, blkg access through cfqg needs to be
      conditioned on it.
      
      * Make blkg_to_cfqg() and cfqg_to_blkg() conditioned on
        CFQ_GROUP_IOSCHED.  If disabled, they always return %NULL.
      
      * Introduce cfqg_get() and cfqg_put() conditioned on
        CFQ_GROUP_IOSCHED.  If disabled, they are noops.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb7d8c07
  3. 20 3月, 2012 2 次提交
    • T
      cfq: don't use icq_get_changed() · 598971bf
      Tejun Heo 提交于
      cfq caches the associated cfqq's for a given cic.  The cache needs to
      be flushed if the cic's ioprio or blkcg has changed.  It is currently
      done by requiring the changing action to set the respective
      ICQ_*_CHANGED bit in the icq and testing it from cfq_set_request(),
      which involves iterating through all the affected icqs.
      
      All cfq wants to know is whether ioprio and/or blkcg have changed
      since the last flush and can be easily achieved by just remembering
      the current ioprio and blkcg ID in cic.
      
      This patch adds cic->{ioprio|blkcg_id}, updates all ioprio users to
      use the remembered value instead, and updates cfq_set_request() path
      such that, instead of using icq_get_changed(), the current values are
      compared against the remembered ones and trigger appropriate flush
      action if not.  Condition tests are moved inside both _changed
      functions which are now named check_ioprio_changed() and
      check_blkcg_changed().
      
      ioprio.h::task_ioprio*() can't be used anymore and replaced with
      open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      598971bf
    • T
      cfq: pass around cfq_io_cq instead of io_context · abede6da
      Tejun Heo 提交于
      Now that io_cq is managed by block core and guaranteed to exist for
      any in-flight request, it is easier and carries more information to
      pass around cfq_io_cq than io_context.
      
      This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and
      cfq_get_queue() to take @cic instead of @ioc.  This change removes a
      duplicate cfq_cic_lookup() from cfq_find_alloc_queue().
      
      This change enables the use of cic-cached ioprio in the next patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      abede6da
  4. 07 3月, 2012 21 次提交
    • T
      block: make block cgroup policies follow bio task association · 4f85cb96
      Tejun Heo 提交于
      Implement bio_blkio_cgroup() which returns the blkcg associated with
      the bio if exists or %current's blkcg, and use it in blk-throttle and
      cfq-iosched propio.  This makes both cgroup policies honor task
      association for the bio instead of always assuming %current.
      
      As nobody is using bio_set_task() yet, this doesn't introduce any
      behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f85cb96
    • T
      block: implement bio_associate_current() · 852c788f
      Tejun Heo 提交于
      IO scheduling and cgroup are tied to the issuing task via io_context
      and cgroup of %current.  Unfortunately, there are cases where IOs need
      to be routed via a different task which makes scheduling and cgroup
      limit enforcement applied completely incorrectly.
      
      For example, all bios delayed by blk-throttle end up being issued by a
      delayed work item and get assigned the io_context of the worker task
      which happens to serve the work item and dumped to the default block
      cgroup.  This is double confusing as bios which aren't delayed end up
      in the correct cgroup and makes using blk-throttle and cfq propio
      together impossible.
      
      Any code which punts IO issuing to another task is affected which is
      getting more and more common (e.g. btrfs).  As both io_context and
      cgroup are firmly tied to task including userland visible APIs to
      manipulate them, it makes a lot of sense to match up tasks to bios.
      
      This patch implements bio_associate_current() which associates the
      specified bio with %current.  The bio will record the associated ioc
      and blkcg at that point and block layer will use the recorded ones
      regardless of which task actually ends up issuing the bio.  bio
      release puts the associated ioc and blkcg.
      
      It grabs and remembers ioc and blkcg instead of the task itself
      because task may already be dead by the time the bio is issued making
      ioc and blkcg inaccessible and those are all block layer cares about.
      
      elevator_set_req_fn() is updated such that the bio elvdata is being
      allocated for is available to the elevator.
      
      This doesn't update block cgroup policies yet.  Further patches will
      implement the support.
      
      -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
           rq_ioc() to fix build breakage.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      852c788f
    • T
      block: add io_context->active_ref · f6e8d01b
      Tejun Heo 提交于
      Currently ioc->nr_tasks is used to decide two things - whether an ioc
      is done issuing IOs and whether it's shared by multiple tasks.  This
      patch separate out the first into ioc->active_ref, which is acquired
      and released using {get|put}_io_context_active() respectively.
      
      This will be used to associate bio's with a given task.  This patch
      doesn't introduce any visible behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f6e8d01b
    • T
      blkcg: drop unnecessary RCU locking · c875f4d0
      Tejun Heo 提交于
      Now that blkg additions / removals are always done under both q and
      blkcg locks, the only places RCU locking is necessary are
      blkg_lookup[_create]() for lookup w/o blkcg lock.  This patch drops
      unncessary RCU locking replacing it with plain blkcg locking as
      necessary.
      
      * blkiocg_pre_destroy() already perform proper locking and don't need
        RCU.  Dropped.
      
      * blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
        lock.  This isn't a hot path.
      
      * Now unnecessary synchronize_rcu() from queue exit paths removed.
        This makes q->nr_blkgs unnecessary.  Dropped.
      
      * RCU annotation on blkg->q removed.
      
      -v2: Vivek pointed out that blkg_lookup_create() still needs to be
           called under rcu_read_lock().  Updated.
      
      -v3: After the update, stats_lock locking in blkio_read_blkg_stats()
           shouldn't be using _irq variant as it otherwise ends up enabling
           irq while blkcg->lock is locked.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c875f4d0
    • T
      blkcg: unify blkg's for blkcg policies · e8989fae
      Tejun Heo 提交于
      Currently, blkg is per cgroup-queue-policy combination.  This is
      unnatural and leads to various convolutions in partially used
      duplicate fields in blkg, config / stat access, and general management
      of blkgs.
      
      This patch make blkg's per cgroup-queue and let them serve all
      policies.  blkgs are now created and destroyed by blkcg core proper.
      This will allow further consolidation of common management logic into
      blkcg core and API with better defined semantics and layering.
      
      As a transitional step to untangle blkg management, elvswitch and
      policy [de]registration, all blkgs except the root blkg are being shot
      down during elvswitch and bypass.  This patch adds blkg_root_update()
      to update root blkg in place on policy change.  This is hacky and racy
      but should be good enough as interim step until we get locking
      simplified and switch over to proper in-place update for all blkgs.
      
      -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
           comment wasn't updated according to the function change.  Fixed.
           Both pointed out by Vivek.
      
      -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
           all policies.  This freed root pd during elvswitch before the
           last queue finished exiting and led to oops.  Directly invoke
           update_root_blkg_pd() only on BLKIO_POLICY_PROP from
           cfq_exit_queue().  This also is closer to what will be done with
           proper in-place blkg update.  Reported by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8989fae
    • T
      blkcg: let blkcg core manage per-queue blkg list and counter · 03aa264a
      Tejun Heo 提交于
      With the previous patch to move blkg list heads and counters to
      request_queue and blkg, logic to manage them in both policies are
      almost identical and can be moved to blkcg core.
      
      This patch moves blkg link logic into blkg_lookup_create(), implements
      common blkg unlink code in blkg_destroy(), and updates
      blkg_destory_all() so that it's policy specific and can skip root
      group.  The updated blkg_destroy_all() is now used to both clear queue
      for bypassing and elv switching, and release all blkgs on q exit.
      
      This patch introduces a race window where policy [de]registration may
      race against queue blkg clearing.  This can only be a problem on cfq
      unload and shouldn't be a real problem in practice (and we have many
      other places where this race already exists).  Future patches will
      remove these unlikely races.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03aa264a
    • T
      blkcg: move per-queue blkg list heads and counters to queue and blkg · 4eef3049
      Tejun Heo 提交于
      Currently, specific policy implementations are responsible for
      maintaining list and number of blkgs.  This duplicates code
      unnecessarily, and hinders factoring common code and providing blkcg
      API with better defined semantics.
      
      After this patch, request_queue hosts list heads and counters and blkg
      has list nodes for both policies.  This patch only relocates the
      necessary fields and the next patch will actually move management code
      into blkcg core.
      
      Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to
      have 2 elements.  This is to avoid include dependency and will be
      removed by the next patch.
      
      This patch doesn't introduce any behavior change.
      
      -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed
           as pointed out by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4eef3049
    • T
      blkcg: don't use blkg->plid in stat related functions · c1768268
      Tejun Heo 提交于
      blkg is scheduled to be unified for all policies and thus there won't
      be one-to-one mapping from blkg to policy.  Update stat related
      functions to take explicit @pol or @plid arguments and not use
      blkg->plid.
      
      This is painful for now but most of specific stat interface functions
      will be replaced with a handful of generic helpers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c1768268
    • T
      blkcg: move refcnt to blkcg core · 1adaf3dd
      Tejun Heo 提交于
      Currently, blkcg policy implementations manage blkg refcnt duplicating
      mostly identical code in both policies.  This patch moves refcnt to
      blkg and let blkcg core handle refcnt and freeing of blkgs.
      
      * cfq blkgs now also get freed via RCU.
      
      * cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free.  If
        necessary, we can add blkio_exit_group_fn() to resurrect this.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1adaf3dd
    • T
      blkcg: let blkcg core handle policy private data allocation · 0381411e
      Tejun Heo 提交于
      Currently, blkg's are embedded in private data blkcg policy private
      data structure and thus allocated and freed by policies.  This leads
      to duplicate codes in policies, hinders implementing common part in
      blkcg core with strong semantics, and forces duplicate blkg's for the
      same cgroup-q association.
      
      This patch introduces struct blkg_policy_data which is a separate data
      structure chained from blkg.  Policies specifies the amount of private
      data it needs in its blkio_policy_type->pdata_size and blkcg core
      takes care of allocating them along with blkg which can be accessed
      using blkg_to_pdata().  blkg can be determined from pdata using
      pdata_to_blkg().  blkio_alloc_group_fn() method is accordingly updated
      to blkio_init_group_fn().
      
      For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with
      blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in
      the reverse direction are added.
      
      Except that policy specific data now lives in a separate data
      structure from blkg, this patch doesn't introduce any functional
      difference.
      
      This will be used to unify blkg's for different policies.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0381411e
    • T
      blkcg: let blkio_group point to blkio_cgroup directly · 7ee9c562
      Tejun Heo 提交于
      Currently, blkg points to the associated blkcg via its css_id.  This
      unnecessarily complicates dereferencing blkcg.  Let blkg hold a
      reference to the associated blkcg and point directly to it and disable
      css_id on blkio_subsys.
      
      This change requires splitting blkiocg_destroy() into
      blkiocg_pre_destroy() and blkiocg_destroy() so that all blkg's can be
      destroyed and all the blkcg references held by them dropped during
      cgroup removal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7ee9c562
    • T
      blkcg: kill the mind-bending blkg->dev · 7a4dd281
      Tejun Heo 提交于
      blkg->dev is dev_t recording the device number of the block device for
      the associated request_queue.  It is used to identify the associated
      block device when printing out configuration or stats.
      
      This is redundant to begin with.  A blkg is an association between a
      cgroup and a request_queue and it of course is possible to reach
      request_queue from blkg and synchronization conventions are in place
      for safe q dereferencing, so this shouldn't be necessary from the
      beginning.  Furthermore, it's initialized by sscanf()ing the device
      name of backing_dev_info.  The mind boggles.
      
      Anyways, if blkg is visible under rcu lock, we *know* that the
      associated request_queue hasn't gone away yet and its bdi is
      registered and alive - blkg can't be created for request_queue which
      hasn't been fully initialized and it can't go away before blkg is
      removed.
      
      Let stat and conf read functions get device name from
      blkg->q->backing_dev_info.dev and pass it down to printing functions
      and remove blkg->dev.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7a4dd281
    • T
      blkcg: don't allow or retain configuration of missing devices · e56da7e2
      Tejun Heo 提交于
      blkcg is very peculiar in that it allows setting and remembering
      configurations for non-existent devices by maintaining separate data
      structures for configuration.
      
      This behavior is completely out of the usual norms and outright
      confusing; furthermore, it uses dev_t number to match the
      configuration to devices, which is unpredictable to begin with and
      becomes completely unuseable if EXT_DEVT is fully used.
      
      It is wholely unnecessary - we already have fully functional userland
      mechanism to program devices being hotplugged which has full access to
      device identification, connection topology and filesystem information.
      
      Add a new struct blkio_group_conf which contains all blkcg
      configurations to blkio_group and let blkio_group, which can be
      created iff the associated device exists and is removed when the
      associated device goes away, carry all configurations.
      
      Note that, after this patch, all newly created blkg's will always have
      the default configuration (unlimited for throttling and blkcg's weight
      for propio).
      
      This patch makes blkio_policy_node meaningless but doesn't remove it.
      The next patch will.
      
      -v2: Updated to retry after short sleep if blkg lookup/creation failed
           due to the queue being temporarily bypassed as indicated by
           -EBUSY return.  Pointed out by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e56da7e2
    • T
      blkcg: factor out blkio_group creation · cd1604fa
      Tejun Heo 提交于
      Currently both blk-throttle and cfq-iosched implement their own
      blkio_group creation code in throtl_get_tg() and cfq_get_cfqg().  This
      patch factors out the common code into blkg_lookup_create(), which
      returns ERR_PTR value so that transitional failures due to queue
      bypass can be distinguished from other failures.
      
      * New plkio_policy_ops methods blkio_alloc_group_fn() and
        blkio_link_group_fn added.  Both are transitional and will be
        removed once the blkg management code is fully moved into
        blk-cgroup.c.
      
      * blkio_alloc_group_fn() allocates policy-specific blkg which is
        usually a larger data structure with blkg as the first entry and
        intiailizes it.  Note that initialization of blkg proper, including
        percpu stats, is responsibility of blk-cgroup proper.
      
        Note that default config (weight, bps...) initialization is done
        from this method; otherwise, we end up violating locking order
        between blkcg and q locks via blkcg_get_CONF() functions.
      
      * blkio_link_group_fn() is called under queue_lock and responsible for
        linking the blkg to the queue.  blkcg side is handled by blk-cgroup
        proper.
      
      * The common blkg creation function is named blkg_lookup_create() and
        blkiocg_lookup_group() is renamed to blkg_lookup() for consistency.
        Also, throtl / cfq related functions are similarly [re]named for
        consistency.
      
      This simplifies blkcg policy implementations and enables further
      cleanup.
      
      -v2: Vivek noticed that blkg_lookup_create() incorrectly tested
           blk_queue_dead() instead of blk_queue_bypass() leading a user of
           the function ending up creating a new blkg on bypassing queue.
           This is a bug introduced while relocating bypass patches before
           this one.  Fixed.
      
      -v3: ERR_PTR patch folded into this one.  @for_root added to
           blkg_lookup_create() to allow creating root group on a bypassed
           queue during elevator switch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd1604fa
    • T
      blkcg: use the usual get blkg path for root blkio_group · f51b802c
      Tejun Heo 提交于
      For root blkg, blk_throtl_init() was using throtl_alloc_tg()
      explicitly and cfq_init_queue() was manually initializing embedded
      cfqd->root_group, adding unnecessarily different code paths to blkg
      handling.
      
      Make both use the usual blkio_group get functions - throtl_get_tg()
      and cfq_get_cfqg() - for the root blkio_group too.  Note that
      blk_throtl_init() callsite is pushed downwards in
      blk_alloc_queue_node() so that @q is sufficiently initialized for
      throtl_get_tg().
      
      This simplifies root blkg handling noticeably for cfq and will allow
      further modularization of blkcg API.
      
      -v2: Vivek pointed out that using cfq_get_cfqg() won't work if
           CONFIG_CFQ_GROUP_IOSCHED is disabled.  Fix it by factoring out
           initialization of base part of cfqg into cfq_init_cfqg_base() and
           alloc/init/free explicitly if !CONFIG_CFQ_GROUP_IOSCHED.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f51b802c
    • T
      blkcg: use q and plid instead of opaque void * for blkio_group association · ca32aefc
      Tejun Heo 提交于
      blkgio_group is association between a block cgroup and a queue for a
      given policy.  Using opaque void * for association makes things
      confusing and hinders factoring of common code.  Use request_queue *
      and, if necessary, policy id instead.
      
      This will help block cgroup API cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ca32aefc
    • T
      blkcg: update blkg get functions take blkio_cgroup as parameter · 0a5a7d0e
      Tejun Heo 提交于
      In both blkg get functions - throtl_get_tg() and cfq_get_cfqg(),
      instead of obtaining blkcg of %current explicitly, let the caller
      specify the blkcg to use as parameter and make both functions hold on
      to the blkcg.
      
      This is part of block cgroup interface cleanup and will help making
      blkcg API more modular.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0a5a7d0e
    • T
      blkcg: move rcu_read_lock() outside of blkio_group get functions · 2a7f1244
      Tejun Heo 提交于
      rcu_read_lock() in throtl_get_tb() and cfq_get_cfqg() holds onto
      @blkcg while looking up blkg.  For API cleanup, the next patch will
      make the caller responsible for determining @blkcg to look blkg from
      and let them specify it as a parameter.  Move rcu read locking out to
      the callers to prepare for the change.
      
      -v2: Originally this patch was described as a fix for RCU read locking
           bug around @blkg, which Vivek pointed out to be incorrect.  It
           was from misunderstanding the role of rcu locking as protecting
           @blkg not @blkcg.  Patch description updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2a7f1244
    • T
      blkcg: shoot down blkio_groups on elevator switch · 72e06c25
      Tejun Heo 提交于
      Elevator switch may involve changes to blkcg policies.  Implement
      shoot down of blkio_groups.
      
      Combined with the previous bypass updates, the end goal is updating
      blkcg core such that it can ensure that blkcg's being affected become
      quiescent and don't have any per-blkg data hanging around before
      commencing any policy updates.  Until queues are made aware of the
      policies that applies to them, as an interim step, all per-policy blkg
      data will be shot down.
      
      * blk-throtl doesn't need this change as it can't be disabled for a
        live queue; however, update it anyway as the scheduled blkg
        unification requires this behavior change.  This means that
        blk-throtl configuration will be unnecessarily lost over elevator
        switch.  This oddity will be removed after blkcg learns to associate
        individual policies with request_queues.
      
      * blk-throtl dosen't shoot down root_tg.  This is to ease transition.
        Unified blkg will always have persistent root group and not shooting
        down root_tg for now eases transition to that point by avoiding
        having to update td->root_tg and is safe as blk-throtl can never be
        disabled
      
      -v2: Vivek pointed out that group list is not guaranteed to be empty
           on return from clear function if it raced cgroup removal and
           lost.  Fix it by waiting a bit and retrying.  This kludge will
           soon be removed once locking is updated such that blkg is never
           in limbo state between blkcg and request_queue locks.
      
           blk-throtl no longer shoots down root_tg to avoid breaking
           td->root_tg.
      
           Also, Nest queue_lock inside blkio_list_lock not the other way
           around to avoid introduce possible deadlock via blkcg lock.
      
      -v3: blkcg_clear_queue() repositioned and renamed to
           blkg_destroy_all() to increase consistency with later changes.
           cfq_clear_queue() updated to check q->elevator before
           dereferencing it to avoid NULL dereference on not fully
           initialized queues (used by later change).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      72e06c25
    • T
      elevator: make elevator_init_fn() return 0/-errno · b2fab5ac
      Tejun Heo 提交于
      elevator_ops->elevator_init_fn() has a weird return value.  It returns
      a void * which the caller should assign to q->elevator->elevator_data
      and %NULL return denotes init failure.
      
      Update such that it returns integer 0/-errno and sets elevator_data
      directly as necessary.
      
      This makes the interface more conventional and eases further cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b2fab5ac
    • T
      cfq: don't register propio policy if !CONFIG_CFQ_GROUP_IOSCHED · b95ada55
      Tejun Heo 提交于
      cfq has been registering zeroed blkio_poilcy_cfq if CFQ_GROUP_IOSCHED
      is disabled.  This fortunately doesn't collide with blk-throtl as
      BLKIO_POLICY_PROP is zero but is unnecessary and risky.  Just don't
      register it if not enabled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b95ada55
  5. 15 2月, 2012 1 次提交
  6. 08 2月, 2012 1 次提交
    • T
      block: don't call elevator callbacks for plug merges · 07c2bd37
      Tejun Heo 提交于
      Plug merge calls two elevator callbacks outside queue lock -
      elevator_allow_merge_fn() and elevator_bio_merged_fn().  Although
      attempt_plug_merge() suggests that elevator is guaranteed to be there
      through the existing request on the plug list, nothing prevents plug
      merge from calling into dying or initializing elevator.
      
      For regular merges, bypass ensures elvpriv count to reach zero, which
      in turn prevents merges as all !ELVPRIV requests get REQ_SOFTBARRIER
      from forced back insertion.  Plug merge doesn't check ELVPRIV, and, as
      the requests haven't gone through elevator insertion yet, it doesn't
      have SOFTBARRIER set allowing merges on a bypassed queue.
      
      This, for example, leads to the following crash during elevator
      switch.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
       IP: [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       PGD 112cbc067 PUD 115d5c067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU 1
       Modules linked in: deadline_iosched
      
       Pid: 819, comm: dd Not tainted 3.3.0-rc2-work+ #76 Bochs Bochs
       RIP: 0010:[<ffffffff813b34e9>]  [<ffffffff813b34e9>] cfq_allow_merge+0x49/0xa0
       RSP: 0018:ffff8801143a38f8  EFLAGS: 00010297
       RAX: 0000000000000000 RBX: ffff88011817ce28 RCX: ffff880116eb6cc0
       RDX: 0000000000000000 RSI: ffff880118056e20 RDI: ffff8801199512f8
       RBP: ffff8801143a3908 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000000 R12: ffff880118195708
       R13: ffff880118052aa0 R14: ffff8801143a3d50 R15: ffff880118195708
       FS:  00007f19f82cb700(0000) GS:ffff88011fc80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
       CR2: 0000000000000008 CR3: 0000000112c6a000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
       Process dd (pid: 819, threadinfo ffff8801143a2000, task ffff880116eb6cc0)
       Stack:
        ffff88011817ce28 ffff880118195708 ffff8801143a3928 ffffffff81391bba
        ffff88011817ce28 ffff880118195708 ffff8801143a3948 ffffffff81391bf1
        ffff88011817ce28 0000000000000000 ffff8801143a39a8 ffffffff81398e3e
       Call Trace:
        [<ffffffff81391bba>] elv_rq_merge_ok+0x4a/0x60
        [<ffffffff81391bf1>] elv_try_merge+0x21/0x40
        [<ffffffff81398e3e>] blk_queue_bio+0x8e/0x390
        [<ffffffff81396a5a>] generic_make_request+0xca/0x100
        [<ffffffff81396b04>] submit_bio+0x74/0x100
        [<ffffffff811d45c2>] __blockdev_direct_IO+0x1ce2/0x3450
        [<ffffffff811d0dc7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811460b5>] generic_file_aio_read+0x6d5/0x760
        [<ffffffff811986b2>] do_sync_read+0xe2/0x120
        [<ffffffff81199345>] vfs_read+0xc5/0x180
        [<ffffffff81199501>] sys_read+0x51/0x90
        [<ffffffff81aeac12>] system_call_fastpath+0x16/0x1b
      
      There are multiple ways to fix this including making plug merge check
      ELVPRIV; however,
      
      * Calling into elevator outside queue lock is confusing and
        error-prone.
      
      * Requests on plug list aren't known to the elevator.  They aren't on
        the elevator yet, so there's no elevator specific state to update.
      
      * Given the nature of plug merges - collecting bio's for the same
        purpose from the same issuer - elevator specific restrictions aren't
        applicable.
      
      So, simply don't call into elevator methods from plug merge by moving
      elv_bio_merged() from bio_attempt_*_merge() to blk_queue_bio(), and
      using blk_try_merge() in attempt_plug_merge().
      
      This is based on Jens' patch to skip elevator_allow_merge_fn() from
      plug merge.
      
      Note that this makes per-cgroup merged stats skip plug merging.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4F16F3CA.90904@kernel.dk>
      Original-patch-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      07c2bd37
  7. 07 2月, 2012 1 次提交
    • T
      block: strip out locking optimization in put_io_context() · 11a3122f
      Tejun Heo 提交于
      put_io_context() performed a complex trylock dancing to avoid
      deferring ioc release to workqueue.  It was also broken on UP because
      trylock was always assumed to succeed which resulted in unbalanced
      preemption count.
      
      While there are ways to fix the UP breakage, even the most
      pathological microbench (forced ioc allocation and tight fork/exit
      loop) fails to show any appreciable performance benefit of the
      optimization.  Strip it out.  If there turns out to be workloads which
      are affected by this change, simpler optimization from the discussion
      thread can be applied later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1328514611.21268.66.camel@sli10-conroe>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      11a3122f
  8. 19 1月, 2012 1 次提交
  9. 18 1月, 2012 1 次提交
    • J
      cfq-iosched: fix use-after-free of cfqq · 54b466e4
      Jens Axboe 提交于
      With the changes in life time management between the cfq IO contexts
      and the cfq queues, we now risk having cfqd->active_queue being
      freed when cfq_slice_expired() is being called. cfq_preempt_queue()
      caches this queue and uses it after calling said function, causing
      a use-after-free condition. This triggers the following oops,
      when cfqq_type() attempts to dereference it:
      
      BUG: unable to handle kernel paging request at ffff8800746c4f0c
      IP: [<ffffffff81266d59>] cfqq_type+0xb/0x20
      PGD 18d4063 PUD 1fe15067 PMD 1ffb9067 PTE 80000000746c4160
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      CPU 3
      Modules linked in:
      
      Pid: 1, comm: init Not tainted 3.2.0-josef+ #367 Bochs Bochs
      RIP: 0010:[<ffffffff81266d59>]  [<ffffffff81266d59>] cfqq_type+0xb/0x20
      RSP: 0018:ffff880079c11778  EFLAGS: 00010046
      RAX: 0000000000000000 RBX: ffff880076f3df08 RCX: 0000000000000000
      RDX: 0000000000000006 RSI: ffff880074271888 RDI: ffff8800746c4f08
      RBP: ffff880079c11778 R08: 0000000000000078 R09: 0000000000000001
      R10: 09f911029d74e35b R11: 09f911029d74e35b R12: ffff880076f337f0
      R13: ffff8800746c4f08 R14: ffff8800746c4f08 R15: 0000000000000002
      FS:  00007f62fd44f700(0000) GS:ffff88007cd80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffff8800746c4f0c CR3: 0000000076c21000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process init (pid: 1, threadinfo ffff880079c10000, task ffff880079c0a040)
      Stack:
       ffff880079c117c8 ffffffff812683d8 ffff880079c117a8 ffffffff8125de43
       ffff8800744fcf48 ffff880074b43e98 ffff8800770c8828 ffff880074b43e98
       0000000000000003 0000000000000000 ffff880079c117f8 ffffffff81254149
      Call Trace:
       [<ffffffff812683d8>] cfq_insert_request+0x3f5/0x47c
       [<ffffffff8125de43>] ? blk_recount_segments+0x20/0x31
       [<ffffffff81254149>] __elv_add_request+0x1ca/0x200
       [<ffffffff8125aa99>] blk_queue_bio+0x2ef/0x312
       [<ffffffff81258f7b>] generic_make_request+0x9f/0xe0
       [<ffffffff8125907b>] submit_bio+0xbf/0xca
       [<ffffffff81136ec7>] submit_bh+0xdf/0xfe
       [<ffffffff81176d04>] ext3_bread+0x50/0x99
       [<ffffffff811785b3>] dx_probe+0x38/0x291
       [<ffffffff81178864>] ext3_dx_find_entry+0x58/0x219
       [<ffffffff81178ad5>] ext3_find_entry+0xb0/0x406
       [<ffffffff8110c4d5>] ? cache_alloc_debugcheck_after.isra.46+0x14d/0x1a0
       [<ffffffff8110cfbd>] ? kmem_cache_alloc+0xef/0x191
       [<ffffffff8117a330>] ext3_lookup+0x39/0xe1
       [<ffffffff81119461>] d_alloc_and_lookup+0x45/0x6c
       [<ffffffff8111ac41>] do_lookup+0x1e4/0x2f5
       [<ffffffff8111aef6>] link_path_walk+0x1a4/0x6ef
       [<ffffffff8111b557>] path_lookupat+0x59/0x5ea
       [<ffffffff8127406c>] ? __strncpy_from_user+0x30/0x5a
       [<ffffffff8111bce0>] do_path_lookup+0x23/0x59
       [<ffffffff8111cfd6>] user_path_at_empty+0x53/0x99
       [<ffffffff8107b37b>] ? remove_wait_queue+0x51/0x56
       [<ffffffff8111d02d>] user_path_at+0x11/0x13
       [<ffffffff811141f5>] vfs_fstatat+0x3a/0x64
       [<ffffffff8111425a>] vfs_stat+0x1b/0x1d
       [<ffffffff81114359>] sys_newstat+0x1a/0x33
       [<ffffffff81060e12>] ? task_stopped_code+0x42/0x42
       [<ffffffff815d6712>] system_call_fastpath+0x16/0x1b
      Code: 89 e6 48 89 c7 e8 fa ca fe ff 85 c0 74 06 4c 89 2b 41 b6 01 5b 44 89 f0 41 5c 41 5d 41 5e 5d c3 55 48 89 e5 66 66 66 66 90 31 c0 <8b> 57 04 f6 c6 01 74 0b 83 e2 20 83 fa 01 19 c0 83 c0 02 5d c3
      RIP  [<ffffffff81266d59>] cfqq_type+0xb/0x20
       RSP <ffff880079c11778>
      CR2: ffff8800746c4f0c
      
      Get rid of the caching of cfqd->active_queue, and reorder the
      check so that it happens before we expire the active queue.
      
      Thanks to Tejun for pin pointing the error location.
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Tested-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      54b466e4
  10. 16 12月, 2011 2 次提交
  11. 14 12月, 2011 4 次提交
    • T
      block, cfq: move icq creation and rq->elv.icq association to block core · f1f8cc94
      Tejun Heo 提交于
      Now block layer knows everything necessary to create and associate
      icq's with requests.  Move ioc_create_icq() to blk-ioc.c and update
      get_request() such that, if elevator_type->icq_size is set, requests
      are automatically associated with their matching icq's before
      elv_set_request().  io_context reference is also managed by block core
      on request alloc/free.
      
      * Only ioprio/cgroup changed handling remains from cfq_get_cic().
        Collapsed into cfq_set_request().
      
      * This removes queue kicking on icq allocation failure (for now).  As
        icq allocation failure is rare and the only effect of queue kicking
        achieved was possibily accelerating queue processing, this change
        shouldn't be noticeable.
      
        There is a larger underlying problem.  Unlike request allocation,
        icq allocation is not guaranteed to succeed eventually after
        retries.  The number of icq is unbound and thus mempool can't be the
        solution either.  This effectively adds allocation dependency on
        memory free path and thus possibility of deadlock.
      
        This usually wouldn't happen because icq allocation is not a hot
        path and, even when the condition triggers, it's highly unlikely
        that none of the writeback workers already has icq.
      
        However, this is still possible especially if elevator is being
        switched under high memory pressure, so we better get it fixed.
        Probably the only solution is just bypassing elevator and appending
        to dispatch queue on any elevator allocation failure.
      
      * Comment added to explain how icq's are managed and synchronized.
      
      This completes cleanup of io_context interface.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f1f8cc94
    • T
      block, cfq: restructure io_cq creation path for io_context interface cleanup · 9b84cacd
      Tejun Heo 提交于
      Add elevator_ops->elevator_init_icq_fn() and restructure
      cfq_create_cic() and rename it to ioc_create_icq().
      
      The new function expects its caller to pass in io_context, uses
      elevator_type->icq_cache, handles generic init, calls the new elevator
      operation for elevator specific initialization, and returns pointer to
      created or looked up icq.  This leaves cfq_icq_pool variable without
      any user.  Removed.
      
      This prepares for io_context interface cleanup and doesn't introduce
      any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b84cacd
    • T
      block, cfq: move io_cq exit/release to blk-ioc.c · 7e5a8794
      Tejun Heo 提交于
      With kmem_cache managed by blk-ioc, io_cq exit/release can be moved to
      blk-ioc too.  The odd ->io_cq->exit/release() callbacks are replaced
      with elevator_ops->elevator_exit_icq_fn() with unlinking from both ioc
      and q, and freeing automatically handled by blk-ioc.  The elevator
      operation only need to perform exit operation specific to the elevator
      - in cfq's case, exiting the cfqq's.
      
      Also, clearing of io_cq's on q detach is moved to block core and
      automatically performed on elevator switch and q release.
      
      Because the q io_cq points to might be freed before RCU callback for
      the io_cq runs, blk-ioc code should remember to which cache the io_cq
      needs to be freed when the io_cq is released.  New field
      io_cq->__rcu_icq_cache is added for this purpose.  As both the new
      field and rcu_head are used only after io_cq is released and the
      q/ioc_node fields aren't, they are put into unions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7e5a8794
    • T
      block, cfq: move icq cache management to block core · 3d3c2379
      Tejun Heo 提交于
      Let elevators set ->icq_size and ->icq_align in elevator_type and
      elv_register() and elv_unregister() respectively create and destroy
      kmem_cache for icq.
      
      * elv_register() now can return failure.  All callers updated.
      
      * icq caches are automatically named "ELVNAME_io_cq".
      
      * cfq_slab_setup/kill() are collapsed into cfq_init/exit().
      
      * While at it, minor indentation change for iosched_cfq.elevator_name
        for consistency.
      
      This will help moving icq management to block core.  This doesn't
      introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3d3c2379