1. 02 4月, 2012 18 次提交
    • T
      blkcg: move blkio_group_stats to cfq-iosched.c · 155fead9
      Tejun Heo 提交于
      blkio_group_stats contains only fields used by cfq and has no reason
      to be defined in blkcg core.
      
      * Move blkio_group_stats to cfq-iosched.c and rename it to cfqg_stats.
      
      * blkg_policy_data->stats is replaced with cfq_group->stats.
        blkg_prfill_[rw]stat() are updated to use offset against pd->pdata
        instead.
      
      * All related macros / functions are renamed so that they have cfqg_
        prefix and the unnecessary @pol arguments are dropped.
      
      * All stat functions now take cfq_group * instead of blkio_group *.
      
      * lockdep assertion on queue lock dropped.  Elevator runs under queue
        lock by default.  There isn't much to be gained by adding lockdep
        assertions at stat function level.
      
      * cfqg_stats_reset() implemented for blkio_reset_group_stats_fn method
        so that cfqg->stats can be reset.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      155fead9
    • T
      blkcg: add blkio_policy_ops operations for exit and stat reset · 9ade5ea4
      Tejun Heo 提交于
      Add blkio_policy_ops->blkio_exit_group_fn() and
      ->blkio_reset_group_stats_fn().  These will be used to further
      modularize blkcg policy implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9ade5ea4
    • T
      blkcg: cfq doesn't need per-cpu dispatch stats · 41b38b6d
      Tejun Heo 提交于
      blkio_group_stats_cpu is used to count dispatch stats using per-cpu
      counters.  This is used by both blk-throtl and cfq-iosched but the
      sharing is rather silly.
      
      * cfq-iosched doesn't need per-cpu dispatch stats.  cfq always updates
        those stats while holding queue_lock.
      
      * blk-throtl needs per-cpu dispatch stats but only service_bytes and
        serviced.  It doesn't make use of sectors.
      
      This patch makes cfq add and use global stats for service_bytes,
      serviced and sectors, removes per-cpu sectors counter and moves
      per-cpu stat printing code to blk-throttle.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      41b38b6d
    • T
      blkcg: move statistics update code to policies · 629ed0b1
      Tejun Heo 提交于
      As with conf/stats file handling code, there's no reason for stat
      update code to live in blkcg core with policies calling into update
      them.  The current organization is both inflexible and complex.
      
      This patch moves stat update code to specific policies.  All
      blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP
      stats are collapsed into their cfq_blkiocg_update_*_stats()
      counterparts.  blkiocg_update_dispatch_stats() is used by both
      policies and duplicated as throtl_update_dispatch_stats() and
      cfq_blkiocg_update_dispatch_stats().  This will be cleaned up later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      629ed0b1
    • T
      cfq: collapse cfq.h into cfq-iosched.c · 2ce4d50f
      Tejun Heo 提交于
      block/cfq.h contains some functions which interact with blkcg;
      however, this is only part of it and cfq-iosched.c already has quite
      some #ifdef CONFIG_CFQ_GROUP_IOSCHED.  With conf/stat handling being
      moved to specific policies, having these relay functions isolated in
      cfq.h doesn't make much sense.  Collapse cfq.h into cfq-iosched.c for
      now.  Let's split blkcg support properly later if necessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2ce4d50f
    • T
      blkcg: move conf/stat file handling code to policies · 60c2bc2d
      Tejun Heo 提交于
      blkcg conf/stat handling is convoluted in that details which belong to
      specific policy implementations are all out in blkcg core and then
      policies hook into core layer to access and manipulate confs and
      stats.  This sadly achieves both inflexibility (confs/stats can't be
      modified without messing with blkcg core) and complexity (all the
      call-ins and call-backs).
      
      The previous patches restructured conf and stat handling code such
      that they can be separated out.  This patch relocates the file
      handling part.  All conf/stat file handling code which belongs to
      BLKIO_POLICY_PROP is moved to cfq-iosched.c and all
      BKLIO_POLICY_THROTL code to blk-throtl.c.
      
      The move is verbatim except for blkio_update_group_{weight|bps|iops}()
      callbacks which relays conf changes to policies.  The configuration
      settings are handled in policies themselves so the relaying isn't
      necessary.  Conf setting functions are modified to directly call
      per-policy update functions and the relaying mechanism is dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      60c2bc2d
    • T
      blkcg: implement blkio_policy_type->cftypes · 44ea53de
      Tejun Heo 提交于
      Add blkiop->cftypes which is added and removed together with the
      policy.  This will be used to move conf/stat handling to the policies.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      44ea53de
    • T
      blkcg: export conf/stat helpers to prepare for reorganization · 829fdb50
      Tejun Heo 提交于
      conf/stat handling is about to be moved to policy implementation from
      blkcg core.  Export conf/stat helpers from blkcg core so that
      blk-throttle and cfq-iosched can use them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      829fdb50
    • T
      blkcg: simplify blkg_conf_prep() · 726fa694
      Tejun Heo 提交于
      blkg_conf_prep() implements "MAJ:MIN VAL" parsing manually, which is
      unnecessary.  Just use sscanf("%u:%u %llu").  This might not reject
      some malformed input (extra input at the end) but we don't care.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      726fa694
    • T
      blkcg: restructure blkio_group configruation setting · 3a8b31d3
      Tejun Heo 提交于
      As part of userland interface restructuring, this patch updates
      per-blkio_group configuration setting.  Instead of funneling
      everything through a master function which has hard-coded cases for
      each config file it may handle, the common part is factored into
      blkg_conf_prep() and blkg_conf_finish() and different configuration
      setters are implemented using the helpers.
      
      While this doesn't result in immediate LOC reduction, this enables
      further cleanups and more modular implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3a8b31d3
    • T
      blkcg: restructure configuration printing · c4682aec
      Tejun Heo 提交于
      Similarly to the previous stat restructuring, this patch restructures
      conf printing code such that,
      
      * Conf printing uses the same helpers as stat.
      
      * Printing function doesn't require hardcoded switching on the config
        being printed.  Note that this isn't complete yet for throttle
        confs.  The next patch will convert setting for these confs and will
        complete the transition.
      
      * Printing uses read_seq_string callback (other methods will be phased
        out).
      
      Note that blkio_group_conf.iops[2] is changed to u64 so that they can
      be manipulated with the same functions.  This is transitional and will
      go away later.
      
      After this patch, per-device configurations - weight, bps and iops -
      use __blkg_prfill_u64() for printing which uses white space as
      delimiter instead of tab.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c4682aec
    • T
      blkcg: drop blkiocg_file_write_u64() · 627f29f4
      Tejun Heo 提交于
      blkiocg_file_write_u64() has single switch case.  Drop
      blkiocg_file_write_u64(), rename blkio_weight_write() to
      blkcg_set_weight() and use it directly for .write_u64 callback.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      627f29f4
    • T
      blkcg: restructure statistics printing · d3d32e69
      Tejun Heo 提交于
      blkcg stats handling is a mess.  None of the stats has much to do with
      blkcg core but they are all implemented in blkcg core.  Code sharing
      is achieved by mixing common code with hard-coded cases for each stat
      counter.
      
      This patch restructures statistics printing such that
      
      * Common logic exists as helper functions and specific print functions
        use the helpers to implement specific cases.
      
      * Printing functions serving multiple counters don't require hardcoded
        switching on specific counters.
      
      * Printing uses read_seq_string callback (other methods will be phased
        out).
      
      This change enables further cleanups and relocating stats code to the
      policy implementation it belongs to.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d3d32e69
    • T
      blkcg: introduce blkg_stat and blkg_rwstat · edcb0722
      Tejun Heo 提交于
      blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values
      on 32bit archs and some stat counters have subtypes to distinguish
      read/writes and sync/async IOs.  The stat code paths are confusing and
      involve a lot of going back and forth between blkcg core and specific
      policy implementations, and synchronization and subtype handling are
      open coded in blkcg core.
      
      This patch introduces struct blkg_stat and blkg_rwstat which, with
      accompanying operations, encapsulate stat updating and accessing with
      proper synchronization.
      
      blkg_stat is simple u64 counter with 64bit read-access protection.
      blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw
      flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and
      replaces stat_sub_type indexed arrays.
      
      All counters in blkio_group_stats and blkio_group_stats_cpu are
      replaced with either blkg_stat or blkg_rwstat along with all users.
      
      This does add one u64_stats_sync per counter and increase stats_sync
      operations but they're empty/noops on 64bit archs and blkcg doesn't
      have too many counters, especially with DEBUG_BLK_CGROUP off.
      
      While the currently resulting code isn't necessarily simpler at the
      moment, this will enable further clean up of blkcg stats code.
      
      - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to
        BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}.
      
      - blkg_stat_add() replaces blkio_add_stat() and
        blkio_check_and_dec_stat().  Note that BUG_ON() on underflow in the
        latter function no longer exists.  It's *way* better to have
        underflowed stat counters than oopsing.
      
      - blkio_group_stats->dequeue is now a proper u64 stat counter instead
        of ulong.
      
      - reset_stats() updated to clear each stat counters individually and
        BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed.
      
      - Some functions reconstruct rw flags from direction and sync
        booleans.  This will be removed by future patches.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      edcb0722
    • T
      blkcg: BLKIO_STAT_CPU_SECTORS doesn't have subcounters · 2aa4a152
      Tejun Heo 提交于
      BLKIO_STAT_CPU_SECTORS doesn't need read/write/sync/async subcounters
      and is counted by blkio_group_stats_cpu->sectors; however, it still
      holds a member in blkio_group_stats_cpu->stat_arr_cpu.
      
      Rearrange stat_type_cpu and define BLKIO_STAT_CPU_ARR_NR and use it
      for stat_arr_cpu[] size so that only SERVICE_BYTES and SERVICED have
      subcounters.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2aa4a152
    • T
      blkcg: remove unused @pol and @plid parameters · aaec55a0
      Tejun Heo 提交于
      @pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no
      longer necessary.  Drop them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      aaec55a0
    • T
      cgroup: convert all non-memcg controllers to the new cftype interface · 4baf6e33
      Tejun Heo 提交于
      Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
      net_cls and device controllers to use the new cftype based interface.
      Termination entry is added to cftype arrays and populate callbacks are
      replaced with cgroup_subsys->base_cftypes initializations.
      
      This is functionally identical transformation.  There shouldn't be any
      visible behavior change.
      
      memcg is rather special and will be converted separately.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      4baf6e33
    • T
      cgroup: relocate cftype and cgroup_subsys definitions in controllers · 676f7c8f
      Tejun Heo 提交于
      blk-cgroup, netprio_cgroup, cls_cgroup and tcp_memcontrol
      unnecessarily define cftype array and cgroup_subsys structures at the
      top of the file, which is unconventional and necessiates forward
      declaration of methods.
      
      This patch relocates those below the definitions of the methods and
      removes the forward declarations.  Note that forward declaration of
      tcp_files[] is added in tcp_memcontrol.c for tcp_init_cgroup().  This
      will be removed soon by another patch.
      
      This patch doesn't introduce any functional change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      676f7c8f
  2. 30 3月, 2012 1 次提交
  3. 23 3月, 2012 1 次提交
    • T
      cfq: fix cfqg ref handling when BLK_CGROUP && !CFQ_GROUP_IOSCHED · eb7d8c07
      Tejun Heo 提交于
      When BLK_CGROUP is enabled but CFQ_GROUP_IOSCHED is, cfq ends up
      calling blkg_get/put() on dummy cfqg leading to the following crash.
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
        IP: [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        CPU 0
        Modules linked in:
      
        Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc6-work+ #125 Bochs Bochs
        RIP: 0010:[<ffffffff813d44d8>]  [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
        RSP: 0018:ffff88001f9dfd80  EFLAGS: 00010046
        RAX: ffff88001aefbbf0 RBX: ffff88001aeedbf0 RCX: 0000000000000100
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff820ffd40
        RBP: ffff88001f9dfdd0 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
        R13: 0000000000000009 R14: ffff88001aefbc30 R15: 0000000000000003
        FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 00000000000000b0 CR3: 000000000206f000 CR4: 00000000000006f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process swapper/0 (pid: 1, threadinfo ffff88001f9de000, task ffff88001f9dc040)
        Stack:
         ffff88001aeedbf0 ffff88001aefbdb0 ffff88001aef1548 ffff88001aefbbf0
         ffff88001f9dfdd0 ffff88001aef1548 ffffffff820d6320 ffffffff8165ce30
         ffffffff82c555e0 ffff88001aeebbf0 ffff88001f9dfe00 ffffffff813b0507
        Call Trace:
         [<ffffffff813b0507>] elevator_init+0xd7/0x140
         [<ffffffff813b83d5>] blk_init_allocated_queue+0x125/0x150
         [<ffffffff813b94d3>] blk_init_queue_node+0x43/0x80
         [<ffffffff813b9523>] blk_init_queue+0x13/0x20
         [<ffffffff821aec00>] floppy_init+0x82/0xec7
         [<ffffffff810001d2>] do_one_initcall+0x42/0x170
         [<ffffffff821835fc>] kernel_init+0xcb/0x14f
         [<ffffffff81b40b24>] kernel_thread_helper+0x4/0x10
        Code: 00 e8 1d 9e 76 00 48 8b 43 48 48 85 c0 48 89 83 28 03 00 00 74 07 4c 8b a0 10 ff ff ff 8b 15 b0 2e d0 00 85 d2 0f 85 49 01 00 00 <41> 8b 84 24 b0 00 00 00 85 c0 0f 8e 8c 01 00 00 83 e8 01 85 c0
        RIP  [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430
      
      Because cfq's blkcg support has a on/off switch, CFQ_GROUP_IOSCHED,
      separate from BLK_CGROUP, blkg access through cfqg needs to be
      conditioned on it.
      
      * Make blkg_to_cfqg() and cfqg_to_blkg() conditioned on
        CFQ_GROUP_IOSCHED.  If disabled, they always return %NULL.
      
      * Introduce cfqg_get() and cfqg_put() conditioned on
        CFQ_GROUP_IOSCHED.  If disabled, they are noops.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eb7d8c07
  4. 20 3月, 2012 9 次提交
    • T
      block: remove ioc_*_changed() · 2b566fa5
      Tejun Heo 提交于
      After the previous patch to cfq, there's no ioc_get_changed() user
      left.  This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all
      related stuff.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2b566fa5
    • T
      cfq: don't use icq_get_changed() · 598971bf
      Tejun Heo 提交于
      cfq caches the associated cfqq's for a given cic.  The cache needs to
      be flushed if the cic's ioprio or blkcg has changed.  It is currently
      done by requiring the changing action to set the respective
      ICQ_*_CHANGED bit in the icq and testing it from cfq_set_request(),
      which involves iterating through all the affected icqs.
      
      All cfq wants to know is whether ioprio and/or blkcg have changed
      since the last flush and can be easily achieved by just remembering
      the current ioprio and blkcg ID in cic.
      
      This patch adds cic->{ioprio|blkcg_id}, updates all ioprio users to
      use the remembered value instead, and updates cfq_set_request() path
      such that, instead of using icq_get_changed(), the current values are
      compared against the remembered ones and trigger appropriate flush
      action if not.  Condition tests are moved inside both _changed
      functions which are now named check_ioprio_changed() and
      check_blkcg_changed().
      
      ioprio.h::task_ioprio*() can't be used anymore and replaced with
      open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      598971bf
    • T
      cfq: pass around cfq_io_cq instead of io_context · abede6da
      Tejun Heo 提交于
      Now that io_cq is managed by block core and guaranteed to exist for
      any in-flight request, it is easier and carries more information to
      pass around cfq_io_cq than io_context.
      
      This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and
      cfq_get_queue() to take @cic instead of @ioc.  This change removes a
      duplicate cfq_cic_lookup() from cfq_find_alloc_queue().
      
      This change enables the use of cic-cached ioprio in the next patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      abede6da
    • T
      blkcg: add blkcg->id · 9a9e8a26
      Tejun Heo 提交于
      Add 64bit unique id to blkcg.  This will be used by policies which
      want blkcg identity test to tell whether the associated blkcg has
      changed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9a9e8a26
    • T
      blkcg: remove blkio_group->stats_lock · edf1b879
      Tejun Heo 提交于
      With recent plug merge updates, all non-percpu stat updates happen
      under queue_lock making stats_lock unnecessary to synchronize stat
      updates.  The only synchronization necessary is stat reading, which
      can be done using u64_stats_sync instead.
      
      This patch removes blkio_group->stats_lock and adds
      blkio_group_stats->syncp for reader synchronization.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      edf1b879
    • T
      blkcg: restructure blkio_get_stat() · c4c76a05
      Tejun Heo 提交于
      Restructure blkio_get_stat() to prepare for removal of stats_lock.
      
      * Define BLKIO_STAT_ARR_NR explicitly to denote which stats have
        subtypes instead of using BLKIO_STAT_QUEUED.
      
      * Separate out stat acquisition and printing.  After this, there are
        only two users of blkio_fill_stat().  Just open code it.
      
      * The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1.  There's no
        need to subtract one.  Use MAX_KEY_LEN consistently.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c4c76a05
    • T
      blkcg: simplify stat reset · 997a026c
      Tejun Heo 提交于
      blkiocg_reset_stats() implements stat reset for blkio.reset_stats
      cgroupfs file.  This feature is very unconventional and something
      which shouldn't have been merged.  It's only useful when there's only
      one user or tool looking at the stats.  As soon as multiple users
      and/or tools are involved, it becomes useless as resetting disrupts
      other usages.  There are very good reasons why all other stats expect
      readers to read values at the start and end of a period and subtract
      to determine delta over the period.
      
      The implementation is rather complex - some fields shouldn't be
      cleared and it saves some fields, resets whole and restores for some
      reason.  Reset of percpu stats is also racy.  The comment points to
      64bit store atomicity for the reason but even without that stores for
      zero can simply race with other CPUs doing RMW and get clobbered.
      
      Simplify reset by
      
      * Clear selectively instead of resetting and restoring.
      
      * Grouping debug stat fields to be reset and using memset() over them.
      
      * Not caring about stats_lock.
      
      * Using memset() to reset percpu stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      997a026c
    • T
      blkcg: don't use percpu for merged stats · 5fe224d2
      Tejun Heo 提交于
      With recent plug merge updates, merged stats are no longer called for
      plug merges and now only updated while holding queue_lock.  As
      stats_lock is scheduled to be removed, there's no reason to use percpu
      for merged stats.  Don't use percpu for merged stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5fe224d2
    • V
      blkcg: alloc per cpu stats from worker thread in a delayed manner · 1cd9e039
      Vivek Goyal 提交于
      Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in
      IO path there are times when we want GFP_NOIO semantics. As there is no
      way to pass the allocation flags to alloc_percpu(), this patch delays the
      allocation of stats using a worker thread.
      
      v2-> tejun suggested following changes. Changed the patch accordingly.
      	- move alloc_node location in structure
      	- reduce the size of names of some of the fields
      	- Reduce the scope of locking of alloc_list_lock
      	- Simplified stat_alloc_fn() by allocating stats for all
      	  policies in one go and then assigning these to a group.
      
      v3 -> Andrew suggested to put some comments in the code. Also raised
            concerns about trying to allocate infinitely in case of allocation
            failure. I have changed the logic to sleep for 10ms before retrying.
            That should take care of non-preemptible UP kernels.
      
      v4 -> Tejun had more suggestions.
      	- drop list_for_each_entry_all()
      	- instead of msleep() use queue_delayed_work()
      	- Some cleanups realted to more compact coding.
      
      v5-> tejun suggested more cleanups leading to more compact code.
      
      tj: - Relocated pcpu_stats into blkio_stat_alloc_fn().
          - Minor comment update.
          - This also fixes suspicious RCU usage warning caused by invoking
            cgroup_path() from blkg_alloc() without holding RCU read lock.
            Now that blkg_alloc() doesn't require sleepable context, RCU
            read lock from blkg_lookup_create() is maintained throughout
            blkg_alloc().
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1cd9e039
  5. 14 3月, 2012 1 次提交
    • X
      block: fix ioc leak in put_io_context · ff8c1474
      Xiaotian Feng 提交于
      When put_io_context is called, if ioc->icq_list is empty and refcount
      is 1, kernel will not free the ioc.
      
      This is caught by following kmemleak:
      
      unreferenced object 0xffff880036349fe0 (size 216):
        comm "sh", pid 2137, jiffies 4294931140 (age 290579.412s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          01 00 01 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
        backtrace:
          [<ffffffff8169f926>] kmemleak_alloc+0x26/0x50
          [<ffffffff81195a9c>] kmem_cache_alloc_node+0x1cc/0x2a0
          [<ffffffff81356b67>] create_io_context_slowpath+0x27/0x130
          [<ffffffff81356d2b>] get_task_io_context+0xbb/0xf0
          [<ffffffff81055f0e>] copy_process+0x188e/0x18b0
          [<ffffffff8105609b>] do_fork+0x11b/0x420
          [<ffffffff810247f8>] sys_clone+0x28/0x30
          [<ffffffff816d3373>] stub_clone+0x13/0x20
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      ioc should be freed if ioc->icq_list is empty.
      Signed-off-by: NXiaotian Feng <dannyfeng@tencent.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ff8c1474
  6. 07 3月, 2012 10 次提交
    • T
      block: make blk-throttle preserve the issuing task on delayed bios · 671058fb
      Tejun Heo 提交于
      Make blk-throttle call bio_associate_current() on bios being delayed
      such that they get issued to block layer with the original io_context.
      This allows stacking blk-throttle and cfq-iosched propio policies.
      bios will always be issued with the correct ioc and blkcg whether it
      gets delayed by blk-throttle or not.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      671058fb
    • T
      block: make block cgroup policies follow bio task association · 4f85cb96
      Tejun Heo 提交于
      Implement bio_blkio_cgroup() which returns the blkcg associated with
      the bio if exists or %current's blkcg, and use it in blk-throttle and
      cfq-iosched propio.  This makes both cgroup policies honor task
      association for the bio instead of always assuming %current.
      
      As nobody is using bio_set_task() yet, this doesn't introduce any
      behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4f85cb96
    • T
      block: implement bio_associate_current() · 852c788f
      Tejun Heo 提交于
      IO scheduling and cgroup are tied to the issuing task via io_context
      and cgroup of %current.  Unfortunately, there are cases where IOs need
      to be routed via a different task which makes scheduling and cgroup
      limit enforcement applied completely incorrectly.
      
      For example, all bios delayed by blk-throttle end up being issued by a
      delayed work item and get assigned the io_context of the worker task
      which happens to serve the work item and dumped to the default block
      cgroup.  This is double confusing as bios which aren't delayed end up
      in the correct cgroup and makes using blk-throttle and cfq propio
      together impossible.
      
      Any code which punts IO issuing to another task is affected which is
      getting more and more common (e.g. btrfs).  As both io_context and
      cgroup are firmly tied to task including userland visible APIs to
      manipulate them, it makes a lot of sense to match up tasks to bios.
      
      This patch implements bio_associate_current() which associates the
      specified bio with %current.  The bio will record the associated ioc
      and blkcg at that point and block layer will use the recorded ones
      regardless of which task actually ends up issuing the bio.  bio
      release puts the associated ioc and blkcg.
      
      It grabs and remembers ioc and blkcg instead of the task itself
      because task may already be dead by the time the bio is issued making
      ioc and blkcg inaccessible and those are all block layer cares about.
      
      elevator_set_req_fn() is updated such that the bio elvdata is being
      allocated for is available to the elevator.
      
      This doesn't update block cgroup policies yet.  Further patches will
      implement the support.
      
      -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in
           rq_ioc() to fix build breakage.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      852c788f
    • T
      block: add io_context->active_ref · f6e8d01b
      Tejun Heo 提交于
      Currently ioc->nr_tasks is used to decide two things - whether an ioc
      is done issuing IOs and whether it's shared by multiple tasks.  This
      patch separate out the first into ioc->active_ref, which is acquired
      and released using {get|put}_io_context_active() respectively.
      
      This will be used to associate bio's with a given task.  This patch
      doesn't introduce any visible behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f6e8d01b
    • T
      block: interface update for ioc/icq creation functions · 24acfc34
      Tejun Heo 提交于
      Make the following interface updates to prepare for future ioc related
      changes.
      
      * create_io_context() returning ioc only works for %current because it
        doesn't increment ref on the ioc.  Drop @task parameter from it and
        always assume %current.
      
      * Make create_io_context_slowpath() return 0 or -errno and rename it
        to create_task_io_context().
      
      * Make ioc_create_icq() take @ioc as parameter instead of assuming
        that of %current.  The caller, get_request(), is updated to create
        ioc explicitly and then pass it into ioc_create_icq().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      24acfc34
    • T
      block: restructure get_request() · b679281a
      Tejun Heo 提交于
      get_request() is structured a bit unusually in that failure path is
      inlined in the usual flow with goto labels atop and inside it.
      Relocate the error path to the end of the function.
      
      This is to prepare for icq handling changes in get_request() and
      doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b679281a
    • T
      blkcg: drop unnecessary RCU locking · c875f4d0
      Tejun Heo 提交于
      Now that blkg additions / removals are always done under both q and
      blkcg locks, the only places RCU locking is necessary are
      blkg_lookup[_create]() for lookup w/o blkcg lock.  This patch drops
      unncessary RCU locking replacing it with plain blkcg locking as
      necessary.
      
      * blkiocg_pre_destroy() already perform proper locking and don't need
        RCU.  Dropped.
      
      * blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read
        lock.  This isn't a hot path.
      
      * Now unnecessary synchronize_rcu() from queue exit paths removed.
        This makes q->nr_blkgs unnecessary.  Dropped.
      
      * RCU annotation on blkg->q removed.
      
      -v2: Vivek pointed out that blkg_lookup_create() still needs to be
           called under rcu_read_lock().  Updated.
      
      -v3: After the update, stats_lock locking in blkio_read_blkg_stats()
           shouldn't be using _irq variant as it otherwise ends up enabling
           irq while blkcg->lock is locked.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c875f4d0
    • T
      blkcg: use double locking instead of RCU for blkg synchronization · 9f13ef67
      Tejun Heo 提交于
      blkgs are chained from both blkcgs and request_queues and thus
      subjected to two locks - blkcg->lock and q->queue_lock.  As both blkcg
      and q can go away anytime, locking during removal is tricky.  It's
      currently solved by wrapping removal inside RCU, which makes the
      synchronization complex.  There are three locks to worry about - the
      outer RCU, q lock and blkcg lock, and it leads to nasty subtle
      complications like conditional synchronize_rcu() on queue exit paths.
      
      For all other paths, blkcg lock is naturally nested inside q lock and
      the only exception is blkcg removal path, which is a very cold path
      and can be implemented as clumsy but conceptually-simple reverse
      double lock dancing.
      
      This patch updates blkg removal path such that blkgs are removed while
      holding both q and blkcg locks, which is trivial for request queue
      exit path - blkg_destroy_all().  The blkcg removal path,
      blkiocg_pre_destroy(), implements reverse double lock dancing
      essentially identical to ioc_release_fn().
      
      This simplifies blkg locking - no half-dead blkgs to worry about.  Now
      unnecessary RCU annotations will be removed by the next patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9f13ef67
    • T
      blkcg: unify blkg's for blkcg policies · e8989fae
      Tejun Heo 提交于
      Currently, blkg is per cgroup-queue-policy combination.  This is
      unnatural and leads to various convolutions in partially used
      duplicate fields in blkg, config / stat access, and general management
      of blkgs.
      
      This patch make blkg's per cgroup-queue and let them serve all
      policies.  blkgs are now created and destroyed by blkcg core proper.
      This will allow further consolidation of common management logic into
      blkcg core and API with better defined semantics and layering.
      
      As a transitional step to untangle blkg management, elvswitch and
      policy [de]registration, all blkgs except the root blkg are being shot
      down during elvswitch and bypass.  This patch adds blkg_root_update()
      to update root blkg in place on policy change.  This is hacky and racy
      but should be good enough as interim step until we get locking
      simplified and switch over to proper in-place update for all blkgs.
      
      -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc()
           comment wasn't updated according to the function change.  Fixed.
           Both pointed out by Vivek.
      
      -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for
           all policies.  This freed root pd during elvswitch before the
           last queue finished exiting and led to oops.  Directly invoke
           update_root_blkg_pd() only on BLKIO_POLICY_PROP from
           cfq_exit_queue().  This also is closer to what will be done with
           proper in-place blkg update.  Reported by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e8989fae
    • T
      blkcg: let blkcg core manage per-queue blkg list and counter · 03aa264a
      Tejun Heo 提交于
      With the previous patch to move blkg list heads and counters to
      request_queue and blkg, logic to manage them in both policies are
      almost identical and can be moved to blkcg core.
      
      This patch moves blkg link logic into blkg_lookup_create(), implements
      common blkg unlink code in blkg_destroy(), and updates
      blkg_destory_all() so that it's policy specific and can skip root
      group.  The updated blkg_destroy_all() is now used to both clear queue
      for bypassing and elv switching, and release all blkgs on q exit.
      
      This patch introduces a race window where policy [de]registration may
      race against queue blkg clearing.  This can only be a problem on cfq
      unload and shouldn't be a real problem in practice (and we have many
      other places where this race already exists).  Future patches will
      remove these unlikely races.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03aa264a