1. 23 6月, 2014 2 次提交
    • J
      Revert "block: add __init to blkcg_policy_register" · d5bf0291
      Jens Axboe 提交于
      This reverts commit a2d445d4.
      
      The original commit is buggy, we do use the registration functions
      at runtime for modular builds.
      d5bf0291
    • T
      blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t · a5049a8a
      Tejun Heo 提交于
      Hello,
      
      So, this patch should do.  Joe, Vivek, can one of you guys please
      verify that the oops goes away with this patch?
      
      Jens, the original thread can be read at
      
        http://thread.gmane.org/gmane.linux.kernel/1720729
      
      The fix converts blkg->refcnt from int to atomic_t.  It does some
      overhead but it should be minute compared to everything else which is
      going on and the involved cacheline bouncing, so I think it's highly
      unlikely to cause any noticeable difference.  Also, the refcnt in
      question should be converted to a perpcu_ref for blk-mq anyway, so the
      atomic_t is likely to go away pretty soon anyway.
      
      Thanks.
      
      ------- 8< -------
      __blkg_release_rcu() may be invoked after the associated request_queue
      is released with a RCU grace period inbetween.  As such, the function
      and callbacks invoked from it must not dereference the associated
      request_queue.  This is clearly indicated in the comment above the
      function.
      
      Unfortunately, while trying to fix a different issue, 2a4fd070
      ("blkcg: move bulk of blkcg_gq release operations to the RCU
      callback") ignored this and added [un]locking of @blkg->q->queue_lock
      to __blkg_release_rcu().  This of course can cause oops as the
      request_queue may be long gone by the time this code gets executed.
      
        general protection fault: 0000 [#1] SMP
        CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1
        Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013
        task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000
        RIP: 0010:[<ffffffff8162e9e5>]  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
        RSP: 0018:ffff88085403fdf0  EFLAGS: 00010086
        RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000
        RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b
        RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39
        R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130
        R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0
        Stack:
         ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000
         ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0
         ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30
        Call Trace:
         [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150
         [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300
         [<ffffffff81091d81>] kthread+0xe1/0x100
         [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0
        Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5
        +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f
        +b7
        RIP  [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60
         RSP <ffff88085403fdf0>
      
      The request_queue locking was added because blkcg_gq->refcnt is an int
      protected with the queue lock and __blkg_release_rcu() needs to put
      the parent.  Let's fix it by making blkcg_gq->refcnt an atomic_t and
      dropping queue locking in the function.
      
      Given the general heavy weight of the current request_queue and blkcg
      operations, this is unlikely to cause any noticeable overhead.
      Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in
      the near future, so whatever (most likely negligible) overhead it may
      add is temporary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJoe Lawrence <joe.lawrence@stratus.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a5049a8a
  2. 11 6月, 2014 1 次提交
  3. 14 5月, 2014 1 次提交
    • T
      cgroup: rename css_tryget*() to css_tryget_online*() · ec903c0c
      Tejun Heo 提交于
      Unlike the more usual refcnting, what css_tryget() provides is the
      distinction between online and offline csses instead of protection
      against upping a refcnt which already reached zero.  cgroup is
      planning to provide actual tryget which fails if the refcnt already
      reached zero.  Let's rename the existing trygets so that they clearly
      indicate that they're onliness.
      
      I thought about keeping the existing names as-are and introducing new
      names for the planned actual tryget; however, given that each
      controller participates in the synchronization of the online state, it
      seems worthwhile to make it explicit that these functions are about
      on/offline state.
      
      Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
      to css_tryget_online_from_dir().  This is pure rename.
      
      v2: cgroup_freezer grew new usages of css_tryget().  Update
          accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      ec903c0c
  4. 06 5月, 2014 1 次提交
    • T
      blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats() · 36c38fb7
      Tejun Heo 提交于
      During the recent conversion of cgroup to kernfs, cgroup_tree_mutex
      which nests above both the kernfs s_active protection and cgroup_mutex
      is added to synchronize cgroup file type operations as cgroup_mutex
      needed to be grabbed from some file operations and thus can't be put
      above s_active protection.
      
      While this arrangement mostly worked for cgroup, this triggered the
      following lockdep warning.
      
        ======================================================
        [ INFO: possible circular locking dependency detected ]
        3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429 Tainted: G        W
        -------------------------------------------------------
        trinity-c173/9024 is trying to acquire lock:
        (blkcg_pol_mutex){+.+.+.}, at: blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
      
        but task is already holding lock:
        (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (s_active#89){++++.+}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        __kernfs_remove (arch/x86/include/asm/atomic.h:27 fs/kernfs/dir.c:352 fs/kernfs/dir.c:1024)
        kernfs_remove_by_name_ns (fs/kernfs/dir.c:1219)
        cgroup_addrm_files (include/linux/kernfs.h:427 kernel/cgroup.c:1074 kernel/cgroup.c:2899)
        cgroup_clear_dir (kernel/cgroup.c:1092 (discriminator 2))
        rebind_subsystems (kernel/cgroup.c:1144)
        cgroup_setup_root (kernel/cgroup.c:1568)
        cgroup_mount (kernel/cgroup.c:1716)
        mount_fs (fs/super.c:1094)
        vfs_kern_mount (fs/namespace.c:899)
        do_mount (fs/namespace.c:2238 fs/namespace.c:2561)
        SyS_mount (fs/namespace.c:2758 fs/namespace.c:2729)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        -> #1 (cgroup_tree_mutex){+.+.+.}:
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        cgroup_add_cftypes (include/linux/list.h:76 kernel/cgroup.c:3040)
        blkcg_policy_register (block/blk-cgroup.c:1106)
        throtl_init (block/blk-throttle.c:1694)
        do_one_initcall (init/main.c:789)
        kernel_init_freeable (init/main.c:854 init/main.c:863 init/main.c:882 init/main.c:1003)
        kernel_init (init/main.c:935)
        ret_from_fork (arch/x86/kernel/entry_64.S:552)
      
        -> #0 (blkcg_pol_mutex){+.+.+.}:
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
        tracesys (arch/x86/kernel/entry_64.S:746)
      
        other info that might help us debug this:
      
        Chain exists of:
        blkcg_pol_mutex --> cgroup_tree_mutex --> s_active#89
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(s_active#89);
      				 lock(cgroup_tree_mutex);
      				 lock(s_active#89);
          lock(blkcg_pol_mutex);
      
         *** DEADLOCK ***
      
        4 locks held by trinity-c173/9024:
        #0: (&f->f_pos_lock){+.+.+.}, at: __fdget_pos (fs/file.c:714)
        #1: (sb_writers#18){.+.+.+}, at: vfs_write (include/linux/fs.h:2255 fs/read_write.c:530)
        #2: (&of->mutex){+.+.+.}, at: kernfs_fop_write (fs/kernfs/file.c:283)
        #3: (s_active#89){++++.+}, at: kernfs_fop_write (fs/kernfs/file.c:283)
      
        stack backtrace:
        CPU: 3 PID: 9024 Comm: trinity-c173 Tainted: G        W     3.15.0-rc3-next-20140430-sasha-00016-g4e281fa-dirty #429
         ffffffff919687b0 ffff8805f6373bb8 ffffffff8e52cdbb 0000000000000002
         ffffffff919d8400 ffff8805f6373c08 ffffffff8e51fb88 0000000000000004
         ffff8805f6373c98 ffff8805f6373c08 ffff88061be70d98 ffff88061be70dd0
        Call Trace:
        dump_stack (lib/dump_stack.c:52)
        print_circular_bug (kernel/locking/lockdep.c:1216)
        __lock_acquire (kernel/locking/lockdep.c:1840 kernel/locking/lockdep.c:1945 kernel/locking/lockdep.c:2131 kernel/locking/lockdep.c:3182)
        lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
        mutex_lock_nested (kernel/locking/mutex.c:486 kernel/locking/mutex.c:587)
        blkcg_reset_stats (include/linux/spinlock.h:328 block/blk-cgroup.c:455)
        cgroup_file_write (kernel/cgroup.c:2714)
        kernfs_fop_write (fs/kernfs/file.c:295)
        vfs_write (fs/read_write.c:532)
        SyS_write (fs/read_write.c:584 fs/read_write.c:576)
      
      This is a highly unlikely but valid circular dependency between "echo
      1 > blkcg.reset_stats" and cfq module [un]loading.  cgroup is going
      through further locking update which will remove this complication but
      for now let's use trylock on blkcg_pol_mutex and retry the file
      operation if the trylock fails.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      References: http://lkml.kernel.org/g/5363C04B.4010400@oracle.com
      36c38fb7
  5. 19 2月, 2014 1 次提交
  6. 13 2月, 2014 1 次提交
    • T
      cgroup: drop @skip_css from cgroup_taskset_for_each() · 924f0d9a
      Tejun Heo 提交于
      If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
      css.  The intention of the interface is to make it easy to skip css's
      (cgroup_subsys_states) which already match the migration target;
      however, this is entirely unnecessary as migration taskset doesn't
      include tasks which are already in the target cgroup.  Drop @skip_css
      from cgroup_taskset_for_each().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      924f0d9a
  7. 08 2月, 2014 2 次提交
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
    • T
      cgroup: drop module support · 3ed80a62
      Tejun Heo 提交于
      With module supported dropped from net_prio, no controller is using
      cgroup module support.  None of actual resource controllers can be
      built as a module and we aren't gonna add new controllers which don't
      control resources.  This patch drops module support from cgroup.
      
      * cgroup_[un]load_subsys() and cgroup_subsys->module removed.
      
      * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
        cgroup_subsys.h now uses IS_ENABLED() directly.
      
      * enum cgroup_subsys_id now exactly matches the list of enabled
        controllers as ordered in cgroup_subsys.h.
      
      * cgroup_subsys[] is now a contiguously occupied array.  Size
        specification is no longer necessary and dropped.
      
      * for_each_builtin_subsys() is removed and for_each_subsys() is
        updated to not require any locking.
      
      * module ref handling is removed from rebind_subsystems().
      
      * Module related comments dropped.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      
      v3: Added {} around the if (need_forkexit_callback) block in
          cgroup_post_fork() for readability as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3ed80a62
  8. 12 9月, 2013 1 次提交
    • T
      blkcg: relocate root_blkg setting and clearing · 577cee1e
      Tejun Heo 提交于
      Hello, Jens.
      
      The original thread can be read from
      
       http://thread.gmane.org/gmane.linux.kernel.cgroups/8937
      
      While it leads to oops, given that it only triggers under specific
      configurations which aren't common.  I don't think it's necessary to
      backport it through -stable and merging it during the coming merge
      window should be enough.
      
      Thanks!
      
      ----- 8< -----
      Currently, q->root_blkg and q->root_rl.blkg are set from
      blkcg_activate_policy() and cleared from blkg_destroy_all().  This
      doesn't necessarily coincide with the lifetime of the root blkcg_gq
      leading to the following oops when blkcg is enabled but no policy is
      activated because __blk_queue_next_rl() malfunctions expecting the
      root_blkg pointers to be set.
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90
        PGD 60f7a9067 PUD 60f4c9067 PMD 0
        Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
        gsmi: Log Shutdown Reason 0x03
        Modules linked in: act_mirred cls_tcindex cls_prioshift sch_dsmark xt_multiport iptable_mangle sata_mv elephant elephant_dev_num cdc_acm uhci_hcd ehci_hcd i2c_d
        CPU: 9 PID: 41382 Comm: iSCSI-write- Not tainted 3.11.0-dbg-DEV #19
        Hardware name: Intel XXX
        task: ffff88060d16eec0 ti: ffff88060d170000 task.ti: ffff88060d170000
        RIP: 0010:[<ffffffff810c58cb>] [<ffffffff810c58cb>] __wake_up_common+0x2b/0x90
        RSP: 0000:ffff88060d171818  EFLAGS: 00010096
        RAX: 0000000000000082 RBX: ffff880baa3dee60 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffff880baa3dee60
        RBP: ffff88060d171858 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000002 R12: ffff880baa3dee98
        R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000003
        FS:  00007f977cba6700(0000) GS:ffff880c79c60000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 000000060f7a5000 CR4: 00000000000007e0
        Stack:
         0000000000000082 0000000000000000 ffff88060d171858 ffff880baa3dee60
         0000000000000082 0000000000000003 0000000000000000 0000000000000000
         ffff88060d171898 ffffffff810c7848 ffff88060d171888 ffff880bde4bc4b8
        Call Trace:
         [<ffffffff810c7848>] __wake_up+0x48/0x70
         [<ffffffff8131da53>] __blk_drain_queue+0x123/0x190
         [<ffffffff8131dbb5>] blk_cleanup_queue+0xf5/0x210
         [<ffffffff8141877a>] __scsi_remove_device+0x5a/0xd0
         [<ffffffff81418824>] scsi_remove_device+0x34/0x50
         [<ffffffff814189cb>] scsi_remove_target+0x16b/0x220
         [<ffffffff814210f1>] __iscsi_unbind_session+0xd1/0x1b0
         [<ffffffff814212b2>] iscsi_remove_session+0xe2/0x1c0
         [<ffffffff814213a6>] iscsi_destroy_session+0x16/0x60
         [<ffffffff81423a59>] iscsi_session_teardown+0xd9/0x100
         [<ffffffff8142b75a>] iscsi_sw_tcp_session_destroy+0x5a/0xb0
         [<ffffffff81420948>] iscsi_if_rx+0x10e8/0x1560
         [<ffffffff81573335>] netlink_unicast+0x145/0x200
         [<ffffffff815736f3>] netlink_sendmsg+0x303/0x410
         [<ffffffff81528196>] sock_sendmsg+0xa6/0xd0
         [<ffffffff815294bc>] ___sys_sendmsg+0x38c/0x3a0
         [<ffffffff811ea840>] ? fget_light+0x40/0x160
         [<ffffffff811ea899>] ? fget_light+0x99/0x160
         [<ffffffff811ea840>] ? fget_light+0x40/0x160
         [<ffffffff8152bc79>] __sys_sendmsg+0x49/0x90
         [<ffffffff8152bcd2>] SyS_sendmsg+0x12/0x20
         [<ffffffff815fb642>] system_call_fastpath+0x16/0x1b
        Code: 66 66 66 66 90 55 48 89 e5 41 57 41 89 f7 41 56 41 89 ce 41 55 41 54 4c 8d 67 38 53 48 83 ec 18 89 55 c4 48 8b 57 38 4c 89 45 c8 <4c> 8b 2a 48 8d 42 e8 49
      
      Fix it by moving r->root_blkg and q->root_rl.blkg setting to
      blkg_create() and clearing to blkg_destroy() so that they area
      initialized when a root blkg is created and cleared when destroyed.
      Reported-and-tested-by: NAnatol Pomozov <anatol.pomozov@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      577cee1e
  9. 09 8月, 2013 6 次提交
    • T
      cgroup: make css_for_each_descendant() and friends include the origin css in the iteration · bd8815a6
      Tejun Heo 提交于
      Previously, all css descendant iterators didn't include the origin
      (root of subtree) css in the iteration.  The reasons were maintaining
      consistency with css_for_each_child() and that at the time of
      introduction more use cases needed skipping the origin anyway;
      however, given that css_is_descendant() considers self to be a
      descendant, omitting the origin css has become more confusing and
      looking at the accumulated use cases rather clearly indicates that
      including origin would result in simpler code overall.
      
      While this is a change which can easily lead to subtle bugs, cgroup
      API including the iterators has recently gone through major
      restructuring and no out-of-tree changes will be applicable without
      adjustments making this a relatively acceptable opportunity for this
      type of change.
      
      The conversions are mostly straight-forward.  If the iteration block
      had explicit origin handling before or after, it's moved inside the
      iteration.  If not, if (pos == origin) continue; is added.  Some
      conversions add extra reference get/put around origin handling by
      consolidating origin handling and the rest.  While the extra ref
      operations aren't strictly necessary, this shouldn't cause any
      noticeable difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      bd8815a6
    • T
      cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroup · d99c8727
      Tejun Heo 提交于
      cgroup is in the process of converting to css (cgroup_subsys_state)
      from cgroup as the principal subsystem interface handle.  This is
      mostly to prepare for the unified hierarchy support where css's will
      be created and destroyed dynamically but also helps cleaning up
      subsystem implementations as css is usually what they are interested
      in anyway.
      
      cgroup_taskset which is used by the subsystem attach methods is the
      last cgroup subsystem API which isn't using css as the handle.  Update
      cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
      cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.
      
      The conversions are pretty mechanical.  One exception is
      cpuset::cgroup_cs(), which lost its last user and got removed.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      d99c8727
    • T
      cgroup: make hierarchy iterators deal with cgroup_subsys_state instead of cgroup · 492eb21b
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using css
      (cgroup_subsys_state) as the primary handle instead of cgroup in
      subsystem API.  For hierarchy iterators, this is beneficial because
      
      * In most cases, css is the only thing subsystems care about anyway.
      
      * On the planned unified hierarchy, iterations for different
        subsystems will need to skip over different subtrees of the
        hierarchy depending on which subsystems are enabled on each cgroup.
        Passing around css makes it unnecessary to explicitly specify the
        subsystem in question as css is intersection between cgroup and
        subsystem
      
      * For the planned unified hierarchy, css's would need to be created
        and destroyed dynamically independent from cgroup hierarchy.  Having
        cgroup core manage css iteration makes enforcing deref rules a lot
        easier.
      
      Most subsystem conversions are straight-forward.  Noteworthy changes
      are
      
      * blkio: cgroup_to_blkcg() is no longer used.  Removed.
      
      * freezer: cgroup_freezer() is no longer used.  Removed.
      
      * devices: cgroup_to_devcgroup() is no longer used.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      492eb21b
    • T
      cgroup: pass around cgroup_subsys_state instead of cgroup in file methods · 182446d0
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using struct
      cgroup_subsys_state * as the primary handle instead of struct cgroup.
      Please see the previous commit which converts the subsystem methods
      for rationale.
      
      This patch converts all cftype file operations to take @css instead of
      @cgroup.  cftypes for the cgroup core files don't have their subsytem
      pointer set.  These will automatically use the dummy_css added by the
      previous patch and can be converted the same way.
      
      Most subsystem conversions are straight forwards but there are some
      interesting ones.
      
      * freezer: update_if_frozen() is also converted to take @css instead
        of @cgroup for consistency.  This will make the code look simpler
        too once iterators are converted to use css.
      
      * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
        vmpressure while mem_cgroup_from_cont() can be made static.
        Updated accordingly.
      
      * cpu: cgroup_tg() doesn't have any user left.  Removed.
      
      * cpuacct: cgroup_ca() doesn't have any user left.  Removed.
      
      * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
        Removed.
      
      * net_cls: cgrp_cls_state() doesn't have any user left.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      182446d0
    • T
      cgroup: add subsys backlink pointer to cftype · 2bb566cb
      Tejun Heo 提交于
      cgroup is transitioning to using css (cgroup_subsys_state) instead of
      cgroup as the primary subsystem handle.  The cgroupfs file interface
      will be converted to use css's which requires finding out the
      subsystem from cftype so that the matching css can be determined from
      the cgroup.
      
      This patch adds cftype->ss which points to the subsystem the file
      belongs to.  The field is initialized while a cftype is being
      registered.  This makes it unnecessary to explicitly specify the
      subsystem for other cftype handling functions.  @ss argument dropped
      from various cftype handling functions.
      
      This patch shouldn't introduce any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      2bb566cb
    • T
      cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods · eb95419b
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using struct
      cgroup_subsys_state * as the primary handle instead of struct cgroup *
      in subsystem implementations for the following reasons.
      
      * With unified hierarchy, subsystems will be dynamically bound and
        unbound from cgroups and thus css's (cgroup_subsys_state) may be
        created and destroyed dynamically over the lifetime of a cgroup,
        which is different from the current state where all css's are
        allocated and destroyed together with the associated cgroup.  This
        in turn means that cgroup_css() should be synchronized and may
        return NULL, making it more cumbersome to use.
      
      * Differing levels of per-subsystem granularity in the unified
        hierarchy means that the task and descendant iterators should behave
        differently depending on the specific subsystem the iteration is
        being performed for.
      
      * In majority of the cases, subsystems only care about its part in the
        cgroup hierarchy - ie. the hierarchy of css's.  Subsystem methods
        often obtain the matching css pointer from the cgroup and don't
        bother with the cgroup pointer itself.  Passing around css fits
        much better.
      
      This patch converts all cgroup_subsys methods to take @css instead of
      @cgroup.  The conversions are mostly straight-forward.  A few
      noteworthy changes are
      
      * ->css_alloc() now takes css of the parent cgroup rather than the
        pointer to the new cgroup as the css for the new cgroup doesn't
        exist yet.  Knowing the parent css is enough for all the existing
        subsystems.
      
      * In kernel/cgroup.c::offline_css(), unnecessary open coded css
        dereference is replaced with local variable access.
      
      This patch shouldn't cause any behavior differences.
      
      v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
          with local variable @css as suggested by Li Zefan.
      
          Rebased on top of new for-3.12 which includes for-3.11-fixes so
          that ->css_free() invocation added by da0a12ca ("cgroup: fix a
          leak when percpu_ref_init() fails") is converted too.  Suggested
          by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      eb95419b
  10. 15 5月, 2013 5 次提交
    • T
      blk-throttle: implement proper hierarchy support · 9138125b
      Tejun Heo 提交于
      With the recent updates, blk-throttle is finally ready for proper
      hierarchy support.  Dispatching now honors service_queue->parent_sq
      and propagates correctly.  The only thing missing is setting
      ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
      hierarchy.
      
      This patch updates throtl_pd_init() such that service_queues form the
      same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
      As this concludes proper hierarchy support for blkcg, the shameful
      .broken_hierarchy tag is removed from blkio_subsys.
      
      v2: Updated blkio-controller.txt as suggested by Vivek.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      9138125b
    • T
      blkcg: move bulk of blkcg_gq release operations to the RCU callback · 2a4fd070
      Tejun Heo 提交于
      Currently, when the last reference of a blkcg_gq is put, all then
      release operations sans the actual freeing happen directly in
      blkg_put().  As blkg_put() may be called under queue_lock, all
      pd_exit_fn()s may be too.  This makes it impossible for pd_exit_fn()s
      to use del_timer_sync() on timers which grab the queue_lock which is
      an irq-safe lock due to the deadlock possibility described in the
      comment on top of del_timer_sync().
      
      This can be easily avoided by perfoming the release operations in the
      RCU callback instead of directly from blkg_put().  This patch moves
      the blkcg_gq release operations to the RCU callback.
      
      As this leaves __blkg_release() with only call_rcu() invocation,
      blkg_rcu_free() is renamed to __blkg_release_rcu(), exported and
      call_rcu() invocation is now done directly from blkg_put() instead of
      going through __blkg_release() which is removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2a4fd070
    • T
      blkcg: invoke blkcg_policy->pd_init() after parent is linked · db613670
      Tejun Heo 提交于
      Currently, when creating a new blkcg_gq, each policy's pd_init_fn() is
      invoked in blkg_alloc() before the parent is linked.  This makes it
      difficult for policies to perform initializations which are dependent
      on the parent.
      
      This patch moves pd_init_fn() invocations to blkg_create() after the
      parent blkg is linked where the new blkg is fully initialized.  As
      this means that blkg_free() can't assume that pd's are initialized,
      pd_exit_fn() invocations are moved to __blkg_release().  This
      guarantees that pd_exit_fn() is also invoked with fully initialized
      blkgs with valid parent pointers.
      
      This will help implementing hierarchy support in blk-throttle.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      db613670
    • T
      blkcg: move blkg_for_each_descendant_pre() to block/blk-cgroup.h · dd4a4ffc
      Tejun Heo 提交于
      blk-throttle hierarchy support will make use of it.  Move
      blkg_for_each_descendant_pre() from block/blk-cgroup.c to
      block/blk-cgroup.h.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      dd4a4ffc
    • T
      blkcg: fix error return path in blkg_create() · 2423c9c3
      Tejun Heo 提交于
      In blkg_create(), after lookup of parent fails, the control jumps to
      error path with the error code encoded into @blkg.  The error path
      doesn't use @blkg for the return value.  It returns ERR_PTR(ret).
      Make lookup fail path set @ret instead of @blkg.
      
      Note that the parent lookup is guaranteed to succeed at that point and
      the condition check is purely for sanity and triggers WARN when fails.
      As such, I don't think it's necessary to mark it for -stable.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      2423c9c3
  11. 09 4月, 2013 1 次提交
    • J
      blkcg: fix "scheduling while atomic" in blk_queue_bypass_start · e5072664
      Jun'ichi Nomura 提交于
      Since 749fefe6 in v3.7 ("block: lift the initial queue bypass mode
      on blk_register_queue() instead of blk_init_allocated_queue()"),
      the following warning appears when multipath is used with CONFIG_PREEMPT=y.
      
      This patch moves blk_queue_bypass_start() before radix_tree_preload()
      to avoid the sleeping call while preemption is disabled.
      
        BUG: scheduling while atomic: multipath/2460/0x00000002
        1 lock held by multipath/2460:
         #0:  (&md->type_lock){......}, at: [<ffffffffa019fb05>] dm_lock_md_type+0x17/0x19 [dm_mod]
        Modules linked in: ...
        Pid: 2460, comm: multipath Tainted: G        W    3.7.0-rc2 #1
        Call Trace:
         [<ffffffff810723ae>] __schedule_bug+0x6a/0x78
         [<ffffffff81428ba2>] __schedule+0xb4/0x5e0
         [<ffffffff814291e6>] schedule+0x64/0x66
         [<ffffffff8142773a>] schedule_timeout+0x39/0xf8
         [<ffffffff8108ad5f>] ? put_lock_stats+0xe/0x29
         [<ffffffff8108ae30>] ? lock_release_holdtime+0xb6/0xbb
         [<ffffffff814289e3>] wait_for_common+0x9d/0xee
         [<ffffffff8107526c>] ? try_to_wake_up+0x206/0x206
         [<ffffffff810c0eb8>] ? kfree_call_rcu+0x1c/0x1c
         [<ffffffff81428aec>] wait_for_completion+0x1d/0x1f
         [<ffffffff810611f9>] wait_rcu_gp+0x5d/0x7a
         [<ffffffff81061216>] ? wait_rcu_gp+0x7a/0x7a
         [<ffffffff8106fb18>] ? complete+0x21/0x53
         [<ffffffff810c0556>] synchronize_rcu+0x1e/0x20
         [<ffffffff811dd903>] blk_queue_bypass_start+0x5d/0x62
         [<ffffffff811ee109>] blkcg_activate_policy+0x73/0x270
         [<ffffffff81130521>] ? kmem_cache_alloc_node_trace+0xc7/0x108
         [<ffffffff811f04b3>] cfq_init_queue+0x80/0x28e
         [<ffffffffa01a1600>] ? dm_blk_ioctl+0xa7/0xa7 [dm_mod]
         [<ffffffff811d8c41>] elevator_init+0xe1/0x115
         [<ffffffff811e229f>] ? blk_queue_make_request+0x54/0x59
         [<ffffffff811dd743>] blk_init_allocated_queue+0x8c/0x9e
         [<ffffffffa019ffcd>] dm_setup_md_queue+0x36/0xaa [dm_mod]
         [<ffffffffa01a60e6>] table_load+0x1bd/0x2c8 [dm_mod]
         [<ffffffffa01a7026>] ctl_ioctl+0x1d6/0x236 [dm_mod]
         [<ffffffffa01a5f29>] ? table_clear+0xaa/0xaa [dm_mod]
         [<ffffffffa01a7099>] dm_ctl_ioctl+0x13/0x17 [dm_mod]
         [<ffffffff811479fc>] do_vfs_ioctl+0x3fb/0x441
         [<ffffffff811b643c>] ? file_has_perm+0x8a/0x99
         [<ffffffff81147aa0>] sys_ioctl+0x5e/0x82
         [<ffffffff812010be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
         [<ffffffff814310d9>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Alasdair G Kergon <agk@redhat.com>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e5072664
  12. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  13. 10 1月, 2013 10 次提交
    • T
      blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock · 810ecfa7
      Tejun Heo 提交于
      Instead of holding blkcg->lock while walking ->blkg_list and executing
      prfill(), RCU walk ->blkg_list and hold the blkg's queue lock while
      executing prfill().  This makes prfill() implementations easier as
      stats are mostly protected by queue lock.
      
      This will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      810ecfa7
    • T
      blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge() · 16b3de66
      Tejun Heo 提交于
      Implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge().
      The former two collect the [rw]stats designated by the target policy
      data and offset from the pd's subtree.  The latter two add one
      [rw]stat to another.
      
      Note that the recursive sum functions require the queue lock to be
      held on entry to make blkg online test reliable.  This is necessary to
      properly handle stats of a dying blkg.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      16b3de66
    • T
      blkcg: export __blkg_prfill_rwstat() · b50da39f
      Tejun Heo 提交于
      Hierarchical stats for cfq-iosched will need __blkg_prfill_rwstat().
      Export it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      b50da39f
    • T
      blkcg: implement blkcg_policy->on/offline_pd_fn() and blkcg_gq->online · f427d909
      Tejun Heo 提交于
      Add two blkcg_policy methods, ->online_pd_fn() and ->offline_pd_fn(),
      which are invoked as the policy_data gets activated and deactivated
      while holding both blkcg and q locks.
      
      Also, add blkcg_gq->online bool, which is set and cleared as the
      blkcg_gq gets activated and deactivated.  This flag also is toggled
      while holding both blkcg and q locks.
      
      These will be used to implement hierarchical stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      f427d909
    • T
      blkcg: add blkg_policy_data->plid · b276a876
      Tejun Heo 提交于
      Add pd->plid so that the policy a pd belongs to can be identified
      easily.  This will be used to implement hierarchical blkg_[rw]stats.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      b276a876
    • T
      cfq-iosched: add leaf_weight · e71357e1
      Tejun Heo 提交于
      cfq blkcg is about to grow proper hierarchy handling, where a child
      blkg's weight would nest inside the parent's.  This makes tasks in a
      blkg to compete against both tasks in the sibling blkgs and the tasks
      of child blkgs.
      
      We're gonna use the existing weight as the group weight which decides
      the blkg's weight against its siblings.  This patch introduces a new
      weight - leaf_weight - which decides the weight of a blkg against the
      child blkgs.
      
      It's named leaf_weight because another way to look at it is that each
      internal blkg nodes have a hidden child leaf node which contains all
      its tasks and leaf_weight is the weight of the leaf node and handled
      the same as the weight of the child blkgs.
      
      This patch only adds leaf_weight fields and exposes it to userland.
      The new weight isn't actually used anywhere yet.  Note that
      cfq-iosched currently offcially supports only single level hierarchy
      and root blkgs compete with the first level blkgs - ie. root weight is
      basically being used as leaf_weight.  For root blkgs, the two weights
      are kept in sync for backward compatibility.
      
      v2: cfqd->root_group->leaf_weight initialization was missing from
          cfq_init_queue() causing divide by zero when
          !CONFIG_CFQ_GROUP_SCHED.  Fix it.  Reported by Fengguang.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      e71357e1
    • T
      blkcg: make blkcg_gq's hierarchical · 3c547865
      Tejun Heo 提交于
      Currently a child blkg (blkcg_gq) can be created even if its parent
      doesn't exist.  ie. Given a blkg, it's not guaranteed that its
      ancestors will exist.  This makes it difficult to implement proper
      hierarchy support for blkcg policies.
      
      Always create blkgs recursively and make a child blkg hold a reference
      to its parent.  blkg->parent is added so that finding the parent is
      easy.  blkcg_parent() is also added in the process.
      
      This change can be visible to userland.  e.g. while issuing IO in a
      nested cgroup didn't affect the ancestors at all, now it will
      initialize all ancestor blkgs and zero stats for the request_queue
      will always appear on them.  While this is userland visible, this
      shouldn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      3c547865
    • T
      blkcg: cosmetic updates to blkg_create() · 93e6d5d8
      Tejun Heo 提交于
      * Rename out_* labels to err_*.
      
      * Do ERR_PTR() conversion once in the error return path.
      
      This patch is cosmetic and to prepare for the hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      93e6d5d8
    • T
      blkcg: reorganize blkg_lookup_create() and friends · 86cde6b6
      Tejun Heo 提交于
      Reorganize such that
      
      * __blkg_lookup() takes bool param @update_hint to determine whether
        to update hint.
      
      * __blkg_lookup_create() no longer performs lookup before trying to
        create.  Renamed to blkg_create().
      
      * blkg_lookup_create() now performs lookup and then invokes
        blkg_create() if lookup fails.
      
      * root_blkg creation in blkcg_activate_policy() updated accordingly.
        Note that blkcg_activate_policy() no longer updates lookup hint if
        root_blkg already exists.
      
      Except for the last lookup hint bit which is immaterial, this is pure
      reorganization and doesn't introduce any visible behavior change.
      This is to prepare for proper hierarchy support.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      86cde6b6
    • T
      blkcg: fix minor bug in blkg_alloc() · 356d2e58
      Tejun Heo 提交于
      blkg_alloc() was mistakenly checking blkcg_policy_enabled() twice.
      The latter test should have been on whether pol->pd_init_fn() exists.
      This doesn't cause actual problems because both blkcg policies
      implement pol->pd_init_fn().  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      356d2e58
  14. 06 12月, 2012 1 次提交
    • B
      block: Rename queue dead flag · 3f3299d5
      Bart Van Assche 提交于
      QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
      stop. After this flag has been set queue draining starts. However,
      during the queue draining phase it is still safe to invoke the
      queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
      flag.
      
      This patch has been generated by running the following command
      over the kernel source tree:
      
      git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
          xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g'      \
              -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g';                \
      sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
          include/linux/blkdev.h;                                       \
      sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
          -e 's/Dead queue/A dying queue/' block/blk-core.c
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <JBottomley@Parallels.com>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Chanho Min <chanho.min@lge.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f3299d5
  15. 20 11月, 2012 1 次提交
  16. 06 11月, 2012 1 次提交
  17. 23 10月, 2012 2 次提交
    • J
      blkcg: stop iteration early if root_rl is the only request list · 65c77fd9
      Jun'ichi Nomura 提交于
      __blk_queue_next_rl() finds next request list based on blkg_list
      while skipping root_blkg in the list.
      OTOH, root_rl is special as it may exist even without root_blkg.
      
      Though the later part of the function handles such a case correctly,
      exiting early is good for readability of the code.
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      65c77fd9
    • J
      blkcg: Fix use-after-free of q->root_blkg and q->root_rl.blkg · 65635cbc
      Jun'ichi Nomura 提交于
      blk_put_rl() does not call blkg_put() for q->root_rl because we
      don't take request list reference on q->root_blkg.
      However, if root_blkg is once attached then detached (freed),
      blk_put_rl() is confused by the bogus pointer in q->root_blkg.
      
      For example, with !CONFIG_BLK_DEV_THROTTLING &&
      CONFIG_CFQ_GROUP_IOSCHED,
      switching IO scheduler from cfq to deadline will cause system stall
      after the following warning with 3.6:
      
      > WARNING: at /work/build/linux/block/blk-cgroup.h:250
      > blk_put_rl+0x4d/0x95()
      > Modules linked in: bridge stp llc sunrpc acpi_cpufreq freq_table mperf
      > ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
      > Pid: 0, comm: swapper/0 Not tainted 3.6.0 #1
      > Call Trace:
      >  <IRQ>  [<ffffffff810453bd>] warn_slowpath_common+0x85/0x9d
      >  [<ffffffff810453ef>] warn_slowpath_null+0x1a/0x1c
      >  [<ffffffff811d5f8d>] blk_put_rl+0x4d/0x95
      >  [<ffffffff811d614a>] __blk_put_request+0xc3/0xcb
      >  [<ffffffff811d71a3>] blk_finish_request+0x232/0x23f
      >  [<ffffffff811d76c3>] ? blk_end_bidi_request+0x34/0x5d
      >  [<ffffffff811d76d1>] blk_end_bidi_request+0x42/0x5d
      >  [<ffffffff811d7728>] blk_end_request+0x10/0x12
      >  [<ffffffff812cdf16>] scsi_io_completion+0x207/0x4d5
      >  [<ffffffff812c6fcf>] scsi_finish_command+0xfa/0x103
      >  [<ffffffff812ce2f8>] scsi_softirq_done+0xff/0x108
      >  [<ffffffff811dcea5>] blk_done_softirq+0x8d/0xa1
      >  [<ffffffff810915d5>] ?
      >  generic_smp_call_function_single_interrupt+0x9f/0xd7
      >  [<ffffffff8104cf5b>] __do_softirq+0x102/0x213
      >  [<ffffffff8108a5ec>] ? lock_release_holdtime+0xb6/0xbb
      >  [<ffffffff8104d2b4>] ? raise_softirq_irqoff+0x9/0x3d
      >  [<ffffffff81424dfc>] call_softirq+0x1c/0x30
      >  [<ffffffff81011beb>] do_softirq+0x4b/0xa3
      >  [<ffffffff8104cdb0>] irq_exit+0x53/0xd5
      >  [<ffffffff8102d865>] smp_call_function_single_interrupt+0x34/0x36
      >  [<ffffffff8142486f>] call_function_single_interrupt+0x6f/0x80
      >  <EOI>  [<ffffffff8101800b>] ? mwait_idle+0x94/0xcd
      >  [<ffffffff81018002>] ? mwait_idle+0x8b/0xcd
      >  [<ffffffff81017811>] cpu_idle+0xbb/0x114
      >  [<ffffffff81401fbd>] rest_init+0xc1/0xc8
      >  [<ffffffff81401efc>] ? csum_partial_copy_generic+0x16c/0x16c
      >  [<ffffffff81cdbd3d>] start_kernel+0x3d4/0x3e1
      >  [<ffffffff81cdb79e>] ? kernel_init+0x1f7/0x1f7
      >  [<ffffffff81cdb2dd>] x86_64_start_reservations+0xb8/0xbd
      >  [<ffffffff81cdb3e3>] x86_64_start_kernel+0x101/0x110
      
      This patch clears q->root_blkg and q->root_rl.blkg when root blkg
      is destroyed.
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      65635cbc
  18. 15 9月, 2012 1 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
  19. 27 6月, 2012 1 次提交
    • T
      blkcg: implement per-blkg request allocation · a051661c
      Tejun Heo 提交于
      Currently, request_queue has one request_list to allocate requests
      from regardless of blkcg of the IO being issued.  When the unified
      request pool is used up, cfq proportional IO limits become meaningless
      - whoever grabs the next request being freed wins the race regardless
      of the configured weights.
      
      This can be easily demonstrated by creating a blkio cgroup w/ very low
      weight, put a program which can issue a lot of random direct IOs there
      and running a sequential IO from a different cgroup.  As soon as the
      request pool is used up, the sequential IO bandwidth crashes.
      
      This patch implements per-blkg request_list.  Each blkg has its own
      request_list and any IO allocates its request from the matching blkg
      making blkcgs completely isolated in terms of request allocation.
      
      * Root blkcg uses the request_list embedded in each request_queue,
        which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
        handling a bit harier, this enables avoiding most overhead for root
        blkcg.
      
      * Queue fullness is properly per request_list but bdi isn't blkcg
        aware yet, so congestion state currently just follows the root
        blkcg.  As writeback isn't aware of blkcg yet, this works okay for
        async congestion but readahead may get the wrong signals.  It's
        better than blkcg completely collapsing with shared request_list but
        needs to be improved with future changes.
      
      * After this change, each block cgroup gets a full request pool making
        resource consumption of each cgroup higher.  This makes allowing
        non-root users to create cgroups less desirable; however, note that
        allowing non-root users to directly manage cgroups is already
        severely broken regardless of this patch - each block cgroup
        consumes kernel memory and skews IO weight (IO weights are not
        hierarchical).
      
      v2: queue-sysfs.txt updated and patch description udpated as suggested
          by Vivek.
      
      v3: blk_get_rl() wasn't checking error return from
          blkg_lookup_create() and may cause oops on lookup failure.  Fix it
          by falling back to root_rl on blkg lookup failures.  This problem
          was spotted by Rakesh Iyer <rni@google.com>.
      
      v4: Updated to accomodate 458f27a9 "block: Avoid missed wakeup in
          request waitqueue".  blk_drain_queue() now wakes up waiters on all
          blkg->rl on the target queue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a051661c