1. 06 11月, 2012 3 次提交
    • T
      cgroup: use cgroup_lock_live_group(parent) in cgroup_create() · 976c06bc
      Tejun Heo 提交于
      This patch makes cgroup_create() fail if @parent is marked removed.
      This is to prepare for further updates to cgroup_rmdir() path.
      
      Note that this change isn't strictly necessary.  cgroup can only be
      created via mkdir and the removed marking and dentry removal happen
      without releasing cgroup_mutex, so cgroup_create() can never race with
      cgroup_rmdir().  Even after the scheduled updates to cgroup_rmdir(),
      cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex
      rendering the added liveliness check unnecessary.
      
      Do it anyway such that locking is contained inside cgroup proper and
      we don't get nasty surprises if we ever grow another caller of
      cgroup_create().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      976c06bc
    • T
      cgroup: kill CSS_REMOVED · e9316080
      Tejun Heo 提交于
      CSS_REMOVED is one of the several contortions which were necessary to
      support css reference draining on cgroup removal.  All css->refcnts
      which need draining should be deactivated and verified to equal zero
      atomically w.r.t. css_tryget().  If any one isn't zero, all refcnts
      needed to be re-activated and css_tryget() shouldn't fail in the
      process.
      
      This was achieved by letting css_tryget() busy-loop until either the
      refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
      (committing to removal).
      
      Now that css refcnt draining is no longer used, there's no need for
      atomic rollback mechanism.  css_tryget() simply can look at the
      reference count and fail if it's deactivated - it's never getting
      re-activated.
      
      This patch removes CSS_REMOVED and updates __css_tryget() to fail if
      the refcnt is deactivated.  As deactivation and removal are a single
      step now, they no longer need to be protected against css_tryget()
      happening from irq context.  Remove local_irq_disable/enable() from
      cgroup_rmdir().
      
      Note that this removes css_is_removed() whose only user is VM_BUG_ON()
      in memcontrol.c.  We can replace it with a check on the refcnt but
      given that the only use case is a debug assert, I think it's better to
      simply unexport it.
      
      v2: Comment updated and explanation on local_irq_disable/enable()
          added per Michal Hocko.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      e9316080
    • T
      cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs · ed957793
      Tejun Heo 提交于
      2ef37d3f ("memcg: Simplify mem_cgroup_force_empty_list error
      handling") removed the last user of __DEPRECATED_clear_css_refs.  This
      patch removes __DEPRECATED_clear_css_refs and mechanisms to support
      it.
      
      * Conditionals dependent on __DEPRECATED_clear_css_refs removed.
      
      * cgroup_clear_css_refs() can no longer fail.  All that needs to be
        done are deactivating refcnts, setting CSS_REMOVED and putting the
        base reference on each css.  Remove cgroup_clear_css_refs() and the
        failure path, and open-code the loops into cgroup_rmdir().
      
      This patch keeps the two for_each_subsys() loops separate while open
      coding them.  They can be merged now but there are scheduled changes
      which need them to be separate, so keep them separate to reduce the
      amount of churn.
      
      local_irq_save/restore() from cgroup_clear_css_refs() are replaced
      with local_irq_disable/enable() for simplicity.  This is safe as
      cgroup_rmdir() is always called with IRQ enabled.  Note that this IRQ
      switching is necessary to ensure that css_tryget() isn't called from
      IRQ context on the same CPU while lower context is between CSS
      deactivation and setting CSS_REMOVED as css_tryget() would hang
      forever in such cases waiting for CSS to be re-activated or
      CSS_REMOVED set.  This will go away soon.
      
      v2: cgroup_call_pre_destroy() removal dropped per Michal.  Commit
          message updated to explain local_irq_disable/enable() conversion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ed957793
  2. 20 9月, 2012 1 次提交
    • T
      workqueue: reimplement work_on_cpu() using system_wq · ed48ece2
      Tejun Heo 提交于
      The existing work_on_cpu() implementation is hugely inefficient.  It
      creates a new kthread, execute that single function and then let the
      kthread die on each invocation.
      
      Now that system_wq can handle concurrent executions, there's no
      advantage of doing this.  Reimplement work_on_cpu() using system_wq
      which makes it simpler and way more efficient.
      
      stable: While this isn't a fix in itself, it's needed to fix a
              workqueue related bug in cpufreq/powernow-k8.  AFAICS, this
              shouldn't break other existing users.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: stable@vger.kernel.org
      ed48ece2
  3. 18 9月, 2012 2 次提交
    • L
      workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn() · 960bd11b
      Lai Jiangshan 提交于
      busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed
      (CPU is down again).  This used to be okay because the flag wasn't
      used for anything else.
      
      However, after 25511a47 "workqueue: reimplement CPU online rebinding
      to handle idle workers", WORKER_REBIND is also used to command idle
      workers to rebind.  If not cleared, the worker may confuse the next
      CPU_UP cycle by having REBIND spuriously set or oops / get stuck by
      prematurely calling idle_worker_rebind().
      
        WARNING: at /work/os/wq/kernel/workqueue.c:1323 worker_thread+0x4cd/0x5
       00()
        Hardware name: Bochs
        Modules linked in: test_wq(O-)
        Pid: 33, comm: kworker/1:1 Tainted: G           O 3.6.0-rc1-work+ #3
        Call Trace:
         [<ffffffff8109039f>] warn_slowpath_common+0x7f/0xc0
         [<ffffffff810903fa>] warn_slowpath_null+0x1a/0x20
         [<ffffffff810b3f1d>] worker_thread+0x4cd/0x500
         [<ffffffff810bc16e>] kthread+0xbe/0xd0
         [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
        ---[ end trace e977cf20f4661968 ]---
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff810b3db0>] worker_thread+0x360/0x500
        PGD 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in: test_wq(O-)
        CPU 0
        Pid: 33, comm: kworker/1:1 Tainted: G        W  O 3.6.0-rc1-work+ #3 Bochs Bochs
        RIP: 0010:[<ffffffff810b3db0>]  [<ffffffff810b3db0>] worker_thread+0x360/0x500
        RSP: 0018:ffff88001e1c9de0  EFLAGS: 00010086
        RAX: 0000000000000000 RBX: ffff88001e633e00 RCX: 0000000000004140
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009
        RBP: ffff88001e1c9ea0 R08: 0000000000000000 R09: 0000000000000001
        R10: 0000000000000002 R11: 0000000000000000 R12: ffff88001fc8d580
        R13: ffff88001fc8d590 R14: ffff88001e633e20 R15: ffff88001e1c6900
        FS:  0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
        CR2: 0000000000000000 CR3: 00000000130e8000 CR4: 00000000000006f0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process kworker/1:1 (pid: 33, threadinfo ffff88001e1c8000, task ffff88001e1c6900)
        Stack:
         ffff880000000000 ffff88001e1c9e40 0000000000000001 ffff88001e1c8010
         ffff88001e519c78 ffff88001e1c9e58 ffff88001e1c6900 ffff88001e1c6900
         ffff88001e1c6900 ffff88001e1c6900 ffff88001fc8d340 ffff88001fc8d340
        Call Trace:
         [<ffffffff810bc16e>] kthread+0xbe/0xd0
         [<ffffffff81bd2664>] kernel_thread_helper+0x4/0x10
        Code: b1 00 f6 43 48 02 0f 85 91 01 00 00 48 8b 43 38 48 89 df 48 8b 00 48 89 45 90 e8 ac f0 ff ff 3c 01 0f 85 60 01 00 00 48 8b 53 50 <8b> 02 83 e8 01 85 c0 89 02 0f 84 3b 01 00 00 48 8b 43 38 48 8b
        RIP  [<ffffffff810b3db0>] worker_thread+0x360/0x500
         RSP <ffff88001e1c9de0>
        CR2: 0000000000000000
      
      There was no reason to keep WORKER_REBIND on failure in the first
      place - WORKER_UNBOUND is guaranteed to be set in such cases
      preventing incorrectly activating concurrency management.  Always
      clear WORKER_REBIND.
      
      tj: Updated comment and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      960bd11b
    • A
      pid-namespace: limit value of ns_last_pid to (0, max_pid) · 579035dc
      Andrew Vagin 提交于
      The kernel doesn't check the pid for negative values, so if you try to
      write -2 to /proc/sys/kernel/ns_last_pid, you will get a kernel panic.
      
      The crash happens because the next pid is -1, and alloc_pidmap() will
      try to access to a nonexistent pidmap.
      
        map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
      Signed-off-by: NAndrew Vagin <avagin@openvz.org>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      579035dc
  4. 17 9月, 2012 1 次提交
  5. 13 9月, 2012 1 次提交
  6. 11 9月, 2012 2 次提交
    • L
      workqueue: fix possible idle worker depletion across CPU hotplug · ee378aa4
      Lai Jiangshan 提交于
      To simplify both normal and CPU hotplug paths, worker management is
      prevented while CPU hoplug is in progress.  This is achieved by CPU
      hotplug holding the same exclusion mechanism used by workers to ensure
      there's only one manager per pool.
      
      If someone else seems to be performing the manager role, workers
      proceed to execute work items.  CPU hotplug using the same mechanism
      can lead to idle worker depletion because all workers could proceed to
      execute work items while CPU hotplug is in progress and CPU hotplug
      itself wouldn't actually perform the worker management duty - it
      doesn't guarantee that there's an idle worker left when it releases
      management.
      
      This idle worker depletion, under extreme circumstances, can break
      forward-progress guarantee and thus lead to deadlock.
      
      This patch fixes the bug by using separate mechanisms for manager
      exclusion among workers and hotplug exclusion.  For manager exclusion,
      POOL_MANAGING_WORKERS which was restored by the previous patch is
      used.  pool->manager_mutex is now only used for exclusion between the
      elected manager and CPU hotplug.  The elected manager won't proceed
      without holding pool->manager_mutex.
      
      This ensures that the worker which won the manager position can't skip
      managing while CPU hotplug is in progress.  It will block on
      manager_mutex and perform management after CPU hotplug is complete.
      
      Note that hotplug may happen while waiting for manager_mutex.  A
      manager isn't either on idle or busy list and thus the hoplug code
      can't unbind/rebind it.  Make the manager handle its own un/rebinding.
      
      tj: Updated comment and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ee378aa4
    • L
      workqueue: restore POOL_MANAGING_WORKERS · 552a37e9
      Lai Jiangshan 提交于
      This patch restores POOL_MANAGING_WORKERS which was replaced by
      pool->manager_mutex by 60373152 "workqueue: use mutex for global_cwq
      manager exclusion".
      
      There's a subtle idle worker depletion bug across CPU hotplug events
      and we need to distinguish an actual manager and CPU hotplug
      preventing management.  POOL_MANAGING_WORKERS will be used for the
      former and manager_mutex the later.
      
      This patch just lays POOL_MANAGING_WORKERS on top of the existing
      manager_mutex and doesn't introduce any synchronization changes.  The
      next patch will update it.
      
      Note that this patch fixes a non-critical anomaly where
      too_many_workers() may return %true spuriously while CPU hotplug is in
      progress.  While the issue could schedule idle timer spuriously, it
      didn't trigger any actual misbehavior.
      
      tj: Rewrote patch description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      552a37e9
  7. 06 9月, 2012 2 次提交
    • T
      workqueue: fix possible deadlock in idle worker rebinding · ec58815a
      Tejun Heo 提交于
      Currently, rebind_workers() and idle_worker_rebind() are two-way
      interlocked.  rebind_workers() waits for idle workers to finish
      rebinding and rebound idle workers wait for rebind_workers() to finish
      rebinding busy workers before proceeding.
      
      Unfortunately, this isn't enough.  The second wait from idle workers
      is implemented as follows.
      
      	wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));
      
      rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
      then returns.  If CPU hotplug cycle happens again before one of the
      idle workers finishes the above wait_event(), rebind_workers() will
      repeat the first part of the handshake - set WORKER_REBIND again and
      wait for the idle worker to finish rebinding - and this leads to
      deadlock because the idle worker would be waiting for WORKER_REBIND to
      clear.
      
      This is fixed by adding another interlocking step at the end -
      rebind_workers() now waits for all the idle workers to finish the
      above WORKER_REBIND wait before returning.  This ensures that all
      rebinding steps are complete on all idle workers before the next
      hotplug cycle can happen.
      
      This problem was diagnosed by Lai Jiangshan who also posted a patch to
      fix the issue, upon which this patch is based.
      
      This is the minimal fix and further patches are scheduled for the next
      merge window to simplify the CPU hotplug path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Original-patch-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
      ec58815a
    • T
      workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function · 90beca5d
      Tejun Heo 提交于
      This doesn't make any functional difference and is purely to help the
      next patch to be simpler.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      90beca5d
  8. 05 9月, 2012 1 次提交
    • L
      workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic · 96e65306
      Lai Jiangshan 提交于
      The compiler may compile the following code into TWO write/modify
      instructions.
      
      	worker->flags &= ~WORKER_UNBOUND;
      	worker->flags |= WORKER_REBIND;
      
      so the other CPU may temporarily see worker->flags which doesn't have
      either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
      prematurely.
      
      Fix it by using single explicit assignment via ACCESS_ONCE().
      
      Because idle workers have another WORKER_NOT_RUNNING flag, this bug
      doesn't exist for them; however, update it to use the same pattern for
      consistency.
      
      tj: Applied the change to idle workers too and updated comments and
          patch description a bit.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      96e65306
  9. 04 9月, 2012 6 次提交
  10. 02 9月, 2012 1 次提交
    • J
      time: Move ktime_t overflow checking into timespec_valid_strict · cee58483
      John Stultz 提交于
      Andreas Bombe reported that the added ktime_t overflow checking added to
      timespec_valid in commit 4e8b1452 ("time: Improve sanity checking of
      timekeeping inputs") was causing problems with X.org because it caused
      timeouts larger then KTIME_T to be invalid.
      
      Previously, these large timeouts would be clamped to KTIME_MAX and would
      never expire, which is valid.
      
      This patch splits the ktime_t overflow checking into a new
      timespec_valid_strict function, and converts the timekeeping codes
      internal checking to use this more strict function.
      Reported-and-tested-by: NAndreas Bombe <aeb@debian.org>
      Cc: Zhouping Liu <zliu@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cee58483
  11. 22 8月, 2012 6 次提交
  12. 21 8月, 2012 1 次提交
  13. 19 8月, 2012 1 次提交
  14. 18 8月, 2012 1 次提交
  15. 15 8月, 2012 4 次提交
  16. 14 8月, 2012 5 次提交
    • M
      sched: Fix migration thread runtime bogosity · 8f618968
      Mike Galbraith 提交于
      Make stop scheduler class do the same accounting as other classes,
      
      Migration threads can be caught in the act while doing exec balancing,
      leading to the below due to use of unmaintained ->se.exec_start.  The
      load that triggered this particular instance was an apparently out of
      control heavily threaded application that does system monitoring in
      what equated to an exec bomb, with one of the VERY frequently migrated
      tasks being ps.
      
      %CPU   PID USER     CMD
      99.3    45 root     [migration/10]
      97.7    53 root     [migration/12]
      97.0    57 root     [migration/13]
      90.1    49 root     [migration/11]
      89.6    65 root     [migration/15]
      88.7    17 root     [migration/3]
      80.4    37 root     [migration/8]
      78.1    41 root     [migration/9]
      44.2    13 root     [migration/2]
      Signed-off-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8f618968
    • M
      sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled · e221d028
      Mike Galbraith 提交于
      Root task group bandwidth replenishment must service all CPUs, regardless of
      where the timer was last started, and regardless of the isolation mechanism,
      lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e221d028
    • M
      sched,cgroup: Fix up task_groups list · 35cf4e50
      Mike Galbraith 提交于
      With multiple instances of task_groups, for_each_rt_rq() is a noop,
      no task groups having been added to the rt.c list instance.  This
      renders __enable/disable_runtime() and print_rt_stats() noop, the
      user (non) visible effect being that rt task groups are missing in
      /proc/sched_debug.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: stable@kernel.org # v3.3+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      35cf4e50
    • S
      sched: fix divide by zero at {thread_group,task}_times · bea6832c
      Stanislaw Gruszka 提交于
      On architectures where cputime_t is 64 bit type, is possible to trigger
      divide by zero on do_div(temp, (__force u32) total) line, if total is a
      non zero number but has lower 32 bit's zeroed. Removing casting is not
      a good solution since some do_div() implementations do cast to u32
      internally.
      
      This problem can be triggered in practice on very long lived processes:
      
        PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
         #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
         #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
         #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
         #3 [ffff880472a51cd0] die at ffffffff8100f26b
         #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
         #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
         #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
            [exception RIP: thread_group_times+0x56]
            RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
            RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
            RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
            RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
            R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
         #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
         #8 [ffff880472a51f40] sys_times at ffffffff81088524
         #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
            RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
            RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
            RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
            RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
            R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
            ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bea6832c
    • P
      sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies · a35b6466
      Peter Zijlstra 提交于
      Peter Portante reported that for large cgroup hierarchies (and or on
      large CPU counts) we get immense lock contention on rq->lock and stuff
      stops working properly.
      
      His workload was a ton of processes, each in their own cgroup,
      everybody idling except for a sporadic wakeup once every so often.
      
      It was found that:
      
        schedule()
          idle_balance()
            load_balance()
              local_irq_save()
              double_rq_lock()
              update_h_load()
                walk_tg_tree(tg_load_down)
                  tg_load_down()
      
      Results in an entire cgroup hierarchy walk under rq->lock for every
      new-idle balance and since new-idle balance isn't throttled this
      results in a lot of work while holding the rq->lock.
      
      This patch does two things, it removes the work from under rq->lock
      based on the good principle of race and pray which is widely employed
      in the load-balancer as a whole. And secondly it throttles the
      update_h_load() calculation to max once per jiffy.
      
      I considered excluding update_h_load() for new-idle balance
      all-together, but purely relying on regular balance passes to update
      this data might not work out under some rare circumstances where the
      new-idle busiest isn't the regular busiest for a while (unlikely, but
      a nightmare to debug if someone hits it and suffers).
      
      Cc: pjt@google.com
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Reported-by: NPeter Portante <pportant@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a35b6466
  17. 13 8月, 2012 1 次提交
    • J
      printk: Fix calculation of length used to discard records · e3756477
      Jeff Mahoney 提交于
      While tracking down a weird buffer overflow issue in a program that
      looked to be sane, I started double checking the length returned by
      syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing
      the buffer.
      
      Sure enough, it was.  I saw this in strace:
      
        11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279
      
      It turns out that the loops that calculate how much space the entries
      will take when they're copied don't include the newlines and prefixes
      that will be included in the final output since prev flags is passed as
      zero.
      
      This patch properly accounts for it and fixes the overflow.
      
      CC: stable@kernel.org
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3756477
  18. 09 8月, 2012 1 次提交