1. 22 2月, 2014 3 次提交
  2. 21 2月, 2014 1 次提交
  3. 19 2月, 2014 2 次提交
    • T
      cgroup: update cgroup_enable_task_cg_lists() to grab siglock · 532de3fc
      Tejun Heo 提交于
      Currently, there's nothing preventing cgroup_enable_task_cg_lists()
      from missing set PF_EXITING and race against cgroup_exit().  Depending
      on the timing, cgroup_exit() may finish with the task still linked on
      css_set leading to list corruption.  Fix it by grabbing siglock in
      cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be
      visible.
      
      This whole on-demand cg_list optimization is extremely fragile and has
      ample possibility to lead to bugs which can cause things like
      once-a-year oops during boot.  I'm wondering whether the better
      approach would be just adding "cgroup_disable=all" handling which
      disables the whole cgroup rather than tempting fate with this
      on-demand craziness.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      532de3fc
    • L
      workqueue: ensure @task is valid across kthread_stop() · 5bdfff96
      Lai Jiangshan 提交于
      When a kworker should die, the kworkre is notified through WORKER_DIE
      flag instead of kthread_should_stop().  This, IIRC, is primarily to
      keep the test synchronized inside worker_pool lock.  WORKER_DIE is
      first set while holding pool->lock, the lock is dropped and
      kthread_stop() is called.
      
      Unfortunately, this means that there's a slight chance that the target
      kworker may see WORKER_DIE before kthread_stop() finishes and exits
      and frees the target task before or during kthread_stop().
      
      Fix it by pinning the target task before setting WORKER_DIE and
      putting it after kthread_stop() is done.
      
      tj: Improved patch description and comment.  Moved pinning above
          WORKER_DIE for better signify what it's protecting.
      
      CC: stable@vger.kernel.org
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5bdfff96
  4. 18 2月, 2014 1 次提交
    • L
      printk: fix syslog() overflowing user buffer · e4178d80
      Linus Torvalds 提交于
      This is not a buffer overflow in the traditional sense: we don't
      overflow any *kernel* buffers, but we do mis-count the amount of data we
      copy back to user space for the SYSLOG_ACTION_READ_ALL case.
      
      In particular, if the user buffer is too small to hold everything, and
      *if* there is a continuation line at just the right place, we can end up
      giving the user more data than he asked for.
      
      The reason is that we first count up the number of bytes all the log
      records contains, then we walk the records again until we've skipped the
      records at the beginning that won't fit, and then we walk the rest of
      the records and copy them to the user space buffer.
      
      And in between that "skip the initial records that won't fit" and the
      "copy the records that *will* fit to user space", we reset the 'prev'
      variable that contained the record information for the last record not
      copied.  That meant that when we started copying to user space, we now
      had a different character count than what we had originally calculated
      in the first record walk-through.
      
      The fix is to simply not clear the 'prev' flags value (in both cases
      where we had the same logic: syslog_print_all and kmsg_dump_get_buffer:
      the latter is used for pstore-like dumping)
      Reported-and-tested-by: NDebabrata Banerjee <dbanerje@akamai.com>
      Acked-by: NKay Sievers <kay@vrfy.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4178d80
  5. 14 2月, 2014 1 次提交
    • T
      tick: Clear broadcast pending bit when switching to oneshot · dd5fd9b9
      Thomas Gleixner 提交于
      AMD systems which use the C1E workaround in the amd_e400_idle routine
      trigger the WARN_ON_ONCE in the broadcast code when onlining a CPU.
      
      The reason is that the idle routine of those AMD systems switches the
      cpu into forced broadcast mode early on before the newly brought up
      CPU can switch over to high resolution / NOHZ mode. The timer related
      CPU1 bringup looks like this:
      
        clockevent_register_device(local_apic);
        tick_setup(local_apic);
        ...
        idle()
      	tick_broadcast_on_off(FORCE);
      	tick_broadcast_oneshot_control(ENTER)
      	  cpumask_set(cpu, broadcast_oneshot_mask);
      	halt();
      
      Now the broadcast interrupt on CPU0 sets CPU1 in the
      broadcast_pending_mask and wakes CPU1. So CPU1 continues:
      
      	local_apic_timer_interrupt()
      	   tick_handle_periodic();
      	   softirq()
      	     tick_init_highres();
      	       cpumask_clr(cpu, broadcast_oneshot_mask);
      	
      	tick_broadcast_oneshot_control(ENTER)
      	   WARN_ON(cpumask_test(cpu, broadcast_pending_mask);
      
      So while we remove CPU1 from the broadcast_oneshot_mask when we switch
      over to highres mode, we do not clear the pending bit, which then
      triggers the warning when we go back to idle.
      
      The reason why this is only visible on C1E affected AMD systems is
      that the other machines enter the deep sleep states via
      acpi_idle/intel_idle and exit the broadcast mode before executing the
      remote triggered local_apic_timer_interrupt. So the pending bit is
      already cleared when the switch over to highres mode is clearing the
      oneshot mask.
      
      The solution is simple: Clear the pending bit together with the mask
      bit when we switch over to highres mode.
      
      Stanislaw came up independently with the same patch by enforcing the
      C1E workaround and debugging the fallout. I picked mine, because mine
      has a changelog :)
      Reported-by: Npoma <pomidorabelisima@gmail.com>
      Debugged-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Olaf Hering <olaf@aepfle.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Justin M. Forbes <jforbes@redhat.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1402111434180.21991@ionos.tec.linutronix.de
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      dd5fd9b9
  6. 13 2月, 2014 1 次提交
    • T
      Revert "cgroup: use an ordered workqueue for cgroup destruction" · 1a11533f
      Tejun Heo 提交于
      This reverts commit ab3f5faa.
      Explanation from Hugh:
      
        It's because more thorough testing, by others here, found that it
        wasn't always solving the problem: so I asked Tejun privately to
        hold off from sending it in, until we'd worked out why not.
      
        Most of our testing being on a v3,11-based kernel, it was perfectly
        possible that the problem was merely our own e.g. missing Tejun's
        8a2b7538 ("workqueue: fix ordered workqueues in NUMA setups").
      
        But that turned out not to be enough to fix it either. Then Filipe
        pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
        before we ever get to put the offline on to the workqueue: by the
        time we get to the workqueue, the ordering has already been lost.
      
        So, thanks for the Acks, but I'm afraid that this ordered workqueue
        solution is just not good enough: we should simply forget that patch
        and provide a different answer."
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      1a11533f
  7. 12 2月, 2014 1 次提交
    • S
      ring-buffer: Fix first commit on sub-buffer having non-zero delta · d651aa1d
      Steven Rostedt (Red Hat) 提交于
      Each sub-buffer (buffer page) has a full 64 bit timestamp. The events on
      that page use a 27 bit delta against that timestamp in order to save on
      bits written to the ring buffer. If the time between events is larger than
      what the 27 bits can hold, a "time extend" event is added to hold the
      entire 64 bit timestamp again and the events after that hold a delta from
      that timestamp.
      
      As a "time extend" is always paired with an event, it is logical to just
      allocate the event with the time extend, to make things a bit more efficient.
      
      Unfortunately, when the pairing code was written, it removed the "delta = 0"
      from the first commit on a page, causing the events on the page to be
      slightly skewed.
      
      Fixes: 69d1b839 "ring-buffer: Bind time extend and data events together"
      Cc: stable@vger.kernel.org # 2.6.37+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d651aa1d
  8. 11 2月, 2014 7 次提交
  9. 10 2月, 2014 8 次提交
  10. 09 2月, 2014 4 次提交
  11. 08 2月, 2014 3 次提交
    • T
      cgroup: fix locking in cgroup_cfts_commit() · 48573a89
      Tejun Heo 提交于
      cgroup_cfts_commit() walks the cgroup hierarchy that the target
      subsystem is attached to and tries to apply the file changes.  Due to
      the convolution with inode locking, it can't keep cgroup_mutex locked
      while iterating.  It currently holds only RCU read lock around the
      actual iteration and then pins the found cgroup using dget().
      
      Unfortunately, this is incorrect.  Although the iteration does check
      cgroup_is_dead() before invoking dget(), there's nothing which
      prevents the dentry from going away inbetween.  Note that this is
      different from the usual css iterations where css_tryget() is used to
      pin the css - css_tryget() tests whether the css can be pinned and
      fails if not.
      
      The problem can be solved by simply holding cgroup_mutex instead of
      RCU read lock around the iteration, which actually reduces LOC.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      48573a89
    • T
      cgroup: fix error return from cgroup_create() · b58c8998
      Tejun Heo 提交于
      cgroup_create() was returning 0 after allocation failures.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      b58c8998
    • T
      cgroup: fix error return value in cgroup_mount() · eb46bf89
      Tejun Heo 提交于
      When cgroup_mount() fails to allocate an id for the root, it didn't
      set ret before jumping to unlock_drop ending up returning 0 after a
      failure.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      eb46bf89
  12. 07 2月, 2014 1 次提交
    • H
      cgroup: use an ordered workqueue for cgroup destruction · ab3f5faa
      Hugh Dickins 提交于
      Sometimes the cleanup after memcg hierarchy testing gets stuck in
      mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
      
      There may turn out to be several causes, but a major cause is this: the
      workitem to offline parent can get run before workitem to offline child;
      parent's mem_cgroup_reparent_charges() circles around waiting for the
      child's pages to be reparented to its lrus, but it's holding cgroup_mutex
      which prevents the child from reaching its mem_cgroup_reparent_charges().
      
      Just use an ordered workqueue for cgroup_destroy_wq.
      
      tj: Committing as the temporary fix until the reverse dependency can
          be removed from memcg.  Comment updated accordingly.
      
      Fixes: e5fca243 ("cgroup: use a dedicated workqueue for cgroup destruction")
      Suggested-by: NFilipe Brandenburger <filbranden@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ab3f5faa
  13. 06 2月, 2014 2 次提交
    • M
      time: Fix overflow when HZ is smaller than 60 · 80d767d7
      Mikulas Patocka 提交于
      When compiling for the IA-64 ski emulator, HZ is set to 32 because the
      emulation is slow and we don't want to waste too many cycles processing
      timers. Alpha also has an option to set HZ to 32.
      
      This causes integer underflow in
      kernel/time/jiffies.c:
      kernel/time/jiffies.c:66:2: warning: large integer implicitly truncated to unsigned type [-Woverflow]
        .mult  = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */
        ^
      
      This patch reduces the JIFFIES_SHIFT value to avoid the overflow.
      Signed-off-by: NMikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
      Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1401241639100.23871@file01.intranet.prod.int.rdu2.redhat.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      80d767d7
    • L
      execve: use 'struct filename *' for executable name passing · c4ad8f98
      Linus Torvalds 提交于
      This changes 'do_execve()' to get the executable name as a 'struct
      filename', and to free it when it is done.  This is what the normal
      users want, and it simplifies and streamlines their error handling.
      
      The controlled lifetime of the executable name also fixes a
      use-after-free problem with the trace_sched_process_exec tracepoint: the
      lifetime of the passed-in string for kernel users was not at all
      obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize
      the pathname allocation lifetime with the execve() having finished,
      which in turn meant that the trace point that happened after
      mm_release() of the old process VM ended up using already free'd memory.
      
      To solve the kernel string lifetime issue, this simply introduces
      "getname_kernel()" that works like the normal user-space getname()
      function, except with the source coming from kernel memory.
      
      As Oleg points out, this also means that we could drop the tcomm[] array
      from 'struct linux_binprm', since the pathname lifetime now covers
      setup_new_exec().  That would be a separate cleanup.
      Reported-by: NIgor Zhbanov <i.zhbanov@samsung.com>
      Tested-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ad8f98
  14. 05 2月, 2014 1 次提交
  15. 04 2月, 2014 1 次提交
    • T
      arm, pm, vmpressure: add missing slab.h includes · 1ff6bbfd
      Tejun Heo 提交于
      arch/arm/mach-tegra/pm.c, kernel/power/console.c and mm/vmpressure.c
      were somehow getting slab.h indirectly through cgroup.h which in turn
      was getting it indirectly through xattr.h.  A scheduled cgroup change
      drops xattr.h inclusion from cgroup.h and breaks compilation of these
      three files.  Add explicit slab.h includes to the three files.
      
      A pending cgroup patch depends on this change and it'd be great if
      this can be routed through cgroup/for-3.14-fixes branch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NStephen Warren <swarren@wwwdotorg.org>
      Cc: Thierry Reding <thierry.reding@gmail.com>
      Cc: linux-tegra@vger.kernel.org
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: cgroups@vger.kernel.org
      1ff6bbfd
  16. 31 1月, 2014 2 次提交
  17. 28 1月, 2014 1 次提交