1. 11 9月, 2012 2 次提交
    • L
      workqueue: fix possible idle worker depletion across CPU hotplug · ee378aa4
      Lai Jiangshan 提交于
      To simplify both normal and CPU hotplug paths, worker management is
      prevented while CPU hoplug is in progress.  This is achieved by CPU
      hotplug holding the same exclusion mechanism used by workers to ensure
      there's only one manager per pool.
      
      If someone else seems to be performing the manager role, workers
      proceed to execute work items.  CPU hotplug using the same mechanism
      can lead to idle worker depletion because all workers could proceed to
      execute work items while CPU hotplug is in progress and CPU hotplug
      itself wouldn't actually perform the worker management duty - it
      doesn't guarantee that there's an idle worker left when it releases
      management.
      
      This idle worker depletion, under extreme circumstances, can break
      forward-progress guarantee and thus lead to deadlock.
      
      This patch fixes the bug by using separate mechanisms for manager
      exclusion among workers and hotplug exclusion.  For manager exclusion,
      POOL_MANAGING_WORKERS which was restored by the previous patch is
      used.  pool->manager_mutex is now only used for exclusion between the
      elected manager and CPU hotplug.  The elected manager won't proceed
      without holding pool->manager_mutex.
      
      This ensures that the worker which won the manager position can't skip
      managing while CPU hotplug is in progress.  It will block on
      manager_mutex and perform management after CPU hotplug is complete.
      
      Note that hotplug may happen while waiting for manager_mutex.  A
      manager isn't either on idle or busy list and thus the hoplug code
      can't unbind/rebind it.  Make the manager handle its own un/rebinding.
      
      tj: Updated comment and description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ee378aa4
    • L
      workqueue: restore POOL_MANAGING_WORKERS · 552a37e9
      Lai Jiangshan 提交于
      This patch restores POOL_MANAGING_WORKERS which was replaced by
      pool->manager_mutex by 60373152 "workqueue: use mutex for global_cwq
      manager exclusion".
      
      There's a subtle idle worker depletion bug across CPU hotplug events
      and we need to distinguish an actual manager and CPU hotplug
      preventing management.  POOL_MANAGING_WORKERS will be used for the
      former and manager_mutex the later.
      
      This patch just lays POOL_MANAGING_WORKERS on top of the existing
      manager_mutex and doesn't introduce any synchronization changes.  The
      next patch will update it.
      
      Note that this patch fixes a non-critical anomaly where
      too_many_workers() may return %true spuriously while CPU hotplug is in
      progress.  While the issue could schedule idle timer spuriously, it
      didn't trigger any actual misbehavior.
      
      tj: Rewrote patch description.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      552a37e9
  2. 06 9月, 2012 2 次提交
    • T
      workqueue: fix possible deadlock in idle worker rebinding · ec58815a
      Tejun Heo 提交于
      Currently, rebind_workers() and idle_worker_rebind() are two-way
      interlocked.  rebind_workers() waits for idle workers to finish
      rebinding and rebound idle workers wait for rebind_workers() to finish
      rebinding busy workers before proceeding.
      
      Unfortunately, this isn't enough.  The second wait from idle workers
      is implemented as follows.
      
      	wait_event(gcwq->rebind_hold, !(worker->flags & WORKER_REBIND));
      
      rebind_workers() clears WORKER_REBIND, wakes up the idle workers and
      then returns.  If CPU hotplug cycle happens again before one of the
      idle workers finishes the above wait_event(), rebind_workers() will
      repeat the first part of the handshake - set WORKER_REBIND again and
      wait for the idle worker to finish rebinding - and this leads to
      deadlock because the idle worker would be waiting for WORKER_REBIND to
      clear.
      
      This is fixed by adding another interlocking step at the end -
      rebind_workers() now waits for all the idle workers to finish the
      above WORKER_REBIND wait before returning.  This ensures that all
      rebinding steps are complete on all idle workers before the next
      hotplug cycle can happen.
      
      This problem was diagnosed by Lai Jiangshan who also posted a patch to
      fix the issue, upon which this patch is based.
      
      This is the minimal fix and further patches are scheduled for the next
      merge window to simplify the CPU hotplug path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Original-patch-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <1346516916-1991-3-git-send-email-laijs@cn.fujitsu.com>
      ec58815a
    • T
      workqueue: move WORKER_REBIND clearing in rebind_workers() to the end of the function · 90beca5d
      Tejun Heo 提交于
      This doesn't make any functional difference and is purely to help the
      next patch to be simpler.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      90beca5d
  3. 05 9月, 2012 1 次提交
    • L
      workqueue: UNBOUND -> REBIND morphing in rebind_workers() should be atomic · 96e65306
      Lai Jiangshan 提交于
      The compiler may compile the following code into TWO write/modify
      instructions.
      
      	worker->flags &= ~WORKER_UNBOUND;
      	worker->flags |= WORKER_REBIND;
      
      so the other CPU may temporarily see worker->flags which doesn't have
      either WORKER_UNBOUND or WORKER_REBIND set and perform local wakeup
      prematurely.
      
      Fix it by using single explicit assignment via ACCESS_ONCE().
      
      Because idle workers have another WORKER_NOT_RUNNING flag, this bug
      doesn't exist for them; however, update it to use the same pattern for
      consistency.
      
      tj: Applied the change to idle workers too and updated comments and
          patch description a bit.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      96e65306
  4. 01 8月, 2012 5 次提交
    • M
      mm: allow PF_MEMALLOC from softirq context · 907aed48
      Mel Gorman 提交于
      This is needed to allow network softirq packet processing to make use of
      PF_MEMALLOC.
      
      Currently softirq context cannot use PF_MEMALLOC due to it not being
      associated with a task, and therefore not having task flags to fiddle with
      - thus the gfp to alloc flag mapping ignores the task flags when in
      interrupts (hard or soft) context.
      
      Allowing softirqs to make use of PF_MEMALLOC therefore requires some
      trickery.  This patch borrows the task flags from whatever process happens
      to be preempted by the softirq.  It then modifies the gfp to alloc flags
      mapping to not exclude task flags in softirq context, and modify the
      softirq code to save, clear and restore the PF_MEMALLOC flag.
      
      The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
      leak into the softirq.  The restore ensures a softirq's PF_MEMALLOC flag
      cannot leak back into the preempted process.  This should be safe due to
      the following reasons
      
      Softirqs can run on multiple CPUs sure but the same task should not be
      	executing the same softirq code. Neither should the softirq
      	handler be preempted by any other softirq handler so the flags
      	should not leak to an unrelated softirq.
      
      Softirqs re-enable hardware interrupts in __do_softirq() so can be
      	preempted by hardware interrupts so PF_MEMALLOC is inherited
      	by the hard IRQ. However, this is similar to a process in
      	reclaim being preempted by a hardirq. While PF_MEMALLOC is
      	set, gfp_to_alloc_flags() distinguishes between hard and
      	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
      	flag.
      
      If the softirq is deferred to ksoftirq then its flags may be used
              instead of a normal tasks but as the softirq cannot be preempted,
              the PF_MEMALLOC flag does not leak to other code by accident.
      
      [davem@davemloft.net: Document why PF_MEMALLOC is safe]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      907aed48
    • J
      mm/hotplug: correctly setup fallback zonelists when creating new pgdat · 9adb62a5
      Jiang Liu 提交于
      When hotadd_new_pgdat() is called to create new pgdat for a new node, a
      fallback zonelist should be created for the new node.  There's code to try
      to achieve that in hotadd_new_pgdat() as below:
      
      	/*
      	 * The node we allocated has no zone fallback lists. For avoiding
      	 * to access not-initialized zonelist, build here.
      	 */
      	mutex_lock(&zonelists_mutex);
      	build_all_zonelists(pgdat, NULL);
      	mutex_unlock(&zonelists_mutex);
      
      But it doesn't work as expected.  When hotadd_new_pgdat() is called, the
      new node is still in offline state because node_set_online(nid) hasn't
      been called yet.  And build_all_zonelists() only builds zonelists for
      online nodes as:
      
              for_each_online_node(nid) {
                      pg_data_t *pgdat = NODE_DATA(nid);
      
                      build_zonelists(pgdat);
                      build_zonelist_cache(pgdat);
              }
      
      Though we hope to create zonelist for the new pgdat, but it doesn't.  So
      add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
      the new pgdat too.
      Signed-off-by: NJiang Liu <liuj97@gmail.com>
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9adb62a5
    • A
      memcg: rename config variables · c255a458
      Andrew Morton 提交于
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
    • W
      mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads · 3965c9ae
      Wanpeng Li 提交于
      Since per-BDI flusher threads were introduced in 2.6, the pdflush
      mechanism is not used any more.  But the old interface exported through
      /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.
      
      For back-compatibility, printk warning information and return 2 to notify
      the users that the interface is removed.
      Signed-off-by: NWanpeng Li <liwp@linux.vnet.ibm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3965c9ae
    • H
      mm: account the total_vm in the vm_stat_account() · 44de9d0c
      Huang Shijie 提交于
      vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
      But we can also account for total_vm in the vm_stat_account() which makes
      the code tidy.
      
      Even for mprotect_fixup(), we can get the right result in the end.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44de9d0c
  5. 31 7月, 2012 17 次提交
    • O
      resource: make sure requested range is included in the root range · 65fed8f6
      Octavian Purdila 提交于
      When the requested range is outside of the root range the logic in
      __reserve_region_with_split will cause an infinite recursion which will
      overflow the stack as seen in the warning bellow.
      
      This particular stack overflow was caused by requesting the
      (100000000-107ffffff) range while the root range was (0-ffffffff).  In
      this case __request_resource would return the whole root range as
      conflict range (i.e.  0-ffffffff).  Then, the logic in
      __reserve_region_with_split would continue the recursion requesting the
      new range as (conflict->end+1, end) which incidentally in this case
      equals the originally requested range.
      
      This patch aborts looking for an usable range when the request does not
      intersect with the root range.  When the request partially overlaps with
      the root range, it ajust the request to fall in the root range and then
      continues with the new request.
      
      When the request is modified or aborted errors and a stack trace are
      logged to allow catching the errors in the upper layers.
      
      [    5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
      [    5.975150] Modules linked in:
      [    5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
      [    5.985324] Call Trace:
      [    5.987759]  [<c1039dfc>] ? console_unlock+0x17b/0x18d
      [    5.992891]  [<c1039620>] warn_slowpath_common+0x48/0x5d
      [    5.998194]  [<c1031758>] ? sub_preempt_count+0x63/0x89
      [    6.003412]  [<c1039644>] warn_slowpath_null+0xf/0x13
      [    6.008453]  [<c1031758>] sub_preempt_count+0x63/0x89
      [    6.013499]  [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
      [    6.018453]  [<c10c6349>] add_partial+0x36/0x3b
      [    6.022973]  [<c10c7c0a>] deactivate_slab+0x96/0xb4
      [    6.027842]  [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
      [    6.034456]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
      [    6.039842]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
      [    6.045232]  [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
      [    6.050710]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
      [    6.056100]  [<c103f78f>] kzalloc.constprop.5+0x29/0x38
      [    6.061320]  [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
      [    6.067230]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
      ...
      [    7.179057]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
      [    7.184970]  [<c17b4779>] reserve_region_with_split+0x30/0x42
      [    7.190709]  [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
      [    7.196623]  [<c17c9526>] pcibios_resource_survey+0x23/0x2a
      [    7.202184]  [<c17cad8a>] pcibios_init+0x23/0x35
      [    7.206789]  [<c17ca574>] pci_subsys_init+0x3f/0x44
      [    7.211659]  [<c1002088>] do_one_initcall+0x72/0x122
      [    7.216615]  [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
      [    7.221659]  [<c17a27ff>] kernel_init+0xa6/0x118
      [    7.226265]  [<c17a2759>] ? start_kernel+0x334/0x334
      [    7.231223]  [<c14d7482>] kernel_thread_helper+0x6/0x10
      Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65fed8f6
    • A
      taskstats: check nla_reserve() return · 25353b33
      Alan Cox 提交于
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=44621
      
      Reported-by: <rucsoftsec@gmail.com>
      Signed-off-by: NAlan Cox <alan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25353b33
    • S
      sysctl: suppress kmemleak messages · fd4b616b
      Steven Rostedt 提交于
      register_sysctl_table() is a strange function, as it makes internal
      allocations (a header) to register a sysctl_table.  This header is a
      handle to the table that is created, and can be used to unregister the
      table.  But if the table is permanent and never unregistered, the header
      acts the same as a static variable.
      
      Unfortunately, this allocation of memory that is never expected to be
      freed fools kmemleak in thinking that we have leaked memory.  For those
      sysctl tables that are never unregistered, and have no pointer referencing
      them, kmemleak will think that these are memory leaks:
      
      unreferenced object 0xffff880079fb9d40 (size 192):
        comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
          [<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
          [<ffffffff8110b852>] __kmalloc+0x107/0x153
          [<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
          [<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
          [<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
          [<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
          [<ffffffff81afb0a1>] sysctl_init+0x10/0x14
          [<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
          [<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
          [<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
          [<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
          [<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The sysctl_base_table used by sysctl itself is one such instance that
      registers the table to never be unregistered.
      
      Use kmemleak_not_leak() to suppress the kmemleak false positive.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd4b616b
    • V
      kdump: append newline to the last lien of vmcoreinfo note · 63dca8d5
      Vivek Goyal 提交于
      The last line of vmcoreinfo note does not end with \n.  Parsing all the
      lines in note becomes easier if all lines end with \n instead of trying to
      special case the last line.
      
      I know at least one tool, vmcore-dmesg in kexec-tools tree which made the
      assumption that all lines end with \n.  I think it is a good idea to fix
      it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63dca8d5
    • A
      fork: fix error handling in dup_task() · f19b9f74
      Akinobu Mita 提交于
      The function dup_task() may fail at the following function calls in the
      following order.
      
      0) alloc_task_struct_node()
      1) alloc_thread_info_node()
      2) arch_dup_task_struct()
      
      Error by 0) is not a matter, it can just return.  But error by 1) requires
      releasing task_struct allocated by 0) before it returns.  Likewise, error
      by 2) requires releasing task_struct and thread_info allocated by 0) and
      1).
      
      The existing error handling calls free_task_struct() and
      free_thread_info() which do not only release task_struct and thread_info,
      but also call architecture specific arch_release_task_struct() and
      arch_release_thread_info().
      
      The problem is that task_struct and thread_info are not fully initialized
      yet at this point, but arch_release_task_struct() and
      arch_release_thread_info() are called with them.
      
      For example, x86 defines its own arch_release_task_struct() that releases
      a task_xstate.  If alloc_thread_info_node() fails in dup_task(),
      arch_release_task_struct() is called with task_struct which is just
      allocated and filled with garbage in this error handling.
      
      This actually happened with tools/testing/fault-injection/failcmd.sh
      
      	# env FAILCMD_TYPE=fail_page_alloc \
      		./tools/testing/fault-injection/failcmd.sh --times=100 \
      		--min-order=0 --ignore-gfp-wait=0 \
      		-- make -C tools/testing/selftests/ run_tests
      
      In order to fix this issue, make free_{task_struct,thread_info}() not to
      call arch_release_{task_struct,thread_info}() and call
      arch_release_{task_struct,thread_info}() implicitly where needed.
      
      Default arch_release_task_struct() and arch_release_thread_info() are
      defined as empty by default.  So this change only affects the
      architectures which implement their own arch_release_task_struct() or
      arch_release_thread_info() as listed below.
      
      arch_release_task_struct(): x86, sh
      arch_release_thread_info(): mn10300, tile
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19b9f74
    • A
      revert "sched: Fix fork() error path to not crash" · 87bec58a
      Andrew Morton 提交于
      To make way for "fork: fix error handling in dup_task()", which fixes the
      errors more completely.
      
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87bec58a
    • H
      fork: use vma_pages() to simplify the code · b2412b7f
      Huang Shijie 提交于
      The current code can be replaced by vma_pages().  So use it to simplify
      the code.
      
      [akpm@linux-foundation.org: initialise `len' at its definition site]
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2412b7f
    • T
      kmod: avoid deadlock from recursive kmod call · 0f20784d
      Tetsuo Handa 提交于
      The system deadlocks (at least since 2.6.10) when
      call_usermodehelper(UMH_WAIT_EXEC) request triggers
      call_usermodehelper(UMH_WAIT_PROC) request.
      
      This is because "khelper thread is waiting for the worker thread at
      wait_for_completion() in do_fork() since the worker thread was created
      with CLONE_VFORK flag" and "the worker thread cannot call complete()
      because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper
      thread cannot start processing UMH_WAIT_PROC request because the khelper
      thread is waiting for the worker thread at wait_for_completion() in
      do_fork()".
      
      The easiest example to observe this deadlock is to use a corrupted
      /sbin/hotplug binary (like shown below).
      
        # : > /tmp/dummy
        # chmod 755 /tmp/dummy
        # echo /tmp/dummy > /proc/sys/kernel/hotplug
        # modprobe whatever
      
      call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from
      kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a
      module.  do_execve("/tmp/dummy") triggers a call to
      request_module("binfmt-0000") from search_binary_handler() which in turn
      calls call_usermodehelper(UMH_WAIT_PROC).
      
      In order to avoid deadlock, as a for-now and easy-to-backport solution, do
      not try to call wait_for_completion() in call_usermodehelper_exec() if the
      worker thread was created by khelper thread with CLONE_VFORK flag.  Future
      and fundamental solution might be replacing singleton khelper thread with
      some workqueue so that recursive calls up to max_active dependency loop
      can be handled without deadlock.
      
      [akpm@linux-foundation.org: add comment to kmod_thread_locker]
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f20784d
    • A
      kernel/kmod.c: document call_usermodehelper_fns() a bit · 79c743dd
      Andrew Morton 提交于
      This function's interface is, uh, subtle.  Attempt to apologise for it.
      
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79c743dd
    • J
      printk: only look for prefix levels in kernel messages · 088a52aa
      Joe Perches 提交于
      vprintk_emit() prefix parsing should only be done for internal kernel
      messages.  This allows existing behavior to be kept in all cases.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      088a52aa
    • J
      printk: add generic functions to find KERN_<LEVEL> headers · acc8fa41
      Joe Perches 提交于
      The current form of a KERN_<LEVEL> is "<.>".
      
      Add printk_get_level and printk_skip_level functions to handle these
      formats.
      
      These functions centralize tests of KERN_<LEVEL> so a future modification
      can change the KERN_<LEVEL> style and shorten the number of bytes consumed
      by these headers.
      
      [akpm@linux-foundation.org: fix build error and warning]
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Wu Fengguang <wfg@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acc8fa41
    • K
    • A
      kernel/sys.c: avoid argv_free(NULL) · b57b44ae
      Andrew Morton 提交于
      If argv_split() failed, the code will end up calling argv_free(NULL).  Fix
      it up and clean things up a bit.
      
      Addresses Coverity report 703573.
      
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b57b44ae
    • S
      NMI watchdog: fix for lockup detector breakage on resume · 45226e94
      Sameer Nanda 提交于
      On the suspend/resume path the boot CPU does not go though an
      offline->online transition.  This breaks the NMI detector post-resume
      since it depends on PMU state that is lost when the system gets
      suspended.
      
      Fix this by forcing a CPU offline->online transition for the lockup
      detector on the boot CPU during resume.
      
      To provide more context, we enable NMI watchdog on Chrome OS.  We have
      seen several reports of systems freezing up completely which indicated
      that the NMI watchdog was not firing for some reason.
      
      Debugging further, we found a simple way of repro'ing system freezes --
      issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
      after the system has been suspended/resumed one or more times.
      
      With this patch in place, the system freeze result in panics, as
      expected.
      
      These panics provide a nice stack trace for us to debug the actual issue
      causing the freeze.
      
      [akpm@linux-foundation.org: fiddle with code comment]
      [akpm@linux-foundation.org: make lockup_detector_bootcpu_resume() conditional on CONFIG_SUSPEND]
      [akpm@linux-foundation.org: fix section errors]
      Signed-off-by: NSameer Nanda <snanda@chromium.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45226e94
    • V
      panic: fix a possible deadlock in panic() · 190320c3
      Vikram Mulukutla 提交于
      panic_lock is meant to ensure that panic processing takes place only on
      one cpu; if any of the other cpus encounter a panic, they will spin
      waiting to be shut down.
      
      However, this causes a regression in this scenario:
      
      1. Cpu 0 encounters a panic and acquires the panic_lock
         and proceeds with the panic processing.
      2. There is an interrupt on cpu 0 that also encounters
         an error condition and invokes panic.
      3. This second invocation fails to acquire the panic_lock
         and enters the infinite while loop in panic_smp_self_stop.
      
      Thus all panic processing is stopped, and the cpu is stuck for eternity
      in the while(1) inside panic_smp_self_stop.
      
      To address this, disable local interrupts with local_irq_disable before
      acquiring the panic_lock.  This will prevent interrupt handlers from
      executing during the panic processing, thus avoiding this particular
      problem.
      Signed-off-by: NVikram Mulukutla <markivx@codeaurora.org>
      Reviewed-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      190320c3
    • K
      coredump: warn about unsafe suid_dumpable / core_pattern combo · 54b50199
      Kees Cook 提交于
      When suid_dumpable=2, detect unsafe core_pattern settings and warn when
      they are seen.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Cox <alan@linux.intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54b50199
    • S
      prctl: remove redunant assignment of "error" to zero · f1fd75bf
      Sasikantha babu 提交于
      Just setting the "error" to error number is enough on failure and It
      doesn't require to set "error" variable to zero in each switch case,
      since it was already initialized with zero.  And also removed return 0
      in switch case with break statement
      Signed-off-by: NSasikantha babu <sasikanth.v19@gmail.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NSerge E. Hallyn <serge@hallyn.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1fd75bf
  6. 30 7月, 2012 13 次提交