1. 09 9月, 2012 1 次提交
  2. 22 8月, 2012 2 次提交
  3. 19 8月, 2012 1 次提交
  4. 15 8月, 2012 5 次提交
  5. 14 8月, 2012 5 次提交
    • M
      sched: Fix migration thread runtime bogosity · 8f618968
      Mike Galbraith 提交于
      Make stop scheduler class do the same accounting as other classes,
      
      Migration threads can be caught in the act while doing exec balancing,
      leading to the below due to use of unmaintained ->se.exec_start.  The
      load that triggered this particular instance was an apparently out of
      control heavily threaded application that does system monitoring in
      what equated to an exec bomb, with one of the VERY frequently migrated
      tasks being ps.
      
      %CPU   PID USER     CMD
      99.3    45 root     [migration/10]
      97.7    53 root     [migration/12]
      97.0    57 root     [migration/13]
      90.1    49 root     [migration/11]
      89.6    65 root     [migration/15]
      88.7    17 root     [migration/3]
      80.4    37 root     [migration/8]
      78.1    41 root     [migration/9]
      44.2    13 root     [migration/2]
      Signed-off-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      8f618968
    • M
      sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled · e221d028
      Mike Galbraith 提交于
      Root task group bandwidth replenishment must service all CPUs, regardless of
      where the timer was last started, and regardless of the isolation mechanism,
      lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e221d028
    • M
      sched,cgroup: Fix up task_groups list · 35cf4e50
      Mike Galbraith 提交于
      With multiple instances of task_groups, for_each_rt_rq() is a noop,
      no task groups having been added to the rt.c list instance.  This
      renders __enable/disable_runtime() and print_rt_stats() noop, the
      user (non) visible effect being that rt task groups are missing in
      /proc/sched_debug.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: stable@kernel.org # v3.3+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      35cf4e50
    • S
      sched: fix divide by zero at {thread_group,task}_times · bea6832c
      Stanislaw Gruszka 提交于
      On architectures where cputime_t is 64 bit type, is possible to trigger
      divide by zero on do_div(temp, (__force u32) total) line, if total is a
      non zero number but has lower 32 bit's zeroed. Removing casting is not
      a good solution since some do_div() implementations do cast to u32
      internally.
      
      This problem can be triggered in practice on very long lived processes:
      
        PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
         #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
         #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
         #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
         #3 [ffff880472a51cd0] die at ffffffff8100f26b
         #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
         #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
         #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
            [exception RIP: thread_group_times+0x56]
            RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
            RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
            RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
            RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
            R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
         #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
         #8 [ffff880472a51f40] sys_times at ffffffff81088524
         #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
            RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
            RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
            RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
            RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
            R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
            ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bea6832c
    • P
      sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies · a35b6466
      Peter Zijlstra 提交于
      Peter Portante reported that for large cgroup hierarchies (and or on
      large CPU counts) we get immense lock contention on rq->lock and stuff
      stops working properly.
      
      His workload was a ton of processes, each in their own cgroup,
      everybody idling except for a sporadic wakeup once every so often.
      
      It was found that:
      
        schedule()
          idle_balance()
            load_balance()
              local_irq_save()
              double_rq_lock()
              update_h_load()
                walk_tg_tree(tg_load_down)
                  tg_load_down()
      
      Results in an entire cgroup hierarchy walk under rq->lock for every
      new-idle balance and since new-idle balance isn't throttled this
      results in a lot of work while holding the rq->lock.
      
      This patch does two things, it removes the work from under rq->lock
      based on the good principle of race and pray which is widely employed
      in the load-balancer as a whole. And secondly it throttles the
      update_h_load() calculation to max once per jiffy.
      
      I considered excluding update_h_load() for new-idle balance
      all-together, but purely relying on regular balance passes to update
      this data might not work out under some rare circumstances where the
      new-idle busiest isn't the regular busiest for a while (unlikely, but
      a nightmare to debug if someone hits it and suffers).
      
      Cc: pjt@google.com
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Reported-by: NPeter Portante <pportant@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a35b6466
  6. 13 8月, 2012 1 次提交
    • J
      printk: Fix calculation of length used to discard records · e3756477
      Jeff Mahoney 提交于
      While tracking down a weird buffer overflow issue in a program that
      looked to be sane, I started double checking the length returned by
      syslog(SYSLOG_ACTION_READ_ALL, ...) to make sure it wasn't overflowing
      the buffer.
      
      Sure enough, it was.  I saw this in strace:
      
        11339 syslog(SYSLOG_ACTION_READ_ALL, "<5>[244017.708129] REISERFS (dev"..., 8192) = 8279
      
      It turns out that the loops that calculate how much space the entries
      will take when they're copied don't include the newlines and prefixes
      that will be included in the final output since prev flags is passed as
      zero.
      
      This patch properly accounts for it and fixes the overflow.
      
      CC: stable@kernel.org
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3756477
  7. 09 8月, 2012 1 次提交
  8. 05 8月, 2012 1 次提交
    • I
      time: Fix adjustment cleanup bug in timekeeping_adjust() · 1d17d174
      Ingo Molnar 提交于
      Tetsuo Handa reported that sporadically the system clock starts
      counting up too quickly which is enough to confuse the hangcheck
      timer to print a bogus stall warning.
      
      Commit 2a8c0883 "time: Move xtime_nsec adjustment underflow handling
      timekeeping_adjust" overlooked this exit path:
      
              } else
                      return;
      
      which should really be a proper exit sequence, fixing the bug as a
      side effect.
      
      Also make the flow more readable by properly balancing curly
      braces.
      
      Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
      Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: john.stultz@linaro.org
      Cc: a.p.zijlstra@chello.nl
      Cc: richardcochran@gmail.com
      Cc: prarit@redhat.com
      Link: http://lkml.kernel.org/r/20120804192114.GA28347@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1d17d174
  9. 01 8月, 2012 5 次提交
    • M
      mm: allow PF_MEMALLOC from softirq context · 907aed48
      Mel Gorman 提交于
      This is needed to allow network softirq packet processing to make use of
      PF_MEMALLOC.
      
      Currently softirq context cannot use PF_MEMALLOC due to it not being
      associated with a task, and therefore not having task flags to fiddle with
      - thus the gfp to alloc flag mapping ignores the task flags when in
      interrupts (hard or soft) context.
      
      Allowing softirqs to make use of PF_MEMALLOC therefore requires some
      trickery.  This patch borrows the task flags from whatever process happens
      to be preempted by the softirq.  It then modifies the gfp to alloc flags
      mapping to not exclude task flags in softirq context, and modify the
      softirq code to save, clear and restore the PF_MEMALLOC flag.
      
      The save and clear, ensures the preempted task's PF_MEMALLOC flag doesn't
      leak into the softirq.  The restore ensures a softirq's PF_MEMALLOC flag
      cannot leak back into the preempted process.  This should be safe due to
      the following reasons
      
      Softirqs can run on multiple CPUs sure but the same task should not be
      	executing the same softirq code. Neither should the softirq
      	handler be preempted by any other softirq handler so the flags
      	should not leak to an unrelated softirq.
      
      Softirqs re-enable hardware interrupts in __do_softirq() so can be
      	preempted by hardware interrupts so PF_MEMALLOC is inherited
      	by the hard IRQ. However, this is similar to a process in
      	reclaim being preempted by a hardirq. While PF_MEMALLOC is
      	set, gfp_to_alloc_flags() distinguishes between hard and
      	soft irqs and avoids giving a hardirq the ALLOC_NO_WATERMARKS
      	flag.
      
      If the softirq is deferred to ksoftirq then its flags may be used
              instead of a normal tasks but as the softirq cannot be preempted,
              the PF_MEMALLOC flag does not leak to other code by accident.
      
      [davem@davemloft.net: Document why PF_MEMALLOC is safe]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      907aed48
    • J
      mm/hotplug: correctly setup fallback zonelists when creating new pgdat · 9adb62a5
      Jiang Liu 提交于
      When hotadd_new_pgdat() is called to create new pgdat for a new node, a
      fallback zonelist should be created for the new node.  There's code to try
      to achieve that in hotadd_new_pgdat() as below:
      
      	/*
      	 * The node we allocated has no zone fallback lists. For avoiding
      	 * to access not-initialized zonelist, build here.
      	 */
      	mutex_lock(&zonelists_mutex);
      	build_all_zonelists(pgdat, NULL);
      	mutex_unlock(&zonelists_mutex);
      
      But it doesn't work as expected.  When hotadd_new_pgdat() is called, the
      new node is still in offline state because node_set_online(nid) hasn't
      been called yet.  And build_all_zonelists() only builds zonelists for
      online nodes as:
      
              for_each_online_node(nid) {
                      pg_data_t *pgdat = NODE_DATA(nid);
      
                      build_zonelists(pgdat);
                      build_zonelist_cache(pgdat);
              }
      
      Though we hope to create zonelist for the new pgdat, but it doesn't.  So
      add a new parameter "pgdat" the build_all_zonelists() to build pgdat for
      the new pgdat too.
      Signed-off-by: NJiang Liu <liuj97@gmail.com>
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9adb62a5
    • A
      memcg: rename config variables · c255a458
      Andrew Morton 提交于
      Sanity:
      
      CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
      CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
      CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM
      
      [mhocko@suse.cz: fix missed bits]
      Cc: Glauber Costa <glommer@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c255a458
    • W
      mm: prepare for removal of obsolete /proc/sys/vm/nr_pdflush_threads · 3965c9ae
      Wanpeng Li 提交于
      Since per-BDI flusher threads were introduced in 2.6, the pdflush
      mechanism is not used any more.  But the old interface exported through
      /proc/sys/vm/nr_pdflush_threads still exists and is obviously useless.
      
      For back-compatibility, printk warning information and return 2 to notify
      the users that the interface is removed.
      Signed-off-by: NWanpeng Li <liwp@linux.vnet.ibm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3965c9ae
    • H
      mm: account the total_vm in the vm_stat_account() · 44de9d0c
      Huang Shijie 提交于
      vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
      But we can also account for total_vm in the vm_stat_account() which makes
      the code tidy.
      
      Even for mprotect_fixup(), we can get the right result in the end.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44de9d0c
  10. 31 7月, 2012 18 次提交
    • J
      time: Remove all direct references to timekeeper · 4e250fdd
      John Stultz 提交于
      Ingo noted that the numerous timekeeper.value references made
      the timekeeping code ugly and caused many long lines that
      had to be broken up. He recommended replacing timekeeper.value
      references with tk->value.
      
      This patch provides a local tk value for all top level time
      functions and sets it to &timekeeper. Then all timekeeper
      access is done via a tk pointer.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Link: http://lkml.kernel.org/r/1343414893-45779-6-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4e250fdd
    • J
      time: Clean up offs_real/wall_to_mono and offs_boot/total_sleep_time updates · 6d0ef903
      John Stultz 提交于
      For performance reasons, we maintain ktime_t based duplicates of
      wall_to_monotonic (offs_real) and total_sleep_time (offs_boot).
      
      Since large problems could occur (such as the resume regression
      on 3.5-rc7, or the leapsecond hrtimer issue) if these value
      pairs were to be inconsistently updated, this patch this cleans
      up how we modify these value pairs to ensure we are always
      consistent.
      
      As a side-effect this is also more efficient as we only
      caulculate the duplicate values when they are changed,
      rather then every update_wall_time call.
      
      This also provides WARN_ONs to detect if future changes break
      the invariants.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Link: http://lkml.kernel.org/r/1343414893-45779-5-git-send-email-john.stultz@linaro.org
      [ Cleaned up minor style issues. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6d0ef903
    • J
      time: Clean up stray newlines · d4e3ab38
      John Stultz 提交于
      Ingo noted inconsistent newline usage between functions.
      This patch cleans those up.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Link: http://lkml.kernel.org/r/1343414893-45779-4-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4e3ab38
    • J
      time/jiffies: Rename ACTHZ to SHIFTED_HZ · 02ab20ae
      John Stultz 提交于
      Ingo noted that ACTHZ is a confusing name, and requested it
      be renamed, so this patch renames ACTHZ to SHIFTED_HZ to
      better describe it.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Link: http://lkml.kernel.org/r/1343414893-45779-3-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      02ab20ae
    • A
      perf/trace: Add ability to set a target task for events · e6dab5ff
      Andrew Vagin 提交于
      A few events are interesting not only for a current task.
      For example, sched_stat_* events are interesting for a task
      which wakes up. For this reason, it will be good if such
      events will be delivered to a target task too.
      
      Now a target task can be set by using __perf_task().
      
      The original idea and a draft patch belongs to Peter Zijlstra.
      
      I need these events for profiling sleep times. sched_switch is used for
      getting callchains and sched_stat_* is used for getting time periods.
      These events are combined in user space, then it can be analyzed by
      perf tools.
      Inspired-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arun Sharma <asharma@fb.com>
      Signed-off-by: NAndrew Vagin <avagin@openvz.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1342016098-213063-1-git-send-email-avagin@openvz.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e6dab5ff
    • M
      sched/cleanups: Add load balance cpumask pointer to 'struct lb_env' · b9403130
      Michael Wang 提交于
      With this patch struct ld_env will have a pointer of the load balancing
      cpumask and we don't need to pass a cpumask around anymore.
      Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4FFE8665.3080705@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b9403130
    • A
      kernel/debug: Make use of KGDB_REASON_NMI · b10d22d6
      Anton Vorontsov 提交于
      Currently kernel never set KGDB_REASON_NMI. We do now, when we enter
      KGDB/KDB from an NMI.
      
      This is not to be confused with kgdb_nmicallback(), NMI callback is
      an entry for the slave CPUs during CPUs roundup, but REASON_NMI is the
      entry for the master CPU.
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      b10d22d6
    • J
      kdb: Remove cpu from the more prompt · 07cd27bb
      Jason Wessel 提交于
      Having the CPU in the more prompt is completely redundent vs the
      standard kdb prompt, and it also wastes 32 bytes on the stack.
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      07cd27bb
    • J
      kdb: Remove unused KDB_FLAG_ONLY_DO_DUMP · 0f26d0e0
      Jason Wessel 提交于
      This code cleanup was missed in the original kdb merge, and this code
      is simply not used at all.  The code that was previously used to set
      the KDB_FLAG_ONLY_DO_DUMP was removed prior to the initial kdb merge.
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      0f26d0e0
    • O
      resource: make sure requested range is included in the root range · 65fed8f6
      Octavian Purdila 提交于
      When the requested range is outside of the root range the logic in
      __reserve_region_with_split will cause an infinite recursion which will
      overflow the stack as seen in the warning bellow.
      
      This particular stack overflow was caused by requesting the
      (100000000-107ffffff) range while the root range was (0-ffffffff).  In
      this case __request_resource would return the whole root range as
      conflict range (i.e.  0-ffffffff).  Then, the logic in
      __reserve_region_with_split would continue the recursion requesting the
      new range as (conflict->end+1, end) which incidentally in this case
      equals the originally requested range.
      
      This patch aborts looking for an usable range when the request does not
      intersect with the root range.  When the request partially overlaps with
      the root range, it ajust the request to fall in the root range and then
      continues with the new request.
      
      When the request is modified or aborted errors and a stack trace are
      logged to allow catching the errors in the upper layers.
      
      [    5.968374] WARNING: at kernel/sched.c:4129 sub_preempt_count+0x63/0x89()
      [    5.975150] Modules linked in:
      [    5.978184] Pid: 1, comm: swapper Not tainted 3.0.22-mid27-00004-gb72c817 #46
      [    5.985324] Call Trace:
      [    5.987759]  [<c1039dfc>] ? console_unlock+0x17b/0x18d
      [    5.992891]  [<c1039620>] warn_slowpath_common+0x48/0x5d
      [    5.998194]  [<c1031758>] ? sub_preempt_count+0x63/0x89
      [    6.003412]  [<c1039644>] warn_slowpath_null+0xf/0x13
      [    6.008453]  [<c1031758>] sub_preempt_count+0x63/0x89
      [    6.013499]  [<c14d60c4>] _raw_spin_unlock+0x27/0x3f
      [    6.018453]  [<c10c6349>] add_partial+0x36/0x3b
      [    6.022973]  [<c10c7c0a>] deactivate_slab+0x96/0xb4
      [    6.027842]  [<c14cf9d9>] __slab_alloc.isra.54.constprop.63+0x204/0x241
      [    6.034456]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
      [    6.039842]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
      [    6.045232]  [<c10c7dc9>] kmem_cache_alloc_trace+0x51/0xb0
      [    6.050710]  [<c103f78f>] ? kzalloc.constprop.5+0x29/0x38
      [    6.056100]  [<c103f78f>] kzalloc.constprop.5+0x29/0x38
      [    6.061320]  [<c17b45e9>] __reserve_region_with_split+0x1c/0xd1
      [    6.067230]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
      ...
      [    7.179057]  [<c17b4693>] __reserve_region_with_split+0xc6/0xd1
      [    7.184970]  [<c17b4779>] reserve_region_with_split+0x30/0x42
      [    7.190709]  [<c17a8ebf>] e820_reserve_resources_late+0xd1/0xe9
      [    7.196623]  [<c17c9526>] pcibios_resource_survey+0x23/0x2a
      [    7.202184]  [<c17cad8a>] pcibios_init+0x23/0x35
      [    7.206789]  [<c17ca574>] pci_subsys_init+0x3f/0x44
      [    7.211659]  [<c1002088>] do_one_initcall+0x72/0x122
      [    7.216615]  [<c17ca535>] ? pci_legacy_init+0x3d/0x3d
      [    7.221659]  [<c17a27ff>] kernel_init+0xa6/0x118
      [    7.226265]  [<c17a2759>] ? start_kernel+0x334/0x334
      [    7.231223]  [<c14d7482>] kernel_thread_helper+0x6/0x10
      Signed-off-by: NOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65fed8f6
    • A
      taskstats: check nla_reserve() return · 25353b33
      Alan Cox 提交于
      Addresses https://bugzilla.kernel.org/show_bug.cgi?id=44621
      
      Reported-by: <rucsoftsec@gmail.com>
      Signed-off-by: NAlan Cox <alan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25353b33
    • S
      sysctl: suppress kmemleak messages · fd4b616b
      Steven Rostedt 提交于
      register_sysctl_table() is a strange function, as it makes internal
      allocations (a header) to register a sysctl_table.  This header is a
      handle to the table that is created, and can be used to unregister the
      table.  But if the table is permanent and never unregistered, the header
      acts the same as a static variable.
      
      Unfortunately, this allocation of memory that is never expected to be
      freed fools kmemleak in thinking that we have leaked memory.  For those
      sysctl tables that are never unregistered, and have no pointer referencing
      them, kmemleak will think that these are memory leaks:
      
      unreferenced object 0xffff880079fb9d40 (size 192):
        comm "swapper/0", pid 0, jiffies 4294667316 (age 12614.152s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8146b590>] kmemleak_alloc+0x73/0x98
          [<ffffffff8110a935>] kmemleak_alloc_recursive.constprop.42+0x16/0x18
          [<ffffffff8110b852>] __kmalloc+0x107/0x153
          [<ffffffff8116fa72>] kzalloc.constprop.8+0xe/0x10
          [<ffffffff811703c9>] __register_sysctl_paths+0xe1/0x160
          [<ffffffff81170463>] register_sysctl_paths+0x1b/0x1d
          [<ffffffff8117047d>] register_sysctl_table+0x18/0x1a
          [<ffffffff81afb0a1>] sysctl_init+0x10/0x14
          [<ffffffff81b05a6f>] proc_sys_init+0x2f/0x31
          [<ffffffff81b0584c>] proc_root_init+0xa5/0xa7
          [<ffffffff81ae5b7e>] start_kernel+0x3d0/0x40a
          [<ffffffff81ae52a7>] x86_64_start_reservations+0xae/0xb2
          [<ffffffff81ae53ad>] x86_64_start_kernel+0x102/0x111
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The sysctl_base_table used by sysctl itself is one such instance that
      registers the table to never be unregistered.
      
      Use kmemleak_not_leak() to suppress the kmemleak false positive.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd4b616b
    • V
      kdump: append newline to the last lien of vmcoreinfo note · 63dca8d5
      Vivek Goyal 提交于
      The last line of vmcoreinfo note does not end with \n.  Parsing all the
      lines in note becomes easier if all lines end with \n instead of trying to
      special case the last line.
      
      I know at least one tool, vmcore-dmesg in kexec-tools tree which made the
      assumption that all lines end with \n.  I think it is a good idea to fix
      it.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63dca8d5
    • A
      fork: fix error handling in dup_task() · f19b9f74
      Akinobu Mita 提交于
      The function dup_task() may fail at the following function calls in the
      following order.
      
      0) alloc_task_struct_node()
      1) alloc_thread_info_node()
      2) arch_dup_task_struct()
      
      Error by 0) is not a matter, it can just return.  But error by 1) requires
      releasing task_struct allocated by 0) before it returns.  Likewise, error
      by 2) requires releasing task_struct and thread_info allocated by 0) and
      1).
      
      The existing error handling calls free_task_struct() and
      free_thread_info() which do not only release task_struct and thread_info,
      but also call architecture specific arch_release_task_struct() and
      arch_release_thread_info().
      
      The problem is that task_struct and thread_info are not fully initialized
      yet at this point, but arch_release_task_struct() and
      arch_release_thread_info() are called with them.
      
      For example, x86 defines its own arch_release_task_struct() that releases
      a task_xstate.  If alloc_thread_info_node() fails in dup_task(),
      arch_release_task_struct() is called with task_struct which is just
      allocated and filled with garbage in this error handling.
      
      This actually happened with tools/testing/fault-injection/failcmd.sh
      
      	# env FAILCMD_TYPE=fail_page_alloc \
      		./tools/testing/fault-injection/failcmd.sh --times=100 \
      		--min-order=0 --ignore-gfp-wait=0 \
      		-- make -C tools/testing/selftests/ run_tests
      
      In order to fix this issue, make free_{task_struct,thread_info}() not to
      call arch_release_{task_struct,thread_info}() and call
      arch_release_{task_struct,thread_info}() implicitly where needed.
      
      Default arch_release_task_struct() and arch_release_thread_info() are
      defined as empty by default.  So this change only affects the
      architectures which implement their own arch_release_task_struct() or
      arch_release_thread_info() as listed below.
      
      arch_release_task_struct(): x86, sh
      arch_release_thread_info(): mn10300, tile
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f19b9f74
    • A
      revert "sched: Fix fork() error path to not crash" · 87bec58a
      Andrew Morton 提交于
      To make way for "fork: fix error handling in dup_task()", which fixes the
      errors more completely.
      
      Cc: Salman Qazi <sqazi@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87bec58a
    • H
      fork: use vma_pages() to simplify the code · b2412b7f
      Huang Shijie 提交于
      The current code can be replaced by vma_pages().  So use it to simplify
      the code.
      
      [akpm@linux-foundation.org: initialise `len' at its definition site]
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2412b7f
    • T
      kmod: avoid deadlock from recursive kmod call · 0f20784d
      Tetsuo Handa 提交于
      The system deadlocks (at least since 2.6.10) when
      call_usermodehelper(UMH_WAIT_EXEC) request triggers
      call_usermodehelper(UMH_WAIT_PROC) request.
      
      This is because "khelper thread is waiting for the worker thread at
      wait_for_completion() in do_fork() since the worker thread was created
      with CLONE_VFORK flag" and "the worker thread cannot call complete()
      because do_execve() is blocked at UMH_WAIT_PROC request" and "the khelper
      thread cannot start processing UMH_WAIT_PROC request because the khelper
      thread is waiting for the worker thread at wait_for_completion() in
      do_fork()".
      
      The easiest example to observe this deadlock is to use a corrupted
      /sbin/hotplug binary (like shown below).
      
        # : > /tmp/dummy
        # chmod 755 /tmp/dummy
        # echo /tmp/dummy > /proc/sys/kernel/hotplug
        # modprobe whatever
      
      call_usermodehelper("/tmp/dummy", UMH_WAIT_EXEC) is called from
      kobject_uevent_env() in lib/kobject_uevent.c upon loading/unloading a
      module.  do_execve("/tmp/dummy") triggers a call to
      request_module("binfmt-0000") from search_binary_handler() which in turn
      calls call_usermodehelper(UMH_WAIT_PROC).
      
      In order to avoid deadlock, as a for-now and easy-to-backport solution, do
      not try to call wait_for_completion() in call_usermodehelper_exec() if the
      worker thread was created by khelper thread with CLONE_VFORK flag.  Future
      and fundamental solution might be replacing singleton khelper thread with
      some workqueue so that recursive calls up to max_active dependency loop
      can be handled without deadlock.
      
      [akpm@linux-foundation.org: add comment to kmod_thread_locker]
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f20784d
    • A
      kernel/kmod.c: document call_usermodehelper_fns() a bit · 79c743dd
      Andrew Morton 提交于
      This function's interface is, uh, subtle.  Attempt to apologise for it.
      
      Cc: WANG Cong <xiyou.wangcong@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79c743dd