1. 24 9月, 2014 5 次提交
  2. 21 9月, 2014 1 次提交
  3. 19 9月, 2014 18 次提交
  4. 09 9月, 2014 2 次提交
  5. 08 9月, 2014 3 次提交
    • R
      sched, time: Atomically increment stime & utime · eb1b4af0
      Rik van Riel 提交于
      The functions task_cputime_adjusted and thread_group_cputime_adjusted()
      can be called locklessly, as well as concurrently on many different CPUs.
      
      This can occasionally lead to the utime and stime reported by times(), and
      other syscalls like it, going backward. The cause for this appears to be
      multiple threads racing in cputime_adjust(), both with values for utime or
      stime that is larger than the original, but each with a different value.
      
      Sometimes the larger value gets saved first, only to be immediately
      overwritten with a smaller value by another thread.
      
      Using atomic exchange prevents that problem, and ensures time
      progresses monotonically.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: umgwanakikbuti@gmail.com
      Cc: fweisbec@gmail.com
      Cc: akpm@linux-foundation.org
      Cc: srao@redhat.com
      Cc: lwoodman@redhat.com
      Cc: atheurer@redhat.com
      Cc: oleg@redhat.com
      Link: http://lkml.kernel.org/r/1408133138-22048-4-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eb1b4af0
    • R
      time, signal: Protect resource use statistics with seqlock · e78c3496
      Rik van Riel 提交于
      Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
      issues on large systems, due to both functions being serialized with a
      lock.
      
      The lock protects against reporting a wrong value, due to a thread in the
      task group exiting, its statistics reporting up to the signal struct, and
      that exited task's statistics being counted twice (or not at all).
      
      Protecting that with a lock results in times() and clock_gettime() being
      completely serialized on large systems.
      
      This can be fixed by using a seqlock around the events that gather and
      propagate statistics. As an additional benefit, the protection code can
      be moved into thread_group_cputime(), slightly simplifying the calling
      functions.
      
      In the case of posix_cpu_clock_get_task() things can be simplified a
      lot, because the calling function already ensures that the task sticks
      around, and the rest is now taken care of in thread_group_cputime().
      
      This way the statistics reporting code can run lockless.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Daeseok Youn <daeseok.youn@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guillaume Morin <guillaume@morinfr.org>
      Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: umgwanakikbuti@gmail.com
      Cc: fweisbec@gmail.com
      Cc: srao@redhat.com
      Cc: lwoodman@redhat.com
      Cc: atheurer@redhat.com
      Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e78c3496
    • R
      exit: Always reap resource stats in __exit_signal() · 90ed9cbe
      Rik van Riel 提交于
      Oleg pointed out that wait_task_zombie adds a task's usage statistics
      to the parent's signal struct, but the task's own signal struct should
      also propagate the statistics at exit time.
      
      This allows thread_group_cputime(reaped_zombie) to get the statistics
      after __unhash_process() has made the task invisible to for_each_thread,
      but before the thread has actually been rcu freed, making sure no
      non-monotonic results are returned inside that window.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Guillaume Morin <guillaume@morinfr.org>
      Cc: Ionut Alexa <ionut.m.alexa@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Michal Schmidt <mschmidt@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: umgwanakikbuti@gmail.com
      Cc: fweisbec@gmail.com
      Cc: srao@redhat.com
      Cc: lwoodman@redhat.com
      Cc: atheurer@redhat.com
      Link: http://lkml.kernel.org/r/1408133138-22048-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      90ed9cbe
  6. 07 9月, 2014 1 次提交
    • X
      sched/deadline: Fix a precision problem in the microseconds range · 177ef2a6
      xiaofeng.yan 提交于
      An overrun could happen in function start_hrtick_dl()
      when a task with SCHED_DEADLINE runs in the microseconds
      range.
      
      For example, if a task with SCHED_DEADLINE has the following parameters:
      
        Task  runtime  deadline  period
         P1   200us     500us    500us
      
      The deadline and period from task P1 are less than 1ms.
      
      In order to achieve microsecond precision, we need to enable HRTICK feature
      by the next command:
      
        PC#echo "HRTICK" > /sys/kernel/debug/sched_features
        PC#trace-cmd record -e sched_switch &
        PC#./schedtool -E -t 200000:500000:500000 -e ./test
      
      The binary test is in an endless while(1) loop here.
      Some pieces of trace.dat are as follows:
      
        <idle>-0   157.603157: sched_switch: :R ==> 2481:4294967295: test
        test-2481  157.603203: sched_switch:  2481:R ==> 0:120: swapper/2
        <idle>-0   157.605657: sched_switch:  :R ==> 2481:4294967295: test
        test-2481  157.608183: sched_switch:  2481:R ==> 2483:120: trace-cmd
        trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test
      
      We can get the runtime of P1 from the information above:
      
        runtime = 157.608183 - 157.605657
        runtime = 0.002526(2.526ms)
      
      The correct runtime should be less than or equal to 200us at some point.
      
      The problem is caused by a conditional judgment "delta > 10000"
      in function start_hrtick_dl().
      
      Because no hrtimer start up to control the rest of runtime
      when the reset of runtime is less than 10us.
      
      So the process will continue to run until tick-period is coming.
      
      Move the code with the limit of the least time slice
      from hrtick_start_fair() to hrtick_start() because the
      EDF schedule class also needs this function in start_hrtick_dl().
      
      To fix this problem, we call hrtimer_start() unconditionally in
      start_hrtick_dl(), and make sure the scheduling slice won't be smaller
      than 10us in hrtimer_start().
      Signed-off-by: NXiaofeng Yan <xiaofeng.yan@huawei.com>
      Reviewed-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com
      [ Massaged the changelog and the code. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      177ef2a6
  7. 06 9月, 2014 2 次提交
  8. 05 9月, 2014 2 次提交
    • A
      sched/fair: Replace rcu_assign_pointer() with RCU_INIT_POINTER() · 35b123e2
      Andreea-Cristina Bernat 提交于
      The use of "rcu_assign_pointer()" is NULLing out the pointer.
      According to RCU_INIT_POINTER()'s block comment:
      
        "1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
      
      it is better to use it instead of rcu_assign_pointer() because it has a
      smaller overhead.
      
      The following Coccinelle semantic patch was used:
       @@
       @@
      
       - rcu_assign_pointer
       + RCU_INIT_POINTER
         (..., NULL)
      Signed-off-by: NAndreea-Cristina Bernat <bernat.ada@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: paulmck@linux.vnet.ibm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140822145043.GA580@adaSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35b123e2
    • F
      nohz: Restore NMI safe local irq work for local nohz kick · 40bea039
      Frederic Weisbecker 提交于
      The local nohz kick is currently used by perf which needs it to be
      NMI-safe. Recent commit though (7d1311b9)
      changed its implementation to fire the local kick using the remote kick
      API. It was convenient to make the code more generic but the remote kick
      isn't NMI-safe.
      
      As a result:
      
      	WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140()
      	CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34
      	0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37
      	0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180
      	0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000
      	Call Trace:
      	<NMI>  [<ffffffff9a7f1e37>] dump_stack+0x4e/0x7a
      	[<ffffffff9a0791dd>] warn_slowpath_common+0x7d/0xa0
      	[<ffffffff9a07930a>] warn_slowpath_null+0x1a/0x20
      	[<ffffffff9a17ca1e>] irq_work_queue_on+0x11e/0x140
      	[<ffffffff9a10a2c7>] tick_nohz_full_kick_cpu+0x57/0x90
      	[<ffffffff9a186cd5>] __perf_event_overflow+0x275/0x350
      	[<ffffffff9a184f80>] ? perf_event_task_disable+0xa0/0xa0
      	[<ffffffff9a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
      	[<ffffffff9a187934>] perf_event_overflow+0x14/0x20
      	[<ffffffff9a020386>] intel_pmu_handle_irq+0x206/0x410
      	[<ffffffff9a0b54d3>] ? arch_vtime_task_switch+0x63/0x130
      	[<ffffffff9a01937b>] perf_event_nmi_handler+0x2b/0x50
      	[<ffffffff9a007b72>] nmi_handle+0xd2/0x390
      	[<ffffffff9a007aa5>] ? nmi_handle+0x5/0x390
      	[<ffffffff9a0d131b>] ? lock_release+0xab/0x330
      	[<ffffffff9a008062>] default_do_nmi+0x72/0x1c0
      	[<ffffffff9a0c925f>] ? cpuacct_account_field+0xcf/0x200
      	[<ffffffff9a008268>] do_nmi+0xb8/0x100
      
      Lets fix this by restoring the use of local irq work for the nohz local
      kick.
      Reported-by: NCatalin Iacob <iacobcatalin@gmail.com>
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      40bea039
  9. 03 9月, 2014 1 次提交
  10. 30 8月, 2014 2 次提交
    • V
      kexec: create a new config option CONFIG_KEXEC_FILE for new syscall · 74ca317c
      Vivek Goyal 提交于
      Currently new system call kexec_file_load() and all the associated code
      compiles if CONFIG_KEXEC=y.  But new syscall also compiles purgatory
      code which currently uses gcc option -mcmodel=large.  This option seems
      to be available only gcc 4.4 onwards.
      
      Hiding new functionality behind a new config option will not break
      existing users of old gcc.  Those who wish to enable new functionality
      will require new gcc.  Having said that, I am trying to figure out how
      can I move away from using -mcmodel=large but that can take a while.
      
      I think there are other advantages of introducing this new config
      option.  As this option will be enabled only on x86_64, other arches
      don't have to compile generic kexec code which will never be used.  This
      new code selects CRYPTO=y and CRYPTO_SHA256=y.  And all other arches had
      to do this for CONFIG_KEXEC.  Now with introduction of new config
      option, we can remove crypto dependency from other arches.
      
      Now CONFIG_KEXEC_FILE is available only on x86_64.  So whereever I had
      CONFIG_X86_64 defined, I got rid of that.
      
      For CONFIG_KEXEC_FILE, instead of doing select CRYPTO=y, I changed it to
      "depends on CRYPTO=y".  This should be safer as "select" is not
      recursive.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Tested-by: NShaun Ruffell <sruffell@digium.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74ca317c
    • V
      resource: fix the case of null pointer access · 800df627
      Vivek Goyal 提交于
      Richard and Daniel reported that UML is broken due to changes to
      resource traversal functions.  Problem is that iomem_resource.child can
      be null and new code does not consider that possibility.  Old code used
      a for loop and that loop will not even execute if p was null.
      
      Revert back to for() loop logic and bail out if p is null.
      
      I also moved sibling_only check out of resource_lock. There is no
      reason to keep it inside the lock.
      
      Following is backtrace of the UML crash.
      
      RIP: 0033:[<0000000060039b9f>]
      RSP: 0000000081459da0  EFLAGS: 00010202
      RAX: 0000000000000000 RBX: 00000000219b3fff RCX: 000000006010d1d9
      RDX: 0000000000000001 RSI: 00000000602dfb94 RDI: 0000000081459df8
      RBP: 0000000081459de0 R08: 00000000601b59f4 R09: ffffffff0000ff00
      R10: ffffffff0000ff00 R11: 0000000081459e88 R12: 0000000081459df8
      R13: 00000000219b3fff R14: 00000000602dfb94 R15: 0000000000000000
      Kernel panic - not syncing: Segfault with no mm
      CPU: 0 PID: 1 Comm: swapper Not tainted 3.16.0-10454-g58d08e3b #13
      Stack:
       00000000 000080d0 81459df0 219b3fff
       81459e70 6010d1d9 ffffffff 6033e010
       81459e50 6003a269 81459e30 00000000
      Call Trace:
       [<6010d1d9>] ? kclist_add_private+0x0/0xe7
       [<6003a269>] walk_system_ram_range+0x61/0xb7
       [<6000e859>] ? proc_kcore_init+0x0/0xf1
       [<6010d574>] kcore_update_ram+0x4c/0x168
       [<6010d72e>] ? kclist_add+0x0/0x2e
       [<6000e943>] proc_kcore_init+0xea/0xf1
       [<6000e859>] ? proc_kcore_init+0x0/0xf1
       [<6000e859>] ? proc_kcore_init+0x0/0xf1
       [<600189f0>] do_one_initcall+0x13c/0x204
       [<6004ca46>] ? parse_args+0x1df/0x2e0
       [<6004c82d>] ? parameq+0x0/0x3a
       [<601b5990>] ? strcpy+0x0/0x18
       [<60001e1a>] kernel_init_freeable+0x240/0x31e
       [<6026f1c0>] kernel_init+0x12/0x148
       [<60019fad>] new_thread_handler+0x81/0xa3
      
      Fixes 8c86e70a ("resource: provide new functions to walk
      through resources").
      Reported-by: NDaniel Walter <sahne@0x90.at>
      Tested-by: NRichard Weinberger <richard@nod.at>
      Tested-by: NToralf Förster <toralf.foerster@gmx.de>
      Tested-by: NDaniel Walter <sahne@0x90.at>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      800df627
  11. 28 8月, 2014 1 次提交
  12. 26 8月, 2014 2 次提交