1. 31 10月, 2017 2 次提交
    • A
      timekeeping: Consolidate timekeeping_inject_offset code · e0956dcc
      Arnd Bergmann 提交于
      The code to check the adjtimex() or clock_adjtime() arguments is spread
      out across multiple files for presumably only historic reasons. As a
      preparatation for a rework to get rid of the use of 'struct timeval'
      and 'struct timespec' in there, this moves all the portions into
      kernel/time/timekeeping.c and marks them as 'static'.
      
      The warp_clock() function here is not as closely related as the others,
      but I feel it still makes sense to move it here in order to consolidate
      all callers of timekeeping_inject_offset().
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      [jstultz: Whitespace fixup]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      e0956dcc
    • J
      rtc: Allow rtc drivers to specify the tv_nsec value for ntp · 0f295b06
      Jason Gunthorpe 提交于
      ntp is currently hardwired to try and call the rtc set when wall clock
      tv_nsec is 0.5 seconds. This historical behaviour works well with certain
      PC RTCs, but is not universal to all rtc hardware.
      
      Change how this works by introducing the driver specific concept of
      set_offset_nsec, the delay between current wall clock time and the target
      time to set (with a 0 tv_nsecs).
      
      For x86-style CMOS set_offset_nsec should be -0.5 s which causes the last
      second to be written 0.5 s after it has started.
      
      For compat with the old rtc_set_ntp_time, the value is defaulted to
      + 0.5 s, which causes the next second to be written 0.5s before it starts,
      as things were before this patch.
      
      Testing shows many non-x86 RTCs would like set_offset_nsec ~= 0,
      so ultimately each RTC driver should set the set_offset_nsec according
      to its needs, and non x86 architectures should stop using
      update_persistent_clock64 in order to access this feature.
      Future patches will revise the drivers as needed.
      
      Since CMOS and RTC now have very different handling they are split
      into two dedicated code paths, sharing the support code, and ifdefs
      are replaced with IS_ENABLED.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Signed-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      0f295b06
  2. 30 9月, 2017 1 次提交
    • A
      fix infoleak in waitid(2) · 6c85501f
      Al Viro 提交于
      kernel_waitid() can return a PID, an error or 0.  rusage is filled in the first
      case and waitid(2) rusage should've been copied out exactly in that case, *not*
      whenever kernel_waitid() has not returned an error.  Compat variant shares that
      braino; none of kernel_wait4() callers do, so the below ought to fix it.
      Reported-and-tested-by: NAlexander Potapenko <glider@google.com>
      Fixes: ce72a16f ("wait4(2)/waitid(2): separate copying rusage to userland")
      Cc: stable@vger.kernel.org # v4.13
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6c85501f
  3. 29 9月, 2017 6 次提交
    • E
      sched/sysctl: Check user input value of sysctl_sched_time_avg · 5ccba44b
      Ethan Zhao 提交于
      System will hang if user set sysctl_sched_time_avg to 0:
      
        [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0
      
        Stack traceback for pid 0
        0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
        ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
        0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
        ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
        Call Trace:
        <IRQ> [<ffffffff810c4dd0>] ? update_group_capacity+0x110/0x200
        [<ffffffff810c4fc9>] ? update_sd_lb_stats+0x109/0x600
        [<ffffffff810c5507>] ? find_busiest_group+0x47/0x530
        [<ffffffff810c5b84>] ? load_balance+0x194/0x900
        [<ffffffff810ad5ca>] ? update_rq_clock.part.83+0x1a/0xe0
        [<ffffffff810c6d42>] ? rebalance_domains+0x152/0x290
        [<ffffffff810c6f5c>] ? run_rebalance_domains+0xdc/0x1d0
        [<ffffffff8108a75b>] ? __do_softirq+0xfb/0x320
        [<ffffffff8108ac85>] ? irq_exit+0x125/0x130
        [<ffffffff810b3a17>] ? scheduler_ipi+0x97/0x160
        [<ffffffff81052709>] ? smp_reschedule_interrupt+0x29/0x30
        [<ffffffff8173a1be>] ? reschedule_interrupt+0x6e/0x80
         <EOI> [<ffffffff815bc83c>] ? cpuidle_enter_state+0xcc/0x230
        [<ffffffff815bc80c>] ? cpuidle_enter_state+0x9c/0x230
        [<ffffffff815bc9d7>] ? cpuidle_enter+0x17/0x20
        [<ffffffff810cd6dc>] ? cpu_startup_entry+0x38c/0x420
        [<ffffffff81053373>] ? start_secondary+0x173/0x1e0
      
      Because divide-by-zero error happens in function:
      
      update_group_capacity()
        update_cpu_capacity()
          scale_rt_capacity()
           {
                ...
                total = sched_avg_period() + delta;
                used = div_u64(avg, total);
                ...
           }
      
      To fix this issue, check user input value of sysctl_sched_time_avg, keep
      it unchanged when hitting invalid input, and set the minimum limit of
      sysctl_sched_time_avg to 1 ms.
      Reported-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Signed-off-by: NEthan Zhao <ethan.zhao@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: ethan.kernel@gmail.com
      Cc: keescook@chromium.org
      Cc: mcgrof@kernel.org
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ccba44b
    • P
      sched/debug: Ignore TASK_IDLE for SysRq-W · 5d68cc95
      Peter Zijlstra 提交于
      Markus reported that tasks in TASK_IDLE state are reported by SysRq-W,
      which results in undesirable clutter.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5d68cc95
    • P
      sched/tracing: Use common task-state helpers · 5f6ad26e
      Peter Zijlstra 提交于
      Remove yet another task-state char instance.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5f6ad26e
    • P
      locking/rwsem-xadd: Fix missed wakeup due to reordering of load · 9c29c318
      Prateek Sood 提交于
      If a spinner is present, there is a chance that the load of
      rwsem_has_spinner() in rwsem_wake() can be reordered with
      respect to decrement of rwsem count in __up_write() leading
      to wakeup being missed:
      
       spinning writer                  up_write caller
       ---------------                  -----------------------
       [S] osq_unlock()                 [L] osq
        spin_lock(wait_lock)
        sem->count=0xFFFFFFFF00000001
                  +0xFFFFFFFF00000000
        count=sem->count
        MB
                                         sem->count=0xFFFFFFFE00000001
                                                   -0xFFFFFFFF00000001
                                         spin_trylock(wait_lock)
                                         return
       rwsem_try_write_lock(count)
       spin_unlock(wait_lock)
       schedule()
      
      Reordering of atomic_long_sub_return_release() in __up_write()
      and rwsem_has_spinner() in rwsem_wake() can cause missing of
      wakeup in up_write() context. In spinning writer, sem->count
      and local variable count is 0XFFFFFFFE00000001. It would result
      in rwsem_try_write_lock() failing to acquire rwsem and spinning
      writer going to sleep in rwsem_down_write_failed().
      
      The smp_rmb() will make sure that the spinner state is
      consulted after sem->count is updated in up_write context.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: longman@redhat.com
      Cc: parri.andrea@gmail.com
      Cc: sramana@codeaurora.org
      Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c29c318
    • P
      sched/debug: Remove unused variable · 65d5dc47
      Peter Zijlstra 提交于
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      65d5dc47
    • A
      perf/aux: Only update ->aux_wakeup in non-overwrite mode · 441430eb
      Alexander Shishkin 提交于
      The following commit:
      
        d9a50b02 ("perf/aux: Ensure aux_wakeup represents most recent wakeup index")
      
      changed the AUX wakeup position calculation to rounddown(), which causes
      a division-by-zero in AUX overwrite mode (aka "snapshot mode").
      
      The zero denominator results from the fact that perf record doesn't set
      aux_watermark to anything, in which case the kernel will set it to half
      the AUX buffer size, but only for non-overwrite mode. In the overwrite
      mode aux_watermark stays zero.
      
      The good news is that, AUX overwrite mode, wakeups don't happen and
      related bookkeeping is not relevant, so we can simply forego the whole
      wakeup updates.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/20170906160811.16510-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      441430eb
  4. 28 9月, 2017 2 次提交
  5. 26 9月, 2017 7 次提交
  6. 25 9月, 2017 3 次提交
  7. 24 9月, 2017 4 次提交
  8. 21 9月, 2017 1 次提交
    • Y
      bpf: one perf event close won't free bpf program attached by another perf event · ec9dd352
      Yonghong Song 提交于
      This patch fixes a bug exhibited by the following scenario:
        1. fd1 = perf_event_open with attr.config = ID1
        2. attach bpf program prog1 to fd1
        3. fd2 = perf_event_open with attr.config = ID1
           <this will be successful>
        4. user program closes fd2 and prog1 is detached from the tracepoint.
        5. user program with fd1 does not work properly as tracepoint
           no output any more.
      
      The issue happens at step 4. Multiple perf_event_open can be called
      successfully, but only one bpf prog pointer in the tp_event. In the
      current logic, any fd release for the same tp_event will free
      the tp_event->prog.
      
      The fix is to free tp_event->prog only when the closing fd
      corresponds to the one which registered the program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec9dd352
  9. 20 9月, 2017 5 次提交
    • D
      bpf: fix ri->map_owner pointer on bpf_prog_realloc · 7c300131
      Daniel Borkmann 提交于
      Commit 109980b8 ("bpf: don't select potentially stale
      ri->map from buggy xdp progs") passed the pointer to the prog
      itself to be loaded into r4 prior on bpf_redirect_map() helper
      call, so that we can store the owner into ri->map_owner out of
      the helper.
      
      Issue with that is that the actual address of the prog is still
      subject to change when subsequent rewrites occur that require
      slow path in bpf_prog_realloc() to alloc more memory, e.g. from
      patching inlining helper functions or constant blinding. Thus,
      we really need to take prog->aux as the address we're holding,
      which also works with prog clones as they share the same aux
      object.
      
      Instead of then fetching aux->prog during runtime, which could
      potentially incur cache misses due to false sharing, we are
      going to just use aux for comparison on the map owner. This
      will also keep the patchlet of the same size, and later check
      in xdp_map_invalid() only accesses read-only aux pointer from
      the prog, it's also in the same cacheline already from prior
      access when calling bpf_func.
      
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c300131
    • E
      bpf: do not disable/enable BH in bpf_map_free_id() · 930651a7
      Eric Dumazet 提交于
      syzkaller reported following splat [1]
      
      Since hard irq are disabled by the caller, bpf_map_free_id()
      should not try to enable/disable BH.
      
      Another solution would be to change htab_map_delete_elem() to
      defer the free_htab_elem() call after
      raw_spin_unlock_irqrestore(&b->lock, flags), but this might be not
      enough to cover other code paths.
      
      [1]
      WARNING: CPU: 1 PID: 8052 at kernel/softirq.c:161 __local_bh_enable_ip
      +0x1e/0x160 kernel/softirq.c:161
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 8052 Comm: syz-executor1 Not tainted 4.13.0-next-20170915+
      #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x417 kernel/panic.c:181
       __warn+0x1c4/0x1d9 kernel/panic.c:542
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
       do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:__local_bh_enable_ip+0x1e/0x160 kernel/softirq.c:161
      RSP: 0018:ffff8801cdcd7748 EFLAGS: 00010046
      RAX: 0000000000000082 RBX: 0000000000000201 RCX: 0000000000000000
      RDX: 1ffffffff0b5933c RSI: 0000000000000201 RDI: ffffffff85ac99e0
      RBP: ffff8801cdcd7758 R08: ffffffff85b87158 R09: 1ffff10039b9aec6
      R10: ffff8801c99f24c0 R11: 0000000000000002 R12: ffffffff817b0b47
      R13: dffffc0000000000 R14: ffff8801cdcd77e8 R15: 0000000000000001
       __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:176 [inline]
       _raw_spin_unlock_bh+0x30/0x40 kernel/locking/spinlock.c:207
       spin_unlock_bh include/linux/spinlock.h:361 [inline]
       bpf_map_free_id kernel/bpf/syscall.c:197 [inline]
       __bpf_map_put+0x267/0x320 kernel/bpf/syscall.c:227
       bpf_map_put+0x1a/0x20 kernel/bpf/syscall.c:235
       bpf_map_fd_put_ptr+0x15/0x20 kernel/bpf/map_in_map.c:96
       free_htab_elem+0xc3/0x1b0 kernel/bpf/hashtab.c:658
       htab_map_delete_elem+0x74d/0x970 kernel/bpf/hashtab.c:1063
       map_delete_elem kernel/bpf/syscall.c:633 [inline]
       SYSC_bpf kernel/bpf/syscall.c:1479 [inline]
       SyS_bpf+0x2188/0x46a0 kernel/bpf/syscall.c:1451
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: f3f1c054 ("bpf: Introduce bpf_map ID")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      930651a7
    • T
      tracing: Fix trace_pipe behavior for instance traces · 75df6e68
      Tahsin Erdogan 提交于
      When reading data from trace_pipe, tracing_wait_pipe() performs a
      check to see if tracing has been turned off after some data was read.
      Currently, this check always looks at global trace state, but it
      should be checking the trace instance where trace_pipe is located at.
      
      Because of this bug, cat instances/i1/trace_pipe in the following
      script will immediately exit instead of waiting for data:
      
      cd /sys/kernel/debug/tracing
      echo 0 > tracing_on
      mkdir -p instances/i1
      echo 1 > instances/i1/tracing_on
      echo 1 > instances/i1/events/sched/sched_process_exec/enable
      cat instances/i1/trace_pipe
      
      Link: http://lkml.kernel.org/r/20170917102348.1615-1-tahsin@google.com
      
      Cc: stable@vger.kernel.org
      Fixes: 10246fa3 ("tracing: give easy way to clear trace buffer")
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      75df6e68
    • Z
      tracing: Ignore mmiotrace from kernel commandline · c7b3ae0b
      Ziqian SUN (Zamir) 提交于
      The mmiotrace tracer cannot be enabled with ftrace=mmiotrace in kernel
      commandline. With this patch, noboot is added to the tracer struct,
      and when system boot with a tracer that has noboot=true, it will print
      out a warning message and continue booting.
      
      Link: http://lkml.kernel.org/r/1505111195-31942-1-git-send-email-zsun@redhat.comSigned-off-by: NZiqian SUN (Zamir) <zsun@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c7b3ae0b
    • B
      tracing: Erase irqsoff trace with empty write · 8dd33bcb
      Bo Yan 提交于
      One convenient way to erase trace is "echo > trace". However, this
      is currently broken if the current tracer is irqsoff tracer. This
      is because irqsoff tracer use max_buffer as the default trace
      buffer.
      
      Set the max_buffer as the one to be cleared when it's the trace
      buffer currently in use.
      
      Link: http://lkml.kernel.org/r/1505754215-29411-1-git-send-email-byan@nvidia.com
      
      Cc: <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 4acd4d00 ("tracing: give easy way to clear trace buffer")
      Signed-off-by: NBo Yan <byan@nvidia.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      8dd33bcb
  10. 19 9月, 2017 1 次提交
  11. 17 9月, 2017 1 次提交
    • T
      genirq: Fix cpumask check in __irq_startup_managed() · 9cb067ef
      Thomas Gleixner 提交于
      The result of cpumask_any_and() is invalid when result greater or equal
      nr_cpu_ids. The current check is checking for greater only. Fix it.
      
      Fixes: 761ea388 ("genirq: Handle managed irqs gracefully in irq_startup()")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: stable@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rui Zhang <rui.zhang@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Link: http://lkml.kernel.org/r/20170913213152.272283444@linutronix.de
      9cb067ef
  12. 16 9月, 2017 1 次提交
  13. 15 9月, 2017 2 次提交
    • T
      sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b
      Tim Chen 提交于
      Now that we have added breaks in the wait queue scan and allow bookmark
      on scan position, we put this logic in the wake_up_page_bit function.
      
      We can have very long page wait list in large system where multiple
      pages share the same wait list. We break the wake up walk here to allow
      other cpus a chance to access the list, and not to disable the interrupts
      when traversing the list for too long.  This reduces the interrupt and
      rescheduling latency, and excessive page wait queue lock hold time.
      
      [ v2: Remove bookmark_wake_function ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a19c7b
    • T
      sched/wait: Break up long wake list walk · 2554db91
      Tim Chen 提交于
      We encountered workloads that have very long wake up list on large
      systems. A waker takes a long time to traverse the entire wake list and
      execute all the wake functions.
      
      We saw page wait list that are up to 3700+ entries long in tests of
      large 4 and 8 socket systems. It took 0.8 sec to traverse such list
      during wake up. Any other CPU that contends for the list spin lock will
      spin for a long time. It is a result of the numa balancing migration of
      hot pages that are shared by many threads.
      
      Multiple CPUs waking are queued up behind the lock, and the last one
      queued has to wait until all CPUs did all the wakeups.
      
      The page wait list is traversed with interrupt disabled, which caused
      various problems. This was the original cause that triggered the NMI
      watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
      extending the NMI watch dog timer there helped.
      
      This patch bookmarks the waker's scan position in wake list and break
      the wake up walk, to allow access to the list before the waker resume
      its walk down the rest of the wait list. It lowers the interrupt and
      rescheduling latency.
      
      This patch also provides a performance boost when combined with the next
      patch to break up page wakeup list walk. We saw 22% improvement in the
      will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
      
      [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
        simply access to flags. ]
      Reported-by: NKan Liang <kan.liang@intel.com>
      Tested-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2554db91
  14. 14 9月, 2017 1 次提交
    • M
      mm: treewide: remove GFP_TEMPORARY allocation flag · 0ee931c4
      Michal Hocko 提交于
      GFP_TEMPORARY was introduced by commit e12ba74d ("Group short-lived
      and reclaimable kernel allocations") along with __GFP_RECLAIMABLE.  It's
      primary motivation was to allow users to tell that an allocation is
      short lived and so the allocator can try to place such allocations close
      together and prevent long term fragmentation.  As much as this sounds
      like a reasonable semantic it becomes much less clear when to use the
      highlevel GFP_TEMPORARY allocation flag.  How long is temporary? Can the
      context holding that memory sleep? Can it take locks? It seems there is
      no good answer for those questions.
      
      The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
      __GFP_RECLAIMABLE which in itself is tricky because basically none of
      the existing caller provide a way to reclaim the allocated memory.  So
      this is rather misleading and hard to evaluate for any benefits.
      
      I have checked some random users and none of them has added the flag
      with a specific justification.  I suspect most of them just copied from
      other existing users and others just thought it might be a good idea to
      use without any measuring.  This suggests that GFP_TEMPORARY just
      motivates for cargo cult usage without any reasoning.
      
      I believe that our gfp flags are quite complex already and especially
      those with highlevel semantic should be clearly defined to prevent from
      confusion and abuse.  Therefore I propose dropping GFP_TEMPORARY and
      replace all existing users to simply use GFP_KERNEL.  Please note that
      SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
      so they will be placed properly for memory fragmentation prevention.
      
      I can see reasons we might want some gfp flag to reflect shorterm
      allocations but I propose starting from a clear semantic definition and
      only then add users with proper justification.
      
      This was been brought up before LSF this year by Matthew [1] and it
      turned out that GFP_TEMPORARY really doesn't have a clear semantic.  It
      seems to be a heuristic without any measured advantage for most (if not
      all) its current users.  The follow up discussion has revealed that
      opinions on what might be temporary allocation differ a lot between
      developers.  So rather than trying to tweak existing users into a
      semantic which they haven't expected I propose to simply remove the flag
      and start from scratch if we really need a semantic for short term
      allocations.
      
      [1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
      
      [akpm@linux-foundation.org: fix typo]
      [akpm@linux-foundation.org: coding-style fixes]
      [sfr@canb.auug.org.au: drm/i915: fix up]
        Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Neil Brown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ee931c4
  15. 12 9月, 2017 3 次提交