1. 05 10月, 2017 3 次提交
  2. 04 10月, 2017 4 次提交
  3. 30 9月, 2017 1 次提交
    • A
      fix infoleak in waitid(2) · 6c85501f
      Al Viro 提交于
      kernel_waitid() can return a PID, an error or 0.  rusage is filled in the first
      case and waitid(2) rusage should've been copied out exactly in that case, *not*
      whenever kernel_waitid() has not returned an error.  Compat variant shares that
      braino; none of kernel_wait4() callers do, so the below ought to fix it.
      Reported-and-tested-by: NAlexander Potapenko <glider@google.com>
      Fixes: ce72a16f ("wait4(2)/waitid(2): separate copying rusage to userland")
      Cc: stable@vger.kernel.org # v4.13
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6c85501f
  4. 29 9月, 2017 6 次提交
    • E
      sched/sysctl: Check user input value of sysctl_sched_time_avg · 5ccba44b
      Ethan Zhao 提交于
      System will hang if user set sysctl_sched_time_avg to 0:
      
        [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0
      
        Stack traceback for pid 0
        0xffff883f6406c600 0 0 1 3 R 0xffff883f6406cf50 *swapper/3
        ffff883f7ccc3ae8 0000000000000018 ffffffff810c4dd0 0000000000000000
        0000000000017800 ffff883f7ccc3d78 0000000000000003 ffff883f7ccc3bf8
        ffffffff810c4fc9 ffff883f7ccc3c08 00000000810c5043 ffff883f7ccc3c08
        Call Trace:
        <IRQ> [<ffffffff810c4dd0>] ? update_group_capacity+0x110/0x200
        [<ffffffff810c4fc9>] ? update_sd_lb_stats+0x109/0x600
        [<ffffffff810c5507>] ? find_busiest_group+0x47/0x530
        [<ffffffff810c5b84>] ? load_balance+0x194/0x900
        [<ffffffff810ad5ca>] ? update_rq_clock.part.83+0x1a/0xe0
        [<ffffffff810c6d42>] ? rebalance_domains+0x152/0x290
        [<ffffffff810c6f5c>] ? run_rebalance_domains+0xdc/0x1d0
        [<ffffffff8108a75b>] ? __do_softirq+0xfb/0x320
        [<ffffffff8108ac85>] ? irq_exit+0x125/0x130
        [<ffffffff810b3a17>] ? scheduler_ipi+0x97/0x160
        [<ffffffff81052709>] ? smp_reschedule_interrupt+0x29/0x30
        [<ffffffff8173a1be>] ? reschedule_interrupt+0x6e/0x80
         <EOI> [<ffffffff815bc83c>] ? cpuidle_enter_state+0xcc/0x230
        [<ffffffff815bc80c>] ? cpuidle_enter_state+0x9c/0x230
        [<ffffffff815bc9d7>] ? cpuidle_enter+0x17/0x20
        [<ffffffff810cd6dc>] ? cpu_startup_entry+0x38c/0x420
        [<ffffffff81053373>] ? start_secondary+0x173/0x1e0
      
      Because divide-by-zero error happens in function:
      
      update_group_capacity()
        update_cpu_capacity()
          scale_rt_capacity()
           {
                ...
                total = sched_avg_period() + delta;
                used = div_u64(avg, total);
                ...
           }
      
      To fix this issue, check user input value of sysctl_sched_time_avg, keep
      it unchanged when hitting invalid input, and set the minimum limit of
      sysctl_sched_time_avg to 1 ms.
      Reported-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com>
      Signed-off-by: NEthan Zhao <ethan.zhao@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: ethan.kernel@gmail.com
      Cc: keescook@chromium.org
      Cc: mcgrof@kernel.org
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.zhao@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5ccba44b
    • P
      sched/debug: Ignore TASK_IDLE for SysRq-W · 5d68cc95
      Peter Zijlstra 提交于
      Markus reported that tasks in TASK_IDLE state are reported by SysRq-W,
      which results in undesirable clutter.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5d68cc95
    • P
      sched/tracing: Use common task-state helpers · 5f6ad26e
      Peter Zijlstra 提交于
      Remove yet another task-state char instance.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5f6ad26e
    • P
      locking/rwsem-xadd: Fix missed wakeup due to reordering of load · 9c29c318
      Prateek Sood 提交于
      If a spinner is present, there is a chance that the load of
      rwsem_has_spinner() in rwsem_wake() can be reordered with
      respect to decrement of rwsem count in __up_write() leading
      to wakeup being missed:
      
       spinning writer                  up_write caller
       ---------------                  -----------------------
       [S] osq_unlock()                 [L] osq
        spin_lock(wait_lock)
        sem->count=0xFFFFFFFF00000001
                  +0xFFFFFFFF00000000
        count=sem->count
        MB
                                         sem->count=0xFFFFFFFE00000001
                                                   -0xFFFFFFFF00000001
                                         spin_trylock(wait_lock)
                                         return
       rwsem_try_write_lock(count)
       spin_unlock(wait_lock)
       schedule()
      
      Reordering of atomic_long_sub_return_release() in __up_write()
      and rwsem_has_spinner() in rwsem_wake() can cause missing of
      wakeup in up_write() context. In spinning writer, sem->count
      and local variable count is 0XFFFFFFFE00000001. It would result
      in rwsem_try_write_lock() failing to acquire rwsem and spinning
      writer going to sleep in rwsem_down_write_failed().
      
      The smp_rmb() will make sure that the spinner state is
      consulted after sem->count is updated in up_write context.
      Signed-off-by: NPrateek Sood <prsood@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Cc: longman@redhat.com
      Cc: parri.andrea@gmail.com
      Cc: sramana@codeaurora.org
      Link: http://lkml.kernel.org/r/1504794658-15397-1-git-send-email-prsood@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c29c318
    • P
      sched/debug: Remove unused variable · 65d5dc47
      Peter Zijlstra 提交于
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      65d5dc47
    • A
      perf/aux: Only update ->aux_wakeup in non-overwrite mode · 441430eb
      Alexander Shishkin 提交于
      The following commit:
      
        d9a50b02 ("perf/aux: Ensure aux_wakeup represents most recent wakeup index")
      
      changed the AUX wakeup position calculation to rounddown(), which causes
      a division-by-zero in AUX overwrite mode (aka "snapshot mode").
      
      The zero denominator results from the fact that perf record doesn't set
      aux_watermark to anything, in which case the kernel will set it to half
      the AUX buffer size, but only for non-overwrite mode. In the overwrite
      mode aux_watermark stays zero.
      
      The good news is that, AUX overwrite mode, wakeups don't happen and
      related bookkeeping is not relevant, so we can simply forego the whole
      wakeup updates.
      Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/20170906160811.16510-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      441430eb
  5. 28 9月, 2017 2 次提交
  6. 26 9月, 2017 7 次提交
  7. 25 9月, 2017 3 次提交
  8. 24 9月, 2017 4 次提交
  9. 21 9月, 2017 1 次提交
    • Y
      bpf: one perf event close won't free bpf program attached by another perf event · ec9dd352
      Yonghong Song 提交于
      This patch fixes a bug exhibited by the following scenario:
        1. fd1 = perf_event_open with attr.config = ID1
        2. attach bpf program prog1 to fd1
        3. fd2 = perf_event_open with attr.config = ID1
           <this will be successful>
        4. user program closes fd2 and prog1 is detached from the tracepoint.
        5. user program with fd1 does not work properly as tracepoint
           no output any more.
      
      The issue happens at step 4. Multiple perf_event_open can be called
      successfully, but only one bpf prog pointer in the tp_event. In the
      current logic, any fd release for the same tp_event will free
      the tp_event->prog.
      
      The fix is to free tp_event->prog only when the closing fd
      corresponds to the one which registered the program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec9dd352
  10. 20 9月, 2017 5 次提交
    • D
      bpf: fix ri->map_owner pointer on bpf_prog_realloc · 7c300131
      Daniel Borkmann 提交于
      Commit 109980b8 ("bpf: don't select potentially stale
      ri->map from buggy xdp progs") passed the pointer to the prog
      itself to be loaded into r4 prior on bpf_redirect_map() helper
      call, so that we can store the owner into ri->map_owner out of
      the helper.
      
      Issue with that is that the actual address of the prog is still
      subject to change when subsequent rewrites occur that require
      slow path in bpf_prog_realloc() to alloc more memory, e.g. from
      patching inlining helper functions or constant blinding. Thus,
      we really need to take prog->aux as the address we're holding,
      which also works with prog clones as they share the same aux
      object.
      
      Instead of then fetching aux->prog during runtime, which could
      potentially incur cache misses due to false sharing, we are
      going to just use aux for comparison on the map owner. This
      will also keep the patchlet of the same size, and later check
      in xdp_map_invalid() only accesses read-only aux pointer from
      the prog, it's also in the same cacheline already from prior
      access when calling bpf_func.
      
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c300131
    • E
      bpf: do not disable/enable BH in bpf_map_free_id() · 930651a7
      Eric Dumazet 提交于
      syzkaller reported following splat [1]
      
      Since hard irq are disabled by the caller, bpf_map_free_id()
      should not try to enable/disable BH.
      
      Another solution would be to change htab_map_delete_elem() to
      defer the free_htab_elem() call after
      raw_spin_unlock_irqrestore(&b->lock, flags), but this might be not
      enough to cover other code paths.
      
      [1]
      WARNING: CPU: 1 PID: 8052 at kernel/softirq.c:161 __local_bh_enable_ip
      +0x1e/0x160 kernel/softirq.c:161
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 8052 Comm: syz-executor1 Not tainted 4.13.0-next-20170915+
      #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x417 kernel/panic.c:181
       __warn+0x1c4/0x1d9 kernel/panic.c:542
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
       do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:__local_bh_enable_ip+0x1e/0x160 kernel/softirq.c:161
      RSP: 0018:ffff8801cdcd7748 EFLAGS: 00010046
      RAX: 0000000000000082 RBX: 0000000000000201 RCX: 0000000000000000
      RDX: 1ffffffff0b5933c RSI: 0000000000000201 RDI: ffffffff85ac99e0
      RBP: ffff8801cdcd7758 R08: ffffffff85b87158 R09: 1ffff10039b9aec6
      R10: ffff8801c99f24c0 R11: 0000000000000002 R12: ffffffff817b0b47
      R13: dffffc0000000000 R14: ffff8801cdcd77e8 R15: 0000000000000001
       __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:176 [inline]
       _raw_spin_unlock_bh+0x30/0x40 kernel/locking/spinlock.c:207
       spin_unlock_bh include/linux/spinlock.h:361 [inline]
       bpf_map_free_id kernel/bpf/syscall.c:197 [inline]
       __bpf_map_put+0x267/0x320 kernel/bpf/syscall.c:227
       bpf_map_put+0x1a/0x20 kernel/bpf/syscall.c:235
       bpf_map_fd_put_ptr+0x15/0x20 kernel/bpf/map_in_map.c:96
       free_htab_elem+0xc3/0x1b0 kernel/bpf/hashtab.c:658
       htab_map_delete_elem+0x74d/0x970 kernel/bpf/hashtab.c:1063
       map_delete_elem kernel/bpf/syscall.c:633 [inline]
       SYSC_bpf kernel/bpf/syscall.c:1479 [inline]
       SyS_bpf+0x2188/0x46a0 kernel/bpf/syscall.c:1451
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: f3f1c054 ("bpf: Introduce bpf_map ID")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      930651a7
    • T
      tracing: Fix trace_pipe behavior for instance traces · 75df6e68
      Tahsin Erdogan 提交于
      When reading data from trace_pipe, tracing_wait_pipe() performs a
      check to see if tracing has been turned off after some data was read.
      Currently, this check always looks at global trace state, but it
      should be checking the trace instance where trace_pipe is located at.
      
      Because of this bug, cat instances/i1/trace_pipe in the following
      script will immediately exit instead of waiting for data:
      
      cd /sys/kernel/debug/tracing
      echo 0 > tracing_on
      mkdir -p instances/i1
      echo 1 > instances/i1/tracing_on
      echo 1 > instances/i1/events/sched/sched_process_exec/enable
      cat instances/i1/trace_pipe
      
      Link: http://lkml.kernel.org/r/20170917102348.1615-1-tahsin@google.com
      
      Cc: stable@vger.kernel.org
      Fixes: 10246fa3 ("tracing: give easy way to clear trace buffer")
      Signed-off-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      75df6e68
    • Z
      tracing: Ignore mmiotrace from kernel commandline · c7b3ae0b
      Ziqian SUN (Zamir) 提交于
      The mmiotrace tracer cannot be enabled with ftrace=mmiotrace in kernel
      commandline. With this patch, noboot is added to the tracer struct,
      and when system boot with a tracer that has noboot=true, it will print
      out a warning message and continue booting.
      
      Link: http://lkml.kernel.org/r/1505111195-31942-1-git-send-email-zsun@redhat.comSigned-off-by: NZiqian SUN (Zamir) <zsun@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c7b3ae0b
    • B
      tracing: Erase irqsoff trace with empty write · 8dd33bcb
      Bo Yan 提交于
      One convenient way to erase trace is "echo > trace". However, this
      is currently broken if the current tracer is irqsoff tracer. This
      is because irqsoff tracer use max_buffer as the default trace
      buffer.
      
      Set the max_buffer as the one to be cleared when it's the trace
      buffer currently in use.
      
      Link: http://lkml.kernel.org/r/1505754215-29411-1-git-send-email-byan@nvidia.com
      
      Cc: <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 4acd4d00 ("tracing: give easy way to clear trace buffer")
      Signed-off-by: NBo Yan <byan@nvidia.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      8dd33bcb
  11. 19 9月, 2017 1 次提交
  12. 17 9月, 2017 1 次提交
    • T
      genirq: Fix cpumask check in __irq_startup_managed() · 9cb067ef
      Thomas Gleixner 提交于
      The result of cpumask_any_and() is invalid when result greater or equal
      nr_cpu_ids. The current check is checking for greater only. Fix it.
      
      Fixes: 761ea388 ("genirq: Handle managed irqs gracefully in irq_startup()")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Yu <yu.c.chen@intel.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: stable@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rui Zhang <rui.zhang@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Link: http://lkml.kernel.org/r/20170913213152.272283444@linutronix.de
      9cb067ef
  13. 16 9月, 2017 1 次提交
  14. 15 9月, 2017 1 次提交
    • T
      sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b
      Tim Chen 提交于
      Now that we have added breaks in the wait queue scan and allow bookmark
      on scan position, we put this logic in the wake_up_page_bit function.
      
      We can have very long page wait list in large system where multiple
      pages share the same wait list. We break the wake up walk here to allow
      other cpus a chance to access the list, and not to disable the interrupts
      when traversing the list for too long.  This reduces the interrupt and
      rescheduling latency, and excessive page wait queue lock hold time.
      
      [ v2: Remove bookmark_wake_function ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a19c7b