1. 29 5月, 2018 3 次提交
    • S
      tracing: Do not reference event data in post call triggers · c94e45bc
      Steven Rostedt (VMware) 提交于
      Trace event triggers can be called before or after the event has been
      committed. If it has been called after the commit, there's a possibility
      that the event no longer exists. Currently, the two post callers is the
      trigger to disable tracing (traceoff) and the one that will record a stack
      dump (stacktrace). Neither of them reference the trace event entry record,
      as that would lead to a race condition that could pass in corrupted data.
      
      To prevent any other users of the post data triggers from using the trace
      event record, pass in NULL to the post call trigger functions for the event
      record, as they should never need to use them in the first place.
      
      This does not fix any bug, but prevents bugs from happening by new post call
      trigger users.
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Reviewed-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c94e45bc
    • L
      tracepoints: Fix the descriptions of tracepoint_probe_register{_prio} · f39e2391
      Lee, Chun-Yi 提交于
      The description of tracepoint_probe_register duplicates
      with tracepoint_probe_register_prio. This patch cleans up
      the descriptions.
      
      Link: http://lkml.kernel.org/r/20170616082643.7311-1-jlee@suse.comSigned-off-by: N"Lee, Chun-Yi" <jlee@suse.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f39e2391
    • S
      tracing: Make the snapshot trigger work with instances · 2824f503
      Steven Rostedt (VMware) 提交于
      The snapshot trigger currently only affects the main ring buffer, even when
      it is used by the instances. This can be confusing as the snapshot trigger
      is listed in the instance.
      
       > # cd /sys/kernel/tracing
       > # mkdir instances/foo
       > # echo snapshot > instances/foo/events/syscalls/sys_enter_fchownat/trigger
       > # echo top buffer > trace_marker
       > # echo foo buffer > instances/foo/trace_marker
       > # touch /tmp/bar
       > # chown rostedt /tmp/bar
       > # cat instances/foo/snapshot
       # tracer: nop
       #
       #
       # * Snapshot is freed *
       #
       # Snapshot commands:
       # echo 0 > snapshot : Clears and frees snapshot buffer
       # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.
       #                      Takes a snapshot of the main buffer.
       # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free)
       #                      (Doesn't have to be '2' works with any number that
       #                       is not a '0' or '1')
      
       > # cat snapshot
       # tracer: nop
       #
       #                              _-----=> irqs-off
       #                             / _----=> need-resched
       #                            | / _---=> hardirq/softirq
       #                            || / _--=> preempt-depth
       #                            ||| /     delay
       #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
       #              | |       |   ||||       |         |
                   bash-1189  [000] ....   111.488323: tracing_mark_write: top buffer
      
      Not only did the snapshot occur in the top level buffer, but the instance
      snapshot buffer should have been allocated, and it is still free.
      
      Cc: stable@vger.kernel.org
      Fixes: 85f2b082 ("tracing: Add basic event trigger framework")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2824f503
  2. 28 5月, 2018 1 次提交
    • S
      tracing: Fix crash when freeing instances with event triggers · 86b389ff
      Steven Rostedt (VMware) 提交于
      If a instance has an event trigger enabled when it is freed, it could cause
      an access of free memory. Here's the case that crashes:
      
       # cd /sys/kernel/tracing
       # mkdir instances/foo
       # echo snapshot > instances/foo/events/initcall/initcall_start/trigger
       # rmdir instances/foo
      
      Would produce:
      
       general protection fault: 0000 [#1] PREEMPT SMP PTI
       Modules linked in: tun bridge ...
       CPU: 5 PID: 6203 Comm: rmdir Tainted: G        W         4.17.0-rc4-test+ #933
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
       RIP: 0010:clear_event_triggers+0x3b/0x70
       RSP: 0018:ffffc90003783de0 EFLAGS: 00010286
       RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b2b RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800c7130ba0
       RBP: ffffc90003783e00 R08: ffff8801131993f8 R09: 0000000100230016
       R10: ffffc90003783d80 R11: 0000000000000000 R12: ffff8800c7130ba0
       R13: ffff8800c7130bd8 R14: ffff8800cc093768 R15: 00000000ffffff9c
       FS:  00007f6f4aa86700(0000) GS:ffff88011eb40000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f6f4a5aed60 CR3: 00000000cd552001 CR4: 00000000001606e0
       Call Trace:
        event_trace_del_tracer+0x2a/0xc5
        instance_rmdir+0x15c/0x200
        tracefs_syscall_rmdir+0x52/0x90
        vfs_rmdir+0xdb/0x160
        do_rmdir+0x16d/0x1c0
        __x64_sys_rmdir+0x17/0x20
        do_syscall_64+0x55/0x1a0
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      This was due to the call the clears out the triggers when an instance is
      being deleted not removing the trigger from the link list.
      
      Cc: stable@vger.kernel.org
      Fixes: 85f2b082 ("tracing: Add basic event trigger framework")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      86b389ff
  3. 11 5月, 2018 1 次提交
    • S
      tracing: Fix regex_match_front() to not over compare the test string · dc432c3d
      Steven Rostedt (VMware) 提交于
      The regex match function regex_match_front() in the tracing filter logic,
      was fixed to test just the pattern length from testing the entire test
      string. That is, it went from strncmp(str, r->pattern, len) to
      strcmp(str, r->pattern, r->len).
      
      The issue is that str is not guaranteed to be nul terminated, and if r->len
      is greater than the length of str, it can access more memory than is
      allocated.
      
      The solution is to add a simple test if (len < r->len) return 0.
      
      Cc: stable@vger.kernel.org
      Fixes: 285caad4 ("tracing/filters: Fix MATCH_FRONT_ONLY filter matching")
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      dc432c3d
  4. 03 5月, 2018 5 次提交
    • Z
      tracing: Fix the file mode of stack tracer · 0c5a9acc
      Zhengyuan Liu 提交于
      It looks weird that the stack_trace_filter file can be written by root
      but shows that it does not have write permission by ll command.
      
      Link: http://lkml.kernel.org/r/1518054113-28096-1-git-send-email-liuzhengyuan@kylinos.cnSigned-off-by: NZhengyuan Liu <liuzhengyuan@kylinos.cn>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      0c5a9acc
    • C
      ftrace: Have set_graph_* files have normal file modes · 1ce0500d
      Chen LinX 提交于
      The set_graph_function and set_graph_notrace file mode should be 0644
      instead of 0444 as they are writeable. Note, the mode appears to be ignored
      regardless, but they should at least look sane.
      
      Link: http://lkml.kernel.org/r/1409725869-4501-1-git-send-email-linx.z.chen@intel.comAcked-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NChen LinX <linx.z.chen@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1ce0500d
    • J
      bpf: sockmap, fix error handling in redirect failures · abaeb096
      John Fastabend 提交于
      When a redirect failure happens we release the buffers in-flight
      without calling a sk_mem_uncharge(), the uncharge is called before
      dropping the sock lock for the redirecte, however we missed updating
      the ring start index. When no apply actions are in progress this
      is OK because we uncharge the entire buffer before the redirect.
      But, when we have apply logic running its possible that only a
      portion of the buffer is being redirected. In this case we only
      do memory accounting for the buffer slice being redirected and
      expect to be able to loop over the BPF program again and/or if
      a sock is closed uncharge the memory at sock destruct time.
      
      With an invalid start index however the program logic looks at
      the start pointer index, checks the length, and when seeing the
      length is zero (from the initial release and failure to update
      the pointer) aborts without uncharging/releasing the remaining
      memory.
      
      The fix for this is simply to update the start index. To avoid
      fixing this error in two locations we do a small refactor and
      remove one case where it is open-coded. Then fix it in the
      single function.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      abaeb096
    • J
      bpf: sockmap, zero sg_size on error when buffer is released · fec51d40
      John Fastabend 提交于
      When an error occurs during a redirect we have two cases that need
      to be handled (i) we have a cork'ed buffer (ii) we have a normal
      sendmsg buffer.
      
      In the cork'ed buffer case we don't currently support recovering from
      errors in a redirect action. So the buffer is released and the error
      should _not_ be pushed back to the caller of sendmsg/sendpage. The
      rationale here is the user will get an error that relates to old
      data that may have been sent by some arbitrary thread on that sock.
      Instead we simple consume the data and tell the user that the data
      has been consumed. We may add proper error recovery in the future.
      However, this patch fixes a bug where the bytes outstanding counter
      sg_size was not zeroed. This could result in a case where if the user
      has both a cork'ed action and apply action in progress we may
      incorrectly call into the BPF program when the user expected an
      old verdict to be applied via the apply action. I don't have a use
      case where using apply and cork at the same time is valid but we
      never explicitly reject it because it should work fine. This patch
      ensures the sg_size is zeroed so we don't have this case.
      
      In the normal sendmsg buffer case (no cork data) we also do not
      zero sg_size. Again this can confuse the apply logic when the logic
      calls into the BPF program when the BPF programmer expected the old
      verdict to remain. So ensure we set sg_size to zero here as well. And
      additionally to keep the psock state in-sync with the sk_msg_buff
      release all the memory as well. Previously we did this before
      returning to the user but this left a gap where psock and sk_msg_buff
      states were out of sync which seems fragile. No additional overhead
      is taken here except for a call to check the length and realize its
      already been freed. This is in the error path as well so in my
      opinion lets have robust code over optimized error paths.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fec51d40
    • J
      bpf: sockmap, fix scatterlist update on error path in send with apply · 3cc9a472
      John Fastabend 提交于
      When the call to do_tcp_sendpage() fails to send the complete block
      requested we either retry if only a partial send was completed or
      abort if we receive a error less than or equal to zero. Before
      returning though we must update the scatterlist length/offset to
      account for any partial send completed.
      
      Before this patch we did this at the end of the retry loop, but
      this was buggy when used while applying a verdict to fewer bytes
      than in the scatterlist. When the scatterlist length was being set
      we forgot to account for the apply logic reducing the size variable.
      So the result was we chopped off some bytes in the scatterlist without
      doing proper cleanup on them. This results in a WARNING when the
      sock is tore down because the bytes have previously been charged to
      the socket but are never uncharged.
      
      The simple fix is to simply do the accounting inside the retry loop
      subtracting from the absolute scatterlist values rather than trying
      to accumulate the totals and subtract at the end.
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3cc9a472
  5. 02 5月, 2018 4 次提交
  6. 01 5月, 2018 1 次提交
  7. 27 4月, 2018 5 次提交
  8. 26 4月, 2018 2 次提交
    • T
      Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME · a3ed0e43
      Thomas Gleixner 提交于
      Revert commits
      
      92af4dcb ("tracing: Unify the "boot" and "mono" tracing clocks")
      127bfa5f ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior")
      7250a404 ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior")
      d6c7270e ("timekeeping: Remove boot time specific code")
      f2d6fdbf ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior")
      d6ed449a ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock")
      72199320 ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock")
      
      As stated in the pull request for the unification of CLOCK_MONOTONIC and
      CLOCK_BOOTTIME, it was clear that we might have to revert the change.
      
      As reported by several folks systemd and other applications rely on the
      documented behaviour of CLOCK_MONOTONIC on Linux and break with the above
      changes. After resume daemons time out and other timeout related issues are
      observed. Rafael compiled this list:
      
      * systemd kills daemons on resume, after >WatchdogSec seconds
        of suspending (Genki Sky).  [Verified that that's because systemd uses
        CLOCK_MONOTONIC and expects it to not include the suspend time.]
      
      * systemd-journald misbehaves after resume:
        systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal
      corrupted or uncleanly shut down, renaming and replacing.
        (Mike Galbraith).
      
      * NetworkManager reports "networking disabled" and networking is broken
        after resume 50% of the time (Pavel).  [May be because of systemd.]
      
      * MATE desktop dims the display and starts the screensaver right after
        system resume (Pavel).
      
      * Full system hang during resume (me).  [May be due to systemd or NM or both.]
      
      That happens on debian and open suse systems.
      
      It's sad, that these problems were neither catched in -next nor by those
      folks who expressed interest in this change.
      Reported-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Reported-by: Genki Sky <sky@genki.is>,
      Reported-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kevin Easton <kevin@guarana.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      a3ed0e43
    • T
      tick/sched: Do not mess with an enqueued hrtimer · 1f71addd
      Thomas Gleixner 提交于
      Kaike reported that in tests rdma hrtimers occasionaly stopped working. He
      did great debugging, which provided enough context to decode the problem.
      
      CPU 3			     	      	     CPU 2
      
      idle
      start sched_timer expires = 712171000000
       queue->next = sched_timer
      					    start rdmavt timer. expires = 712172915662
      					    lock(baseof(CPU3))
      tick_nohz_stop_tick()
      tick = 716767000000			    timerqueue_add(tmr)
      
      hrtimer_set_expires(sched_timer, tick);
        sched_timer->expires = 716767000000  <---- FAIL
      					     if (tmr->expires < queue->next->expires)
      hrtimer_start(sched_timer)		          queue->next = tmr;
      lock(baseof(CPU3))
      					     unlock(baseof(CPU3))
      timerqueue_remove()
      timerqueue_add()
      
      ts->sched_timer is queued and queue->next is pointing to it, but then
      ts->sched_timer.expires is modified.
      
      This not only corrupts the ordering of the timerqueue RB tree, it also
      makes CPU2 see the new expiry time of timerqueue->next->expires when
      checking whether timerqueue->next needs to be updated. So CPU2 sees that
      the rdma timer is earlier than timerqueue->next and sets the rdma timer as
      new next.
      
      Depending on whether it had also seen the new time at RB tree enqueue, it
      might have queued the rdma timer at the wrong place and then after removing
      the sched_timer the RB tree is completely hosed.
      
      The problem was introduced with a commit which tried to solve inconsistency
      between the hrtimer in the tick_sched data and the underlying hardware
      clockevent. It split out hrtimer_set_expires() to store the new tick time
      in both the NOHZ and the NOHZ + HIGHRES case, but missed the fact that in
      the NOHZ + HIGHRES case the hrtimer might still be queued.
      
      Use hrtimer_start(timer, tick...) for the NOHZ + HIGHRES case which sets
      timer->expires after canceling the timer and move the hrtimer_set_expires()
      invocation into the NOHZ only code path which is not affected as it merily
      uses the hrtimer as next event storage so code pathes can be shared with
      the NOHZ + HIGHRES case.
      
      Fixes: d4af6d93 ("nohz: Fix spurious warning when hrtimer and clockevent get out of sync")
      Reported-by: N"Wan Kaike" <kaike.wan@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NFrederic Weisbecker <frederic@kernel.org>
      Cc: "Marciniszyn Mike" <mike.marciniszyn@intel.com>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: linux-rdma@vger.kernel.org
      Cc: "Dalessandro Dennis" <dennis.dalessandro@intel.com>
      Cc: "Fleck John" <john.fleck@intel.com>
      Cc: stable@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: "Weiny Ira" <ira.weiny@intel.com>
      Cc: "linux-rdma@vger.kernel.org"
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804241637390.1679@nanos.tec.linutronix.de
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1804242119210.1597@nanos.tec.linutronix.de
      
      1f71addd
  9. 25 4月, 2018 3 次提交
    • P
      tracing: Fix missing tab for hwlat_detector print format · 9a0fd675
      Peter Xu 提交于
      It's been missing for a while but no one is touching that up.  Fix it.
      
      Link: http://lkml.kernel.org/r/20180315060639.9578-1-peterx@redhat.com
      
      CC: Ingo Molnar <mingo@kernel.org>
      Cc:stable@vger.kernel.org
      Fixes: 7b2c8625 ("tracing: Add NMI tracing in hwlat detector")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      9a0fd675
    • T
      kprobes: Fix random address output of blacklist file · bcbd385b
      Thomas Richter 提交于
      File /sys/kernel/debug/kprobes/blacklist displays random addresses:
      
      [root@s8360046 linux]# cat /sys/kernel/debug/kprobes/blacklist
      0x0000000047149a90-0x00000000bfcb099a	print_type_x8
      ....
      
      This breaks 'perf probe' which uses the blacklist file to prohibit
      probes on certain functions by checking the address range.
      
      Fix this by printing the correct (unhashed) address.
      
      The file mode is read all but this is not an issue as the file
      hierarchy points out:
       # ls -ld /sys/ /sys/kernel/ /sys/kernel/debug/ /sys/kernel/debug/kprobes/
      	/sys/kernel/debug/kprobes/blacklist
      dr-xr-xr-x 12 root root 0 Apr 19 07:56 /sys/
      drwxr-xr-x  8 root root 0 Apr 19 07:56 /sys/kernel/
      drwx------ 16 root root 0 Apr 19 06:56 /sys/kernel/debug/
      drwxr-xr-x  2 root root 0 Apr 19 06:56 /sys/kernel/debug/kprobes/
      -r--r--r--  1 root root 0 Apr 19 06:56 /sys/kernel/debug/kprobes/blacklist
      
      Everything in and below /sys/kernel/debug is rwx to root only,
      no group or others have access.
      
      Background:
      Directory /sys/kernel/debug/kprobes is created by debugfs_create_dir()
      which sets the mode bits to rwxr-xr-x. Maybe change that to use the
      parent's directory mode bits instead?
      
      Link: http://lkml.kernel.org/r/20180419105556.86664-1-tmricht@linux.ibm.com
      
      Fixes: ad67b74d ("printk: hash addresses printed with %p")
      Cc: stable@vger.kernel.org
      Cc: <stable@vger.kernel.org> # v4.15+
      Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: David S Miller <davem@davemloft.net>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: acme@kernel.org
      Signed-off-by: NThomas Richter <tmricht@linux.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      bcbd385b
    • R
      tracing: Fix kernel crash while using empty filter with perf · ba16293d
      Ravi Bangoria 提交于
      Kernel is crashing when user tries to record 'ftrace:function' event
      with empty filter:
      
        # perf record -e ftrace:function --filter="" ls
      
        # dmesg
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        Oops: 0000 [#1] SMP PTI
        ...
        RIP: 0010:ftrace_profile_set_filter+0x14b/0x2d0
        RSP: 0018:ffffa4a7c0da7d20 EFLAGS: 00010246
        RAX: ffffa4a7c0da7d64 RBX: 0000000000000000 RCX: 0000000000000006
        RDX: 0000000000000000 RSI: 0000000000000092 RDI: ffff8c48ffc968f0
        ...
        Call Trace:
         _perf_ioctl+0x54a/0x6b0
         ? rcu_all_qs+0x5/0x30
        ...
      
      After patch:
        # perf record -e ftrace:function --filter="" ls
        failed to set filter "" on event ftrace:function with 22 (Invalid argument)
      
      Also, if user tries to echo "" > filter, it used to throw an error.
      This behavior got changed by commit 80765597 ("tracing: Rewrite
      filter logic to be simpler and faster"). This patch restores the
      behavior as a side effect:
      
      Before patch:
        # echo "" > filter
        #
      
      After patch:
        # echo "" > filter
        bash: echo: write error: Invalid argument
        #
      
      Link: http://lkml.kernel.org/r/20180420150758.19787-1-ravi.bangoria@linux.ibm.com
      
      Fixes: 80765597 ("tracing: Rewrite filter logic to be simpler and faster")
      Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      ba16293d
  10. 24 4月, 2018 3 次提交
    • J
      bpf: sockmap, fix double page_put on ENOMEM error in redirect path · 4fcfdfb8
      John Fastabend 提交于
      In the case where the socket memory boundary is hit the redirect
      path returns an ENOMEM error. However, before checking for this
      condition the redirect scatterlist buffer is setup with a valid
      page and length. This is never unwound so when the buffers are
      released latter in the error path we do a put_page() and clear
      the scatterlist fields. But, because the initial error happens
      before completing the scatterlist buffer we end up with both the
      original buffer and the redirect buffer pointing to the same page
      resulting in duplicate put_page() calls.
      
      To fix this simply move the initial configuration of the redirect
      scatterlist buffer below the sock memory check.
      
      Found this while running TCP_STREAM test with netperf using Cilium.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fcfdfb8
    • J
      bpf: sockmap, sk_wait_event needed to handle blocking cases · e20f7334
      John Fastabend 提交于
      In the recvmsg handler we need to add a wait event to support the
      blocking use cases. Without this we return zero and may confuse
      user applications. In the wait event any data received on the
      sk either via sk_receive_queue or the psock ingress list will
      wake up the sock.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e20f7334
    • J
      bpf: sockmap, map_release does not hold refcnt for pinned maps · ba6b8de4
      John Fastabend 提交于
      Relying on map_release hook to decrement the reference counts when a
      map is removed only works if the map is not being pinned. In the
      pinned case the ref is decremented immediately and the BPF programs
      released. After this BPF programs may not be in-use which is not
      what the user would expect.
      
      This patch moves the release logic into bpf_map_put_uref() and brings
      sockmap in-line with how a similar case is handled in prog array maps.
      
      Fixes: 3d9e9526 ("bpf: sockmap, fix leaking maps with attached but not detached progs")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ba6b8de4
  11. 21 4月, 2018 2 次提交
    • K
      fork: unconditionally clear stack on fork · e01e8063
      Kees Cook 提交于
      One of the classes of kernel stack content leaks[1] is exposing the
      contents of prior heap or stack contents when a new process stack is
      allocated.  Normally, those stacks are not zeroed, and the old contents
      remain in place.  In the face of stack content exposure flaws, those
      contents can leak to userspace.
      
      Fixing this will make the kernel no longer vulnerable to these flaws, as
      the stack will be wiped each time a stack is assigned to a new process.
      There's not a meaningful change in runtime performance; it almost looks
      like it provides a benefit.
      
      Performing back-to-back kernel builds before:
      	Run times: 157.86 157.09 158.90 160.94 160.80
      	Mean: 159.12
      	Std Dev: 1.54
      
      and after:
      	Run times: 159.31 157.34 156.71 158.15 160.81
      	Mean: 158.46
      	Std Dev: 1.46
      
      Instead of making this a build or runtime config, Andy Lutomirski
      recommended this just be enabled by default.
      
      [1] A noisy search for many kinds of stack content leaks can be seen here:
      https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak
      
      I did some more with perf and cycle counts on running 100,000 execs of
      /bin/true.
      
      before:
      Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
      Mean:  221015379122.60
      Std Dev: 4662486552.47
      
      after:
      Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
      Mean:  217745009865.40
      Std Dev: 5935559279.99
      
      It continues to look like it's faster, though the deviation is rather
      wide, but I'm not sure what I could do that would be less noisy.  I'm
      open to ideas!
      
      Link: http://lkml.kernel.org/r/20180221021659.GA37073@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e01e8063
    • J
      bpf: sockmap remove dead check · 6ab690aa
      Jann Horn 提交于
      Remove dead code that bails on `attr->value_size > KMALLOC_MAX_SIZE` - the
      previous check already bails on `attr->value_size != 4`.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6ab690aa
  12. 19 4月, 2018 2 次提交
  13. 17 4月, 2018 8 次提交