1. 03 5月, 2018 2 次提交
    • J
      bpf: sockmap, zero sg_size on error when buffer is released · fec51d40
      John Fastabend 提交于
      When an error occurs during a redirect we have two cases that need
      to be handled (i) we have a cork'ed buffer (ii) we have a normal
      sendmsg buffer.
      
      In the cork'ed buffer case we don't currently support recovering from
      errors in a redirect action. So the buffer is released and the error
      should _not_ be pushed back to the caller of sendmsg/sendpage. The
      rationale here is the user will get an error that relates to old
      data that may have been sent by some arbitrary thread on that sock.
      Instead we simple consume the data and tell the user that the data
      has been consumed. We may add proper error recovery in the future.
      However, this patch fixes a bug where the bytes outstanding counter
      sg_size was not zeroed. This could result in a case where if the user
      has both a cork'ed action and apply action in progress we may
      incorrectly call into the BPF program when the user expected an
      old verdict to be applied via the apply action. I don't have a use
      case where using apply and cork at the same time is valid but we
      never explicitly reject it because it should work fine. This patch
      ensures the sg_size is zeroed so we don't have this case.
      
      In the normal sendmsg buffer case (no cork data) we also do not
      zero sg_size. Again this can confuse the apply logic when the logic
      calls into the BPF program when the BPF programmer expected the old
      verdict to remain. So ensure we set sg_size to zero here as well. And
      additionally to keep the psock state in-sync with the sk_msg_buff
      release all the memory as well. Previously we did this before
      returning to the user but this left a gap where psock and sk_msg_buff
      states were out of sync which seems fragile. No additional overhead
      is taken here except for a call to check the length and realize its
      already been freed. This is in the error path as well so in my
      opinion lets have robust code over optimized error paths.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fec51d40
    • J
      bpf: sockmap, fix scatterlist update on error path in send with apply · 3cc9a472
      John Fastabend 提交于
      When the call to do_tcp_sendpage() fails to send the complete block
      requested we either retry if only a partial send was completed or
      abort if we receive a error less than or equal to zero. Before
      returning though we must update the scatterlist length/offset to
      account for any partial send completed.
      
      Before this patch we did this at the end of the retry loop, but
      this was buggy when used while applying a verdict to fewer bytes
      than in the scatterlist. When the scatterlist length was being set
      we forgot to account for the apply logic reducing the size variable.
      So the result was we chopped off some bytes in the scatterlist without
      doing proper cleanup on them. This results in a WARNING when the
      sock is tore down because the bytes have previously been charged to
      the socket but are never uncharged.
      
      The simple fix is to simply do the accounting inside the retry loop
      subtracting from the absolute scatterlist values rather than trying
      to accumulate the totals and subtract at the end.
      Reported-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3cc9a472
  2. 24 4月, 2018 3 次提交
    • J
      bpf: sockmap, fix double page_put on ENOMEM error in redirect path · 4fcfdfb8
      John Fastabend 提交于
      In the case where the socket memory boundary is hit the redirect
      path returns an ENOMEM error. However, before checking for this
      condition the redirect scatterlist buffer is setup with a valid
      page and length. This is never unwound so when the buffers are
      released latter in the error path we do a put_page() and clear
      the scatterlist fields. But, because the initial error happens
      before completing the scatterlist buffer we end up with both the
      original buffer and the redirect buffer pointing to the same page
      resulting in duplicate put_page() calls.
      
      To fix this simply move the initial configuration of the redirect
      scatterlist buffer below the sock memory check.
      
      Found this while running TCP_STREAM test with netperf using Cilium.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fcfdfb8
    • J
      bpf: sockmap, sk_wait_event needed to handle blocking cases · e20f7334
      John Fastabend 提交于
      In the recvmsg handler we need to add a wait event to support the
      blocking use cases. Without this we return zero and may confuse
      user applications. In the wait event any data received on the
      sk either via sk_receive_queue or the psock ingress list will
      wake up the sock.
      
      Fixes: fa246693 ("bpf: sockmap, BPF_F_INGRESS flag for BPF_SK_SKB_STREAM_VERDICT")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      e20f7334
    • J
      bpf: sockmap, map_release does not hold refcnt for pinned maps · ba6b8de4
      John Fastabend 提交于
      Relying on map_release hook to decrement the reference counts when a
      map is removed only works if the map is not being pinned. In the
      pinned case the ref is decremented immediately and the BPF programs
      released. After this BPF programs may not be in-use which is not
      what the user would expect.
      
      This patch moves the release logic into bpf_map_put_uref() and brings
      sockmap in-line with how a similar case is handled in prog array maps.
      
      Fixes: 3d9e9526 ("bpf: sockmap, fix leaking maps with attached but not detached progs")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ba6b8de4
  3. 21 4月, 2018 2 次提交
    • K
      fork: unconditionally clear stack on fork · e01e8063
      Kees Cook 提交于
      One of the classes of kernel stack content leaks[1] is exposing the
      contents of prior heap or stack contents when a new process stack is
      allocated.  Normally, those stacks are not zeroed, and the old contents
      remain in place.  In the face of stack content exposure flaws, those
      contents can leak to userspace.
      
      Fixing this will make the kernel no longer vulnerable to these flaws, as
      the stack will be wiped each time a stack is assigned to a new process.
      There's not a meaningful change in runtime performance; it almost looks
      like it provides a benefit.
      
      Performing back-to-back kernel builds before:
      	Run times: 157.86 157.09 158.90 160.94 160.80
      	Mean: 159.12
      	Std Dev: 1.54
      
      and after:
      	Run times: 159.31 157.34 156.71 158.15 160.81
      	Mean: 158.46
      	Std Dev: 1.46
      
      Instead of making this a build or runtime config, Andy Lutomirski
      recommended this just be enabled by default.
      
      [1] A noisy search for many kinds of stack content leaks can be seen here:
      https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak
      
      I did some more with perf and cycle counts on running 100,000 execs of
      /bin/true.
      
      before:
      Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841
      Mean:  221015379122.60
      Std Dev: 4662486552.47
      
      after:
      Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348
      Mean:  217745009865.40
      Std Dev: 5935559279.99
      
      It continues to look like it's faster, though the deviation is rather
      wide, but I'm not sure what I could do that would be less noisy.  I'm
      open to ideas!
      
      Link: http://lkml.kernel.org/r/20180221021659.GA37073@beastSigned-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e01e8063
    • J
      bpf: sockmap remove dead check · 6ab690aa
      Jann Horn 提交于
      Remove dead code that bails on `attr->value_size > KMALLOC_MAX_SIZE` - the
      previous check already bails on `attr->value_size != 4`.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6ab690aa
  4. 19 4月, 2018 1 次提交
  5. 17 4月, 2018 9 次提交
  6. 14 4月, 2018 14 次提交
  7. 12 4月, 2018 9 次提交