1. 06 7月, 2016 2 次提交
  2. 05 7月, 2016 5 次提交
  3. 03 7月, 2016 2 次提交
  4. 02 7月, 2016 3 次提交
    • M
      cgroup: Add cgroup_get_from_fd · 1f3fe7eb
      Martin KaFai Lau 提交于
      Add a helper function to get a cgroup2 from a fd.  It will be
      stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
      be introduced in the later patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f3fe7eb
    • D
      bpf: refactor bpf_prog_get and type check into helper · 113214be
      Daniel Borkmann 提交于
      Since bpf_prog_get() and program type check is used in a couple of places,
      refactor this into a small helper function that we can make use of. Since
      the non RO prog->aux part is not used in performance critical paths and a
      program destruction via RCU is rather very unlikley when doing the put, we
      shouldn't have an issue just doing the bpf_prog_get() + prog->type != type
      check, but actually not taking the ref at all (due to being in fdget() /
      fdput() section of the bpf fd) is even cleaner and makes the diff smaller
      as well, so just go for that. Callsites are changed to make use of the new
      helper where possible.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      113214be
    • D
      bpf: generally move prog destruction to RCU deferral · 1aacde3d
      Daniel Borkmann 提交于
      Jann Horn reported following analysis that could potentially result
      in a very hard to trigger (if not impossible) UAF race, to quote his
      event timeline:
      
       - Set up a process with threads T1, T2 and T3
       - Let T1 set up a socket filter F1 that invokes another filter F2
         through a BPF map [tail call]
       - Let T1 trigger the socket filter via a unix domain socket write,
         don't wait for completion
       - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
       - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
       - Let T3 close the file descriptor for F2, dropping the reference
         count of F2 to 2
       - At this point, T1 should have looked up F2 from the map, but not
         finished executing it
       - Let T3 remove F2 from the BPF map, dropping the reference count of
         F2 to 1
       - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
         the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
         via schedule_work()
       - At this point, the BPF program could be freed
       - BPF execution is still running in a freed BPF program
      
      While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
      event fd we're doing the syscall on doesn't disappear from underneath us
      for whole syscall time, it may not be the case for the bpf fd used as
      an argument only after we did the put. It needs to be a valid fd pointing
      to a BPF program at the time of the call to make the bpf_prog_get() and
      while T2 gets preempted, F2 must have dropped reference to 1 on the other
      CPU. The fput() from the close() in T3 should also add additionally delay
      to the reference drop via exit_task_work() when bpf_prog_release() gets
      called as well as scheduling bpf_prog_free_deferred().
      
      That said, it makes nevertheless sense to move the BPF prog destruction
      generally after RCU grace period to guarantee that such scenario above,
      but also others as recently fixed in ceb56070 ("bpf, perf: delay release
      of BPF prog after grace period") with regards to tail calls won't happen.
      Integrating bpf_prog_free_deferred() directly into the RCU callback is
      not allowed since the invocation might happen from either softirq or
      process context, so we're not permitted to block. Reviewing all bpf_prog_put()
      invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
      their destruction) with call_rcu() look good to me.
      
      Since we don't know whether at the time of attaching the program, we're
      already part of a tail call map, we need to use RCU variant. However, due
      to this, there won't be severely more stress on the RCU callback queue:
      situations with above bpf_prog_get() and bpf_prog_put() combo in practice
      normally won't lead to releases, but even if they would, enough effort/
      cycles have to be put into loading a BPF program into the kernel already.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1aacde3d
  5. 01 7月, 2016 6 次提交
  6. 29 6月, 2016 2 次提交
  7. 28 6月, 2016 3 次提交
  8. 27 6月, 2016 3 次提交
  9. 25 6月, 2016 4 次提交
    • K
      Revert "mm: make faultaround produce old ptes" · 315d09bf
      Kirill A. Shutemov 提交于
      This reverts commit 5c0a85fa.
      
      The commit causes ~6% regression in unixbench.
      
      Let's revert it for now and consider other solution for reclaim problem
      later.
      
      Link: http://lkml.kernel.org/r/1465893750-44080-2-git-send-email-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      315d09bf
    • A
      mm: mempool: kasan: don't poot mempool objects in quarantine · 9b75a867
      Andrey Ryabinin 提交于
      Currently we may put reserved by mempool elements into quarantine via
      kasan_kfree().  This is totally wrong since quarantine may really free
      these objects.  So when mempool will try to use such element,
      use-after-free will happen.  Or mempool may decide that it no longer
      need that element and double-free it.
      
      So don't put object into quarantine in kasan_kfree(), just poison it.
      Rename kasan_kfree() to kasan_poison_kfree() to respect that.
      
      Also, we shouldn't use kasan_slab_alloc()/kasan_krealloc() in
      kasan_unpoison_element() because those functions may update allocation
      stacktrace.  This would be wrong for the most of the remove_element call
      sites.
      
      (The only call site where we may want to update alloc stacktrace is
       in mempool_alloc(). Kmemleak solves this by calling
       kmemleak_update_trace(), so we could make something like that too.
       But this is out of scope of this patch).
      
      Fixes: 55834c59 ("mm: kasan: initial memory quarantine implementation")
      Link: http://lkml.kernel.org/r/575977C3.1010905@virtuozzo.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NKuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b75a867
    • L
      fix up initial thread stack pointer vs thread_info confusion · 7f1a00b6
      Linus Torvalds 提交于
      The INIT_TASK() initializer was similarly confused about the stack vs
      thread_info allocation that the allocators had, and that were fixed in
      commit b235beea ("Clarify naming of thread info/stack allocators").
      
      The task ->stack pointer only incidentally ends up having the same value
      as the thread_info, and in fact that will change.
      
      So fix the initial task struct initializer to point to 'init_stack'
      instead of 'init_thread_info', and make sure the ia64 definition for
      that exists.
      
      This actually makes the ia64 tsk->stack pointer be sensible for the
      initial task, but not for any other task.  As mentioned in commit
      b235beea, that whole pointer isn't actually used on ia64, since
      task_stack_page() there just points to the (single) allocation.
      
      All the other architectures seem to have copied the 'init_stack'
      definition, even if it tended to be generally unusued.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f1a00b6
    • L
      Clarify naming of thread info/stack allocators · b235beea
      Linus Torvalds 提交于
      We've had the thread info allocated together with the thread stack for
      most architectures for a long time (since the thread_info was split off
      from the task struct), but that is about to change.
      
      But the patches that move the thread info to be off-stack (and a part of
      the task struct instead) made it clear how confused the allocator and
      freeing functions are.
      
      Because the common case was that we share an allocation with the thread
      stack and the thread_info, the two pointers were identical.  That
      identity then meant that we would have things like
      
      	ti = alloc_thread_info_node(tsk, node);
      	...
      	tsk->stack = ti;
      
      which certainly _worked_ (since stack and thread_info have the same
      value), but is rather confusing: why are we assigning a thread_info to
      the stack? And if we move the thread_info away, the "confusing" code
      just gets to be entirely bogus.
      
      So remove all this confusion, and make it clear that we are doing the
      stack allocation by renaming and clarifying the function names to be
      about the stack.  The fact that the thread_info then shares the
      allocation is an implementation detail, and not really about the
      allocation itself.
      
      This is a pure renaming and type fix: we pass in the same pointer, it's
      just that we clarify what the pointer means.
      
      The ia64 code that actually only has one single allocation (for all of
      task_struct, thread_info and kernel thread stack) now looks a bit odd,
      but since "tsk->stack" is actually not even used there, that oddity
      doesn't matter.  It would be a separate thing to clean that up, I
      intentionally left the ia64 changes as a pure brute-force renaming and
      type change.
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b235beea
  10. 24 6月, 2016 4 次提交
  11. 23 6月, 2016 2 次提交
  12. 22 6月, 2016 1 次提交
    • D
      rxrpc: Fix exclusive connection handling · cc8feb8e
      David Howells 提交于
      "Exclusive connections" are meant to be used for a single client call and
      then scrapped.  The idea is to limit the use of the negotiated security
      context.  The current code, however, isn't doing this: it is instead
      restricting the socket to a single virtual connection and doing all the
      calls over that.
      
      This is changed such that the socket no longer maintains a special virtual
      connection over which it will do all the calls, but rather gets a new one
      each time a new exclusive call is made.
      
      Further, using a socket option for this is a poor choice.  It should be
      done on sendmsg with a control message marker instead so that calls can be
      marked exclusive individually.  To that end, add RXRPC_EXCLUSIVE_CALL
      which, if passed to sendmsg() as a control message element, will cause the
      call to be done on an single-use connection.
      
      The socket option (RXRPC_EXCLUSIVE_CONNECTION) still exists and, if set,
      will override any lack of RXRPC_EXCLUSIVE_CALL being specified so that
      programs using the setsockopt() will appear to work the same.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc8feb8e
  13. 20 6月, 2016 1 次提交
  14. 19 6月, 2016 2 次提交
    • E
      ipv6: RFC 4884 partial support for SIT/GRE tunnels · 20e1954f
      Eric Dumazet 提交于
      When receiving an ICMPv4 message containing extensions as
      defined in RFC 4884, and translating it to ICMPv6 at SIT
      or GRE tunnel, we need some extra manipulation in order
      to properly forward the extensions.
      
      This patch only takes care of Time Exceeded messages as they
      are the ones that typically carry information from various
      routers in a fabric during a traceroute session.
      
      It also avoids complex skb logic if the data_len is not
      a multiple of 8.
      
      RFC states :
      
         The "original datagram" field MUST contain at least 128 octets.
         If the original datagram did not contain 128 octets, the
         "original datagram" field MUST be zero padded to 128 octets.
      
      In practice routers use 128 bytes of original datagram, not more.
      
      Initial translation was added in commit ca15a078
      ("sit: generate icmpv6 error when receiving icmpv4 error")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20e1954f
    • E
      ipv6: translate ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED · 2d7a3b27
      Eric Dumazet 提交于
      For better traceroute/mtr support for SIT and GRE tunnels,
      we translate IPV4 ICMP ICMP_TIME_EXCEEDED to ICMPV6_TIME_EXCEED
      
      We also have to translate the IPv4 source IP address of ICMP
      message to IPv6 v4mapped.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d7a3b27