1. 30 1月, 2018 5 次提交
    • N
      dev: advertise the new ifindex when the netns iface changes · 38e01b30
      Nicolas Dichtel 提交于
      The goal is to let the user follow an interface that moves to another
      netns.
      
      CC: Jiri Benc <jbenc@redhat.com>
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38e01b30
    • N
      dev: always advertise the new nsid when the netns iface changes · c36ac8e2
      Nicolas Dichtel 提交于
      The user should be able to follow any interface that moves to another
      netns.  There is no reason to hide physical interfaces.
      
      CC: Jiri Benc <jbenc@redhat.com>
      CC: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NJiri Benc <jbenc@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c36ac8e2
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK · b61ad68a
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_DELLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_DELLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_DELLINK. Userpace should then fallback to other means.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b61ad68a
    • C
      rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK · c310bfcb
      Christian Brauner 提交于
      - Backwards Compatibility:
        If userspace wants to determine whether RTM_SETLINK supports the
        IFLA_IF_NETNSID property they should first send an RTM_GETLINK request
        with IFLA_IF_NETNSID on lo. If either EACCESS is returned or the reply
        does not include IFLA_IF_NETNSID userspace should assume that
        IFLA_IF_NETNSID is not supported on this kernel.
        If the reply does contain an IFLA_IF_NETNSID property userspace
        can send an RTM_SETLINK with a IFLA_IF_NETNSID property. If they receive
        EOPNOTSUPP then the kernel does not support the IFLA_IF_NETNSID property
        with RTM_SETLINK. Userpace should then fallback to other means.
      
        To retain backwards compatibility the kernel will first check whether a
        IFLA_NET_NS_PID or IFLA_NET_NS_FD property has been passed. If either
        one is found it will be used to identify the target network namespace.
        This implies that users who do not care whether their running kernel
        supports IFLA_IF_NETNSID with RTM_SETLINK can pass both
        IFLA_NET_NS_{FD,PID} and IFLA_IF_NETNSID referring to the same network
        namespace.
      
      - Security:
        Callers must have CAP_NET_ADMIN in the owning user namespace of the
        target network namespace.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c310bfcb
    • C
      rtnetlink: enable IFLA_IF_NETNSID in do_setlink() · 7c4f63ba
      Christian Brauner 提交于
      RTM_{NEW,SET}LINK already allow operations on other network namespaces
      by identifying the target network namespace through IFLA_NET_NS_{FD,PID}
      properties. This is done by looking for the corresponding properties in
      do_setlink(). Extend do_setlink() to also look for the IFLA_IF_NETNSID
      property. This introduces no functional changes since all callers of
      do_setlink() currently block IFLA_IF_NETNSID by reporting an error before
      they reach do_setlink().
      
      This introduces the helpers:
      
      static struct net *rtnl_link_get_net_by_nlattr(struct net *src_net, struct
                                                     nlattr *tb[])
      
      static struct net *rtnl_link_get_net_capable(const struct sk_buff *skb,
                                                   struct net *src_net,
      					     struct nlattr *tb[], int cap)
      
      to simplify permission checks and target network namespace retrieval for
      RTM_* requests that already support IFLA_NET_NS_{FD,PID} but get extended
      to IFLA_IF_NETNSID. To perserve backwards compatibility the helpers look
      for IFLA_NET_NS_{FD,PID} properties first before checking for
      IFLA_IF_NETNSID.
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4f63ba
  2. 27 1月, 2018 2 次提交
    • D
      bpf: fix subprog verifier bypass by div/mod by 0 exception · f6b1b3bf
      Daniel Borkmann 提交于
      One of the ugly leftovers from the early eBPF days is that div/mod
      operations based on registers have a hard-coded src_reg == 0 test
      in the interpreter as well as in JIT code generators that would
      return from the BPF program with exit code 0. This was basically
      adopted from cBPF interpreter for historical reasons.
      
      There are multiple reasons why this is very suboptimal and prone
      to bugs. To name one: the return code mapping for such abnormal
      program exit of 0 does not always match with a suitable program
      type's exit code mapping. For example, '0' in tc means action 'ok'
      where the packet gets passed further up the stack, which is just
      undesirable for such cases (e.g. when implementing policy) and
      also does not match with other program types.
      
      While trying to work out an exception handling scheme, I also
      noticed that programs crafted like the following will currently
      pass the verifier:
      
        0: (bf) r6 = r1
        1: (85) call pc+8
        caller:
         R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        callee:
         frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_1
        10: (b4) (u32) r2 = (u32) 0
        11: (b4) (u32) r3 = (u32) 1
        12: (3c) (u32) r3 /= (u32) r2
        13: (61) r0 = *(u32 *)(r1 +76)
        14: (95) exit
        returning from callee:
         frame1: R0_w=pkt(id=0,off=0,r=0,imm=0)
                 R1=ctx(id=0,off=0,imm=0) R2_w=inv0
                 R3_w=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
                 R10=fp0,call_1
        to caller at 2:
         R0_w=pkt(id=0,off=0,r=0,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
      
        from 14 to 2: R0=pkt(id=0,off=0,r=0,imm=0)
                      R6=ctx(id=0,off=0,imm=0) R10=fp0,call_-1
        2: (bf) r1 = r6
        3: (61) r1 = *(u32 *)(r1 +80)
        4: (bf) r2 = r0
        5: (07) r2 += 8
        6: (2d) if r2 > r1 goto pc+1
         R0=pkt(id=0,off=0,r=8,imm=0) R1=pkt_end(id=0,off=0,imm=0)
         R2=pkt(id=0,off=8,r=8,imm=0) R6=ctx(id=0,off=0,imm=0)
         R10=fp0,call_-1
        7: (71) r0 = *(u8 *)(r0 +0)
        8: (b7) r0 = 1
        9: (95) exit
      
        from 6 to 8: safe
        processed 16 insns (limit 131072), stack depth 0+0
      
      Basically what happens is that in the subprog we make use of a
      div/mod by 0 exception and in the 'normal' subprog's exit path
      we just return skb->data back to the main prog. This has the
      implication that the verifier thinks we always get a pkt pointer
      in R0 while we still have the implicit 'return 0' from the div
      as an alternative unconditional return path earlier. Thus, R0
      then contains 0, meaning back in the parent prog we get the
      address range of [0x0, skb->data_end] as read and writeable.
      Similar can be crafted with other pointer register types.
      
      Since i) BPF_ABS/IND is not allowed in programs that contain
      BPF to BPF calls (and generally it's also disadvised to use in
      native eBPF context), ii) unknown opcodes don't return zero
      anymore, iii) we don't return an exception code in dead branches,
      the only last missing case affected and to fix is the div/mod
      handling.
      
      What we would really need is some infrastructure to propagate
      exceptions all the way to the original prog unwinding the
      current stack and returning that code to the caller of the
      BPF program. In user space such exception handling for similar
      runtimes is typically implemented with setjmp(3) and longjmp(3)
      as one possibility which is not available in the kernel,
      though (kgdb used to implement it in kernel long time ago). I
      implemented a PoC exception handling mechanism into the BPF
      interpreter with porting setjmp()/longjmp() into x86_64 and
      adding a new internal BPF_ABRT opcode that can use a program
      specific exception code for all exception cases we have (e.g.
      div/mod by 0, unknown opcodes, etc). While this seems to work
      in the constrained BPF environment (meaning, here, we don't
      need to deal with state e.g. from memory allocations that we
      would need to undo before going into exception state), it still
      has various drawbacks: i) we would need to implement the
      setjmp()/longjmp() for every arch supported in the kernel and
      for x86_64, arm64, sparc64 JITs currently supporting calls,
      ii) it has unconditional additional cost on main program
      entry to store CPU register state in initial setjmp() call,
      and we would need some way to pass the jmp_buf down into
      ___bpf_prog_run() for main prog and all subprogs, but also
      storing on stack is not really nice (other option would be
      per-cpu storage for this, but it also has the drawback that
      we need to disable preemption for every BPF program types).
      All in all this approach would add a lot of complexity.
      
      Another poor-man's solution would be to have some sort of
      additional shared register or scratch buffer to hold state
      for exceptions, and test that after every call return to
      chain returns and pass R0 all the way down to BPF prog caller.
      This is also problematic in various ways: i) an additional
      register doesn't map well into JITs, and some other scratch
      space could only be on per-cpu storage, which, again has the
      side-effect that this only works when we disable preemption,
      or somewhere in the input context which is not available
      everywhere either, and ii) this adds significant runtime
      overhead by putting conditionals after each and every call,
      as well as implementation complexity.
      
      Yet another option is to teach verifier that div/mod can
      return an integer, which however is also complex to implement
      as verifier would need to walk such fake 'mov r0,<code>; exit;'
      sequeuence and there would still be no guarantee for having
      propagation of this further down to the BPF caller as proper
      exception code. For parent prog, it is also is not distinguishable
      from a normal return of a constant scalar value.
      
      The approach taken here is a completely different one with
      little complexity and no additional overhead involved in
      that we make use of the fact that a div/mod by 0 is undefined
      behavior. Instead of bailing out, we adapt the same behavior
      as on some major archs like ARMv8 [0] into eBPF as well:
      X div 0 results in 0, and X mod 0 results in X. aarch64 and
      aarch32 ISA do not generate any traps or otherwise aborts
      of program execution for unsigned divides. I verified this
      also with a test program compiled by gcc and clang, and the
      behavior matches with the spec. Going forward we adapt the
      eBPF verifier to emit such rewrites once div/mod by register
      was seen. cBPF is not touched and will keep existing 'return 0'
      semantics. Given the options, it seems the most suitable from
      all of them, also since major archs have similar schemes in
      place. Given this is all in the realm of undefined behavior,
      we still have the option to adapt if deemed necessary and
      this way we would also have the option of more flexibility
      from LLVM code generation side (which is then fully visible
      to verifier). Thus, this patch i) fixes the panic seen in
      above program and ii) doesn't bypass the verifier observations.
      
        [0] ARM Architecture Reference Manual, ARMv8 [ARM DDI 0487B.b]
            http://infocenter.arm.com/help/topic/com.arm.doc.ddi0487b.b/DDI0487B_b_armv8_arm.pdf
            1) aarch64 instruction set: section C3.4.7 and C6.2.279 (UDIV)
               "A division by zero results in a zero being written to
                the destination register, without any indication that
                the division by zero occurred."
            2) aarch32 instruction set: section F1.4.8 and F5.1.263 (UDIV)
               "For the SDIV and UDIV instructions, division by zero
                always returns a zero result."
      
      Fixes: f4d7e40a ("bpf: introduce function calls (verification)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f6b1b3bf
    • D
      bpf: xor of a/x in cbpf can be done in 32 bit alu · 1d621674
      Daniel Borkmann 提交于
      Very minor optimization; saves 1 byte per program in x86_64
      JIT in cBPF prologue.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      1d621674
  3. 26 1月, 2018 8 次提交
    • L
      bpf: Add sock_ops R/W access to tclass · 6f9bd3d7
      Lawrence Brakmo 提交于
      Adds direct write access to sk_txhash and access to tclass for ipv6
      flows through getsockopt and setsockopt. Sample usage for tclass:
      
        bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, &v, sizeof(v))
      
      where skops is a pointer to the ctx (struct bpf_sock_ops).
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6f9bd3d7
    • L
      bpf: Add support for reading sk_state and more · 44f0e430
      Lawrence Brakmo 提交于
      Add support for reading many more tcp_sock fields
      
        state,	same as sk->sk_state
        rtt_min	same as sk->rtt_min.s[0].v (current rtt_min)
        snd_ssthresh
        rcv_nxt
        snd_nxt
        snd_una
        mss_cache
        ecn_flags
        rate_delivered
        rate_interval_us
        packets_out
        retrans_out
        total_retrans
        segs_in
        data_segs_in
        segs_out
        data_segs_out
        lost_out
        sacked_out
        sk_txhash
        bytes_received (__u64)
        bytes_acked    (__u64)
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      44f0e430
    • L
      bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock · b13d8807
      Lawrence Brakmo 提交于
      Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
      use is to determine if there should be calls to sock_ops bpf program at
      various points in the TCP code. The field is initialized to zero,
      disabling the calls. A sock_ops BPF program can set it, per connection and
      as necessary, when the connection is established.
      
      It also adds support for reading and writting the field within a
      sock_ops BPF program. Reading is done by accessing the field directly.
      However, writing is done through the helper function
      bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
      is trying to set a callback that is not supported in the current kernel
      (i.e. running an older kernel). The helper function returns 0 if it was
      able to set all of the bits set in the argument, a positive number
      containing the bits that could not be set, or -EINVAL if the socket is
      not a full TCP socket.
      
      Examples of where one could call the bpf program:
      
      1) When RTO fires
      2) When a packet is retransmitted
      3) When the connection terminates
      4) When a packet is sent
      5) When a packet is received
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b13d8807
    • L
      bpf: Add write access to tcp_sock and sock fields · b73042b8
      Lawrence Brakmo 提交于
      This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
      struct tcp_sock or struct sock fields. This required adding a new
      field "temp" to struct bpf_sock_ops_kern for temporary storage that
      is used by sock_ops_convert_ctx_access. It is used to store and recover
      the contents of a register, so the register can be used to store the
      address of the sk. Since we cannot overwrite the dst_reg because it
      contains the pointer to ctx, nor the src_reg since it contains the value
      we want to store, we need an extra register to contain the address
      of the sk.
      
      Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
      GET or SET macros depending on the value of the TYPE field.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      b73042b8
    • L
      bpf: Make SOCK_OPS_GET_TCP struct independent · 34d367c5
      Lawrence Brakmo 提交于
      Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
      arguments so now it can also work with struct sock fields.
      The first argument is the name of the field in the bpf_sock_ops
      struct, the 2nd argument is the name of the field in the OBJ struct.
      
      Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
      New:      SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)
      
      Where OBJ is either "struct tcp_sock" or "struct sock" (without
      quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
      struct and OBJ_FIELD is the name of the field in the OBJ struct.
      
      Although the field names are currently the same, the kernel struct names
      could change in the future and this change makes it easier to support
      that.
      
      Note that adding access to tcp_sock fields in sock_ops programs does
      not preclude the tcp_sock fields from being removed as long as we are
      willing to do one of the following:
      
        1) Return a fixed value (e.x. 0 or 0xffffffff), or
        2) Make the verifier fail if that field is accessed (i.e. program
          fails to load) so the user will know that field is no longer
          supported.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      34d367c5
    • L
      bpf: Make SOCK_OPS_GET_TCP size independent · a33de397
      Lawrence Brakmo 提交于
      Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
      with 4-byte fields.
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      a33de397
    • L
      bpf: Only reply field should be writeable · 2585cd62
      Lawrence Brakmo 提交于
      Currently, a sock_ops BPF program can write the op field and all the
      reply fields (reply and replylong). This is a bug. The op field should
      not have been writeable and there is currently no way to use replylong
      field for indices >= 1. This patch enforces that only the reply field
      (which equals replylong[0]) is writeable.
      
      Fixes: 40304b2a ("bpf: BPF support for sock_ops")
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2585cd62
    • K
      net: Move net:netns_ids destruction out of rtnl_lock() and document locking scheme · fb07a820
      Kirill Tkhai 提交于
      Currently, we unhash a dying net from netns_ids lists
      under rtnl_lock(). It's a leftover from the time when
      net::netns_ids was introduced. There was no net::nsid_lock,
      and rtnl_lock() was mostly need to order modification
      of alive nets nsid idr, i.e. for:
      	for_each_net(tmp) {
      		...
      		id = __peernet2id(tmp, net);
      		idr_remove(&tmp->netns_ids, id);
      		...
      	}
      
      Since we have net::nsid_lock, the modifications are
      protected by this local lock, and now we may introduce
      better scheme of netns_ids destruction.
      
      Let's look at the functions peernet2id_alloc() and
      get_net_ns_by_id(). Previous commits taught these
      functions to work well with dying net acquired from
      rtnl unlocked lists. And they are the only functions
      which can hash a net to netns_ids or obtain from there.
      And as easy to check, other netns_ids operating functions
      works with id, not with net pointers. So, we do not
      need rtnl_lock to synchronize cleanup_net() with all them.
      
      The another property, which is used in the patch,
      is that net is unhashed from net_namespace_list
      in the only place and by the only process. So,
      we avoid excess rcu_read_lock() or rtnl_lock(),
      when we'are iterating over the list in unhash_nsid().
      
      All the above makes possible to keep rtnl_lock() locked
      only for net->list deletion, and completely avoid it
      for netns_ids unhashing and destruction. As these two
      doings may take long time (e.g., memory allocation
      to send skb), the patch should positively act on
      the scalability and signify decrease the time, which
      rtnl_lock() is held in cleanup_net().
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb07a820
  4. 25 1月, 2018 7 次提交
  5. 24 1月, 2018 3 次提交
  6. 23 1月, 2018 2 次提交
  7. 22 1月, 2018 1 次提交
  8. 20 1月, 2018 5 次提交
  9. 19 1月, 2018 2 次提交
    • A
      bpf: allow socket_filter programs to use bpf_prog_test_run · 61f3c964
      Alexei Starovoitov 提交于
      in order to improve test coverage allow socket_filter program type
      to be run via bpf_prog_test_run command.
      Since such programs can be loaded by non-root tighten
      permissions for bpf_prog_test_run to be root only
      to avoid surprises.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      61f3c964
    • E
      flow_dissector: properly cap thoff field · d0c081b4
      Eric Dumazet 提交于
      syzbot reported yet another crash [1] that is caused by
      insufficient validation of DODGY packets.
      
      Two bugs are happening here to trigger the crash.
      
      1) Flow dissection leaves with incorrect thoff field.
      
      2) skb_probe_transport_header() sets transport header to this invalid
      thoff, even if pointing after skb valid data.
      
      3) qdisc_pkt_len_init() reads out-of-bound data because it
      trusts tcp_hdrlen(skb)
      
      Possible fixes :
      
      - Full flow dissector validation before injecting bad DODGY packets in
      the stack.
       This approach was attempted here : https://patchwork.ozlabs.org/patch/
      861874/
      
      - Have more robust functions in the core.
        This might be needed anyway for stable versions.
      
      This patch fixes the flow dissection issue.
      
      [1]
      CPU: 1 PID: 3144 Comm: syzkaller271204 Not tainted 4.15.0-rc4-mm1+ #49
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:53
       print_address_description+0x73/0x250 mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:355 [inline]
       kasan_report+0x23b/0x360 mm/kasan/report.c:413
       __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:432
       __tcp_hdrlen include/linux/tcp.h:35 [inline]
       tcp_hdrlen include/linux/tcp.h:40 [inline]
       qdisc_pkt_len_init net/core/dev.c:3160 [inline]
       __dev_queue_xmit+0x20d3/0x2200 net/core/dev.c:3465
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3554
       packet_snd net/packet/af_packet.c:2943 [inline]
       packet_sendmsg+0x3ad5/0x60a0 net/packet/af_packet.c:2968
       sock_sendmsg_nosec net/socket.c:628 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:638
       sock_write_iter+0x31a/0x5d0 net/socket.c:907
       call_write_iter include/linux/fs.h:1776 [inline]
       new_sync_write fs/read_write.c:469 [inline]
       __vfs_write+0x684/0x970 fs/read_write.c:482
       vfs_write+0x189/0x510 fs/read_write.c:544
       SYSC_write fs/read_write.c:589 [inline]
       SyS_write+0xef/0x220 fs/read_write.c:581
       entry_SYSCALL_64_fastpath+0x1f/0x96
      
      Fixes: 34fad54c ("net: __skb_flow_dissect() must cap its return value")
      Fixes: a6e544b0 ("flow_dissector: Jump to exit code in __skb_flow_dissect")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0c081b4
  10. 18 1月, 2018 2 次提交
    • K
      net: Remove spinlock from get_net_ns_by_id() · 42157277
      Kirill Tkhai 提交于
      idr_find() is safe under rcu_read_lock() and
      maybe_get_net() guarantees that net is alive.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42157277
    • K
      net: Fix possible race in peernet2id_alloc() · 0c06bea9
      Kirill Tkhai 提交于
      peernet2id_alloc() is racy without rtnl_lock() as refcount_read(&peer->count)
      under net->nsid_lock does not guarantee, peer is alive:
      
      rcu_read_lock()
      peernet2id_alloc()                            ..
        spin_lock_bh(&net->nsid_lock)               ..
        refcount_read(&peer->count) (!= 0)          ..
        ..                                          put_net()
        ..                                            cleanup_net()
        ..                                              for_each_net(tmp)
        ..                                                spin_lock_bh(&tmp->nsid_lock)
        ..                                                __peernet2id(tmp, net) == -1
        ..                                                    ..
        ..                                                    ..
          __peernet2id_alloc(alloc == true)                   ..
        ..                                                    ..
      rcu_read_unlock()                                       ..
      ..                                                synchronize_rcu()
      ..                                                kmem_cache_free(net)
      
      After the above situation, net::netns_id contains id pointing to freed memory,
      and any other dereferencing by the id will operate with this freed memory.
      
      Currently, peernet2id_alloc() is used under rtnl_lock() everywhere except
      ovs_vport_cmd_fill_info(), and this race can't occur. But peernet2id_alloc()
      is generic interface, and better we fix it before someone really starts
      use it in wrong context.
      
      v2: Don't place refcount_read(&net->count) under net->nsid_lock
          as suggested by Eric W. Biederman <ebiederm@xmission.com>
      v3: Rebase on top of net-next
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c06bea9
  11. 17 1月, 2018 3 次提交