1. 25 8月, 2019 1 次提交
    • H
      ipv4/icmp: fix rt dst dev null pointer dereference · e2c69393
      Hangbin Liu 提交于
      In __icmp_send() there is a possibility that the rt->dst.dev is NULL,
      e,g, with tunnel collect_md mode, which will cause kernel crash.
      Here is what the code path looks like, for GRE:
      
      - ip6gre_tunnel_xmit
        - ip6gre_xmit_ipv4
          - __gre6_xmit
            - ip6_tnl_xmit
              - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE
          - icmp_send
            - net = dev_net(rt->dst.dev); <-- here
      
      The reason is __metadata_dst_init() init dst->dev to NULL by default.
      We could not fix it in __metadata_dst_init() as there is no dev supplied.
      On the other hand, the reason we need rt->dst.dev is to get the net.
      So we can just try get it from skb->dev when rt->dst.dev is NULL.
      
      v4: Julian Anastasov remind skb->dev also could be NULL. We'd better
      still use dst.dev and do a check to avoid crash.
      
      v3: No changes.
      
      v2: fix the issue in __icmp_send() instead of updating shared dst dev
      in {ip_md, ip6}_tunnel_xmit.
      
      Fixes: c8b34e68 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2c69393
  2. 22 8月, 2019 1 次提交
  3. 21 8月, 2019 1 次提交
  4. 09 8月, 2019 2 次提交
    • J
      net/tls: prevent skb_orphan() from leaking TLS plain text with offload · 41477662
      Jakub Kicinski 提交于
      sk_validate_xmit_skb() and drivers depend on the sk member of
      struct sk_buff to identify segments requiring encryption.
      Any operation which removes or does not preserve the original TLS
      socket such as skb_orphan() or skb_clone() will cause clear text
      leaks.
      
      Make the TCP socket underlying an offloaded TLS connection
      mark all skbs as decrypted, if TLS TX is in offload mode.
      Then in sk_validate_xmit_skb() catch skbs which have no socket
      (or a socket with no validation) and decrypted flag set.
      
      Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
      sk->sk_validate_xmit_skb are slightly interchangeable right now,
      they all imply TLS offload. The new checks are guarded by
      CONFIG_TLS_DEVICE because that's the option guarding the
      sk_buff->decrypted member.
      
      Second, smaller issue with orphaning is that it breaks
      the guarantee that packets will be delivered to device
      queues in-order. All TLS offload drivers depend on that
      scheduling property. This means skb_orphan_partial()'s
      trick of preserving partial socket references will cause
      issues in the drivers. We need a full orphan, and as a
      result netem delay/throttling will cause all TLS offload
      skbs to be dropped.
      
      Reusing the sk_buff->decrypted flag also protects from
      leaking clear text when incoming, decrypted skb is redirected
      (e.g. by TC).
      
      See commit 0608c69c ("bpf: sk_msg, sock{map|hash} redirect
      through ULP") for justification why the internal flag is safe.
      The only location which could leak the flag in is tcp_bpf_sendmsg(),
      which is taken care of by clearing the previously unused bit.
      
      v2:
       - remove superfluous decrypted mark copy (Willem);
       - remove the stale doc entry (Boris);
       - rely entirely on EOR marking to prevent coalescing (Boris);
       - use an internal sendpages flag instead of marking the socket
         (Boris).
      v3 (Willem):
       - reorganize the can_skb_orphan_partial() condition;
       - fix the flag leak-in through tcp_bpf_sendmsg.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Reviewed-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      41477662
    • G
      inet: frags: re-introduce skb coalescing for local delivery · 891584f4
      Guillaume Nault 提交于
      Before commit d4289fcc ("net: IP6 defrag: use rbtrees for IPv6
      defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
      generating many fragments) and running over an IPsec tunnel, reported
      more than 6Gbps throughput. After that patch, the same test gets only
      9Mbps when receiving on a be2net nic (driver can make a big difference
      here, for example, ixgbe doesn't seem to be affected).
      
      By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
      (IPv4 fragment coalescing was dropped by commit 14fe22e3 ("Revert
      "ipv4: use skb coalescing in defragmentation"")).
      
      Without fragment coalescing, be2net runs out of Rx ring entries and
      starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
      the netperf traffic is only composed of UDP fragments, any lost packet
      prevents reassembly of the full datagram. Therefore, fragments which
      have no possibility to ever get reassembled pile up in the reassembly
      queue, until the memory accounting exeeds the threshold. At that point
      no fragment is accepted anymore, which effectively discards all
      netperf traffic.
      
      When reassembly timeout expires, some stale fragments are removed from
      the reassembly queue, so a few packets can be received, reassembled
      and delivered to the netperf receiver. But the nic still drops frames
      and soon the reassembly queue gets filled again with stale fragments.
      These long time frames where no datagram can be received explain why
      the performance drop is so significant.
      
      Re-introducing fragment coalescing is enough to get the initial
      performances again (6.6Gbps with be2net): driver doesn't drop frames
      anymore (no more rx_drops_no_frags errors) and the reassembly engine
      works at full speed.
      
      This patch is quite conservative and only coalesces skbs for local
      IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
      forwarding). Coalescing could be extended in the future if need be, as
      more scenarios would probably benefit from it.
      
      [0]: Test configuration
      Sender:
      ip xfrm policy flush
      ip xfrm state flush
      ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
      ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
      ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
      ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
      netserver -D -L fc00:2::1
      
      Receiver:
      ip xfrm policy flush
      ip xfrm state flush
      ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
      ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
      ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
      ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
      netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      891584f4
  5. 26 7月, 2019 1 次提交
  6. 25 7月, 2019 1 次提交
  7. 22 7月, 2019 2 次提交
    • J
      bpf: sockmap/tls, close can race with map free · 95fa1454
      John Fastabend 提交于
      When a map free is called and in parallel a socket is closed we
      have two paths that can potentially reset the socket prot ops, the
      bpf close() path and the map free path. This creates a problem
      with which prot ops should be used from the socket closed side.
      
      If the map_free side completes first then we want to call the
      original lowest level ops. However, if the tls path runs first
      we want to call the sockmap ops. Additionally there was no locking
      around prot updates in TLS code paths so the prot ops could
      be changed multiple times once from TLS path and again from sockmap
      side potentially leaving ops pointed at either TLS or sockmap
      when psock and/or tls context have already been destroyed.
      
      To fix this race first only update ops inside callback lock
      so that TLS, sockmap and lowest level all agree on prot state.
      Second and a ULP callback update() so that lower layers can
      inform the upper layer when they are being removed allowing the
      upper layer to reset prot ops.
      
      This gets us close to allowing sockmap and tls to be stacked
      in arbitrary order but will save that patch for *next trees.
      
      v4:
       - make sure we don't free things for device;
       - remove the checks which swap the callbacks back
         only if TLS is at the top.
      
      Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
      Fixes: 02c558b2 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NDirk van der Merwe <dirk.vandermerwe@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      95fa1454
    • E
      tcp: be more careful in tcp_fragment() · b617158d
      Eric Dumazet 提交于
      Some applications set tiny SO_SNDBUF values and expect
      TCP to just work. Recent patches to address CVE-2019-11478
      broke them in case of losses, since retransmits might
      be prevented.
      
      We should allow these flows to make progress.
      
      This patch allows the first and last skb in retransmit queue
      to be split even if memory limits are hit.
      
      It also adds the some room due to the fact that tcp_sendmsg()
      and tcp_sendpage() might overshoot sk_wmem_queued by about one full
      TSO skb (64KB size). Note this allowance was already present
      in stable backports for kernels < 4.15
      
      Note for < 4.15 backports :
       tcp_rtx_queue_tail() will probably look like :
      
      static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
      {
      	struct sk_buff *skb = tcp_send_head(sk);
      
      	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
      }
      
      Fixes: f070ef2a ("tcp: tcp_fragment() should apply sane memory limits")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NAndrew Prout <aprout@ll.mit.edu>
      Tested-by: NJonathan Lemon <jonathan.lemon@gmail.com>
      Tested-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NChristoph Paasch <cpaasch@apple.com>
      Cc: Jonathan Looney <jtl@netflix.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b617158d
  8. 19 7月, 2019 3 次提交
  9. 18 7月, 2019 1 次提交
    • C
      fib: relax source validation check for loopback packets · 66f82095
      Cong Wang 提交于
      In a rare case where we redirect local packets from veth to lo,
      these packets fail to pass the source validation when rp_filter
      is turned on, as the tracing shows:
      
        <...>-311708 [040] ..s1 7951180.957825: fib_table_lookup: table 254 oif 0 iif 1 src 10.53.180.130 dst 10.53.180.130 tos 0 scope 0 flags 0
        <...>-311708 [040] ..s1 7951180.957826: fib_table_lookup_nh: nexthop dev eth0 oif 4 src 10.53.180.130
      
      So, the fib table lookup returns eth0 as the nexthop even though
      the packets are local and should be routed to loopback nonetheless,
      but they can't pass the dev match check in fib_info_nh_uses_dev()
      without this patch.
      
      It should be safe to relax this check for this special case, as
      normally packets coming out of loopback device still have skb_dst
      so they won't even hit this slow path.
      
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: David Ahern <dsahern@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66f82095
  10. 16 7月, 2019 4 次提交
    • F
      netfilter: synproxy: fix erroneous tcp mss option · b83329fb
      Fernando Fernandez Mancera 提交于
      Now synproxy sends the mss value set by the user on client syn-ack packet
      instead of the mss value that client announced.
      
      Fixes: 48b1de4c ("netfilter: add SYNPROXY core/target")
      Signed-off-by: NFernando Fernandez Mancera <ffmancera@riseup.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b83329fb
    • Y
      netfilter: Update obsolete comments referring to ip_conntrack · 05ba4c89
      Yonatan Goldschmidt 提交于
      In 9fb9cbb1 ("[NETFILTER]: Add nf_conntrack subsystem.") the new
      generic nf_conntrack was introduced, and it came to supersede the old
      ip_conntrack.
      
      This change updates (some) of the obsolete comments referring to old
      file/function names of the ip_conntrack mechanism, as well as removes a
      few self-referencing comments that we shouldn't maintain anymore.
      
      I did not update any comments referring to historical actions (e.g,
      comments like "this file was derived from ..." were left untouched, even
      if the referenced file is no longer here).
      Signed-off-by: NYonatan Goldschmidt <yon.goldschmidt@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      05ba4c89
    • X
      netfilter: nf_conntrack_sip: fix expectation clash · 3c00fb0b
      xiao ruizhu 提交于
      When conntracks change during a dialog, SDP messages may be sent from
      different conntracks to establish expects with identical tuples. In this
      case expects conflict may be detected for the 2nd SDP message and end up
      with a process failure.
      
      The fixing here is to reuse an existing expect who has the same tuple for a
      different conntrack if any.
      
      Here are two scenarios for the case.
      
      1)
               SERVER                   CPE
      
                 |      INVITE SDP       |
            5060 |<----------------------|5060
                 |      100 Trying       |
            5060 |---------------------->|5060
                 |      183 SDP          |
            5060 |---------------------->|5060    ===> Conntrack 1
                 |       PRACK           |
           50601 |<----------------------|5060
                 |    200 OK (PRACK)     |
           50601 |---------------------->|5060
                 |    200 OK (INVITE)    |
            5060 |---------------------->|5060
                 |        ACK            |
           50601 |<----------------------|5060
                 |                       |
                 |<--- RTP stream ------>|
                 |                       |
                 |    INVITE SDP (t38)   |
           50601 |---------------------->|5060    ===> Conntrack 2
      
      With a certain configuration in the CPE, SIP messages "183 with SDP" and
      "re-INVITE with SDP t38" will go through the sip helper to create
      expects for RTP and RTCP.
      
      It is okay to create RTP and RTCP expects for "183", whose master
      connection source port is 5060, and destination port is 5060.
      
      In the "183" message, port in Contact header changes to 50601 (from the
      original 5060). So the following requests e.g. PRACK and ACK are sent to
      port 50601. It is a different conntrack (let call Conntrack 2) from the
      original INVITE (let call Conntrack 1) due to the port difference.
      
      In this example, after the call is established, there is RTP stream but no
      RTCP stream for Conntrack 1, so the RTP expect created upon "183" is
      cleared, and RTCP expect created for Conntrack 1 retains.
      
      When "re-INVITE with SDP t38" arrives to create RTP&RTCP expects, current
      ALG implementation will call nf_ct_expect_related() for RTP and RTCP. The
      expects tuples are identical to those for Conntrack 1. RTP expect for
      Conntrack 2 succeeds in creation as the one for Conntrack 1 has been
      removed. RTCP expect for Conntrack 2 fails in creation because it has
      idential tuples and 'conflict' with the one retained for Conntrack 1. And
      then result in a failure in processing of the re-INVITE.
      
      2)
      
          SERVER A                 CPE
      
             |      REGISTER     |
        5060 |<------------------| 5060  ==> CT1
             |       200         |
        5060 |------------------>| 5060
             |                   |
             |   INVITE SDP(1)   |
        5060 |<------------------| 5060
             | 300(multi choice) |
        5060 |------------------>| 5060                    SERVER B
             |       ACK         |
        5060 |<------------------| 5060
                                        |    INVITE SDP(2)    |
                                   5060 |-------------------->| 5060  ==> CT2
                                        |       100           |
                                   5060 |<--------------------| 5060
                                        | 200(contact changes)|
                                   5060 |<--------------------| 5060
                                        |       ACK           |
                                   5060 |-------------------->| 50601 ==> CT3
                                        |                     |
                                        |<--- RTP stream ---->|
                                        |                     |
                                        |       BYE           |
                                   5060 |<--------------------| 50601
                                        |       200           |
                                   5060 |-------------------->| 50601
             |   INVITE SDP(3)   |
        5060 |<------------------| 5060  ==> CT1
      
      CPE sends an INVITE request(1) to Server A, and creates a RTP&RTCP expect
      pair for this Conntrack 1 (CT1). Server A responds 300 to redirect to
      Server B. The RTP&RTCP expect pairs created on CT1 are removed upon 300
      response.
      
      CPE sends the INVITE request(2) to Server B, and creates an expect pair
      for the new conntrack (due to destination address difference), let call
      CT2. Server B changes the port to 50601 in 200 OK response, and the
      following requests ACK and BYE from CPE are sent to 50601. The call is
      established. There is RTP stream and no RTCP stream. So RTP expect is
      removed and RTCP expect for CT2 retains.
      
      As BYE request is sent from port 50601, it is another conntrack, let call
      CT3, different from CT2 due to the port difference. So the BYE request will
      not remove the RTCP expect for CT2.
      
      Then another outgoing call is made, with the same RTP port being used (not
      definitely but possibly). CPE firstly sends the INVITE request(3) to Server
      A, and tries to create a RTP&RTCP expect pairs for this CT1. In current ALG
      implementation, the RTCP expect for CT1 fails in creation because it
      'conflicts' with the residual one for CT2. As a result the INVITE request
      fails to send.
      Signed-off-by: Nxiao ruizhu <katrina.xiaorz@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3c00fb0b
    • M
      netfilter: Fix rpfilter dropping vrf packets by mistake · b575b24b
      Miaohe Lin 提交于
      When firewalld is enabled with ipv4/ipv6 rpfilter, vrf
      ipv4/ipv6 packets will be dropped. Vrf device will pass
      through netfilter hook twice. One with enslaved device
      and another one with l3 master device. So in device may
      dismatch witch out device because out device is always
      enslaved device.So failed with the check of the rpfilter
      and drop the packets by mistake.
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b575b24b
  11. 09 7月, 2019 2 次提交
  12. 06 7月, 2019 2 次提交
    • I
      ipv4: Fix NULL pointer dereference in ipv4_neigh_lookup() · 537de0c8
      Ido Schimmel 提交于
      Both ip_neigh_gw4() and ip_neigh_gw6() can return either a valid pointer
      or an error pointer, but the code currently checks that the pointer is
      not NULL.
      
      Fix this by checking that the pointer is not an error pointer, as this
      can result in a NULL pointer dereference [1]. Specifically, I believe
      that what happened is that ip_neigh_gw4() returned '-EINVAL'
      (0xffffffffffffffea) to which the offset of 'refcnt' (0x70) was added,
      which resulted in the address 0x000000000000005a.
      
      [1]
       BUG: KASAN: null-ptr-deref in refcount_inc_not_zero_checked+0x6e/0x180
       Read of size 4 at addr 000000000000005a by task swapper/2/0
      
       CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.2.0-rc6-custom-reg-179657-gaa32d89 #396
       Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017
       Call Trace:
       <IRQ>
       dump_stack+0x73/0xbb
       __kasan_report+0x188/0x1ea
       kasan_report+0xe/0x20
       refcount_inc_not_zero_checked+0x6e/0x180
       ipv4_neigh_lookup+0x365/0x12c0
       __neigh_update+0x1467/0x22f0
       arp_process.constprop.6+0x82e/0x1f00
       __netif_receive_skb_one_core+0xee/0x170
       process_backlog+0xe3/0x640
       net_rx_action+0x755/0xd90
       __do_softirq+0x29b/0xae7
       irq_exit+0x177/0x1c0
       smp_apic_timer_interrupt+0x164/0x5e0
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      
      Fixes: 5c9f7c1d ("ipv4: Add helpers for neigh lookup for nexthop")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NShalom Toledo <shalomt@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      537de0c8
    • L
      net: remove unused parameter from skb_checksum_try_convert · e4aa33ad
      Li RongQing 提交于
      the check parameter is never used
      Signed-off-by: NLi RongQing <lirongqing@baidu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4aa33ad
  13. 04 7月, 2019 3 次提交
  14. 03 7月, 2019 2 次提交
    • S
      bpf: add BPF_CGROUP_SOCK_OPS callback that is executed on every RTT · 23729ff2
      Stanislav Fomichev 提交于
      Performance impact should be minimal because it's under a new
      BPF_SOCK_OPS_RTT_CB_FLAG flag that has to be explicitly enabled.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Priyaranjan Jha <priyarjha@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      23729ff2
    • S
      ipv4: Fix off-by-one in route dump counter without netlink strict checking · 885b8b4d
      Stefano Brivio 提交于
      In commit ee28906f ("ipv4: Dump route exceptions if requested") I
      added a counter of per-node dumped routes (including actual routes and
      exceptions), analogous to the existing counter for dumped nodes. Dumping
      exceptions means we need to also keep track of how many routes are dumped
      for each node: this would be just one route per node, without exceptions.
      
      When netlink strict checking is not enabled, we dump both routes and
      exceptions at the same time: the RTM_F_CLONED flag is not used as a
      filter. In this case, the per-node counter 'i_fa' is incremented by one
      to track the single dumped route, then also incremented by one for each
      exception dumped, and then stored as netlink callback argument as skip
      counter, 's_fa', to be used when a partial dump operation restarts.
      
      The per-node counter needs to be increased by one also when we skip a
      route (exception) due to a previous non-zero skip counter, because it
      needs to match the existing skip counter, if we are dumping both routes
      and exceptions. I missed this, and only incremented the counter, for
      regular routes, if the previous skip counter was zero. This means that,
      in case of a mixed dump, partial dump operations after the first one
      will start with a mismatching skip counter value, one less than expected.
      
      This means in turn that the first exception for a given node is skipped
      every time a partial dump operation restarts, if netlink strict checking
      is not enabled (iproute < 5.0).
      
      It turns out I didn't repeat the test in its final version, commit
      de755a85 ("selftests: pmtu: Introduce list_flush_ipv4_exception test
      case"), which also counts the number of route exceptions returned, with
      iproute2 versions < 5.0 -- I was instead using the equivalent of the IPv6
      test as it was before commit b964641e ("selftests: pmtu: Make
      list_flush_ipv6_exception test more demanding").
      
      Always increment the per-node counter by one if we previously dumped
      a regular route, so that it matches the current skip counter.
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      885b8b4d
  15. 02 7月, 2019 2 次提交
  16. 01 7月, 2019 1 次提交
  17. 30 6月, 2019 1 次提交
    • E
      igmp: fix memory leak in igmpv3_del_delrec() · e5b1c6c6
      Eric Dumazet 提交于
      im->tomb and/or im->sources might not be NULL, but we
      currently overwrite their values blindly.
      
      Using swap() will make sure the following call to kfree_pmc(pmc)
      will properly free the psf structures.
      
      Tested with the C repro provided by syzbot, which basically does :
      
       socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
       setsockopt(3, SOL_IP, IP_ADD_MEMBERSHIP, "\340\0\0\2\177\0\0\1\0\0\0\0", 12) = 0
       ioctl(3, SIOCSIFFLAGS, {ifr_name="lo", ifr_flags=0}) = 0
       setsockopt(3, SOL_IP, IP_MSFILTER, "\340\0\0\2\177\0\0\1\1\0\0\0\1\0\0\0\377\377\377\377", 20) = 0
       ioctl(3, SIOCSIFFLAGS, {ifr_name="lo", ifr_flags=IFF_UP}) = 0
       exit_group(0)                    = ?
      
      BUG: memory leak
      unreferenced object 0xffff88811450f140 (size 64):
        comm "softirq", pid 0, jiffies 4294942448 (age 32.070s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00  ................
          00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000c7bad083>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
          [<00000000c7bad083>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<00000000c7bad083>] slab_alloc mm/slab.c:3326 [inline]
          [<00000000c7bad083>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<000000009acc4151>] kmalloc include/linux/slab.h:547 [inline]
          [<000000009acc4151>] kzalloc include/linux/slab.h:742 [inline]
          [<000000009acc4151>] ip_mc_add1_src net/ipv4/igmp.c:1976 [inline]
          [<000000009acc4151>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2100
          [<000000004ac14566>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2484
          [<0000000052d8f995>] do_ip_setsockopt.isra.0+0x1795/0x1930 net/ipv4/ip_sockglue.c:959
          [<000000004ee1e21f>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1248
          [<0000000066cdfe74>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2618
          [<000000009383a786>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3126
          [<00000000d8ac0c94>] __sys_setsockopt+0x98/0x120 net/socket.c:2072
          [<000000001b1e9666>] __do_sys_setsockopt net/socket.c:2083 [inline]
          [<000000001b1e9666>] __se_sys_setsockopt net/socket.c:2080 [inline]
          [<000000001b1e9666>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2080
          [<00000000420d395e>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
          [<000000007fd83a4b>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 24803f38 ("igmp: do not remove igmp souce list info when set link down")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Hangbin Liu <liuhangbin@gmail.com>
      Reported-by: syzbot+6ca1abd0db68b5173a4f@syzkaller.appspotmail.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5b1c6c6
  18. 29 6月, 2019 1 次提交
  19. 28 6月, 2019 1 次提交
  20. 27 6月, 2019 2 次提交
    • S
      ipv4: reset rt_iif for recirculated mcast/bcast out pkts · 5b18f128
      Stephen Suryaputra 提交于
      Multicast or broadcast egress packets have rt_iif set to the oif. These
      packets might be recirculated back as input and lookup to the raw
      sockets may fail because they are bound to the incoming interface
      (skb_iif). If rt_iif is not zero, during the lookup, inet_iif() function
      returns rt_iif instead of skb_iif. Hence, the lookup fails.
      
      v2: Make it non vrf specific (David Ahern). Reword the changelog to
          reflect it.
      Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b18f128
    • E
      ipv4: fix suspicious RCU usage in fib_dump_info_fnhe() · 93ed54b1
      Eric Dumazet 提交于
      sysbot reported that we lack appropriate rcu_read_lock()
      protection in fib_dump_info_fnhe()
      
      net/ipv4/route.c:2875 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by syz-executor609/8966:
       #0: 00000000b7dbe288 (rtnl_mutex){+.+.}, at: netlink_dump+0xe7/0xfb0 net/netlink/af_netlink.c:2199
      
      stack backtrace:
      CPU: 0 PID: 8966 Comm: syz-executor609 Not tainted 5.2.0-rc5+ #43
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5250
       fib_dump_info_fnhe+0x9d9/0x1080 net/ipv4/route.c:2875
       fn_trie_dump_leaf net/ipv4/fib_trie.c:2141 [inline]
       fib_table_dump+0x64a/0xd00 net/ipv4/fib_trie.c:2175
       inet_dump_fib+0x83c/0xa90 net/ipv4/fib_frontend.c:1004
       rtnl_dump_all+0x295/0x490 net/core/rtnetlink.c:3445
       netlink_dump+0x558/0xfb0 net/netlink/af_netlink.c:2244
       __netlink_dump_start+0x5b1/0x7d0 net/netlink/af_netlink.c:2352
       netlink_dump_start include/linux/netlink.h:226 [inline]
       rtnetlink_rcv_msg+0x73d/0xb00 net/core/rtnetlink.c:5182
       netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:646 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:665
       sock_write_iter+0x27c/0x3e0 net/socket.c:994
       call_write_iter include/linux/fs.h:1872 [inline]
       new_sync_write+0x4d3/0x770 fs/read_write.c:483
       __vfs_write+0xe1/0x110 fs/read_write.c:496
       vfs_write+0x20c/0x580 fs/read_write.c:558
       ksys_write+0x14f/0x290 fs/read_write.c:611
       __do_sys_write fs/read_write.c:623 [inline]
       __se_sys_write fs/read_write.c:620 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:620
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4401b9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffc8e134978 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401b9
      RDX: 000000000000001c RSI: 0000000020000000 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 00000000004002c8 R09: 00000000004002c8
      R10: 0000000000000010 R11: 0000000000000246 R12: 0000000000401a40
      R13: 0000000000401ad0 R14: 0000000000000000 R15: 0000000000000000
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93ed54b1
  21. 26 6月, 2019 1 次提交
  22. 25 6月, 2019 5 次提交
    • S
      ipv4: Dump route exceptions if requested · ee28906f
      Stefano Brivio 提交于
      Since commit 4895c771 ("ipv4: Add FIB nexthop exceptions."), cached
      exception routes are stored as a separate entity, so they are not dumped
      on a FIB dump, even if the RTM_F_CLONED flag is passed.
      
      This implies that the command 'ip route list cache' doesn't return any
      result anymore.
      
      If the RTM_F_CLONED is passed, and strict checking requested, retrieve
      nexthop exception routes and dump them. If no strict checking is
      requested, filtering can't be performed consistently: dump everything in
      that case.
      
      With this, we need to add an argument to the netlink callback in order to
      track how many entries were already dumped for the last leaf included in
      a partial netlink dump.
      
      A single additional argument is sufficient, even if we traverse logically
      nested structures (nexthop objects, hash table buckets, bucket chains): it
      doesn't matter if we stop in the middle of any of those, because they are
      always traversed the same way. As an example, s_i values in [], s_fa
      values in ():
      
        node (fa) #1 [1]
          nexthop #1
          bucket #1 -> #0 in chain (1)
          bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4)
          bucket #3 -> #0 in chain (5) -> #1 in chain (6)
      
          nexthop #2
          bucket #1 -> #0 in chain (7) -> #1 in chain (8)
          bucket #2 -> #0 in chain (9)
        --
        node (fa) #2 [2]
          nexthop #1
          bucket #1 -> #0 in chain (1) -> #1 in chain (2)
          bucket #2 -> #0 in chain (3)
      
      it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2)
      for "node #2": walking flattens all that.
      
      It would even be possible to drop the distinction between the in-tree
      (s_i) and in-node (s_fa) counter, but a further improvement might
      advise against this. This is only as accurate as the existing tracking
      mechanism for leaves: if a partial dump is restarted after exceptions
      are removed or expired, we might skip some non-dumped entries.
      
      To improve this, we could attach a 'sernum' attribute (similar to the
      one used for IPv6) to nexthop entities, and bump this counter whenever
      exceptions change: having a distinction between the two counters would
      make this more convenient.
      
      Listing of exception routes (modified routes pre-3.5) was tested against
      these versions of kernel and iproute2:
      
                          iproute2
      kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
       3.5-rc4         +        +        +        +       +
       4.4
       4.9
       4.14
       4.15
       4.19
       5.0
       5.1
       fixed           +        +        +        +       +
      
      v7:
         - Move loop over nexthop objects to route.c, and pass struct fib_info
           and table ID to it, not a struct fib_alias (suggested by David Ahern)
         - While at it, note that the NULL check on fa->fa_info is redundant,
           and the check on RTNH_F_DEAD is also not consistent with what's done
           with regular route listing: just keep it for nhc_flags
         - Rename entry point function for dumping exceptions to
           fib_dump_info_fnhe(), and rearrange arguments for consistency with
           fib_dump_info()
         - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle
           one bucket at a time
         - Expand commit message to describe why we can have a single "skip"
           counter for all exceptions stored in bucket chains in nexthop objects
           (suggested by David Ahern)
      
      v6:
         - Rebased onto net-next
         - Loop over nexthop paths too. Move loop over fnhe buckets to route.c,
           avoids need to export rt_fill_info() and to touch exceptions from
           fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that
           (suggested by David Ahern)
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee28906f
    • S
      ipv4/route: Allow NULL flowinfo in rt_fill_info() · d948974c
      Stefano Brivio 提交于
      In the next patch, we're going to use rt_fill_info() to dump exception
      routes upon RTM_GETROUTE with NLM_F_ROOT, meaning userspace is requesting
      a dump and not a specific route selection, which in turn implies the input
      interface is not relevant. Update rt_fill_info() to handle a NULL
      flowinfo.
      
      v7: If fl4 is NULL, explicitly set r->rtm_tos to 0: it's not initialised
          otherwise (spotted by David Ahern)
      
      v6: New patch
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d948974c
    • S
      ipv4/fib_frontend: Allow RTM_F_CLONED flag to be used for filtering · b597ca6e
      Stefano Brivio 提交于
      This functionally reverts the check introduced by commit
      e8ba330a ("rtnetlink: Update fib dumps for strict data checking")
      as modified by commit e4e92fb1 ("net/ipv4: Bail early if user only
      wants prefix entries").
      
      As we are preparing to fix listing of IPv4 cached routes, we need to
      give userspace a way to request them.
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b597ca6e
    • S
      fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED · 564c91f7
      Stefano Brivio 提交于
      The following patches add back the ability to dump IPv4 and IPv6 exception
      routes, and we need to allow selection of regular routes or exceptions.
      
      Use RTM_F_CLONED as filter to decide whether to dump routes or exceptions:
      iproute2 passes it in dump requests (except for IPv6 cache flush requests,
      this will be fixed in iproute2) and this used to work as long as
      exceptions were stored directly in the FIB, for both IPv4 and IPv6.
      
      Caveat: if strict checking is not requested (that is, if the dump request
      doesn't go through ip_valid_fib_dump_req()), we can't filter on protocol,
      tables or route types.
      
      In this case, filtering on RTM_F_CLONED would be inconsistent: we would
      fix 'ip route list cache' by returning exception routes and at the same
      time introduce another bug in case another selector is present, e.g. on
      'ip route list cache table main' we would return all exception routes,
      without filtering on tables.
      
      Keep this consistent by applying no filters at all, and dumping both
      routes and exceptions, if strict checking is not requested. iproute2
      currently filters results anyway, and no unwanted results will be
      presented to the user. The kernel will just dump more data than needed.
      
      v7: No changes
      
      v6: Rebase onto net-next, no changes
      
      v5: New patch: add dump_routes and dump_exceptions flags in filter and
          simply clear the unwanted one if strict checking is enabled, don't
          ignore NLM_F_MATCH and don't set filter_set if NLM_F_MATCH is set.
          Skip filtering altogether if no strict checking is requested:
          selecting routes or exceptions only would be inconsistent with the
          fact we can't filter on tables.
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564c91f7
    • S
      ipv4: fix confirm_addr_indev() when enable route_localnet · 650638a7
      Shijie Luo 提交于
      When arp_ignore=3, the NIC won't reply for scope host addresses, but
      if enable route_locanet, we need to reply ip address with head 127 and
      scope RT_SCOPE_HOST.
      
      Fixes: d0daebc3 ("ipv4: Add interface option to enable routing of 127.0.0.0/8")
      Signed-off-by: NShijie Luo <luoshijie1@huawei.com>
      Signed-off-by: NZhiqiang Liu <liuzhiqiang26@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      650638a7