1. 14 3月, 2016 3 次提交
  2. 10 3月, 2016 1 次提交
  3. 09 3月, 2016 1 次提交
    • D
      bpf, vxlan, geneve, gre: fix usage of dst_cache on xmit · db3c6139
      Daniel Borkmann 提交于
      The assumptions from commit 0c1d70af ("net: use dst_cache for vxlan
      device"), 468dfffc ("geneve: add dst caching support") and 3c1cb4d2
      ("net/ipv4: add dst cache support for gre lwtunnels") on dst_cache usage
      when ip_tunnel_info is used is unfortunately not always valid as assumed.
      
      While it seems correct for ip_tunnel_info front-ends such as OVS, eBPF
      however can fill in ip_tunnel_info for consumers like vxlan, geneve or gre
      with different remote dsts, tos, etc, therefore they cannot be assumed as
      packet independent.
      
      Right now vxlan, geneve, gre would cache the dst for eBPF and every packet
      would reuse the same entry that was first created on the initial route
      lookup. eBPF doesn't store/cache the ip_tunnel_info, so each skb may have
      a different one.
      
      Fix it by adding a flag that checks the ip_tunnel_info. Also the !tos test
      in vxlan needs to be handeled differently in this context as it is currently
      inferred from ip_tunnel_info as well if present. ip_tunnel_dst_cache_usable()
      helper is added for the three tunnel cases, which checks if we can use dst
      cache.
      
      Fixes: 0c1d70af ("net: use dst_cache for vxlan device")
      Fixes: 468dfffc ("geneve: add dst caching support")
      Fixes: 3c1cb4d2 ("net/ipv4: add dst cache support for gre lwtunnels")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db3c6139
  4. 08 3月, 2016 2 次提交
    • E
      tcp: fix tcpi_segs_in after connection establishment · a9d99ce2
      Eric Dumazet 提交于
      If final packet (ACK) of 3WHS is lost, it appears we do not properly
      account the following incoming segment into tcpi_segs_in
      
      While we are at it, starts segs_in with one, to count the SYN packet.
      
      We do not yet count number of SYN we received for a request sock, we
      might add this someday.
      
      packetdrill script showing proper behavior after fix :
      
      // Tests tcpi_segs_in when 3rd packet (ACK) of 3WHS is lost
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
      +.020 < P. 1:1001(1000) ack 1 win 32792
      
         +0 accept(3, ..., ...) = 4
      
      +.000 %{ assert tcpi_segs_in == 2, 'tcpi_segs_in=%d' % tcpi_segs_in }%
      
      Fixes: 2efd055c ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9d99ce2
    • Z
      arp: correct return value of arp_rcv · 8dfd329f
      Zhang Shengju 提交于
      Currently, arp_rcv() always return zero on a packet delivery upcall.
      
      To make its behavior more compliant with the way this API should be
      used, this patch changes this to let it return NET_RX_SUCCESS when the
      packet is proper handled, and NET_RX_DROP otherwise.
      
      v1->v2:
      If sanity check is failed, call kfree_skb() instead of consume_skb(), then
      return the correct return value.
      Signed-off-by: NZhang Shengju <zhangshengju@cmss.chinamobile.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8dfd329f
  5. 04 3月, 2016 1 次提交
    • B
      mld, igmp: Fix reserved tailroom calculation · 1837b2e2
      Benjamin Poirier 提交于
      The current reserved_tailroom calculation fails to take hlen and tlen into
      account.
      
      skb:
      [__hlen__|__data____________|__tlen___|__extra__]
      ^                                               ^
      head                                            skb_end_offset
      
      In this representation, hlen + data + tlen is the size passed to alloc_skb.
      "extra" is the extra space made available in __alloc_skb because of
      rounding up by kmalloc. We can reorder the representation like so:
      
      [__hlen__|__data____________|__extra__|__tlen___]
      ^                                               ^
      head                                            skb_end_offset
      
      The maximum space available for ip headers and payload without
      fragmentation is min(mtu, data + extra). Therefore,
      reserved_tailroom
      = data + extra + tlen - min(mtu, data + extra)
      = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
      = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
      
      Compare the second line to the current expression:
      reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
      and we can see that hlen and tlen are not taken into account.
      
      The min() in the third line can be expanded into:
      if mtu < skb_tailroom - tlen:
      	reserved_tailroom = skb_tailroom - mtu
      else:
      	reserved_tailroom = tlen
      
      Depending on hlen, tlen, mtu and the number of multicast address records,
      the current code may output skbs that have less tailroom than
      dev->needed_tailroom or it may output more skbs than needed because not all
      space available is used.
      
      Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
      Signed-off-by: NBenjamin Poirier <bpoirier@suse.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1837b2e2
  6. 03 3月, 2016 5 次提交
    • E
      net/ipv4: remove left over dead code · a9d56235
      Eric Engestrom 提交于
      8cc785f6 ("net: ipv4: make the ping
      /proc code AF-independent") removed the code using it, but renamed this
      variable instead of removing it.
      Signed-off-by: NEric Engestrom <eric.engestrom@imgtec.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9d56235
    • P
      netfilter: nft_masq: support port range · 8a6bf5da
      Pablo Neira Ayuso 提交于
      Complete masquerading support by allowing port range selection.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8a6bf5da
    • F
      netfilter: xtables: don't hook tables by default · b9e69e12
      Florian Westphal 提交于
      delay hook registration until the table is being requested inside a
      namespace.
      
      Historically, a particular table (iptables mangle, ip6tables filter, etc)
      was registered on module load.
      
      When netns support was added to iptables only the ip/ip6tables ruleset was
      made namespace aware, not the actual hook points.
      
      This means f.e. that when ipt_filter table/module is loaded on a system,
      then each namespace on that system has an (empty) iptables filter ruleset.
      
      In other words, if a namespace sends a packet, such skb is 'caught' by
      netfilter machinery and fed to hooking points for that table (i.e. INPUT,
      FORWARD, etc).
      
      Thanks to Eric Biederman, hooks are no longer global, but per namespace.
      
      This means that we can avoid allocation of empty ruleset in a namespace and
      defer hook registration until we need the functionality.
      
      We register a tables hook entry points ONLY in the initial namespace.
      When an iptables get/setockopt is issued inside a given namespace, we check
      if the table is found in the per-namespace list.
      
      If not, we attempt to find it in the initial namespace, and, if found,
      create an empty default table in the requesting namespace and register the
      needed hooks.
      
      Hook points are destroyed only once namespace is deleted, there is no
      'usage count' (it makes no sense since there is no 'remove table' operation
      in xtables api).
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b9e69e12
    • F
      netfilter: xtables: prepare for on-demand hook register · a67dd266
      Florian Westphal 提交于
      This change prepares for upcoming on-demand xtables hook registration.
      
      We change the protoypes of the register/unregister functions.
      A followup patch will then add nf_hook_register/unregister calls
      to the iptables one.
      
      Once a hook is registered packets will be picked up, so all assignments
      of the form
      
      net->ipv4.iptable_$table = new_table
      
      have to be moved to ip(6)t_register_table, else we can see NULL
      net->ipv4.iptable_$table later.
      
      This patch doesn't change functionality; without this the actual change
      simply gets too big.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      a67dd266
    • J
      netfilter: nf_defrag_ipv4: Drop redundant ip_send_check() · 5f547391
      Joe Stringer 提交于
      Since commit 0848f642 ("inet: frags: fix defragmented packet's IP
      header for af_packet"), ip_send_check() would be called twice for
      defragmentation that occurs from netfilter ipv4 defrag hooks. Remove the
      extra call.
      Signed-off-by: NJoe Stringer <joe@ovn.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5f547391
  7. 02 3月, 2016 3 次提交
  8. 27 2月, 2016 3 次提交
    • A
      GSO: Provide software checksum of tunneled UDP fragmentation offload · 22463876
      Alexander Duyck 提交于
      On reviewing the code I realized that GRE and UDP tunnels could cause a
      kernel panic if we used GSO to segment a large UDP frame that was sent
      through the tunnel with an outer checksum and hardware offloads were not
      available.
      
      In order to correct this we need to update the feature flags that are
      passed to the skb_segment function so that in the event of UDP
      fragmentation being requested for the inner header the segmentation
      function will correctly generate the checksum for the payload if we cannot
      segment the outer header.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22463876
    • D
      net: l3mdev: prefer VRF master for source address selection · 17b693cd
      David Lamparter 提交于
      When selecting an address in context of a VRF, the vrf master should be
      preferred for address selection.  If it isn't, the user has a hard time
      getting the system to select to their preference - the code will pick
      the address off the first in-VRF interface it can find, which on a
      router could well be a non-routable address.
      Signed-off-by: NDavid Lamparter <equinox@diac24.net>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      [dsa: Fixed comment style and removed extra blank link ]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17b693cd
    • D
      net: l3mdev: address selection should only consider devices in L3 domain · 3f2fb9a8
      David Ahern 提交于
      David Lamparter noted a use case where the source address selection fails
      to pick an address from a VRF interface - unnumbered interfaces.
      
      Relevant commands from his script:
          ip addr add 9.9.9.9/32 dev lo
          ip link set lo up
      
          ip link add name vrf0 type vrf table 101
          ip rule add oif vrf0 table 101
          ip rule add iif vrf0 table 101
          ip link set vrf0 up
          ip addr add 10.0.0.3/32 dev vrf0
      
          ip link add name dummy2 type dummy
          ip link set dummy2 master vrf0 up
      
          --> note dummy2 has no address - unnumbered device
      
          ip route add 10.2.2.2/32 dev dummy2 table 101
          ip neigh add 10.2.2.2 dev dummy2 lladdr 02:00:00:00:00:02
      
          tcpdump -ni dummy2 &
      
      And using ping instead of his socat example:
          $ ping -I vrf0 -c1 10.2.2.2
          ping: Warning: source address might be selected on device other than vrf0.
          PING 10.2.2.2 (10.2.2.2) from 9.9.9.9 vrf0: 56(84) bytes of data.
      
      >From tcpdump:
          12:57:29.449128 IP 9.9.9.9 > 10.2.2.2: ICMP echo request, id 2491, seq 1, length 64
      
      Note the source address is from lo and is not a VRF local address. With
      this patch:
      
          $ ping -I vrf0 -c1 10.2.2.2
          PING 10.2.2.2 (10.2.2.2) from 10.0.0.3 vrf0: 56(84) bytes of data.
      
      >From tcpdump:
          12:59:25.096426 IP 10.0.0.3 > 10.2.2.2: ICMP echo request, id 2113, seq 1, length 64
      
      Now the source address comes from vrf0.
      
      The ipv4 function for selecting source address takes a const argument.
      Removing the const requires touching a lot of places, so instead
      l3mdev_master_ifindex_rcu is changed to take a const argument and then
      do the typecast to non-const as required by netdev_master_upper_dev_get_rcu.
      This is similar to what l3mdev_fib_table_rcu does.
      
      IPv6 for unnumbered interfaces appears to be selecting the addresses
      properly.
      
      Cc: David Lamparter <david@opensourcerouting.org>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f2fb9a8
  9. 25 2月, 2016 2 次提交
  10. 24 2月, 2016 2 次提交
  11. 20 2月, 2016 1 次提交
  12. 19 2月, 2016 4 次提交
  13. 18 2月, 2016 2 次提交
  14. 17 2月, 2016 10 次提交