1. 27 4月, 2018 5 次提交
    • W
      udp: add gso segment cmsg · 2e8de857
      Willem de Bruijn 提交于
      Allow specifying segment size in the send call.
      
      The new control message performs the same function as socket option
      UDP_SEGMENT while avoiding the extra system call.
      
      [ Export udp_cmsg_send for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e8de857
    • W
      udp: paged allocation with gso · 15e36f5b
      Willem de Bruijn 提交于
      When sending large datagrams that are later segmented, store data in
      page frags to avoid copying from linear in skb_segment.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15e36f5b
    • W
      udp: generate gso with UDP_SEGMENT · bec1f6f6
      Willem de Bruijn 提交于
      Support generic segmentation offload for udp datagrams. Callers can
      concatenate and send at once the payload of multiple datagrams with
      the same destination.
      
      To set segment size, the caller sets socket option UDP_SEGMENT to the
      length of each discrete payload. This value must be smaller than or
      equal to the relevant MTU.
      
      A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
      per send call basis.
      
      Total byte length may then exceed MTU. If not an exact multiple of
      segment size, the last segment will be shorter.
      
      The implementation adds a gso_size field to the udp socket, ip(v6)
      cmsg cookie and inet_cork structure to be able to set the value at
      setsockopt or cmsg time and to work with both lockless and corked
      paths.
      
      Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
      
          tcp tso
           3197 MB/s 54232 msg/s 54232 calls/s
               6,457,754,262      cycles
      
          tcp gso
           1765 MB/s 29939 msg/s 29939 calls/s
              11,203,021,806      cycles
      
          tcp without tso/gso *
            739 MB/s 12548 msg/s 12548 calls/s
              11,205,483,630      cycles
      
          udp
            876 MB/s 14873 msg/s 624666 calls/s
              11,205,777,429      cycles
      
          udp gso
           2139 MB/s 36282 msg/s 36282 calls/s
              11,204,374,561      cycles
      
         [*] after reverting commit 0a6b2a1d
             ("tcp: switch to GSO being always on")
      
      Measured total system cycles ('-a') for one core while pinning both
      the network receive path and benchmark process to that core:
      
        perf stat -a -C 12 -e cycles \
          ./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
      
      Note the reduction in calls/s with GSO. Bytes per syscall drops
      increases from 1470 to 61818.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec1f6f6
    • W
      udp: add udp gso · ee80d1eb
      Willem de Bruijn 提交于
      Implement generic segmentation offload support for udp datagrams. A
      follow-up patch adds support to the protocol stack to generate such
      packets.
      
      UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
      a large payload into a number of discrete UDP datagrams.
      
      The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
      from UFO (SKB_UDP_GSO).
      
      IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
      registered.
      
      [ Export __udp_gso_segment for ipv6. -DaveM ]
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee80d1eb
    • W
      udp: expose inet cork to udp · 1cd7884d
      Willem de Bruijn 提交于
      UDP segmentation offload needs access to inet_cork in the udp layer.
      Pass the struct to ip(6)_make_skb instead of allocating it on the
      stack in that function itself.
      
      This patch is a noop otherwise.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cd7884d
  2. 26 4月, 2018 2 次提交
    • I
      ipv6: addrconf: don't evaluate keep_addr_on_down twice · 0aef78aa
      Ivan Vecera 提交于
      The addrconf_ifdown() evaluates keep_addr_on_down state twice. There
      is no need to do it.
      
      Cc: David Ahern <dsahern@gmail.com>
      Signed-off-by: NIvan Vecera <cera@cera.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0aef78aa
    • A
      ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode · b5facfdb
      Ahmed Abdelsalam 提交于
      ECMP (equal-cost multipath) hashes are typically computed on the packets'
      5-tuple(src IP, dst IP, src port, dst port, L4 proto).
      
      For encapsulated packets, the L4 data is not readily available and ECMP
      hashing will often revert to (src IP, dst IP). This will lead to traffic
      polarization on a single ECMP path, causing congestion and waste of network
      capacity.
      
      In IPv6, the 20-bit flow label field is also used as part of the ECMP hash.
      In the lack of L4 data, the hashing will be on (src IP, dst IP, flow
      label). Having a non-zero flow label is thus important for proper traffic
      load balancing when L4 data is unavailable (i.e., when packets are
      encapsulated).
      
      Currently, the seg6_do_srh_encap() function extracts the original packet's
      flow label and set it as the outer IPv6 flow label. There are two issues
      with this behaviour:
      
      a) There is no guarantee that the inner flow label is set by the source.
      b) If the original packet is not IPv6, the flow label will be set to
      zero (e.g., IPv4 or L2 encap).
      
      This patch adds a function, named seg6_make_flowlabel(), that computes a
      flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and
      leverages the per namespace 'seg6_flowlabel" sysctl value.
      
      The currently support behaviours are as follows:
      -1 set flowlabel to zero.
      0 copy flowlabel from Inner paceket in case of Inner IPv6
      (Set flowlabel to 0 in case IPv4/L2)
      1 Compute the flowlabel using seg6_make_flowlabel()
      
      This patch has been tested for IPv6, IPv4, and L2 traffic.
      Signed-off-by: NAhmed Abdelsalam <amsalam20@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5facfdb
  3. 25 4月, 2018 1 次提交
  4. 24 4月, 2018 3 次提交
  5. 23 4月, 2018 2 次提交
    • R
      net: fib_rules: add extack support · b16fb418
      Roopa Prabhu 提交于
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b16fb418
    • A
      ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts · a957fa19
      Ahmed Abdelsalam 提交于
      In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
      in order to set the src addr of outer IPv6 header.
      
      The net_device is required for set_tun_src(). However calling ip6_dst_idev()
      on dst_entry in case of IPv4 traffic results on the following bug.
      
      Using just dst->dev should fix this BUG.
      
      [  196.242461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
      [  196.242975] PGD 800000010f076067 P4D 800000010f076067 PUD 10f060067 PMD 0
      [  196.243329] Oops: 0000 [#1] SMP PTI
      [  196.243468] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd input_leds glue_helper led_class pcspkr serio_raw mac_hid video autofs4 hid_generic usbhid hid e1000 i2c_piix4 ahci pata_acpi libahci
      [  196.244362] CPU: 2 PID: 1089 Comm: ping Not tainted 4.16.0+ #1
      [  196.244606] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  196.244968] RIP: 0010:seg6_do_srh_encap+0x1ac/0x300
      [  196.245236] RSP: 0018:ffffb2ce00b23a60 EFLAGS: 00010202
      [  196.245464] RAX: 0000000000000000 RBX: ffff8c7f53eea300 RCX: 0000000000000000
      [  196.245742] RDX: 0000f10000000000 RSI: ffff8c7f52085a6c RDI: ffff8c7f41166850
      [  196.246018] RBP: ffffb2ce00b23aa8 R08: 00000000000261e0 R09: ffff8c7f41166800
      [  196.246294] R10: ffffdce5040ac780 R11: ffff8c7f41166828 R12: ffff8c7f41166808
      [  196.246570] R13: ffff8c7f52085a44 R14: ffffffffb73211c0 R15: ffff8c7e69e44200
      [  196.246846] FS:  00007fc448789700(0000) GS:ffff8c7f59d00000(0000) knlGS:0000000000000000
      [  196.247286] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  196.247526] CR2: 0000000000000000 CR3: 000000010f05a000 CR4: 00000000000406e0
      [  196.247804] Call Trace:
      [  196.247972]  seg6_do_srh+0x15b/0x1c0
      [  196.248156]  seg6_output+0x3c/0x220
      [  196.248341]  ? prandom_u32+0x14/0x20
      [  196.248526]  ? ip_idents_reserve+0x6c/0x80
      [  196.248723]  ? __ip_select_ident+0x90/0x100
      [  196.248923]  ? ip_append_data.part.50+0x6c/0xd0
      [  196.249133]  lwtunnel_output+0x44/0x70
      [  196.249328]  ip_send_skb+0x15/0x40
      [  196.249515]  raw_sendmsg+0x8c3/0xac0
      [  196.249701]  ? _copy_from_user+0x2e/0x60
      [  196.249897]  ? rw_copy_check_uvector+0x53/0x110
      [  196.250106]  ? _copy_from_user+0x2e/0x60
      [  196.250299]  ? copy_msghdr_from_user+0xce/0x140
      [  196.250508]  sock_sendmsg+0x36/0x40
      [  196.250690]  ___sys_sendmsg+0x292/0x2a0
      [  196.250881]  ? _cond_resched+0x15/0x30
      [  196.251074]  ? copy_termios+0x1e/0x70
      [  196.251261]  ? _copy_to_user+0x22/0x30
      [  196.251575]  ? tty_mode_ioctl+0x1c3/0x4e0
      [  196.251782]  ? _cond_resched+0x15/0x30
      [  196.251972]  ? mutex_lock+0xe/0x30
      [  196.252152]  ? vvar_fault+0xd2/0x110
      [  196.252337]  ? __do_fault+0x1f/0xc0
      [  196.252521]  ? __handle_mm_fault+0xc1f/0x12d0
      [  196.252727]  ? __sys_sendmsg+0x63/0xa0
      [  196.252919]  __sys_sendmsg+0x63/0xa0
      [  196.253107]  do_syscall_64+0x72/0x200
      [  196.253305]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      [  196.253530] RIP: 0033:0x7fc4480b0690
      [  196.253715] RSP: 002b:00007ffde9f252f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [  196.254053] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007fc4480b0690
      [  196.254331] RDX: 0000000000000000 RSI: 000000000060a360 RDI: 0000000000000003
      [  196.254608] RBP: 00007ffde9f253f0 R08: 00000000002d1e81 R09: 0000000000000002
      [  196.254884] R10: 00007ffde9f250c0 R11: 0000000000000246 R12: 0000000000b22070
      [  196.255205] R13: 20c49ba5e353f7cf R14: 431bde82d7b634db R15: 00007ffde9f278fe
      [  196.255484] Code: a5 0f b6 45 c0 41 88 41 28 41 0f b6 41 2c 48 c1 e0 04 49 8b 54 01 38 49 8b 44 01 30 49 89 51 20 49 89 41 18 48 8b 83 b0 00 00 00 <48> 8b 30 49 8b 86 08 0b 00 00 48 8b 40 20 48 8b 50 08 48 0b 10
      [  196.256190] RIP: seg6_do_srh_encap+0x1ac/0x300 RSP: ffffb2ce00b23a60
      [  196.256445] CR2: 0000000000000000
      [  196.256676] ---[ end trace 71af7d093603885c ]---
      
      Fixes: 8936ef76 ("ipv6: sr: fix NULL pointer dereference when setting encap source address")
      Signed-off-by: NAhmed Abdelsalam <amsalam20@gmail.com>
      Acked-by: NDavid Lebrun <dlebrun@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a957fa19
  6. 22 4月, 2018 7 次提交
  7. 20 4月, 2018 7 次提交
    • E
      net/ipv6: Fix ip6_convert_metrics() bug · 263243d6
      Eric Dumazet 提交于
      If ip6_convert_metrics() fails to allocate memory, it should not
      overwrite rt->fib6_metrics or we risk a crash later as syzbot found.
      
      BUG: KASAN: null-ptr-deref in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
      BUG: KASAN: null-ptr-deref in refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
      Read of size 4 at addr 0000000000000044 by task syzkaller832429/4487
      
      CPU: 1 PID: 4487 Comm: syzkaller832429 Not tainted 4.16.0+ #6
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1b9/0x294 lib/dump_stack.c:113
       kasan_report_error mm/kasan/report.c:352 [inline]
       kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
       atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
       refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
       refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
       fib6_info_destroy+0x2d0/0x3c0 net/ipv6/ip6_fib.c:206
       fib6_info_release include/net/ip6_fib.h:304 [inline]
       ip6_route_info_create+0x677/0x3240 net/ipv6/route.c:3020
       ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
       inet6_rtm_newroute+0x142/0x160 net/ipv6/route.c:4406
       rtnetlink_rcv_msg+0x466/0xc10 net/core/rtnetlink.c:4648
       netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
       rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4666
       netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
       netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
       netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
       sock_sendmsg_nosec net/socket.c:629 [inline]
       sock_sendmsg+0xd5/0x120 net/socket.c:639
       ___sys_sendmsg+0x805/0x940 net/socket.c:2117
       __sys_sendmsg+0x115/0x270 net/socket.c:2155
       SYSC_sendmsg net/socket.c:2164 [inline]
       SyS_sendmsg+0x29/0x30 net/socket.c:2162
       do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      Fixes: d4ead6b3 ("net/ipv6: move metrics from dst to rt6_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Ahern <dsa@cumulusnetworks.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263243d6
    • D
      net/ipv6: Fix gfp_flags arg to addrconf_prefix_route · 27b10608
      David Ahern 提交于
      Eric noticed that __ipv6_ifa_notify is called under rcu_read_lock, so
      the gfp argument to addrconf_prefix_route can not be GFP_KERNEL.
      
      While scrubbing other calls I noticed addrconf_addr_gen has one
      place with GFP_ATOMIC that can be GFP_KERNEL.
      
      Fixes: acb54e3c ("net/ipv6: Add gfp_flags to route add functions")
      Reported-by: syzbot+2add39b05179b31f912f@syzkaller.appspotmail.com
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b10608
    • D
      net/ipv6: Remove fib6_idev · dcd1f572
      David Ahern 提交于
      fib6_idev can be obtained from __in6_dev_get on the nexthop device
      rather than caching it in the fib6_info. Remove it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcd1f572
    • D
      net/ipv6: Remove unnecessary checks on fib6_idev · eea68cd3
      David Ahern 提交于
      Prior to 4832c30d ("net: ipv6: put host and anycast routes on device
      with address") host routes and anycast routes were installed with the
      device set to loopback (or VRF device once that feature was added). In the
      older code dst.dev was set to loopback (needed for packet tx) and rt6i_idev
      was used to denote the actual interface.
      
      Commit 4832c30d changed the code to have dst.dev pointing to the real
      device with the switch to lo or vrf device done on dst clones. As a
      consequence of this change a couple of device checks during route lookups
      are no longer needed. Remove them.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eea68cd3
    • D
      net/ipv6: Remove aca_idev · 9ee8cbb2
      David Ahern 提交于
      aca_idev has only 1 user - inet6_fill_ifacaddr - and it only
      wants the device index which can be extracted from the fib6_info
      nexthop.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ee8cbb2
    • D
      net/ipv6: Rename addrconf_dst_alloc · 360a9887
      David Ahern 提交于
      addrconf_dst_alloc now returns a fib6_info. Update the name
      and its users to reflect the change.
      
      Rename only; no functional change intended.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      360a9887
    • D
      net/ipv6: Rename fib6_info struct elements · 93c2fb25
      David Ahern 提交于
      Change the prefix for fib6_info struct elements from rt6i_ to fib6_.
      rt6i_pcpu and rt6i_exception_bucket are left as is given that they
      point to rt6_info entries.
      
      Rename only; not functional change intended.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93c2fb25
  8. 19 4月, 2018 2 次提交
    • P
      netfilter: nf_tables: NAT chain and extensions require NF_TABLES · 39f2ff08
      Pablo Neira Ayuso 提交于
      Move these options inside the scope of the 'if' NF_TABLES and
      NF_TABLES_IPV6 dependencies. This patch fixes:
      
         net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_nat_do_chain':
      >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:37: undefined reference to `nft_do_chain'
         net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_exit':
      >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:94: undefined reference to `nft_unregister_chain_type'
         net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_init':
      >> net/ipv6/netfilter/nft_chain_nat_ipv6.c:87: undefined reference to `nft_register_chain_type'
      
      that happens with:
      
      CONFIG_NF_TABLES=m
      CONFIG_NFT_CHAIN_NAT_IPV6=y
      
      Fixes: 02c7b25e ("netfilter: nf_tables: build-in filter chain type")
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      39f2ff08
    • E
      ipv6: frags: fix a lockdep false positive · 415787d7
      Eric Dumazet 提交于
      lockdep does not know that the locks used by IPv4 defrag
      and IPv6 reassembly units are of different classes.
      
      It complains because of following chains :
      
      1) sch_direct_xmit()        (lock txq->_xmit_lock)
          dev_hard_start_xmit()
           xmit_one()
            dev_queue_xmit_nit()
             packet_rcv_fanout()
              ip_check_defrag()
               ip_defrag()
                spin_lock()     (lock frag queue spinlock)
      
      2) ip6_input_finish()
          ipv6_frag_rcv()       (lock frag queue spinlock)
           ip6_frag_queue()
            icmpv6_param_prob() (lock txq->_xmit_lock at some point)
      
      We could add lockdep annotations, but we also can make sure IPv6
      calls icmpv6_param_prob() only after the release of the frag queue spinlock,
      since this naturally makes frag queue spinlock a leaf in lock hierarchy.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      415787d7
  9. 18 4月, 2018 11 次提交