1. 01 7月, 2017 1 次提交
  2. 30 6月, 2017 1 次提交
  3. 28 6月, 2017 2 次提交
  4. 27 6月, 2017 3 次提交
  5. 26 6月, 2017 1 次提交
  6. 25 6月, 2017 1 次提交
    • J
      net: store port/representator id in metadata_dst · 3fcece12
      Jakub Kicinski 提交于
      Switches and modern SR-IOV enabled NICs may multiplex traffic from Port
      representators and control messages over single set of hardware queues.
      Control messages and muxed traffic may need ordered delivery.
      
      Those requirements make it hard to comfortably use TC infrastructure today
      unless we have a way of attaching metadata to skbs at the upper device.
      Because single set of queues is used for many netdevs stopping TC/sched
      queues of all of them reliably is impossible and lower device has to
      retreat to returning NETDEV_TX_BUSY and usually has to take extra locks on
      the fastpath.
      
      This patch attempts to enable port/representative devs to attach metadata
      to skbs which carry port id.  This way representatives can be queueless and
      all queuing can be performed at the lower netdev in the usual way.
      
      Traffic arriving on the port/representative interfaces will be have
      metadata attached and will subsequently be queued to the lower device for
      transmission.  The lower device should recognize the metadata and translate
      it to HW specific format which is most likely either a special header
      inserted before the network headers or descriptor/metadata fields.
      
      Metadata is associated with the lower device by storing the netdev pointer
      along with port id so that if TC decides to redirect or mirror the new
      netdev will not try to interpret it.
      
      This is mostly for SR-IOV devices since switches don't have lower netdevs
      today.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fcece12
  7. 24 6月, 2017 2 次提交
    • J
      tcp: fix out-of-bounds access in ULP sysctl · 926f38e9
      Jakub Kicinski 提交于
      KASAN reports out-of-bound access in proc_dostring() coming from
      proc_tcp_available_ulp() because in case TCP ULP list is empty
      the buffer allocated for the response will not have anything
      printed into it.  Set the first byte to zero to avoid strlen()
      going out-of-bounds.
      
      Fixes: 734942cc ("tcp: ULP infrastructure")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      926f38e9
    • M
      net: account for current skb length when deciding about UFO · a5cb659b
      Michal Kubeček 提交于
      Our customer encountered stuck NFS writes for blocks starting at specific
      offsets w.r.t. page boundary caused by networking stack sending packets via
      UFO enabled device with wrong checksum. The problem can be reproduced by
      composing a long UDP datagram from multiple parts using MSG_MORE flag:
      
        sendto(sd, buff, 1000, MSG_MORE, ...);
        sendto(sd, buff, 1000, MSG_MORE, ...);
        sendto(sd, buff, 3000, 0, ...);
      
      Assume this packet is to be routed via a device with MTU 1500 and
      NETIF_F_UFO enabled. When second sendto() gets into __ip_append_data(),
      this condition is tested (among others) to decide whether to call
      ip_ufo_append_data():
      
        ((length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))
      
      At the moment, we already have skb with 1028 bytes of data which is not
      marked for GSO so that the test is false (fragheaderlen is usually 20).
      Thus we append second 1000 bytes to this skb without invoking UFO. Third
      sendto(), however, has sufficient length to trigger the UFO path so that we
      end up with non-UFO skb followed by a UFO one. Later on, udp_send_skb()
      uses udp_csum() to calculate the checksum but that assumes all fragments
      have correct checksum in skb->csum which is not true for UFO fragments.
      
      When checking against MTU, we need to add skb->len to length of new segment
      if we already have a partially filled skb and fragheaderlen only if there
      isn't one.
      
      In the IPv6 case, skb can only be null if this is the first segment so that
      we have to use headersize (length of the first IPv6 header) rather than
      fragheaderlen (length of IPv6 header of further fragments) for skb == NULL.
      
      Fixes: e89e9cf5 ("[IPv4/IPv6]: UFO Scatter-gather approach")
      Fixes: e4c5e13a ("ipv6: Should use consistent conditional judgement for
      	ip6 fragment between __ip6_append_data and ip6_finish_output")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NVlad Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5cb659b
  8. 23 6月, 2017 1 次提交
    • P
      udp: fix poll() · 9bd780f5
      Paolo Abeni 提交于
      Michael reported an UDP breakage caused by the commit b65ac446
      ("udp: try to avoid 2 cache miss on dequeue").
      The function __first_packet_length() can update the checksum bits
      of the pending skb, making the scratched area out-of-sync, and
      setting skb->csum, if the skb was previously in need of checksum
      validation.
      
      On later recvmsg() for such skb, checksum validation will be
      invoked again - due to the wrong udp_skb_csum_unnecessary()
      value - and will fail, causing the valid skb to be dropped.
      
      This change addresses the issue refreshing the scratch area in
      __first_packet_length() after the possible checksum update.
      
      Fixes: b65ac446 ("udp: try to avoid 2 cache miss on dequeue")
      Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bd780f5
  9. 22 6月, 2017 1 次提交
  10. 21 6月, 2017 4 次提交
    • P
      udp: prefetch rmem_alloc in udp_queue_rcv_skb() · dd99e425
      Paolo Abeni 提交于
      On UDP packets processing, if the BH is the bottle-neck, it
      always sees a cache miss while updating rmem_alloc; try to
      avoid it prefetching the value as soon as we have the socket
      available.
      
      Performances under flood with multiple NIC rx queues used are
      unaffected, but when a single NIC rx queue is in use, this
      gives ~10% performance improvement.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd99e425
    • J
      ipmr: add netlink notifications on igmpmsg cache reports · 5a645dd8
      Julien Gomes 提交于
      Add Netlink notifications on cache reports in ipmr, in addition to the
      existing igmpmsg sent to mroute_sk.
      Send RTM_NEWCACHEREPORT notifications to RTNLGRP_IPV4_MROUTE_R.
      
      MSGTYPE, VIF_ID, SRC_ADDR and DST_ADDR Netlink attributes contain the
      same data as their equivalent fields in the igmpmsg header.
      PKT attribute is the packet sent to mroute_sk, without the added igmpmsg
      header.
      Suggested-by: NRyan Halbrook <halbrook@arista.com>
      Signed-off-by: NJulien Gomes <julien@arista.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a645dd8
    • A
      tcp: md5: hide unused variable · 083a0326
      Arnd Bergmann 提交于
      Changing from a memcpy to per-member comparison left the
      size variable unused:
      
      net/ipv4/tcp_ipv4.c: In function 'tcp_md5_do_lookup':
      net/ipv4/tcp_ipv4.c:910:15: error: unused variable 'size' [-Werror=unused-variable]
      
      This does not show up when CONFIG_IPV6 is enabled, but the
      variable can be removed either way, along with the now unused
      assignment.
      
      Fixes: 6797318e ("tcp: md5: add an address prefix for key lookup")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      083a0326
    • W
      igmp: add a missing spin_lock_init() · b4846fc3
      WANG Cong 提交于
      Andrey reported a lockdep warning on non-initialized
      spinlock:
      
       INFO: trying to register non-static key.
       the code is fine but needs lockdep annotation.
       turning off the locking correctness validator.
       CPU: 1 PID: 4099 Comm: a.out Not tainted 4.12.0-rc6+ #9
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
       Call Trace:
        __dump_stack lib/dump_stack.c:16
        dump_stack+0x292/0x395 lib/dump_stack.c:52
        register_lock_class+0x717/0x1aa0 kernel/locking/lockdep.c:755
        ? 0xffffffffa0000000
        __lock_acquire+0x269/0x3690 kernel/locking/lockdep.c:3255
        lock_acquire+0x22d/0x560 kernel/locking/lockdep.c:3855
        __raw_spin_lock_bh ./include/linux/spinlock_api_smp.h:135
        _raw_spin_lock_bh+0x36/0x50 kernel/locking/spinlock.c:175
        spin_lock_bh ./include/linux/spinlock.h:304
        ip_mc_clear_src+0x27/0x1e0 net/ipv4/igmp.c:2076
        igmpv3_clear_delrec+0xee/0x4f0 net/ipv4/igmp.c:1194
        ip_mc_destroy_dev+0x4e/0x190 net/ipv4/igmp.c:1736
      
      We miss a spin_lock_init() in igmpv3_add_delrec(), probably
      because previously we never use it on this code path. Since
      we already unlink it from the global mc_tomb list, it is
      probably safe not to acquire this spinlock here. It does not
      harm to have it although, to avoid conditional locking.
      
      Fixes: c38b7d32 ("igmp: acquire pmc lock for ip_mc_clear_src()")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b4846fc3
  11. 20 6月, 2017 3 次提交
    • I
      tcp: md5: add TCP_MD5SIG_EXT socket option to set a key address prefix · 8917a777
      Ivan Delalande 提交于
      Replace first padding in the tcp_md5sig structure with a new flag field
      and address prefix length so it can be specified when configuring a new
      key for TCP MD5 signature. The tcpm_flags field will only be used if the
      socket option is TCP_MD5SIG_EXT to avoid breaking existing programs, and
      tcpm_prefixlen only when the TCP_MD5SIG_FLAG_PREFIX flag is set.
      Signed-off-by: NBob Gilligan <gilligan@arista.com>
      Signed-off-by: NEric Mowat <mowat@arista.com>
      Signed-off-by: NIvan Delalande <colona@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8917a777
    • I
      tcp: md5: add an address prefix for key lookup · 6797318e
      Ivan Delalande 提交于
      This allows the keys used for TCP MD5 signature to be used for whole
      range of addresses, specified with a prefix length, instead of only one
      address as it currently is.
      Signed-off-by: NBob Gilligan <gilligan@arista.com>
      Signed-off-by: NEric Mowat <mowat@arista.com>
      Signed-off-by: NIvan Delalande <colona@arista.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6797318e
    • X
      netfilter: ipt_CLUSTERIP: do not hold dev · 202f59af
      Xin Long 提交于
      It's a terrible thing to hold dev in iptables target. When the dev is
      being removed, unregister_netdevice has to wait for the dev to become
      free. dmesg will keep logging the err:
      
        kernel:unregister_netdevice: waiting for veth0_in to become free. \
        Usage count = 1
      
      until iptables rules with this target are removed manually.
      
      The worse thing is when deleting a netns, a virtual nic will be deleted
      instead of reset to init_net in default_device_ops exit/exit_batch. As
      it is earlier than to flush the iptables rules in iptable_filter_net_ops
      exit, unregister_netdevice will block to wait for the nic to become free.
      
      As unregister_netdevice is actually waiting for iptables rules flushing
      while iptables rules have to be flushed after unregister_netdevice. This
      'dead lock' will cause unregister_netdevice to block there forever. As
      the netns is not available to operate at that moment, iptables rules can
      not even be flushed manually either.
      
      The reproducer can be:
      
        # ip netns add test
        # ip link add veth0_in type veth peer name veth0_out
        # ip link set veth0_in netns test
        # ip netns exec test ip link set lo up
        # ip netns exec test ip link set veth0_in up
        # ip netns exec test iptables -I INPUT -d 1.2.3.4 -i veth0_in -j \
          CLUSTERIP --new --clustermac 89:d4:47:eb:9a:fa --total-nodes 3 \
          --local-node 1 --hashmode sourceip-sourceport
        # ip netns del test
      
      This issue can be triggered by all virtual nics with ipt_CLUSTERIP.
      
      This patch is to fix it by not holding dev in ipt_CLUSTERIP, but saving
      the dev->ifindex instead of the dev.
      
      As Pablo Neira Ayuso's suggestion, it will refresh c->ifindex and dev's
      mc by registering a netdevice notifier, just as what xt_TEE does. So it
      removes the old codes updating dev's mc, and also no need to initialize
      c->ifindex with dev->ifindex.
      
      But as one config can be shared by more than one targets, and the netdev
      notifier is per config, not per target. It couldn't get e->ip.iniface
      in the notifier handler. So e->ip.iniface has to be saved into config.
      
      Note that for backwards compatibility, this patch doesn't remove the
      codes checking if the dev exists before creating a config.
      
      v1->v2:
        - As Pablo Neira Ayuso's suggestion, register a netdevice notifier to
          manage c->ifindex and dev's mc.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      202f59af
  12. 18 6月, 2017 8 次提交
    • W
      net: remove DST_NOCACHE flag · a4c2fd7f
      Wei Wang 提交于
      DST_NOCACHE flag check has been removed from dst_release() and
      dst_hold_safe() in a previous patch because all the dst are now ref
      counted properly and can be released based on refcnt only.
      Looking at the rest of the DST_NOCACHE use, all of them can now be
      removed or replaced with other checks.
      So this patch gets rid of all the DST_NOCACHE usage and remove this flag
      completely.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4c2fd7f
    • W
      net: remove DST_NOGC flag · b2a9c0ed
      Wei Wang 提交于
      Now that all the components have been changed to release dst based on
      refcnt only and not depend on dst gc anymore, we can remove the
      temporary flag DST_NOGC.
      
      Note that we also need to remove the DST_NOCACHE check in dst_release()
      and dst_hold_safe() because now all the dst are released based on refcnt
      and behaves as DST_NOCACHE.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2a9c0ed
    • W
      ipv4: mark DST_NOGC and remove the operation of dst_free() · b838d5e1
      Wei Wang 提交于
      With the previous preparation patches, we are ready to get rid of the
      dst gc operation in ipv4 code and release dst based on refcnt only.
      So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
      to dst_free().
      At this point, all dst created in ipv4 code do not use the dst gc
      anymore and will be destroyed at the point when refcnt drops to 0.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b838d5e1
    • W
      ipv4: call dst_hold_safe() properly · 9df16efa
      Wei Wang 提交于
      This patch checks all the calls to
      dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if
      dst_hold_safe() is needed to avoid double free issue if dst
      gc is removed and dst_release() directly destroys dst when
      dst->__refcnt drops to 0.
      
      In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock().
      UDP and other similar protocols always hold refcount for
      skb->_skb_refdst. So both paths seem to be safe.
      
      In rx path, as it is lockless and skb_dst_set_noref() is likely to be
      used, dst_hold_safe() should always be used when trying to hold dst.
      
      In the routing code, if dst is held during an rcu protected session, it
      is necessary to call dst_hold_safe() as the current dst might be in its
      rcu grace period.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9df16efa
    • W
      ipv4: call dst_dev_put() properly · 95c47f9c
      Wei Wang 提交于
      As the intend of this patch series is to completely remove dst gc,
      we need to call dst_dev_put() to release the reference to dst->dev
      when removing routes from fib because we won't keep the gc list anymore
      and will lose the dst pointer right after removing the routes.
      Without the gc list, there is no way to find all the dst's that have
      dst->dev pointing to the going-down dev.
      Hence, we are doing dst_dev_put() immediately before we lose the last
      reference of the dst from the routing code. The next dst_check() will
      trigger a route re-lookup to find another route (if there is any).
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95c47f9c
    • W
      ipv4: take dst->__refcnt when caching dst in fib · 0830106c
      Wei Wang 提交于
      In IPv4 routing code, fib_nh and fib_nh_exception can hold pointers
      to struct rtable but they never increment dst->__refcnt.
      This leads to the need of the dst garbage collector because when user
      is done with this dst and calls dst_release(), it can only decrement
      dst->__refcnt and can not free the dst even it sees dst->__refcnt
      drops from 1 to 0 (unless DST_NOCACHE flag is set) because the routing
      code might still hold reference to it.
      And when the routing code tries to delete a route, it has to put the
      dst to the gc_list if dst->__refcnt is not yet 0 and have a gc thread
      running periodically to check on dst->__refcnt and finally to free dst
      when refcnt becomes 0.
      
      This patch increments dst->__refcnt when
      fib_nh/fib_nh_exception holds reference to this dst and properly release
      the dst when fib_nh/fib_nh_exception has been updated with a new dst.
      
      This patch is a preparation in order to fully get rid of dst gc later.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0830106c
    • W
      net: use loopback dev when generating blackhole route · 1dbe3252
      Wei Wang 提交于
      Existing ipv4/6_blackhole_route() code generates a blackhole route
      with dst->dev pointing to the passed in dst->dev.
      It is not necessary to hold reference to the passed in dst->dev
      because the packets going through this route are dropped anyway.
      A loopback interface is good enough so that we don't need to worry about
      releasing this dst->dev when this dev is going down.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dbe3252
    • W
      udp: call dst_hold_safe() in udp_sk_rx_set_dst() · d24406c8
      Wei Wang 提交于
      In udp_v4/6_early_demux() code, we try to hold dst->__refcnt for
      dst with DST_NOCACHE flag. This is because later in udp_sk_rx_dst_set()
      function, we will try to cache this dst in sk for connected case.
      However, a better way to achieve this is to not try to hold dst in
      early_demux(), but in udp_sk_rx_dst_set(), call dst_hold_safe(). This
      approach is also more consistant with how tcp is handling it. And it
      will make later changes simpler.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d24406c8
  13. 17 6月, 2017 1 次提交
  14. 16 6月, 2017 6 次提交
    • J
      networking: make skb_push & __skb_push return void pointers · d58ff351
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions return void * and remove all the casts across
      the tree, adding a (u8 *) cast only where the unsigned char pointer
      was used directly, all done with the following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
          @@
          expression SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - fn(SKB, LEN)[0]
          + *(u8 *)fn(SKB, LEN)
      
      Note that the last part there converts from push(...)[0] to the
      more idiomatic *(u8 *)push(...).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d58ff351
    • J
      networking: make skb_pull & friends return void pointers · af72868b
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions return void * and remove all the casts across
      the tree, adding a (u8 *) cast only where the unsigned char pointer
      was used directly, all done with the following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = {
                  skb_pull,
                  __skb_pull,
                  skb_pull_inline,
                  __pskb_pull_tail,
                  __pskb_pull,
                  pskb_pull
          };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = {
                  skb_pull,
                  __skb_pull,
                  skb_pull_inline,
                  __pskb_pull_tail,
                  __pskb_pull,
                  pskb_pull
          };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af72868b
    • J
      networking: make skb_put & friends return void pointers · 4df864c1
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions (skb_put, __skb_put and pskb_put) return void *
      and remove all the casts across the tree, adding a (u8 *) cast only
      where the unsigned char pointer was used directly, all done with the
      following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_put, __skb_put };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_put, __skb_put };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
      which actually doesn't cover pskb_put since there are only three
      users overall.
      
      A handful of stragglers were converted manually, notably a macro in
      drivers/isdn/i4l/isdn_bsdcomp.c and, oddly enough, one of the many
      instances in net/bluetooth/hci_sock.c. In the former file, I also
      had to fix one whitespace problem spatch introduced.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4df864c1
    • J
      networking: convert many more places to skb_put_zero() · b080db58
      Johannes Berg 提交于
      There were many places that my previous spatch didn't find,
      as pointed out by yuan linyu in various patches.
      
      The following spatch found many more and also removes the
      now unnecessary casts:
      
          @@
          identifier p, p2;
          expression len;
          expression skb;
          type t, t2;
          @@
          (
          -p = skb_put(skb, len);
          +p = skb_put_zero(skb, len);
          |
          -p = (t)skb_put(skb, len);
          +p = skb_put_zero(skb, len);
          )
          ... when != p
          (
          p2 = (t2)p;
          -memset(p2, 0, len);
          |
          -memset(p, 0, len);
          )
      
          @@
          type t, t2;
          identifier p, p2;
          expression skb;
          @@
          t *p;
          ...
          (
          -p = skb_put(skb, sizeof(t));
          +p = skb_put_zero(skb, sizeof(t));
          |
          -p = (t *)skb_put(skb, sizeof(t));
          +p = skb_put_zero(skb, sizeof(t));
          )
          ... when != p
          (
          p2 = (t2)p;
          -memset(p2, 0, sizeof(*p));
          |
          -memset(p, 0, sizeof(*p));
          )
      
          @@
          expression skb, len;
          @@
          -memset(skb_put(skb, len), 0, len);
          +skb_put_zero(skb, len);
      
      Apply it to the tree (with one manual fixup to keep the
      comment in vxlan.c, which spatch removed.)
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b080db58
    • D
      tcp: export do_tcp_sendpages and tcp_rate_check_app_limited functions · e3b5616a
      Dave Watson 提交于
      Export do_tcp_sendpages and tcp_rate_check_app_limited, since tls will need to
      sendpages while the socket is already locked.
      
      tcp_sendpage is exported, but requires the socket lock to not be held already.
      Signed-off-by: NAviad Yehezkel <aviadye@mellanox.com>
      Signed-off-by: NIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDave Watson <davejwatson@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3b5616a
    • D
      tcp: ULP infrastructure · 734942cc
      Dave Watson 提交于
      Add the infrustructure for attaching Upper Layer Protocols (ULPs) over TCP
      sockets. Based on a similar infrastructure in tcp_cong.  The idea is that any
      ULP can add its own logic by changing the TCP proto_ops structure to its own
      methods.
      
      Example usage:
      
      setsockopt(sock, SOL_TCP, TCP_ULP, "tls", sizeof("tls"));
      
      modules will call:
      tcp_register_ulp(&tcp_tls_ulp_ops);
      
      to register/unregister their ulp, with an init function and name.
      
      A list of registered ulps will be returned by tcp_get_available_ulp, which is
      hooked up to /proc.  Example:
      
      $ cat /proc/sys/net/ipv4/tcp_available_ulp
      tls
      
      There is currently no functionality to remove or chain ULPs, but
      it should be possible to add these in the future if needed.
      Signed-off-by: NBoris Pismenny <borisp@mellanox.com>
      Signed-off-by: NDave Watson <davejwatson@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      734942cc
  15. 15 6月, 2017 1 次提交
    • J
      net: don't global ICMP rate limit packets originating from loopback · 849a44de
      Jesper Dangaard Brouer 提交于
      Florian Weimer seems to have a glibc test-case which requires that
      loopback interfaces does not get ICMP ratelimited.  This was broken by
      commit c0303efe ("net: reduce cycles spend on ICMP replies that
      gets rate limited").
      
      An ICMP response will usually be routed back-out the same incoming
      interface.  Thus, take advantage of this and skip global ICMP
      ratelimit when the incoming device is loopback.  In the unlikely event
      that the outgoing it not loopback, due to strange routing policy
      rules, ICMP rate limiting still works via peer ratelimiting via
      icmpv4_xrlim_allow().  Thus, we should still comply with RFC1812
      (section 4.3.2.8 "Rate Limiting").
      
      This seems to fix the reproducer given by Florian.  While still
      avoiding to perform expensive and unneeded outgoing route lookup for
      rate limited packets (in the non-loopback case).
      
      Fixes: c0303efe ("net: reduce cycles spend on ICMP replies that gets rate limited")
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Reported-by: N"H.J. Lu" <hjl.tools@gmail.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      849a44de
  16. 14 6月, 2017 1 次提交
    • W
      igmp: acquire pmc lock for ip_mc_clear_src() · c38b7d32
      WANG Cong 提交于
      Andrey reported a use-after-free in add_grec():
      
              for (psf = *psf_list; psf; psf = psf_next) {
      		...
                      psf_next = psf->sf_next;
      
      where the struct ip_sf_list's were already freed by:
      
       kfree+0xe8/0x2b0 mm/slub.c:3882
       ip_mc_clear_src+0x69/0x1c0 net/ipv4/igmp.c:2078
       ip_mc_dec_group+0x19a/0x470 net/ipv4/igmp.c:1618
       ip_mc_drop_socket+0x145/0x230 net/ipv4/igmp.c:2609
       inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:411
       sock_release+0x8d/0x1e0 net/socket.c:597
       sock_close+0x16/0x20 net/socket.c:1072
      
      This happens because we don't hold pmc->lock in ip_mc_clear_src()
      and a parallel mr_ifc_timer timer could jump in and access them.
      
      The RCU lock is there but it is merely for pmc itself, this
      spinlock could actually ensure we don't access them in parallel.
      
      Thanks to Eric and Long for discussion on this bug.
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Xin Long <lucien.xin@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c38b7d32
  17. 12 6月, 2017 3 次提交
    • P
      udp: try to avoid 2 cache miss on dequeue · b65ac446
      Paolo Abeni 提交于
      when udp_recvmsg() is executed, on x86_64 and other archs, most skb
      fields are on cold cachelines.
      If the skb are linear and the kernel don't need to compute the udp
      csum, only a handful of skb fields are required by udp_recvmsg().
      Since we already use skb->dev_scratch to cache hot data, and
      there are 32 bits unused on 64 bit archs, use such field to cache
      as much data as we can, and try to prefetch on dequeue the relevant
      fields that are left out.
      
      This can save up to 2 cache miss per packet.
      
      v1 -> v2:
        - changed udp_dev_scratch fields types to u{32,16} variant,
          replaced bitfiled with bool
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b65ac446
    • P
      udp: avoid a cache miss on dequeue · 0a463c78
      Paolo Abeni 提交于
      Since UDP no more uses sk->destructor, we can clear completely
      the skb head state before enqueuing. Amend and use
      skb_release_head_state() for that.
      
      All head states share a single cacheline, which is not
      normally used/accesses on dequeue. We can avoid entirely accessing
      such cacheline implementing and using in the UDP code a specialized
      skb free helper which ignores the skb head state.
      
      This saves a cacheline miss at skb deallocation time.
      
      v1 -> v2:
        replaced secpath_reset() with skb_release_head_state()
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a463c78
    • D
      net: ipmr: Fix some mroute forwarding issues in vrf's · 4b1f0d33
      Donald Sharp 提交于
      This patch fixes two issues:
      
      1) When forwarding on *,G mroutes that are in a vrf, the
      kernel was dropping information about the actual incoming
      interface when calling ip_mr_forward from ip_mr_input.
      This caused ip_mr_forward to send the multicast packet
      back out the incoming interface.  Fix this by
      modifying ip_mr_forward to be handed the correctly
      resolved dev.
      
      2) When a unresolved cache entry is created we store
      the incoming skb on the unresolved cache entry and
      upon mroute resolution from the user space daemon,
      we attempt to forward the packet.  Again we were
      not resolving to the correct incoming device for
      a vrf scenario, before calling ip_mr_forward.
      Fix this by resolving to the correct interface
      and calling ip_mr_forward with the result.
      
      Fixes: e58e4159 ("net: Enable support for VRF with ipv4 multicast")
      Signed-off-by: NDonald Sharp <sharpd@cumulusnetworks.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NYotam Gigi <yotamg@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b1f0d33