1. 16 6月, 2016 40 次提交
    • P
      net: ipv4: Add ability to have GRE ignore DF bit in IPv4 payloads · 22a59be8
      Philip Prindeville 提交于
          In the presence of firewalls which improperly block ICMP Unreachable
          (including Fragmentation Required) messages, Path MTU Discovery is
          prevented from working.
      
          A workaround is to handle IPv4 payloads opaquely, ignoring the DF bit--as
          is done for other payloads like AppleTalk--and doing transparent
          fragmentation and reassembly.
      
          Redux includes the enforcement of mutual exclusion between this feature
          and Path MTU Discovery as suggested by Alexander Duyck.
      
          Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Reviewed-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NPhilip Prindeville <philipp@redfish-solutions.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      22a59be8
    • D
      net: vrf: Switch dst dev to loopback on device delete · 810e530b
      David Ahern 提交于
      Attempting to delete a VRF device with a socket bound to it can stall:
      
        unregister_netdevice: waiting for red to become free. Usage count = 1
      
      The unregister is waiting for the dst to be released and with it
      references to the vrf device. Similar to dst_ifdown switch the dst
      dev to loopback on delete for all of the dst's for the vrf device
      and release the references to the vrf device.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Fixes: 35402e31 ("net: Add IPv6 support to VRF device")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      810e530b
    • D
      Merge tag 'shared' of git://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma · 042ce722
      David S. Miller 提交于
      Mellanox shared code between RDMA and net-next trees
      
      This is Mellanox mlx5_core shared code for both net-next and RDMA
      trees for 4.8 kernel cycle.
      042ce722
    • A
      mdio: mux: avoid 'maybe-uninitialized' warning · a78c16e1
      Arnd Bergmann 提交于
      The latest changes to the MDIO code introduced a false-positive
      warning with gcc-6 (possibly others):
      
      drivers/net/phy/mdio-mux.c: In function 'mdio_mux_init':
      drivers/net/phy/mdio-mux.c:188:3: error: 'parent_bus_node' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      It's easy to avoid the warning by making sure the parent_bus_node
      is initialized in both cases at the start of the function, since
      the later 'of_node_put()' call is also valid for a NULL pointer
      argument.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Fixes: f20e6657 ("mdio: mux: Enhanced MDIO mux framework for integrated multiplexers")
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a78c16e1
    • D
      Merge branch '6lowpan-ndisc' · b06f9527
      David S. Miller 提交于
      Alexander Aring says:
      
      ====================
      6lowpan: introduce 6lowpan-nd
      
      David can you please pick-up this patch-serie for your net-next tree?
      Thanks in advance.
      
      This patch series introduces the ndisc ops callback structure to add different
      handling for IPv6 neighbour discovery cache functionality. It implements at first
      the two following use-cases:
      
       - 6CO handling as userspace option (For all 6LoWPAN layers, BTLE/802.15.4) [0]
       - short address handling for 802.15.4 6LoWPAN only [1]
      
      Since my last patch series, I completely changed the whole ndisc_ops callback
      structure to not replace the whole ndisc functionality at recv/send level of
      NS/NA/RS/RA which I send in my previous patch-series "6lowpan: introduce basic
      6lowpan-nd". I changed it now to add different handling in a very low-level way
      of ndisc functionality.
      
      The ndisc_ops don't must be registered to dev->ndisc_ops anymore, if they are not
      set, then no additional ipv6 ndisc handling will be done.
      
      This patch series now introduce a complete handling of short address for
      802.15.4 6LoWPAN in case of send/recv of NA/NS/RS and RA. In case of RA
      (receive only) and PIO we also need a second prefix + short-address based
      address.
      
      This callback structure can be used later (I hope) for RFC 6775 [0]. This RFC
      defines some new option fields and messages for 6LoWPAN-ND. This patch series
      does not implement RFC 6775 (except we decide now to handle 6CO in userspace).
      
      Additional we can use the current ops for parse/fill ndisc options for kernel
      handled ndisc messages to add 6CIO, see [2].
      
      I tested RA/NS/NA/RS messages with short address which seems to work, what I
      didn't test is the redirect messages since I don't know how to generate them.
      The short address for redirect messages are also some special case here, because
      the short address by a L3 target address lookuped by neighbour cache need to be
      added.
      
      btw:
      According to [3] sending redirect messages should be also disabled by default
      on 6lowpan interfaces, but can be activated afterwards. This is maybe
      something for the ipv6_devconf structure. There is a "accept_redirects" but
      no "disable_redirects".
      
      - Alex
      
      [0] https://tools.ietf.org/html/rfc6775
      [1] https://tools.ietf.org/html/rfc4944#section-8
      [2] https://tools.ietf.org/html/rfc7400
      
      changes since v3:
       - add acked-by and reviewed-by tags
       - fix url references in cover-letter
       - add cover-letter that this patch series is okay to go through net-next tree
      
      changes since RFC:
       - add lowlevel functions __ndisc_opt_addr_space,
         __ndisc_opt_addr_data and __ndisc_fill_addr_option for corresponding
         functions which doesn't requires net_device argument.
       - move ndisc_ops e.g. ndisc_ops_fill_addr_option function call into the
         corresponding device argument function ndisc_fill_addr_option.
         (Introduced a special static inline function for redirect handling).
       - fix error handling in addrconf_prefix_rcv_add_addr.
         (Please see, introduce new API handling that second address registration
          (in case of 802.15.4 6LoWPAN) will still be notified if failed, because
          dev->addr was successful.
       - add ieee802154 sub-directory in short address entry for 6lowpan UAPI.
       - add lowpan_802154_is_valid_src_short_addr, because 802.15.4 6lowpan
         defines the first bit as multicast (don't know how this can be working
         at the end, because some hardware addresses will handle such addresses
         in L2 as unicast. See:
         https://www.iana.org/assignments/_6lowpan-parameters/_6lowpan-parameters.xhtml#_6lowpan-parameters-2
      
      changes since v2:
       - Introduce ndisc_ops to have our own implementation for dealing with NS/NA
         which allows also to support RFC6775 (e.g. ARO).
       - add handling for handling 6CO as userspace option for RA messages in
         case of 6LoWPAN interfaces.
       - change lowpan_is_ll to check on linklayer type only.
       - added some reviewed-by's.
       - move short addr slaac to net/6lowpan instead ipv6 handling.
       - add handling for context based address compression in case for
         short address as link-layer address.
       - change strategy to use short address, a short address will always be used
         when it's available.
       - Handle override flag in NA messages to update short address information or
         not.
      ====================
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b06f9527
    • A
      6lowpan: add support for 802.15.4 short addr handling · eab560e5
      Alexander Aring 提交于
      This patch adds necessary handling for use the short address for
      802.15.4 6lowpan. It contains support for IPHC address compression
      and new matching algorithmn to decide which link layer address will be
      used for 802.15.4 frame.
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eab560e5
    • A
      6lowpan: add support for getting short address · cfce9465
      Alexander Aring 提交于
      In case of sending RA messages we need some way to get the short address
      from an 802.15.4 6LoWPAN interface. This patch will add a temporary
      debugfs entry for experimental userspace api.
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfce9465
    • A
      6lowpan: introduce 6lowpan-nd · bbe5f5ce
      Alexander Aring 提交于
      This patch introduce different 6lowpan handling for receive and transmit
      NS/NA messages for the ipv6 neighbour discovery. The first use-case is
      for supporting 802.15.4 short addresses inside the option fields and
      handling for RFC6775 6CO option field as userspace option.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbe5f5ce
    • A
      ipv6: export several functions · cc84b3c6
      Alexander Aring 提交于
      This patch exports some neighbour discovery functions which can be used
      by 6lowpan neighbour discovery ops functionality then.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc84b3c6
    • A
      ipv6: introduce neighbour discovery ops · f997c55c
      Alexander Aring 提交于
      This patch introduces neighbour discovery ops callback structure. The
      idea is to separate the handling for 6LoWPAN into the 6lowpan module.
      
      These callback offers 6lowpan different handling, such as 802.15.4 short
      address handling or RFC6775 (Neighbor Discovery Optimization for IPv6
      over 6LoWPANs).
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f997c55c
    • A
      addrconf: put prefix address add in an own function · 4f672235
      Alexander Aring 提交于
      This patch moves the functionality to add a RA PIO prefix generated
      address in an own function. This move prepares to add a hook for
      adding a second address for a second link-layer address. E.g. short
      address for 802.15.4 6LoWPAN.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f672235
    • A
      ndisc: add __ndisc_fill_addr_option function · 8ec5da41
      Alexander Aring 提交于
      This patch adds __ndisc_fill_addr_option as low-level function for
      ndisc_fill_addr_option which doesn't depend on net_device parameter.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ec5da41
    • A
      ndisc: add __ndisc_opt_addr_data function · 4f36ce84
      Alexander Aring 提交于
      This patch adds __ndisc_opt_addr_data as low-level function for
      ndisc_opt_addr_data which doesn't depend on net_device parameter.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f36ce84
    • A
      ndisc: add __ndisc_opt_addr_space function · 1e82f961
      Alexander Aring 提交于
      This patch adds __ndisc_opt_addr_space as low-level function for
      ndisc_opt_addr_space which doesn't depend on net_device parameter.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e82f961
    • A
      6lowpan: remove ipv6 module request · 848484c9
      Alexander Aring 提交于
      Since we use exported function from ipv6 kernel module we don't need to
      request the module anymore to have ipv6 functionality.
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      848484c9
    • A
      6lowpan: add 802.15.4 short addr slaac · 2ad3ed59
      Alexander Aring 提交于
      This patch adds the autoconfiguration if a valid 802.15.4 short address
      is available for 802.15.4 6LoWPAN interfaces.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ad3ed59
    • A
      6lowpan: add private neighbour data · 8626a0c8
      Alexander Aring 提交于
      This patch will introduce a 6lowpan neighbour private data. Like the
      interface private data we handle private data for generic 6lowpan and
      for link-layer specific 6lowpan.
      
      The current first use case if to save the short address for a 802.15.4
      6lowpan neighbour.
      
      Cc: David S. Miller <davem@davemloft.net>
      Reviewed-by: NStefan Schmidt <stefan@osg.samsung.com>
      Acked-by: NYOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Signed-off-by: NAlexander Aring <aar@pengutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8626a0c8
    • D
      Merge branch 'cxgb4-sriov-sysfs' · 60100978
      David S. Miller 提交于
      Hariprasad Shenai says:
      
      ====================
      Add SRIOV configuration via sysfs and few fixes
      
      This series adds support to configure SR-IOV via PCI sysfs interface,
      reduces resource allocation in kdump kernel by disabling offload. Also
      synchronize unicast and multicast mac address, even in the interface is in
      Promiscuous mode.
      
      This patch series has been created against net-next tree and includes
      patches on cxgb4 and cxgb4vf driver.
      
      We have included all the maintainers of respective drivers. Kindly review
      the change and let us know in case of any review comments.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60100978
    • H
      cxgb4/cxgb4vf: Synchronize all MAC addresses · d01f7abc
      Hariprasad Shenai 提交于
      Even if interface is in Promiscuous mode/Allmulti mode synchronize
      MAC addresses.
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d01f7abc
    • H
      cxgb4: Enable SR-IOV configuration via PCI sysfs interface · b6244201
      Hariprasad Shenai 提交于
      Implement callback in the driver for the new PCI bus driver
      interface that allows the user to enable/disable SR-IOV
      virtual functions in a device via the sysfs interface.
      
      Deprecate module parameter used to configure SRIOV
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6244201
    • H
      cxgb4: Force cxgb4 driver as MASTER in kdump kernel · c5a8c0f3
      Hariprasad Shenai 提交于
      When is_kdump_kernel() is true, Forcing cxgb4 driver as Master so we can
      reinitialize the Firmware/Chip. Also reduce memory usage by disabling
      offload.
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5a8c0f3
    • D
      Merge branch 'sched_skb_free_defer' · 88da48f4
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      net_sched: defer skb freeing while changing qdiscs
      
      qdiscs/classes are changed under RTNL protection and often
      while blocking BH and root qdisc spinlock.
      
      When lots of skbs need to be dropped, we free
      them under these locks causing TX/RX freezes,
      and more generally latency spikes.
      
      I saw spikes of 50+ ms on quite fast hardware...
      
      This patch series adds a simple queue protected by RTNL
      where skbs can be placed until RTNL is released.
      
      Note that this might also serve in the future for optional
      reinjection of packets when a qdisc is replaced.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88da48f4
    • E
      net_sched: sch_fq: defer skb freeing · fea02478
      Eric Dumazet 提交于
      sfq_reset() can use rtnl_kfree_skbs() instead of kfree_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fea02478
    • E
      net_sched: sch_pie: defer skb freeing · db4879d9
      Eric Dumazet 提交于
      pie_change() can use rtnl_qdisc_drop() to benefit from
      deferred freeing.
      
      pie_reset() is already using qdisc_reset_queue()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      db4879d9
    • E
      net_sched: sch_netem: defer skb freeing · 2f08a9a1
      Eric Dumazet 提交于
      rtnl_kfree_skbs() can be used in tfifo_reset()
      
      It would be nice if we could iterate through rb tree instead
      of removing one skb at a time, and build a single skb chain.
      But this is left for a future patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f08a9a1
    • E
      net_sched: sch_htb: defer skb freeing · a5a9f534
      Eric Dumazet 提交于
      Both htb_reset() and htb_destroy() can use __qdisc_reset_queue()
      instead of __skb_queue_purge() to defer skb freeing of internal
      queues.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5a9f534
    • E
      net_sched: sch_hhf: defer skb freeing · e7e424cd
      Eric Dumazet 提交于
      Both hhf_reset() and hhf_change() can use rtnl_kfree_skbs()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7e424cd
    • E
      net_sched: fq_codel: defer skb freeing · ece5d4c7
      Eric Dumazet 提交于
      Both fq_codel_change() and fq_codel_reset() can use rtnl_kfree_skbs()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ece5d4c7
    • E
      net_sched: sch_fq: defer skb freeing · e14ffdfd
      Eric Dumazet 提交于
      Both fq_change() and fq_reset() can use rtnl_kfree_skbs()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e14ffdfd
    • E
      net_sched: sch_codel: defer skb freeing in codel_change() · b3d7e2b2
      Eric Dumazet 提交于
      codel_change() can use rtnl_qdisc_drop()
      to defer expensive skb freeing after locks are released.
      
      codel_reset() already has support for deferred skb freeing
      because it uses qdisc_reset_queue()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b3d7e2b2
    • E
      net_sched: sch_choke: defer skb freeing · f9aed311
      Eric Dumazet 提交于
      choke_reset() and choke_change() can use rtnl_qdisc_drop()
      to defer expensive skb freeing after locks are released.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9aed311
    • E
      net_sched: add the ability to defer skb freeing · 1b5c5493
      Eric Dumazet 提交于
      qdisc are changed under RTNL protection and often
      while blocking BH and root qdisc spinlock.
      
      When lots of skbs need to be dropped, we free
      them under these locks causing TX/RX freezes,
      and more generally latency spikes.
      
      This commit adds rtnl_kfree_skbs(), used to queue
      skbs for deferred freeing.
      
      Actual freeing happens right after RTNL is released,
      with appropriate scheduling points.
      
      rtnl_qdisc_drop() can also be used in place
      of disc_drop() when RTNL is held.
      
      qdisc_reset_queue() and __qdisc_reset_queue() get
      the new behavior, so standard qdiscs like pfifo, pfifo_fast...
      have their ->reset() method automatically handled.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b5c5493
    • J
      tipc: add neighbor monitoring framework · 35c55c98
      Jon Paul Maloy 提交于
      TIPC based clusters are by default set up with full-mesh link
      connectivity between all nodes. Those links are expected to provide
      a short failure detection time, by default set to 1500 ms. Because
      of this, the background load for neighbor monitoring in an N-node
      cluster increases with a factor N on each node, while the overall
      monitoring traffic through the network infrastructure increases at
      a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
      scale well beyond ~100 nodes unless we significantly increase failure
      discovery tolerance.
      
      This commit introduces a framework and an algorithm that drastically
      reduces this background load, while basically maintaining the original
      failure detection times across the whole cluster. Using this algorithm,
      background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
      at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
      now have to actively monitor 38 neighbors in a 400-node cluster, instead
      of as before 399.
      
      This "Overlapping Ring Supervision Algorithm" is completely distributed
      and employs no centralized or coordinated state. It goes as follows:
      
      - Each node makes up a linearly ascending, circular list of all its N
        known neighbors, based on their TIPC node identity. This algorithm
        must be the same on all nodes.
      
      - The node then selects the next M = sqrt(N) - 1 nodes downstream from
        itself in the list, and chooses to actively monitor those. This is
        called its "local monitoring domain".
      
      - It creates a domain record describing the monitoring domain, and
        piggy-backs this in the data area of all neighbor monitoring messages
        (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
        the cluster eventually (default within 400 ms) will learn about
        its monitoring domain.
      
      - Whenever a node discovers a change in its local domain, e.g., a node
        has been added or has gone down, it creates and sends out a new
        version of its node record to inform all neighbors about the change.
      
      - A node receiving a domain record from anybody outside its local domain
        matches this against its own list (which may not look the same), and
        chooses to not actively monitor those members of the received domain
        record that are also present in its own list. Instead, it relies on
        indications from the direct monitoring nodes if an indirectly
        monitored node has gone up or down. If a node is indicated lost, the
        receiving node temporarily activates its own direct monitoring towards
        that node in order to confirm, or not, that it is actually gone.
      
      - Since each node is actively monitoring sqrt(N) downstream neighbors,
        each node is also actively monitored by the same number of upstream
        neighbors. This means that all non-direct monitoring nodes normally
        will receive sqrt(N) indications that a node is gone.
      
      - A major drawback with ring monitoring is how it handles failures that
        cause massive network partitionings. If both a lost node and all its
        direct monitoring neighbors are inside the lost partition, the nodes in
        the remaining partition will never receive indications about the loss.
        To overcome this, each node also chooses to actively monitor some
        nodes outside its local domain. Those nodes are called remote domain
        "heads", and are selected in such a way that no node in the cluster
        will be more than two direct monitoring hops away. Because of this,
        each node, apart from monitoring the member of its local domain, will
        also typically monitor sqrt(N) remote head nodes.
      
      - As an optimization, local list status, domain status and domain
        records are marked with a generation number. This saves senders from
        unnecessarily conveying  unaltered domain records, and receivers from
        performing unneeded re-adaptations of their node monitoring list, such
        as re-assigning domain heads.
      
      - As a measure of caution we have added the possibility to disable the
        new algorithm through configuration. We do this by keeping a threshold
        value for the cluster size; a cluster that grows beyond this value
        will switch from full-mesh to ring monitoring, and vice versa when
        it shrinks below the value. This means that if the threshold is set to
        a value larger than any anticipated cluster size (default size is 32)
        the new algorithm is effectively disabled. A patch set for altering the
        threshold value and for listing the table contents will follow shortly.
      
      - This change is fully backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35c55c98
    • D
      net: vrf: Update flags and features settings · 7889681f
      David Ahern 提交于
      1. Default VRF devices to not having a qdisc (IFF_NO_QUEUE). Users
         can add one as desired.
      
      2. Disable adding a VLAN to a VRF device.
      
      3. Enable offloads and hardware features similar to other logical
         devices (e.g., dummy, veth)
      
      Change provides a significant boost in TCP stream Tx performance,
      from ~2,700 Mbps to ~18,100 Mbps and makes throughput close to the
      performance without a VRF (18,500 Mbps). netperf TCP_STREAM benchmark
      using qemu with virtio+vhost for the NICs
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7889681f
    • P
      tun: fix csum generation for tap devices · df10db98
      Paolo Abeni 提交于
      The commit 34166093 ("tuntap: use common code for virtio_net_hdr
      and skb GSO conversion") replaced the tun code for header manipulation
      with the generic helpers. While doing so, it implictly moved the
      skb_partial_csum_set() invocation after eth_type_trans(), which
      invalidate the current gso start/offset values.
      Fix it by moving the helper invocation before the mac pulling.
      
      Fixes: 34166093 ("tuntap: use common code for virtio_net_hdr and skb GSO conversion")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df10db98
    • D
      Merge branch 'skb_array' · 829e64d1
      David S. Miller 提交于
      Michael S. Tsirkin says:
      
      ====================
      skb_array: array based FIFO for skbs
      
      This is in response to the proposal by Jason to make tun
      rx packet queue lockless using a circular buffer.
      My testing seems to show that at least for the common usecase
      in networking, which isn't lockless, circular buffer
      with indices does not perform that well, because
      each index access causes a cache line to bounce between
      CPUs, and index access causes stalls due to the dependency.
      
      By comparison, an array of pointers where NULL means invalid
      and !NULL means valid, can be updated without messing up barriers
      at all and does not have this issue.
      
      On the flip side, cache pressure may be caused by using large queues.
      tun has a queue of 1000 entries by default and that's 8K.
      At this point I'm not sure this can be solved efficiently.
      The correct solution might be sizing the queues appropriately.
      
      Here's an implementation of this idea: it can be used more
      or less whenever sk_buff_head can be used, except you need
      to know the queue size in advance.
      
      As this might be useful outside of networking, I implemented
      a generic array of void pointers, with a type-safe wrapper for skbs.
      
      It remains to be seen whether resizing is required, in case it is
      I included patches implementing resizing by holding both the
      consumer and the producer locks.
      
      I think this code works fine without any extra memory barriers since we
      always read and write the same location, so the accesses can not be
      reordered.
      Multiple writes of the same value into memory would mess things up
      for us, I don't think compilers would do it though.
      But if people feel it's better to be safe wrt compiler optimizations,
      specifying queue as volatile would probably do it in a cleaner way
      than converting all accesses to READ_ONCE/WRITE_ONCE. Thoughts?
      
      The only issue is with calls within a loop using the __ptr_ring_XXX
      accessors - in theory compiler could hoist accesses out of the loop.
      
      Following volatile-considered-harmful.txt I merely
      documented that callers that busy-poll should invoke cpu_relax().
      Most people will use the external skb_array_XXX APIs with a spinlock,
      so this should not be an issue for them.
      
      Eric Dumazet suggested adding an extra pointer to skb for when
      we have a single outstanding packet. I could not figure out
      a way to implement this without a shared consumer/producer lock
      though, which would cause cache line bounces by itself.
      
      Jesper, Jason, I know that both of you tested this,
      please post Tested-by tags for whatever was tested.
      
      changes since v7
      	fix typos noticed by Jesper Brouer
      
      changes since v6
      	resize implemented. peek/full calls are no longer lockless
      
      	replaced _FIELD macros with _CALL which invoke a function
      	on the pointer rather than just returning a value
      
      	destroy now scans the array and frees all queued skbs
      
      changes since v5
      	implemented a generic ptr_ring api, and
      		made skb_array a type-safe wrapper
      	apis for taking the spinlock in different contexts
      		following expected usecase in tun
      changes since v4 (v3 was never posted)
      	documentation
      	dropped SKB_ARRAY_MIN_SIZE heuristic
      	unit test (in userspace, included as patch 2)
      
      changes since v2:
              fixed integer overflow pointed out by Eric.
              added some comments.
      
      changes since v1:
              fixed bug pointed out by Eric.
      ====================
      Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829e64d1
    • M
      skb_array: resize support · 7d7072e3
      Michael S. Tsirkin 提交于
      Update skb_array after ptr_ring API changes.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d7072e3
    • M
      ptr_ring: resize support · 5d49de53
      Michael S. Tsirkin 提交于
      This adds ring resize support. Seems to be necessary as
      users such as tun allow userspace control over queue size.
      
      If resize is used, this costs us ability to peek at queue without
      consumer lock - should not be a big deal as peek and consumer are
      usually run on the same CPU.
      
      If ring is made bigger, ring contents is preserved.  If ring is made
      smaller, extra pointers are passed to an optional destructor callback.
      
      Cleanup function also gains destructor callback such that
      all pointers in queue can be cleaned up.
      
      This changes some APIs but we don't have any users yet,
      so it won't break bisect.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d49de53
    • M
      skb_array: array based FIFO for skbs · ad69f35d
      Michael S. Tsirkin 提交于
      A simple array based FIFO of pointers.  Intended for net stack so uses
      skbs for type safety. Implemented as a set of wrappers around ptr_ring.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad69f35d
    • M
      ptr_ring: ring test · 9fb6bc5b
      Michael S. Tsirkin 提交于
      Add ringtest based unit test for ptr ring.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fb6bc5b