1. 24 7月, 2018 8 次提交
    • S
      net/smc: add function to get link group from link · 00e5fb26
      Stefan Raspl 提交于
      Replace a frequently used construct with a more readable variant,
      reducing the code. Also might come handy when we start to support
      more than a single per link group.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00e5fb26
    • S
      net/smc: eliminate cursor read and write calls · bac6de7b
      Stefan Raspl 提交于
      The functions to read and write cursors are exclusively used to copy
      cursors. Therefore switch to a respective function instead.
      Signed-off-by: NStefan Raspl <raspl@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bac6de7b
    • K
      net/smc: provide smc mode in smc_diag.c · c601171d
      Karsten Graul 提交于
      Rename field diag_fallback into diag_mode and set the smc mode of a
      connection explicitly.
      Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c601171d
    • P
      selftests: forwarding: gre_multipath: Drop IPv6 tests · 9a2ad362
      Petr Machata 提交于
      Support for device-only IPv6 multipath next hops was dropped in
      commit 33bd5ac5 ("net/ipv6: Revert attempt to simplify route replace
      and append") and as of commit b5d2d75e ("net/ipv6: Do not allow
      device only routes via the multipath API"), attempts to add a next hop
      like that yield an explicit diagnostic.
      
      Correspondingly, drop the IPv6 parts of GRE multipath test that are
      supposed to test that code.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a2ad362
    • Y
      ipv6: sr: Use kmemdup instead of duplicating it in parse_nla_srh · 7fa41efa
      YueHaibing 提交于
      Replace calls to kmalloc followed by a memcpy with a direct call to
      kmemdup.
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fa41efa
    • D
      Merge branch 'net-bridge-add-support-for-backup-port' · f8b2990f
      David S. Miller 提交于
      Nikolay Aleksandrov says:
      
      ====================
      net: bridge: add support for backup port
      
      This set introduces a new bridge port option that allows any port to have
      any other port (in the same bridge of course) as its backup and traffic
      will be forwarded to the backup port when the primary goes down. This is
      mainly used in MLAG and EVPN setups where we have peerlink path which is
      a backup of many (or even all) ports and is a participating bridge port
      itself. There's more detailed information in patch 02. Patch 01 just
      prepares the port sysfs code for options that take raw value. The main
      issues that this set solves are scalability and fallback latency.
      
      We have used similar code for over 6 months now to bring the fallback
      latency of the backup peerlink down and avoid fdb notification storms.
      Also due to the nature of master devices such setup is currently not
      possible, and last but not least having tens of thousands of fdbs require
      thousands of calls to switch.
      
      I've also CCed our MLAG experts that have been using similar option.
      
      Roopa also adds:
      
      "Two switches acting in a MLAG pair are connected by the peerlink
      interface which is a bridge port.
      
      the config on one of the switches looks like the below. The other
      switch also has a similar config.
      eth0 is connected to one port on the server. And the server is
      connected to both switches.
      
      br0 -- team0---eth0
            |
            -- switch-peerlink
      
      switch-peerlink becomes the failover/backport port when say team0 to
      the server goes down.
      Today, when team0 goes down, control plane has to withdraw all the fdb
      entries pointing to team0
      and re-install the fdb entries pointing to switch-peerlink...and
      restore the fdb entries when team0 comes back up again.
      and  this is the problem we are trying to solve.
      
      This also becomes necessary when multihoming is implemented by a
      standard like E-VPN https://tools.ietf.org/html/rfc8365#section-8
      where the 'switch-peerlink' is an overlay vxlan port (like nikolay
      mentions in his patch commit). In these implementations, the fdb scale
      can be much larger.
      
      On why bond failover cannot be used here ?: the point that nikolay was
      alluding to is, switch-peerlink in the above example is a bridge port
      and is a failover/backport port for more than one or all ports in the
      bridge br0. And you cannot enslave switch-peerlink into a second level
      team
      with other bridge ports. Hence a multi layered team device is not an
      option (FWIW, switch-peerlink is also a teamed interface to the peer
      switch)."
      
      v3: Added Roopa's explanation and diagram
      v2: In patch 01 use kstrdup/kfree to avoid casting the const buf. In order
      to avoid using GFP_ATOMIC or always allocating I kept the spinlock inside
      each branch.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8b2990f
    • N
      net: bridge: add support for backup port · 2756f68c
      Nikolay Aleksandrov 提交于
      This patch adds a new port attribute - IFLA_BRPORT_BACKUP_PORT, which
      allows to set a backup port to be used for known unicast traffic if the
      port has gone carrier down. The backup pointer is rcu protected and set
      only under RTNL, a counter is maintained so when deleting a port we know
      how many other ports reference it as a backup and we remove it from all.
      Also the pointer is in the first cache line which is hot at the time of
      the check and thus in the common case we only add one more test.
      The backup port will be used only for the non-flooding case since
      it's a part of the bridge and the flooded packets will be forwarded to it
      anyway. To remove the forwarding just send a 0/non-existing backup port.
      This is used to avoid numerous scalability problems when using MLAG most
      notably if we have thousands of fdbs one would need to change all of them
      on port carrier going down which takes too long and causes a storm of fdb
      notifications (and again when the port comes back up). In a Multi-chassis
      Link Aggregation setup usually hosts are connected to two different
      switches which act as a single logical switch. Those switches usually have
      a control and backup link between them called peerlink which might be used
      for communication in case a host loses connectivity to one of them.
      We need a fast way to failover in case a host port goes down and currently
      none of the solutions (like bond) cannot fulfill the requirements because
      the participating ports are actually the "master" devices and must have the
      same peerlink as their backup interface and at the same time all of them
      must participate in the bridge device. As Roopa noted it's normal practice
      in routing called fast re-route where a precalculated backup path is used
      when the main one is down.
      Another use case of this is with EVPN, having a single vxlan device which
      is backup of every port. Due to the nature of master devices it's not
      currently possible to use one device as a backup for many and still have
      all of them participate in the bridge (which is master itself).
      More detailed information about MLAG is available at the link below.
      https://docs.cumulusnetworks.com/display/DOCS/Multi-Chassis+Link+Aggregation+-+MLAG
      
      Further explanation and a diagram by Roopa:
      Two switches acting in a MLAG pair are connected by the peerlink
      interface which is a bridge port.
      
      the config on one of the switches looks like the below. The other
      switch also has a similar config.
      eth0 is connected to one port on the server. And the server is
      connected to both switches.
      
      br0 -- team0---eth0
            |
            -- switch-peerlink
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2756f68c
    • N
      net: bridge: add support for raw sysfs port options · a5f3ea54
      Nikolay Aleksandrov 提交于
      This patch adds a new alternative store callback for port sysfs options
      which takes a raw value (buf) and can use it directly. It is needed for the
      backup port sysfs support since we have to pass the device by its name.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5f3ea54
  2. 23 7月, 2018 16 次提交
  3. 22 7月, 2018 16 次提交
    • H
      multicast: remove useless parameter for group add · 0ae0d60a
      Hangbin Liu 提交于
      Remove the mode parameter for igmp/igmp6_group_added as we can get it
      from first parameter.
      
      Fixes: 6e2059b5 (ipv4/igmp: init group mode as INCLUDE when join source group)
      Fixes: c7ea20c9 (ipv6/mcast: init as INCLUDE when join SSM INCLUDE group)
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ae0d60a
    • M
      net: wimax: stack: fixed multi line comment issue · ef324779
      Mark Railton 提交于
      Moved end of comment to it's own line per guide
      Signed-off-by: NMark Railton <mark@markrailton.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ef324779
    • G
      net: phy: sfp: Do not use "imply HWMON" · b5293443
      Guenter Roeck 提交于
      "imply HWMON" was supposed to ensure that the SFP phy code can be built
      with HWMON enabled or disabled while at the same time ensuring that
      HWMON is not built as module if SFP is built into the kernel.
      Unfortunately, that does not work as intended. With "allmodconfig", it
      results in several unrelated HWMON drivers to be disabled instead of
      being built as module as expected.
      
      Let's use the old "depends on HWMON || HWMON=n" instead. This is slightly
      different (it enforces SFP to be built as module if HWMON is built as
      module), but it is better than the alternative of using "IS_REACHABLE()"
      in the driver since that would disable sensor support if HWMON is built
      as module and SFP is built into the kernel.
      
      Fixes: 1323061a ("net: phy: sfp: Add HWMON support for module sensors")
      Cc: Andrew Lunn <andrew@lunn.ch>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5293443
    • Y
      libcxgb: replace vmalloc and memset with vzalloc · 4c303373
      YueHaibing 提交于
      Use vzalloc instead of the vmalloc, memset combo
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c303373
    • Y
      net: hix5hd2_gmac: use dma_zalloc_coherent instead of allocator/memset · c1907e53
      YueHaibing 提交于
      Use dma_zalloc_coherent instead of dma_alloc_coherent
      followed by memset 0.
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1907e53
    • Y
      tipc: make some functions static · e064cce1
      YueHaibing 提交于
      Fixes the following sparse warnings:
      
      net/tipc/link.c:376:5: warning: symbol 'link_bc_rcv_gap' was not declared. Should it be static?
      net/tipc/link.c:823:6: warning: symbol 'link_prepare_wakeup' was not declared. Should it be static?
      net/tipc/link.c:959:6: warning: symbol 'tipc_link_advance_backlog' was not declared. Should it be static?
      net/tipc/link.c:1009:5: warning: symbol 'tipc_link_retrans' was not declared. Should it be static?
      net/tipc/monitor.c:687:5: warning: symbol '__tipc_nl_add_monitor_peer' was not declared. Should it be static?
      net/tipc/group.c:230:20: warning: symbol 'tipc_group_find_member' was not declared. Should it be static?
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e064cce1
    • G
      net: sched: use PTR_ERR_OR_ZERO macro in tcf_block_cb_register · baa2d2b1
      Gustavo A. R. Silva 提交于
      This line makes up what macro PTR_ERR_OR_ZERO already does. So,
      make use of PTR_ERR_OR_ZERO rather than an open-code version.
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      baa2d2b1
    • D
      Merge branch 'tcp-improve-setsockopt-TCP_USER_TIMEOUT-accuracy' · d1afdc51
      David S. Miller 提交于
      Jon Maxwell says:
      
      ====================
      tcp: improve setsockopt() TCP_USER_TIMEOUT accuracy
      
      The patch was becoming bigger based on feedback therefore I have
      implemented a series of 3 commits instead in V4.
      
      This series is a continuation based on V3 here and associated feedback:
      
      https://patchwork.kernel.org/patch/10516195/
      
      Suggestions by Neal Cardwell:
      
      1) Fix up units mismatch regarding msec/jiffies.
      2) Address possiblility of time_remaining being negative.
      3) Add a helper routine tcp_clamp_rto_to_user_timeout() to do the rto
      calculation.
      4) Move start_ts logic into helper routine tcp_retrans_stamp() to
      validate tcp_sk(sk)->retrans_stamp.
      5) Some u32 declation and return refactoring.
      6) Return 0 instead of false in tcp_retransmit_stamp(), it's not a bool.
      
      Suggestions by David Laight:
      
      1) Don't cache rto in tcp_clamp_rto_to_user_timeout().
      
      Suggestions by Eric Dumazet:
      
      1) Make u32 declartions consistent.
      2) Use patch series for easier review.
      3) Convert icsk->icsk_user_timeout to millisconds to avoid jiffie to
      msec dance.
      4) Use seperate titles for each commit in the series.
      5) Fix fuzzy indentation and line wrap issues.
      6) Make commit titles descriptive.
      
      Changes:
      
      1) Call tcp_clamp_rto_to_user_timeout(sk) as an argument to
      inet_csk_reset_xmit_timer() to save on rto declaration.
      
      Every time the TCP retransmission timer fires. It checks to see if
      there is a timeout before scheduling the next retransmit timer. The
      retransmit interval between each retransmission increases
      exponentially. The issue is that in order for the timeout to occur the
      retransmit timer needs to fire again. If the user timeout check happens
      after the 9th retransmit for example. It needs to wait for the 10th
      retransmit timer to fire in order to evaluate whether a timeout has
      occurred or not. If the interval is large enough then the timeout will
      be inaccurate.
      
      For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
      
      1st retransmit:
      
      22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
      
      Last retransmit:
      
      22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
      
      Timeout:
      
      send: Connection timed out
      Sun Jul  1 22:25:34 EDT 2018
      
      We can see that last retransmit took ~7 seconds. Which pushed the total
      timeout to ~15 seconds instead of the expected 10 seconds. This gets
      more inaccurate the larger the TCP_USER_TIMEOUT value. As the interval
      increases.
      
      Add tcp_clamp_rto_to_user_timeout() to determine if the user rto has
      expired. Or whether the rto interval needs to be recalculated. Use the
      original interval if user rto is not set.
      
      Test results with the patch is the expected 10 second timeout:
      
      1st retransmit:
      
      01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
      
      Last retransmit:
      
      01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
      
      Timeout:
      
      send: Connection timed out
      Mon Jul  2 01:38:09 EDT 2018
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1afdc51
    • J
      tcp: Add tcp_clamp_rto_to_user_timeout() helper to improve accuracy · b701a99e
      Jon Maxwell 提交于
      Create the tcp_clamp_rto_to_user_timeout() helper routine. To calculate
      the correct rto, so that the TCP_USER_TIMEOUT socket option is more
      accurate. Taking suggestions and feedback into account from
      Eric Dumazet, Neal Cardwell and David Laight. Due to the 1st commit we
      can avoid the msecs_to_jiffies() and jiffies_to_msecs() dance.
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b701a99e
    • J
      tcp: Add tcp_retransmit_stamp() helper routine · a7fa3770
      Jon Maxwell 提交于
      Create a seperate helper routine as per Neal Cardwells suggestion. To
      be used by the final commit in this series and retransmits_timed_out().
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7fa3770
    • J
      tcp: convert icsk_user_timeout from jiffies to msecs · 9bcc66e1
      Jon Maxwell 提交于
      This is a preparatory commit. Part of this series that improves the
      socket TCP_USER_TIMEOUT option accuracy. Implement Eric Dumazets idea
      to convert icsk->icsk_user_timeout from jiffies to msecs. To eliminate
      the msecs_to_jiffies() and jiffies_to_msecs() dance in future.
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bcc66e1
    • D
      Merge branch 's390-qeth-updates' · 975cd350
      David S. Miller 提交于
      Julian Wiedmann says:
      
      ====================
      s390/qeth: updates 2018-07-19
      
      please apply one more round of qeth patches to net-next.
      This brings additional performance improvements for the transmit code,
      and some refactoring to pave the way for using netdev_priv.
      Also, two minor fixes for rare corner cases.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      975cd350
    • J
      s390/qeth: speed up L2 IQD xmit · 5f89eca5
      Julian Wiedmann 提交于
      Modify the L2 OSA xmit path so that it also supports L2 IQD devices
      (in particular, their HW header requirements). This allows IQD devices
      to advertise NETIF_F_SG support, and eliminates the allocation overhead
      for the HW header.
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f89eca5
    • J
      s390/qeth: add support for constrained HW headers · a7c2f4a3
      Julian Wiedmann 提交于
      Some transmit modes require that the HW header is located in the same
      page as the initial protocol headers in skb->data. Let callers specify
      the size of this contiguous header range, and enforce it when building
      the HW header.
      
      While at it, apply some gentle renaming to the relevant L2 code so that
      it matches the L3 code.
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a7c2f4a3
    • J
      s390/qeth: merge linearize-check into HW header construction · ba86ceee
      Julian Wiedmann 提交于
      When checking whether an skb needs to be linearized to fit into an IO
      buffer, it's desirable to consider the skb's final size and layout
      (ie. after the HW header was added). But a subsequent linearization can
      then cause the re-positioned HW header to violate its alignment
      restrictions.
      
      Dealing with this situation in two different code paths is quite tricky.
      This patch integrates a) linearize-check and b) HW header construction
      into one 3 step-sequence:
      1. evaluate how the HW header needs to be added (to identify if it takes
         up an additional buffer element), then
      2. check if the required buffer elements exceed the device's limit.
         Linearize when necessary and re-evaluate the HW header placement.
      3. Add the HW header in the best-possible way:
         a) push, without taking up an additional buffer element
         b) push, but consume another buffer element
         c) allocate a header object from the cache.
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba86ceee
    • J
      s390/qeth: add statistics for consumed buffer elements · d2a274b2
      Julian Wiedmann 提交于
      Nowadays an skb fragment typically spans over multiple pages. So replace
      the obsolete, SG-only 'fragments' counter with one that tracks the
      consumed buffer elements. This is what actually matters for performance.
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2a274b2