1. 24 10月, 2015 10 次提交
    • J
      tipc: simplify bearer level broadcast · b06b281e
      Jon Paul Maloy 提交于
      Until now, we have been keeping track of the exact set of broadcast
      destinations though the help structure tipc_node_map. This leads us to
      have to maintain a whole infrastructure for supporting this, including
      a pseudo-bearer and a number of functions to manipulate both the bearers
      and the node map correctly. Apart from the complexity, this approach is
      also limiting, as struct tipc_node_map only can support cluster local
      broadcast if we want to avoid it becoming excessively large. We want to
      eliminate this limitation, in order to enable introduction of scoped
      multicast in the future.
      
      A closer analysis reveals that it is unnecessary maintaining this "full
      set" overview; it is sufficient to keep a counter per bearer, indicating
      how many nodes can be reached via this bearer at the moment. The protocol
      is now robust enough to handle transitional discrepancies between the
      nominal number of reachable destinations, as expected by the broadcast
      protocol itself, and the number which is actually reachable at the
      moment. The initial broadcast synchronization, in conjunction with the
      retransmission mechanism, ensures that all packets will eventually be
      acknowledged by the correct set of destinations.
      
      This commit introduces these changes.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b06b281e
    • J
      tipc: let broadcast packet reception use new link receive function · 52666986
      Jon Paul Maloy 提交于
      The code path for receiving broadcast packets is currently distinct
      from the unicast path. This leads to unnecessary code and data
      duplication, something that can be avoided with some effort.
      
      We now introduce separate per-peer tipc_link instances for handling
      broadcast packet reception. Each receive link keeps a pointer to the
      common, single, broadcast link instance, and can hence handle release
      and retransmission of send buffers as if they belonged to the own
      instance.
      
      Furthermore, we let each unicast link instance keep a reference to both
      the pertaining broadcast receive link, and to the common send link.
      This makes it possible for the unicast links to easily access data for
      broadcast link synchronization, as well as for carrying acknowledges for
      received broadcast packets.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52666986
    • J
      tipc: introduce capability bit for broadcast synchronization · fd556f20
      Jon Paul Maloy 提交于
      Until now, we have tried to support both the newer, dedicated broadcast
      synchronization mechanism along with the older, less safe, RESET_MSG/
      ACTIVATE_MSG based one. The latter method has turned out to be a hazard
      in a highly dynamic cluster, so we find it safer to disable it completely
      when we find that the former mechanism is supported by the peer node.
      
      For this purpose, we now introduce a new capabability bit,
      TIPC_BCAST_SYNCH, to inform any peer nodes that dedicated broadcast
      syncronization is supported by the present node. The new bit is conveyed
      between peers in the 'capabilities' field of neighbor discovery messages.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd556f20
    • J
      tipc: let broadcast transmission use new link transmit function · 2f566124
      Jon Paul Maloy 提交于
      This commit simplifies the broadcast link transmission function, by
      leveraging previous changes to the link transmission function and the
      broadcast transmission link life cycle.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f566124
    • J
      tipc: make struct tipc_link generic to support broadcast · c1ab3f1d
      Jon Paul Maloy 提交于
      Realizing that unicast is just a special case of broadcast, we also see
      that we can go in the other direction, i.e., that modest changes to the
      current unicast link can make it generic enough to support broadcast.
      
      The following changes are introduced here:
      
      - A new counter ("ackers") in struct tipc_link, to indicate how many
        peers need to ack a packet before it can be released.
      - A corresponding counter in the skb user area, to keep track of how
        many peers a are left to ack before a buffer can be released.
      - A new counter ("acked"), to keep persistent track of how far a peer
        has acked at the moment, i.e., where in the transmission queue to
        start updating buffers when the next ack arrives. This is to avoid
        double acknowledgements from a peer, with inadvertent relase of
        packets as a result.
      - A more generic tipc_link_retrans() function, where retransmit starts
        from a given sequence number, instead of the first packet in the
        transmision queue. This is to minimize the number of retransmitted
        packets on the broadcast media.
      
      When the new functionality is taken into use in the next commits,
      we expect it to have minimal effect on unicast mode performance.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c1ab3f1d
    • J
      tipc: use explicit allocation of broadcast send link · 32301906
      Jon Paul Maloy 提交于
      The broadcast link instance (struct tipc_link) used for sending is
      currently aggregated into struct tipc_bclink. This means that we cannot
      use the regular tipc_link_create() function for initiating the link, but
      do instead have to initiate numerous fields directly from the
      bcast_init() function.
      
      We want to reduce dependencies between the broadcast functionality
      and the inner workings of tipc_link. In this commit, we introduce
      a new function tipc_bclink_create() to link.c, and allocate the
      instance of the link separately using this function.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32301906
    • J
      tipc: make link implementation independent from struct tipc_bearer · 0e05498e
      Jon Paul Maloy 提交于
      In reality, the link implementation is already independent from
      struct tipc_bearer, in that it doesn't store any reference to it.
      However, we still pass on a pointer to a bearer instance in the
      function tipc_link_create(), just to have it extract some
      initialization information from it.
      
      I later commits, we need to create instances of tipc_link without
      having any associated struct tipc_bearer. To facilitate this, we
      want to extract the initialization data already in the creator
      function in node.c, before calling tipc_link_create(), and pass
      this info on as individual parameters in the call.
      
      This commit introduces this change.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e05498e
    • J
      tipc: create broadcast transmission link at namespace init · 5fd9fd63
      Jon Paul Maloy 提交于
      The broadcast transmission link is currently instantiated when the
      network subsystem is started, i.e., on order from user space via netlink.
      
      This forces the broadcast transmission code to do unnecessary tests for
      the existence of the transmission link, as well in single mode node as
      in network mode.
      
      In this commit, we do instead create the link during initialization of
      the name space, and remove it when it is stopped. The fact that the
      transmission link now has a guaranteed longer life cycle than any of its
      potential clients paves the way for further code simplifcations
      and optimizations.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fd9fd63
    • J
      tipc: move broadcast link lock to struct tipc_net · 0043550b
      Jon Paul Maloy 提交于
      The broadcast lock will need to be acquired outside bcast.c in a later
      commit. For this reason, we move the lock to struct tipc_net. Consistent
      with the changes in the previous commit, we also introducee two new
      functions tipc_bcast_lock() and tipc_bcast_unlock(). The code that is
      currently using tipc_bclink_lock()/unlock() will be phased out during
      the coming commits in this series.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0043550b
    • J
      tipc: move bcast definitions to bcast.c · 6beb19a6
      Jon Paul Maloy 提交于
      Currently, a number of structure and function definitions related
      to the broadcast functionality are unnecessarily exposed in the file
      bcast.h. This obscures the fact that the external interface towards
      the broadcast link in fact is very narrow, and causes unnecessary
      recompilations of other files when anything changes in those
      definitions.
      
      In this commit, we move as many of those definitions as is currently
      possible to the file bcast.c.
      
      We also rename the structure 'tipc_bclink' to 'tipc_bc_base', both
      since the name does not correctly describe the contents of this
      struct, and will do so even less in the future, and because we want
      to use the term 'link' more appropriately in the functionality
      introduced later in this series.
      
      Finally, we rename a couple of functions, such as tipc_bclink_xmit()
      and others that will be kept in the future, to include the term 'bcast'
      instead.
      
      There are no functional changes in this commit.
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Reviewed-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6beb19a6
  2. 23 10月, 2015 12 次提交
    • R
      mpls: flow-based multipath selection · 1c78efa8
      Robert Shearman 提交于
      Change the selection of a multipath route to use a flow-based
      hash. This more suitable for traffic sensitive to reordering within a
      flow (e.g. TCP, L2VPN) and whilst still allowing a good distribution
      of traffic given enough flows.
      
      Selection of the path for a multipath route is done using a hash of:
      1. Label stack up to MAX_MP_SELECT_LABELS labels or up to and
         including entropy label, whichever is first.
      2. 3-tuple of (L3 src, L3 dst, proto) from IPv4/IPv6 header in MPLS
         payload, if present.
      
      Naturally, a 5-tuple hash using L4 information in addition would be
      possible and be better in some scenarios, but there is a tradeoff
      between looking deeper into the packet to achieve good distribution,
      and packet forwarding performance, and I have erred on the side of the
      latter as the default.
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c78efa8
    • R
      mpls: multipath route support · f8efb73c
      Roopa Prabhu 提交于
      This patch adds support for MPLS multipath routes.
      
      Includes following changes to support multipath:
      - splits struct mpls_route into 'struct mpls_route + struct mpls_nh'
      
      - 'struct mpls_nh' represents a mpls nexthop label forwarding entry
      
      - moves mpls route and nexthop structures into internal.h
      
      - A mpls_route can point to multiple mpls_nh structs
      
      - the nexthops are maintained as a array (similar to ipv4 fib)
      
      - In the process of restructuring, this patch also consistently changes
        all labels to u8
      
      - Adds support to parse/fill RTA_MULTIPATH netlink attribute for
      multipath routes similar to ipv4/v6 fib
      
      - In this patch, the multipath route nexthop selection algorithm
      simply returns the first nexthop. It is replaced by a
      hash based algorithm from Robert Shearman in the next patch
      
      - mpls_route_update cleanup: remove 'dev' handling in mpls_route_update.
      mpls_route_update though implemented to update based on dev, it was
      never used that way. And the dev handling gets tricky with multiple
      nexthops. Cannot match against any single nexthops dev. So, this patch
      removes the unused 'dev' handling in mpls_route_update.
      
      - dead route/path handling will be implemented in a subsequent patch
      
      Example:
      
      $ip -f mpls route add 100 nexthop as 200 via inet 10.1.1.2 dev swp1 \
                      nexthop as 700 via inet 10.1.1.6 dev swp2 \
                      nexthop as 800 via inet 40.1.1.2 dev swp3
      
      $ip  -f mpls route show
      100
              nexthop as to 200 via inet 10.1.1.2  dev swp1
              nexthop as to 700 via inet 10.1.1.6  dev swp2
              nexthop as to 800 via inet 40.1.1.2  dev swp3
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Acked-by: NRobert Shearman <rshearma@brocade.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8efb73c
    • L
      net: sysctl: fix a kmemleak warning · ce9d9b8e
      Li RongQing 提交于
      the returned buffer of register_sysctl() is stored into net_header
      variable, but net_header is not used after, and compiler maybe
      optimise the variable out, and lead kmemleak reported the below warning
      
      	comm "swapper/0", pid 1, jiffies 4294937448 (age 267.270s)
      	hex dump (first 32 bytes):
      	90 38 8b 01 c0 ff ff ff 00 00 00 00 01 00 00 00 .8..............
      	01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      	backtrace:
      	[<ffffffc00020f134>] create_object+0x10c/0x2a0
      	[<ffffffc00070ff44>] kmemleak_alloc+0x54/0xa0
      	[<ffffffc0001fe378>] __kmalloc+0x1f8/0x4f8
      	[<ffffffc00028e984>] __register_sysctl_table+0x64/0x5a0
      	[<ffffffc00028eef0>] register_sysctl+0x30/0x40
      	[<ffffffc00099c304>] net_sysctl_init+0x20/0x58
      	[<ffffffc000994dd8>] sock_init+0x10/0xb0
      	[<ffffffc0000842e0>] do_one_initcall+0x90/0x1b8
      	[<ffffffc000966bac>] kernel_init_freeable+0x218/0x2f0
      	[<ffffffc00070ed6c>] kernel_init+0x1c/0xe8
      	[<ffffffc000083bfc>] ret_from_fork+0xc/0x50
      	[<ffffffffffffffff>] 0xffffffffffffffff <<end check kmemleak>>
      
      Before fix, the objdump result on ARM64:
      0000000000000000 <net_sysctl_init>:
         0:   a9be7bfd        stp     x29, x30, [sp,#-32]!
         4:   90000001        adrp    x1, 0 <net_sysctl_init>
         8:   90000000        adrp    x0, 0 <net_sysctl_init>
         c:   910003fd        mov     x29, sp
        10:   91000021        add     x1, x1, #0x0
        14:   91000000        add     x0, x0, #0x0
        18:   a90153f3        stp     x19, x20, [sp,#16]
        1c:   12800174        mov     w20, #0xfffffff4                // #-12
        20:   94000000        bl      0 <register_sysctl>
        24:   b4000120        cbz     x0, 48 <net_sysctl_init+0x48>
        28:   90000013        adrp    x19, 0 <net_sysctl_init>
        2c:   91000273        add     x19, x19, #0x0
        30:   9101a260        add     x0, x19, #0x68
        34:   94000000        bl      0 <register_pernet_subsys>
        38:   2a0003f4        mov     w20, w0
        3c:   35000060        cbnz    w0, 48 <net_sysctl_init+0x48>
        40:   aa1303e0        mov     x0, x19
        44:   94000000        bl      0 <register_sysctl_root>
        48:   2a1403e0        mov     w0, w20
        4c:   a94153f3        ldp     x19, x20, [sp,#16]
        50:   a8c27bfd        ldp     x29, x30, [sp],#32
        54:   d65f03c0        ret
      After:
      0000000000000000 <net_sysctl_init>:
         0:   a9bd7bfd        stp     x29, x30, [sp,#-48]!
         4:   90000000        adrp    x0, 0 <net_sysctl_init>
         8:   910003fd        mov     x29, sp
         c:   a90153f3        stp     x19, x20, [sp,#16]
        10:   90000013        adrp    x19, 0 <net_sysctl_init>
        14:   91000000        add     x0, x0, #0x0
        18:   91000273        add     x19, x19, #0x0
        1c:   f90013f5        str     x21, [sp,#32]
        20:   aa1303e1        mov     x1, x19
        24:   12800175        mov     w21, #0xfffffff4                // #-12
        28:   94000000        bl      0 <register_sysctl>
        2c:   f9002260        str     x0, [x19,#64]
        30:   b40001a0        cbz     x0, 64 <net_sysctl_init+0x64>
        34:   90000014        adrp    x20, 0 <net_sysctl_init>
        38:   91000294        add     x20, x20, #0x0
        3c:   9101a280        add     x0, x20, #0x68
        40:   94000000        bl      0 <register_pernet_subsys>
        44:   2a0003f5        mov     w21, w0
        48:   35000080        cbnz    w0, 58 <net_sysctl_init+0x58>
        4c:   aa1403e0        mov     x0, x20
        50:   94000000        bl      0 <register_sysctl_root>
        54:   14000004        b       64 <net_sysctl_init+0x64>
        58:   f9402260        ldr     x0, [x19,#64]
        5c:   94000000        bl      0 <unregister_sysctl_table>
        60:   f900227f        str     xzr, [x19,#64]
        64:   2a1503e0        mov     w0, w21
        68:   f94013f5        ldr     x21, [sp,#32]
        6c:   a94153f3        ldp     x19, x20, [sp,#16]
        70:   a8c37bfd        ldp     x29, x30, [sp],#48
        74:   d65f03c0        ret
      
      Add the possible error handle to free the net_header to remove the
      kmemleak warning
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce9d9b8e
    • H
      if_link: Add control trust VF · dd461d6a
      Hiroshi Shimamoto 提交于
      Add netlink directives and ndo entry to trust VF user.
      
      This controls the special permission of VF user.
      The administrator will dedicatedly trust VF user to use some features
      which impacts security and/or performance.
      
      The administrator never turn it on unless VF user is fully trusted.
      
      CC: Sy Jong Choi <sy.jong.choi@intel.com>
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Acked-by: NGreg Rose <gregory.v.rose@intel.com>
      Tested-by: NKrishneil Singh <Krishneil.k.singh@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      dd461d6a
    • E
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet 提交于
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e0724d0
    • L
      af_key: fix two typos · f6b8dec9
      Li RongQing 提交于
      Signed-off-by: NLi RongQing <roy.qing.li@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6b8dec9
    • P
      ipv4: implement support for NOPREFIXROUTE ifa flag for ipv4 address · 7b131180
      Paolo Abeni 提交于
      Currently adding a new ipv4 address always cause the creation of the
      related network route, with default metric. When a host has multiple
      interfaces on the same network, multiple routes with the same metric
      are created.
      
      If the userspace wants to set specific metric on each routes, i.e.
      giving better metric to ethernet links in respect to Wi-Fi ones,
      the network routes must be deleted and recreated, which is error-prone.
      
      This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4
      address. When an address is added with such flag set, no associated
      network route is created, no network route is deleted when
      said IP is gone and it's up to the user space manage such route.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b131180
    • H
      ipv6: protect mtu calculation of wrap-around and infinite loop by rounding issues · b72a2b01
      Hannes Frederic Sowa 提交于
      Raw sockets with hdrincl enabled can insert ipv6 extension headers
      right into the data stream. In case we need to fragment those packets,
      we reparse the options header to find the place where we can insert
      the fragment header. If the extension headers exceed the link's MTU we
      actually cannot make progress in such a case.
      
      Instead of ending up in broken arithmetic or rounding towards 0 and
      entering an endless loop in ip6_fragment, just prevent those cases by
      aborting early and signal -EMSGSIZE to user space.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b72a2b01
    • A
      tcp: allow dctcp alpha to drop to zero · c80dbe04
      Andrew Shewmaker 提交于
      If alpha is strictly reduced by alpha >> dctcp_shift_g and if alpha is less
      than 1 << dctcp_shift_g, then alpha may never reach zero. For example,
      given shift_g=4 and alpha=15, alpha >> dctcp_shift_g yields 0 and alpha
      remains 15. The effect isn't noticeable in this case below cwnd=137, but
      could gradually drive uncongested flows with leftover alpha down to
      cwnd=137. A larger dctcp_shift_g would have a greater effect.
      
      This change causes alpha=15 to drop to 0 instead of being decrementing by 1
      as it would when alpha=16. However, it requires one less conditional to
      implement since it doesn't have to guard against subtracting 1 from 0U. A
      decay of 15 is not unreasonable since an equal or greater amount occurs at
      alpha >= 240.
      Signed-off-by: NAndrew G. Shewmaker <agshew@gmail.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c80dbe04
    • L
      ipv6: fix the incorrect return value of throw route · ab997ad4
      lucien 提交于
      The error condition -EAGAIN, which is signaled by throw routes, tells
      the rules framework to walk on searching for next matches. If the walk
      ends and we stop walking the rules with the result of a throw route we
      have to translate the error conditions to -ENETUNREACH.
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab997ad4
    • P
      openvswitch: Fix egress tunnel info. · fc4099f1
      Pravin B Shelar 提交于
      While transitioning to netdev based vport we broke OVS
      feature which allows user to retrieve tunnel packet egress
      information for lwtunnel devices.  Following patch fixes it
      by introducing ndo operation to get the tunnel egress info.
      Same ndo operation can be used for lwtunnel devices and compat
      ovs-tnl-vport devices. So after adding such device operation
      we can remove similar operation from ovs-vport.
      
      Fixes: 614732ea ("openvswitch: Use regular VXLAN net_device device").
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc4099f1
    • J
      VSOCK: Fix lockdep issue. · 8566b86a
      Jorgen Hansen 提交于
      The recent fix for the vsock sock_put issue used the wrong
      initializer for the transport spin_lock causing an issue when
      running with lockdep checking.
      
      Testing: Verified fix on kernel with lockdep enabled.
      Reviewed-by: NThomas Hellstrom <thellstrom@vmware.com>
      Signed-off-by: NJorgen Hansen <jhansen@vmware.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8566b86a
  3. 22 10月, 2015 18 次提交