1. 30 3月, 2018 1 次提交
    • K
      net: Introduce net_rwsem to protect net_namespace_list · f0b07bb1
      Kirill Tkhai 提交于
      rtnl_lock() is used everywhere, and contention is very high.
      When someone wants to iterate over alive net namespaces,
      he/she has no a possibility to do that without exclusive lock.
      But the exclusive rtnl_lock() in such places is overkill,
      and it just increases the contention. Yes, there is already
      for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
      and this can't be sleepable. Also, sometimes it may be need
      really prevent net_namespace_list growth, so for_each_net_rcu()
      is not fit there.
      
      This patch introduces new rw_semaphore, which will be used
      instead of rtnl_mutex to protect net_namespace_list. It is
      sleepable and allows not-exclusive iterations over net
      namespaces list. It allows to stop using rtnl_lock()
      in several places (what is made in next patches) and makes
      less the time, we keep rtnl_mutex. Here we just add new lock,
      while the explanation of we can remove rtnl_lock() there are
      in next patches.
      
      Fine grained locks generally are better, then one big lock,
      so let's do that with net_namespace_list, while the situation
      allows that.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0b07bb1
  2. 28 3月, 2018 7 次提交
  3. 27 3月, 2018 15 次提交
  4. 26 3月, 2018 4 次提交
    • K
      net: Drop NETDEV_UNREGISTER_FINAL · 070f2d7e
      Kirill Tkhai 提交于
      Last user is gone after bdf5bd7f "rds: tcp: remove
      register_netdevice_notifier infrastructure.", so we can
      remove this netdevice command. This allows to delete
      rtnl_lock() in netdev_run_todo(), which is hot path for
      net namespace unregistration.
      
      dev_change_net_namespace() and netdev_wait_allrefs()
      have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
      and the source commits say they were introduced to
      delemit the call with NETDEV_UNREGISTER, but this patch
      leaves them on the places, since they require additional
      analysis, whether we need in them for something else.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      070f2d7e
    • K
      net: Make NETDEV_XXX commands enum { } · ede2762d
      Kirill Tkhai 提交于
      This patch is preparation to drop NETDEV_UNREGISTER_FINAL.
      Since the cmd is used in usnic_ib_netdev_event_to_string()
      to get cmd name, after plain removing NETDEV_UNREGISTER_FINAL
      from everywhere, we'd have holes in event2str[] in this
      function.
      
      Instead of that, let's make NETDEV_XXX commands names
      available for everyone, and to define netdev_cmd_to_name()
      in the way we won't have to shaffle names after their
      numbers are changed.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ede2762d
    • K
      tipc: tipc_disc_addr_trial_msg() can be static · da18ab32
      kbuild test robot 提交于
      Fixes: 25b0b9c4 ("tipc: handle collisions of 32-bit node address hash values")
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by: Jon Maloy jon.maloy@ericsson.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da18ab32
    • Y
      net: permit skb_segment on head_frag frag_list skb · 13acc94e
      Yonghong Song 提交于
      One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
      function skb_segment(), line 3667. The bpf program attaches to
      clsact ingress, calls bpf_skb_change_proto to change protocol
      from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
      to send the changed packet out.
      
      3472 struct sk_buff *skb_segment(struct sk_buff *head_skb,
      3473                             netdev_features_t features)
      3474 {
      3475         struct sk_buff *segs = NULL;
      3476         struct sk_buff *tail = NULL;
      ...
      3665                 while (pos < offset + len) {
      3666                         if (i >= nfrags) {
      3667                                 BUG_ON(skb_headlen(list_skb));
      3668
      3669                                 i = 0;
      3670                                 nfrags = skb_shinfo(list_skb)->nr_frags;
      3671                                 frag = skb_shinfo(list_skb)->frags;
      3672                                 frag_skb = list_skb;
      ...
      
      call stack:
      ...
       #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525
       #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc
       #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7
       #4 [ffff883ffef03668] die at ffffffff8101deb2
       #5 [ffff883ffef03698] do_trap at ffffffff8101a700
       #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe
       #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0
       #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab
          [exception RIP: skb_segment+3044]
          RIP: ffffffff817e4dd4  RSP: ffff883ffef03860  RFLAGS: 00010216
          RAX: 0000000000002bf6  RBX: ffff883feb7aaa00  RCX: 0000000000000011
          RDX: ffff883fb87910c0  RSI: 0000000000000011  RDI: ffff883feb7ab500
          RBP: ffff883ffef03928   R8: 0000000000002ce2   R9: 00000000000027da
          R10: 000001ea00000000  R11: 0000000000002d82  R12: ffff883f90a1ee80
          R13: ffff883fb8791120  R14: ffff883feb7abc00  R15: 0000000000002ce2
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      13acc94e
  5. 24 3月, 2018 13 次提交
    • D
      net/sched: act_vlan: declare push_vid with host byte order · 94cb5492
      Davide Caratti 提交于
      use u16 in place of __be16 to suppress the following sparse warnings:
      
       net/sched/act_vlan.c:150:26: warning: incorrect type in assignment (different base types)
       net/sched/act_vlan.c:150:26: expected restricted __be16 [usertype] push_vid
       net/sched/act_vlan.c:150:26: got unsigned short
       net/sched/act_vlan.c:151:21: warning: restricted __be16 degrades to integer
       net/sched/act_vlan.c:208:26: warning: incorrect type in assignment (different base types)
       net/sched/act_vlan.c:208:26: expected unsigned short [unsigned] [usertype] tcfv_push_vid
       net/sched/act_vlan.c:208:26: got restricted __be16 [usertype] push_vid
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94cb5492
    • D
      net/sched: remove tcf_idr_cleanup() · affaa0c7
      Davide Caratti 提交于
      tcf_idr_cleanup() is no more used, so remove it.
      Suggested-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      affaa0c7
    • J
      tipc: obtain node identity from interface by default · 52dfae5c
      Jon Maloy 提交于
      Selecting and explicitly configuring a TIPC node identity may be
      unwanted in some cases.
      
      In this commit we introduce a default setting if the identity has not
      been set at the moment the first bearer is enabled. We do this by
      using a raw copy of a unique identifier from the used interface: MAC
      address in the case of an L2 bearer, IPv4/IPv6 address in the case
      of a UDP bearer.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52dfae5c
    • J
      tipc: handle collisions of 32-bit node address hash values · 25b0b9c4
      Jon Maloy 提交于
      When a 32-bit node address is generated from a 128-bit identifier,
      there is a risk of collisions which must be discovered and handled.
      
      We do this as follows:
      - We don't apply the generated address immediately to the node, but do
        instead initiate a 1 sec trial period to allow other cluster members
        to discover and handle such collisions.
      
      - During the trial period the node periodically sends out a new type
        of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast,
        to all the other nodes in the cluster.
      
      - When a node is receiving such a message, it must check that the
        presented 32-bit identifier either is unused, or was used by the very
        same peer in a previous session. In both cases it accepts the request
        by not responding to it.
      
      - If it finds that the same node has been up before using a different
        address, it responds with a DSC_TRIAL_FAIL_MSG containing that
        address.
      
      - If it finds that the address has already been taken by some other
        node, it generates a new, unused address and returns it to the
        requester.
      
      - During the trial period the requesting node must always be prepared
        to accept a failure message, i.e., a message where a peer suggests a
        different (or equal)  address to the one tried. In those cases it
        must apply the suggested value as trial address and restart the trial
        period.
      
      This algorithm ensures that in the vast majority of cases a node will
      have the same address before and after a reboot. If a legacy user
      configures the address explicitly, there will be no trial period and
      messages, so this protocol addition is completely backwards compatible.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25b0b9c4
    • J
      tipc: add 128-bit node identifier · d50ccc2d
      Jon Maloy 提交于
      We add a 128-bit node identity, as an alternative to the currently used
      32-bit node address.
      
      For the sake of compatibility and to minimize message header changes
      we retain the existing 32-bit address field. When not set explicitly by
      the user, this field will be filled with a hash value generated from the
      much longer node identity, and be used as a shorthand value for the
      latter.
      
      We permit either the address or the identity to be set by configuration,
      but not both, so when the address value is set by a legacy user the
      corresponding 128-bit node identity is generated based on the that value.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d50ccc2d
    • J
      tipc: remove direct accesses to own_addr field in struct tipc_net · 23fd3eac
      Jon Maloy 提交于
      As a preparation to changing the addressing structure of TIPC we replace
      all direct accesses to the tipc_net::own_addr field with the function
      dedicated for this, tipc_own_addr().
      
      There are no changes to program logics in this commit.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      23fd3eac
    • J
      tipc: allow closest-first lookup algorithm when legacy address is configured · b89afb11
      Jon Maloy 提交于
      The removal of an internal structure of the node address has an unwanted
      side effect.
      - Currently, if a user is sending an anycast message with destination
        domain 0, the tipc_namebl_translate() function will use the 'closest-
        first' algorithm to first look for a node local destination, and only
        when no such is found, will it resort to the cluster global 'round-
        robin' lookup algorithm.
      - Current users can get around this, and enforce unconditional use of
        global round-robin by indicating a destination as Z.0.0 or Z.C.0.
      - This option disappears when we make the node address flat, since the
        lookup algorithm has no way of recognizing this case. So, as long as
        there are node local destinations, the algorithm will always select
        one of those, and there is nothing the sender can do to change this.
      
      We solve this by eliminating the 'closest-first' option, which was never
      a good idea anyway, for non-legacy users, but only for those. To
      distinguish between legacy users and non-legacy users we introduce a new
      flag 'legacy_addr_format' in struct tipc_core, to be set when the user
      configures a legacy-style Z.C.N node address. Hence, when a legacy user
      indicates a zero lookup domain 'closest-first' is selected, and in all
      other cases we use 'round-robin'.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b89afb11
    • J
      tipc: remove restrictions on node address values · 20263641
      Jon Maloy 提交于
      Nominally, TIPC organizes network nodes into a three-level network
      hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This
      hierarchy is reflected in the node address format, - it is sub-divided
      into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id.
      
      However, the 'zone' and 'cluster' levels have in reality never been
      fully implemented,and never will be. The result of this has been
      that the first 20 bits the node identity structure have been wasted,
      and the usable node identity range within a cluster has been limited
      to 12 bits. This is starting to become a problem.
      
      In the following commits, we will need to be able to connect between
      nodes which are using the whole 32-bit value space of the node address.
      We therefore remove the restrictions on which values can be assigned
      to node identity, -it is from now on only a 32-bit integer with no
      assumed internal structure.
      
      Isolation between clusters is now achieved only by setting different
      values for the 'network id' field used during neighbor discovery, in
      practice leading to the latter becoming the new cluster identity.
      
      The rules for accepting discovery requests/responses from neighboring
      nodes now become:
      
      - If the user is using legacy address format on both peers, reception
        of discovery messages is subject to the legacy lookup domain check
        in addition to the cluster id check.
      
      - Otherwise, the discovery request/response is always accepted, provided
        both peers have the same network id.
      
      This secures backwards compatibility for users who have been using zone
      or cluster identities as cluster separators, instead of the intended
      'network id'.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      20263641
    • J
      tipc: some cleanups in the file discover.c · b39e465e
      Jon Maloy 提交于
      To facilitate the coming changes in the neighbor discovery functionality
      we make some renaming and refactoring of that code. The functional changes
      in this commit are trivial, e.g., that we move the message sending call in
      tipc_disc_timeout() outside the spinlock protected region.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b39e465e
    • J
      tipc: refactor function tipc_enable_bearer() · cb30a633
      Jon Maloy 提交于
      As a preparation for the next commits we try to reduce the footprint of
      the function tipc_enable_bearer(), while hopefully making is simpler to
      follow.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb30a633
    • K
      net: Convert rxrpc_net_ops · b2864fbd
      Kirill Tkhai 提交于
      These pernet_operations modifies rxrpc_net_id-pointed
      per-net entities. There is external link to AF_RXRPC
      in fs/afs/Kconfig, but it seems there is no other
      pernet_operations interested in that per-net entities.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2864fbd
    • K
      net: Convert udp_sysctl_ops · fc18999e
      Kirill Tkhai 提交于
      These pernet_operations just initialize udp4 defaults.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc18999e
    • N
      net: bridge: fix direct access to bridge vlan_enabled and use helper · 82792a07
      Nikolay Aleksandrov 提交于
      We need to use br_vlan_enabled() helper otherwise we'll break builds
      without bridge vlans:
      net/bridge//br_if.c: In function ‘br_mtu’:
      net/bridge//br_if.c:458:8: error: ‘const struct net_bridge’ has no
      member named ‘vlan_enabled’
        if (br->vlan_enabled)
              ^
      net/bridge//br_if.c:462:1: warning: control reaches end of non-void
      function [-Wreturn-type]
       }
       ^
      scripts/Makefile.build:324: recipe for target 'net/bridge//br_if.o'
      failed
      
      Fixes: 419d14af ("bridge: Allow max MTU when multiple VLANs present")
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      82792a07