1. 28 10月, 2016 9 次提交
    • A
      net: skip genenerating uevents for network namespaces that are exiting · 002d8a1a
      Andrey Vagin 提交于
      No one can see these events, because a network namespace can not be
      destroyed, if it has sockets.
      
      Unlike other devices, uevent-s for network devices are generated
      only inside their network namespaces. They are filtered in
      kobj_bcast_filter()
      
      My experiments shows that net namespaces are destroyed more 30% faster
      with this optimization.
      
      Here is a perf output for destroying network namespaces without this
      patch.
      
      -   94.76%     0.02%  kworker/u48:1  [kernel.kallsyms]     [k] cleanup_net
         - 94.74% cleanup_net
            - 94.64% ops_exit_list.isra.4
               - 41.61% default_device_exit_batch
                  - 41.47% unregister_netdevice_many
                     - rollback_registered_many
                        - 40.36% netdev_unregister_kobject
                           - 14.55% device_del
                              + 13.71% kobject_uevent
                           - 13.04% netdev_queue_update_kobjects
                              + 12.96% kobject_put
                           - 12.72% net_rx_queue_update_kobjects
                                kobject_put
                              - kobject_release
                                 + 12.69% kobject_uevent
                        + 0.80% call_netdevice_notifiers_info
               + 19.57% nfsd_exit_net
               + 11.15% tcp_net_metrics_exit
               + 8.25% rpcsec_gss_exit_net
      
      It's very critical to optimize the exit path for network namespaces,
      because they are destroyed under net_mutex and many namespaces can be
      destroyed for one iteration.
      
      v2: use dev_set_uevent_suppress()
      
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrei Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      002d8a1a
    • S
      ethernet: fix min/max MTU typos · 110447f8
      Stefan Richter 提交于
      Fixes: d894be57('ethernet: use net core MTU range checking in more drivers')
      CC: Jarod Wilson <jarod@redhat.com>
      CC: Thomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: NStefan Richter <stefanr@s5r6.in-berlin.de>
      Acked-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      110447f8
    • D
      Merge branch 'genetlink-improvements' · 0fb6af70
      David S. Miller 提交于
      Johannes Berg says:
      
      ====================
      genetlink improvements
      
      This series contains some generic netlink improvements, making
      the API safer to use, and making the function pointers in the
      family struct safer by allowing it to be __ro_after_init.
      
      The first patch, introducing genl_family_attrbuf(), just ensures
      that the users of family->attrbuf aren't actually racy, but making
      them use the indirection function for obtaining a reference and
      checking that the context can actually do so.
      
      The second patch removes the more or less broken ability to have
      a static family ID, the three IDs that need to be static because
      it's simply needed (genl controller), or due to old API misused.
      Everything else couldn't be static anyway, or could fail when the
      family is registered, if somebody else already got a static ID.
      
      The third patch statically initializes the families, mostly to save
      some code. I wrote this initially because I thought I could make
      them all const, but that ends up being very inefficient (it would
      require always doing some kind of family -> id lookup), so now it's
      just here because I had it already and it reduces the code size.
      
      The fourth patch then, finally, lays the groundwork for what I had
      really wanted - now with __ro_after_init instead of const; I remove
      code there to do the ID->family hash table mapping in genetlink and
      use IDR instead to both allocate and map the IDs, which again ends
      up saving some code size.
      
      Finally, the fifth patch updates all families, as it turns out, no
      families exist that really dynamically register/unregister. This
      last patch should perhaps be split up, I could submit it for each
      subsystem separately, but it'd depend on the second and third to
      go in first, so would take a while. I can do that though, if that
      seems better to you.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0fb6af70
    • J
      genetlink: mark families as __ro_after_init · 56989f6d
      Johannes Berg 提交于
      Now genl_register_family() is the only thing (other than the
      users themselves, perhaps, but I didn't find any doing that)
      writing to the family struct.
      
      In all families that I found, genl_register_family() is only
      called from __init functions (some indirectly, in which case
      I've add __init annotations to clarifly things), so all can
      actually be marked __ro_after_init.
      
      This protects the data structure from accidental corruption.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56989f6d
    • J
      genetlink: use idr to track families · 2ae0f17d
      Johannes Berg 提交于
      Since generic netlink family IDs are small integers, allocated
      densely, IDR is an ideal match for lookups. Replace the existing
      hand-written hash-table with IDR for allocation and lookup.
      
      This lets the families only be written to once, during register,
      since the list_head can be removed and removal of a family won't
      cause any writes.
      
      It also slightly reduces the code size (by about 1.3k on x86-64).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ae0f17d
    • J
      genetlink: statically initialize families · 489111e5
      Johannes Berg 提交于
      Instead of providing macros/inline functions to initialize
      the families, make all users initialize them statically and
      get rid of the macros.
      
      This reduces the kernel code size by about 1.6k on x86-64
      (with allyesconfig).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      489111e5
    • J
      genetlink: no longer support using static family IDs · a07ea4d9
      Johannes Berg 提交于
      Static family IDs have never really been used, the only
      use case was the workaround I introduced for those users
      that assumed their family ID was also their multicast
      group ID.
      
      Additionally, because static family IDs would never be
      reserved by the generic netlink code, using a relatively
      low ID would only work for built-in families that can be
      registered immediately after generic netlink is started,
      which is basically only the control family (apart from
      the workaround code, which I also had to add code for so
      it would reserve those IDs)
      
      Thus, anything other than GENL_ID_GENERATE is flawed and
      luckily not used except in the cases I mentioned. Move
      those workarounds into a few lines of code, and then get
      rid of GENL_ID_GENERATE entirely, making it more robust.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a07ea4d9
    • J
      genetlink: introduce and use genl_family_attrbuf() · c90c39da
      Johannes Berg 提交于
      This helper function allows family implementations to access
      their family's attrbuf. This gets rid of the attrbuf usage
      in families, and also adds locking validation, since it's not
      valid to use the attrbuf with parallel_ops or outside of the
      dumpit callback.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c90c39da
    • A
      skbedit: allow the user to specify bitmask for mark · 4fe77d82
      Antonio Quartulli 提交于
      The user may want to use only some bits of the skb mark in
      his skbedit rules because the remaining part might be used by
      something else.
      
      Introduce the "mask" parameter to the skbedit actor in order
      to implement such functionality.
      
      When the mask is specified, only those bits selected by the
      latter are altered really changed by the actor, while the
      rest is left untouched.
      Signed-off-by: NAntonio Quartulli <antonio@open-mesh.com>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fe77d82
  2. 27 10月, 2016 15 次提交
  3. 24 10月, 2016 11 次提交
  4. 23 10月, 2016 5 次提交
    • D
      Merge branch 'bpf-numa-id' · 67dc1596
      David S. Miller 提交于
      Daniel Borkmann says:
      
      ====================
      Add BPF numa id helper
      
      This patch set adds a helper for retrieving current numa node
      id and a test case for SO_REUSEPORT.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67dc1596
    • D
      reuseport, bpf: add test case for bpf_get_numa_node_id · 3c2c3c16
      Daniel Borkmann 提交于
      The test case is very similar to reuseport_bpf_cpu, only that here
      we select socket members based on current numa node id.
      
        # numactl -H
        available: 2 nodes (0-1)
        node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
        node 0 size: 128867 MB
        node 0 free: 120080 MB
        node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
        node 1 size: 96765 MB
        node 1 free: 87504 MB
        node distances:
        node   0   1
          0:  10  20
          1:  20  10
      
        # ./reuseport_bpf_numa
        ---- IPv4 UDP ----
        send node 0, receive socket 0
        send node 1, receive socket 1
        send node 1, receive socket 1
        send node 0, receive socket 0
        ---- IPv6 UDP ----
        send node 0, receive socket 0
        send node 1, receive socket 1
        send node 1, receive socket 1
        send node 0, receive socket 0
        ---- IPv4 TCP ----
        send node 0, receive socket 0
        send node 1, receive socket 1
        send node 1, receive socket 1
        send node 0, receive socket 0
        ---- IPv6 TCP ----
        send node 0, receive socket 0
        send node 1, receive socket 1
        send node 1, receive socket 1
        send node 0, receive socket 0
        SUCCESS
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c2c3c16
    • D
      bpf: add helper for retrieving current numa node id · 2d0e30c3
      Daniel Borkmann 提交于
      Use case is mainly for soreuseport to select sockets for the local
      numa node, but since generic, lets also add this for other networking
      and tracing program types.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2d0e30c3
    • D
      Merge branch 'udpmem' · a10b91b8
      David S. Miller 提交于
      Paolo Abeni says:
      
      ====================
      udp: refactor memory accounting
      
      This patch series refactor the udp memory accounting, replacing the
      generic implementation with a custom one, in order to remove the needs for
      locking the socket on the enqueue and dequeue operations. The socket backlog
      usage is dropped, as well.
      
      The first patch factor out pieces of some queue and memory management
      socket helpers, so that they can later be used by the udp memory accounting
      functions.
      The second patch adds the memory account helpers, without using them.
      The third patch replacse the old rx memory accounting path for udp over ipv4 and
      udp over ipv6. In kernel UDP users are updated, as well.
      
      The memory accounting schema is described in detail in the individual patch
      commit message.
      
      The performance gain depends on the specific scenario; with few flows (and
      little contention in the original code) the differences are in the noise range,
      while with several flows contending the same socket, the measured speed-up
      is relevant (e.g. even over 100% in case of extreme contention)
      
      Many thanks to Eric Dumazet for the reiterated reviews and suggestions.
      
      v5 -> v6:
       - do not orphan the skb on enqueue, skb_steal_sock() already did
         the work for us
      
      v4 -> v5:
       - use the receive queue spin lock to protect the memory accounting
       - several minor clean-up
      
      v3 -> v4:
       - simplified the locking schema, always use a plain spinlock
      
      v2 -> v3:
       - do not set the now unsed backlog_rcv callback
      
      v1 -> v2:
       - changed slighly the memory accounting schema, we now perform lazy reclaim
       - fixed forward_alloc updating issue
       - fixed memory counter integer overflows
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a10b91b8
    • P
      udp: use it's own memory accounting schema · 850cbadd
      Paolo Abeni 提交于
      Completely avoid default sock memory accounting and replace it
      with udp-specific accounting.
      
      Since the new memory accounting model encapsulates completely
      the required locking, remove the socket lock on both enqueue and
      dequeue, and avoid using the backlog on enqueue.
      
      Be sure to clean-up rx queue memory on socket destruction, using
      udp its own sk_destruct.
      
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      
      nr readers      Kpps (vanilla)  Kpps (patched)
      1               170             440
      3               1250            2150
      6               3000            3650
      9               4200            4450
      12              5700            6250
      
      v4 -> v5:
        - avoid unneeded test in first_packet_length
      
      v3 -> v4:
        - remove useless sk_rcvqueues_full() call
      
      v2 -> v3:
        - do not set the now unsed backlog_rcv callback
      
      v1 -> v2:
        - add memory pressure support
        - fixed dropwatch accounting for ipv6
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      850cbadd