1. 03 5月, 2016 25 次提交
    • D
      Merge branch 'bridge-per-vlan-stats' · e8194d4f
      David S. Miller 提交于
      Nikolay Aleksandrov says:
      
      ====================
      bridge: per-vlan stats
      
      This set adds support for bridge per-vlan statistics.
      In order to be able to dump statistics for many vlans we need a way to
      continue dumping after reaching maximum size, thus patches 01 and 02 extend
      the new stats API with a per-device extended link stats attribute and
      callback which can save its local state and continue where it left off
      afterwards. I considered using the already existing "fill_xstats" callback
      but it gets confusing since we need to separate the linkinfo dump from the
      new stats api dump and adding a flag/argument to do that just looks messy.
      I don't think the rtnl_link_ops size is an issue, so adding these seemed
      like the cleaner approach.
      
      Patches 03 and 04 add the stats support and netlink dump support
      respectively. The stats accounting is controlled via a bridge option which
      is default off, thus the performance impact is kept minimal.
      I've tested this set with both old and modified iproute2, kmemleak on and
      some traffic stress tests while adding/removing vlans and ports.
      
      v3:
       - drop the RCU pvid patch and remove one pointer fetch as requested
       - make stats accounting optional with default to off, the option is in the
         same cache line as vlan_proto and vlan_enabled, so it is already fetched
         before the fast path check thus the performance impact is minimal, this
         also allows us to avoid one vlan lookup and return early when using pvid
       - rebased and retested
      
      v2:
       - Improve the error checking, rename lidx to prividx and save the current
         idx user instead of restricting it to one in patch 01
       - squash patch 02 into 01 and remove the restriction
       - add callback descriptions, improve the size calculation and change the
         xstats message structure to have an embedding level per rtnl link type
         so we can avoid one call to get the link type (and thus filter on it)
         and also each link type can now have any number of private attributes
         inside
       - fix a problem where the vlan stats are not dumped if the bridge has 0
         vlans on it but has vlans on the ports, add bridge link type private
         attributes and also add paddings for future extensions to avoid at least
         a few netlink attributes and improve struct alignment
       - drop the is_skb_forwardable argument constifying patch as it's not
         needed anymore, but it's a nice cleanup which I'll send separately
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8194d4f
    • N
      bridge: netlink: export per-vlan stats · a60c0903
      Nikolay Aleksandrov 提交于
      Add a new LINK_XSTATS_TYPE_BRIDGE attribute and implement the
      RTM_GETSTATS callbacks for IFLA_STATS_LINK_XSTATS (fill_linkxstats and
      get_linkxstats_size) in order to export the per-vlan stats.
      The paddings were added because soon these fields will be needed for
      per-port per-vlan stats (or something else if someone beats me to it) so
      avoiding at least a few more netlink attributes.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a60c0903
    • N
      bridge: vlan: learn to count · 6dada9b1
      Nikolay Aleksandrov 提交于
      Add support for per-VLAN Tx/Rx statistics. Every global vlan context gets
      allocated a per-cpu stats which is then set in each per-port vlan context
      for quick access. The br_allowed_ingress() common function is used to
      account for Rx packets and the br_handle_vlan() common function is used
      to account for Tx packets. Stats accounting is performed only if the
      bridge-wide vlan_stats_enabled option is set either via sysfs or netlink.
      A struct hole between vlan_enabled and vlan_proto is used for the new
      option so it is in the same cache line. Currently it is binary (on/off)
      but it is intentionally restricted to exactly 0 and 1 since other values
      will be used in the future for different purposes (e.g. per-port stats).
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6dada9b1
    • N
      net: rtnetlink: add linkxstats callbacks and attribute · 97a47fac
      Nikolay Aleksandrov 提交于
      Add callbacks to calculate the size and fill link extended statistics
      which can be split into multiple messages and are dumped via the new
      rtnl stats API (RTM_GETSTATS) with the IFLA_STATS_LINK_XSTATS attribute.
      Also add that attribute to the idx mask check since it is expected to
      be able to save state and resume dumping (e.g. future bridge per-vlan
      stats will be dumped via this attribute and callbacks).
      Each link type should nest its private attributes under the per-link type
      attribute. This allows to have any number of separated private attributes
      and to avoid one call to get the dev link type.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      97a47fac
    • N
      net: rtnetlink: allow rtnl_fill_statsinfo to save private state counter · e8872a25
      Nikolay Aleksandrov 提交于
      The new prividx argument allows the current dumping device to save a
      private state counter which would enable it to continue dumping from
      where it left off. And the idxattr is used to save the current idx user
      so multiple prividx using attributes can be requested at the same time
      as suggested by Roopa Prabhu.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e8872a25
    • D
      Merge branch 'ipv6-tunnel-cleanups' · d1ac3b16
      David S. Miller 提交于
      Tom Herbert says:
      
      ====================
      net: Cleanup IPv6 ip tunnels
      
      The IPv6 tunnel code is very different from IPv4 code. There is a lot
      of redundancy with the IPv4 code, particularly in the GRE tunneling.
      
      This patch set cleans up the tunnel code to make the IPv6 code look
      more like the IPv4 code and use common functions between the two
      stacks where possible.
      
      This work should make it easier to maintain and extend the IPv6 ip
      tunnels.
      
      Items in this patch set:
        - Cleanup IPv6 tunnel receive path (ip6_tnl_rcv). Includes using
          gro_cells and exporting ip6_tnl_rcv so the ip6_gre can call it
        - Move GRE functions to common header file (tx functions) or
          gre_demux.c (rx functions like gre_parse_header)
        - Call common GRE functions from IPv6 GRE
        - Create ip6_tnl_xmit (to be like ip_tunnel_xmit)
      
      Tested:
        Ran super_netperf tests for TCP_RR and TCP_STREAM for:
          - IPv4 over gre, gretap, gre6, gre6tap
          - IPv6 over gre, gretap, gre6, gre6tap
          - ipip
          - ip6ip6
          - ipip/gue
          - IPv6 over gre/gue
          - IPv4 over gre/gue
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1ac3b16
    • T
      gre6: Cleanup GREv6 transmit path, call common GRE functions · b05229f4
      Tom Herbert 提交于
      Changes in GREv6 transmit path:
        - Call gre_checksum, remove gre6_checksum
        - Rename ip6gre_xmit2 to __gre6_xmit
        - Call gre_build_header utility function
        - Call ip6_tnl_xmit common function
        - Call ip6_tnl_change_mtu, eliminate ip6gre_tunnel_change_mtu
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b05229f4
    • T
      ipv6: Generic tunnel cleanup · 79ecb90e
      Tom Herbert 提交于
      A few generic changes to generalize tunnels in IPv6:
        - Export ip6_tnl_change_mtu so that it can be called by ip6_gre
        - Add tun_hlen to ip6_tnl structure.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79ecb90e
    • T
      gre: Create common functions for transmit · 182a352d
      Tom Herbert 提交于
      Create common functions for both IPv4 and IPv6 GRE in transmit. These
      are put into gre.h.
      
      Common functions are for:
        - GRE checksum calculation. Move gre_checksum to gre.h.
        - Building a GRE header. Move GRE build_header and rename
          gre_build_header.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      182a352d
    • T
      ipv6: Create ip6_tnl_xmit · 8eb30be0
      Tom Herbert 提交于
      This patch renames ip6_tnl_xmit2 to ip6_tnl_xmit and exports it. Other
      users like GRE will be able to call this. The original ip6_tnl_xmit
      function is renamed to ip6_tnl_start_xmit (this is an ndo_start_xmit
      function).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8eb30be0
    • T
      gre6: Cleanup GREv6 receive path, call common GRE functions · 308edfdf
      Tom Herbert 提交于
      - Create gre_rcv function. This calls gre_parse_header and ip6gre_rcv.
        - Call ip6_tnl_rcv. Doing this and using gre_parse_header eliminates
          most of the code in ip6gre_rcv.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      308edfdf
    • T
      gre: Move utility functions to common headers · 95f5c64c
      Tom Herbert 提交于
      Several of the GRE functions defined in net/ipv4/ip_gre.c are usable
      for IPv6 GRE implementation (that is they are protocol agnostic).
      
      These include:
        - GRE flag handling functions are move to gre.h
        - GRE build_header is moved to gre.h and renamed gre_build_header
        - parse_gre_header is moved to gre_demux.c and renamed gre_parse_header
        - iptunnel_pull_header is taken out of gre_parse_header. This is now
          done by caller. The header length is returned from gre_parse_header
          in an int* argument.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95f5c64c
    • T
      ipv6: Cleanup IPv6 tunnel receive path · 0d3c703a
      Tom Herbert 提交于
      Some basic changes to make IPv6 tunnel receive path look more like
      IPv4 path:
        - Make ip6_tnl_rcv non-static so that GREv6 and others can call it
        - Make ip6_tnl_rcv look like ip_tunnel_rcv
        - Switch to gro_cells_receive
        - Make ip6_tnl_rcv non-static and export it
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d3c703a
    • D
      Merge branch 'tcp-preempt' · 570d6320
      David S. Miller 提交于
      Eric Dumazet says:
      
      ====================
      net: make TCP preemptible
      
      Most of TCP stack assumed it was running from BH handler.
      
      This is great for most things, as TCP behavior is very sensitive
      to scheduling artifacts.
      
      However, the prequeue and backlog processing are problematic,
      as they need to be flushed with BH being blocked.
      
      To cope with modern needs, TCP sockets have big sk_rcvbuf values,
      in the order of 16 MB, and soon 32 MB.
      This means that backlog can hold thousands of packets, and things
      like TCP coalescing or collapsing on this amount of packets can
      lead to insane latency spikes, since BH are blocked for too long.
      
      It is time to make UDP/TCP stacks preemptible.
      
      Note that fast path still runs from BH handler.
      
      v2: Added "tcp: make tcp_sendmsg() aware of socket backlog"
          to reduce latency problems of large sends.
      
      v3: Fixed a typo in tcp_cdg.c
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      570d6320
    • E
      tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1
      Eric Dumazet 提交于
      Large sendmsg()/write() hold socket lock for the duration of the call,
      unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
      are parked into socket backlog for a long time.
      Critical decisions like fast retransmit might be delayed.
      Receivers have to maintain a big out of order queue with additional cpu
      overhead, and also possible stalls in TX once windows are full.
      
      Bidirectional flows are particularly hurt since the backlog can become
      quite big if the copy from user space triggers IO (page faults)
      
      Some applications learnt to use sendmsg() (or sendmmsg()) with small
      chunks to avoid this issue.
      
      Kernel should know better, right ?
      
      Add a generic sk_flush_backlog() helper and use it right
      before a new skb is allocated. Typically we put 64KB of payload
      per skb (unless MSG_EOR is requested) and checking socket backlog
      every 64KB gives good results.
      
      As a matter of fact, tests with TSO/GSO disabled give very nice
      results, as we manage to keep a small write queue and smaller
      perceived rtt.
      
      Note that sk_flush_backlog() maintains socket ownership,
      so is not equivalent to a {release_sock(sk); lock_sock(sk);},
      to ensure implicit atomicity rules that sendmsg() was
      giving to (possibly buggy) applications.
      
      In this simple implementation, I chose to not call tcp_release_cb(),
      but we might consider this later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d41a69f1
    • E
      net: do not block BH while processing socket backlog · 5413d1ba
      Eric Dumazet 提交于
      Socket backlog processing is a major latency source.
      
      With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
      holding cpu for more than 5 ms, and packets being dropped by the NIC
      once ring buffer is filled.
      
      All users are now ready to be called from process context,
      we can unblock BH and let interrupts be serviced faster.
      
      cond_resched_softirq() could be removed, as it has no more user.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5413d1ba
    • E
      sctp: prepare for socket backlog behavior change · 860fbbc3
      Eric Dumazet 提交于
      sctp_inq_push() will soon be called without BH being blocked
      when generic socket code flushes the socket backlog.
      
      It is very possible SCTP can be converted to not rely on BH,
      but this needs to be done by SCTP experts.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      860fbbc3
    • E
      udp: prepare for non BH masking at backlog processing · e61da9e2
      Eric Dumazet 提交于
      UDP uses the generic socket backlog code, and this will soon
      be changed to not disable BH when protocol is called back.
      
      We need to use appropriate SNMP accessors.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e61da9e2
    • E
      dccp: do not assume DCCP code is non preemptible · 7309f882
      Eric Dumazet 提交于
      DCCP uses the generic backlog code, and this will soon
      be changed to not disable BH when protocol is called back.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7309f882
    • E
      tcp: do not block bh during prequeue processing · fb3477c0
      Eric Dumazet 提交于
      AFAIK, nothing in current TCP stack absolutely wants BH
      being disabled once socket is owned by a thread running in
      process context.
      
      As mentioned in my prior patch ("tcp: give prequeue mode some care"),
      processing a batch of packets might take time, better not block BH
      at all.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb3477c0
    • E
      tcp: do not assume TCP code is non preemptible · c10d9310
      Eric Dumazet 提交于
      We want to to make TCP stack preemptible, as draining prequeue
      and backlog queues can take lot of time.
      
      Many SNMP updates were assuming that BH (and preemption) was disabled.
      
      Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
      and some __TCP_INC_STATS() to TCP_INC_STATS()
      
      Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
      and tcp_v4_send_ack(), we add an explicit preempt disabled section.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c10d9310
    • D
      Merge branch 'xgene-channel-number' · 5e59c83f
      David S. Miller 提交于
      Iyappan Subramanian says:
      
      ====================
      drivers: net: xgene: fix: Get channel number from device binding
      
      This patch set adds 'channel' property to get ethernet to CPU channel number,
      thus decoupling the Linux driver from static resource selection.
      
      v2: Address review comments from v1
      - removed irq reference from Linux driver
      - added 'channel' property to get ethernet to CPU channel number
      
      v1:
      - Initial version
      ====================
      Signed-off-by: NIyappan Subramanian <isubramanian@apm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e59c83f
    • I
      dtb: xgene: Add channel property · 6619ac5a
      Iyappan Subramanian 提交于
      Added 'channel' property, describing ethernet to CPU channel number.
      Signed-off-by: NIyappan Subramanian <isubramanian@apm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6619ac5a
    • I
    • I
      drivers: net: xgene: Get channel number from device binding · 2a37daa6
      Iyappan Subramanian 提交于
      This patch gets ethernet to CPU channel (prefetch buffer number) from
      the newly added 'channel' property, thus decoupling Linux driver from
      resource management.
      Signed-off-by: NIyappan Subramanian <isubramanian@apm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a37daa6
  2. 02 5月, 2016 15 次提交