1. 30 7月, 2014 1 次提交
  2. 28 7月, 2014 3 次提交
    • N
      inet: frag: set limits and make init_net's high_thresh limit global · 1bab4c75
      Nikolay Aleksandrov 提交于
      This patch makes init_net's high_thresh limit to be the maximum for all
      namespaces, thus introducing a global memory limit threshold equal to the
      sum of the individual high_thresh limits which are capped.
      It also introduces some sane minimums for low_thresh as it shouldn't be
      able to drop below 0 (or > high_thresh in the unsigned case), and
      overall low_thresh should not ever be above high_thresh, so we make the
      following relations for a namespace:
      init_net:
       high_thresh - max(not capped), min(init_net low_thresh)
       low_thresh - max(init_net high_thresh), min (0)
      
      all other namespaces:
       high_thresh = max(init_net high_thresh), min(namespace's low_thresh)
       low_thresh = max(namespace's high_thresh), min(0)
      
      The major issue with having low_thresh > high_thresh is that we'll
      schedule eviction but never evict anything and thus rely only on the
      timers.
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1bab4c75
    • F
      inet: frag: remove periodic secret rebuild timer · e3a57d18
      Florian Westphal 提交于
      merge functionality into the eviction workqueue.
      
      Instead of rebuilding every n seconds, take advantage of the upper
      hash chain length limit.
      
      If we hit it, mark table for rebuild and schedule workqueue.
      To prevent frequent rebuilds when we're completely overloaded,
      don't rebuild more than once every 5 seconds.
      
      ipfrag_secret_interval sysctl is now obsolete and has been marked as
      deprecated, it still can be changed so scripts won't be broken but it
      won't have any effect. A comment is left above each unused secret_timer
      variable to avoid confusion.
      
      Joint work with Nikolay Aleksandrov.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3a57d18
    • F
      inet: frag: move eviction of queues to work queue · b13d3cbf
      Florian Westphal 提交于
      When the high_thresh limit is reached we try to toss the 'oldest'
      incomplete fragment queues until memory limits are below the low_thresh
      value.  This happens in softirq/packet processing context.
      
      This has two drawbacks:
      
      1) processors might evict a queue that was about to be completed
      by another cpu, because they will compete wrt. resource usage and
      resource reclaim.
      
      2) LRU list maintenance is expensive.
      
      But when constantly overloaded, even the 'least recently used' element is
      recent, so removing 'lru' queue first is not 'fairer' than removing any
      other fragment queue.
      
      This moves eviction out of the fast path:
      
      When the low threshold is reached, a work queue is scheduled
      which then iterates over the table and removes the queues that exceed
      the memory limits of the namespace. It sets a new flag called
      INET_FRAG_EVICTED on the evicted queues so the proper counters will get
      incremented when the queue is forcefully expired.
      
      When the high threshold is reached, no more fragment queues are
      created until we're below the limit again.
      
      The LRU list is now unused and will be removed in a followup patch.
      
      Joint work with Nikolay Aleksandrov.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b13d3cbf
  3. 18 7月, 2014 1 次提交
  4. 16 7月, 2014 1 次提交
    • W
      net-timestamp: document deprecated syststamp · 26c4fdb0
      Willem de Bruijn 提交于
      The SO_TIMESTAMPING API defines option SOF_TIMESTAMPING_SYS_HW.
      This feature is deprecated. It should not be implemented by new
      device drivers. Existing drivers do not implement it, either --
      with one exception.
      
      Driver developers are encouraged to expose the NIC hw clock as a
      PTP HW clock source, instead, and synchronize system time to the
      HW source.
      
      The control flag cannot be removed due to being part of the ABI, nor
      can the structure scm_timestamping that is returned. Due to the one
      legacy driver, the internal datapath and structure are not removed.
      
      This patch only clearly marks the interface as deprecated. Device
      drivers should always return a syststamp value of zero.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      We can consider adding a WARN_ON_ONCE in__sock_recv_timestamp
      if non-zero syststamp is encountered
      Acked-by: NRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26c4fdb0
  5. 08 7月, 2014 1 次提交
    • T
      ipv6: Implement automatic flow label generation on transmit · cb1ce2ef
      Tom Herbert 提交于
      Automatically generate flow labels for IPv6 packets on transmit.
      The flow label is computed based on skb_get_hash. The flow label will
      only automatically be set when it is zero otherwise (i.e. flow label
      manager hasn't set one). This supports the transmit side functionality
      of RFC 6438.
      
      Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior
      system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this
      functionality per socket.
      
      By default, auto flowlabels are disabled to avoid possible conflicts
      with flow label manager, however if this feature proves useful we
      may want to enable it by default.
      
      It should also be noted that FreeBSD has already implemented automatic
      flow labels (including the sysctl and socket option). In FreeBSD,
      automatic flow labels default to enabled.
      
      Performance impact:
      
      Running super_netperf with 200 flows for TCP_RR and UDP_RR for
      IPv6. Note that in UDP case, __skb_get_hash will be called for
      every packet with explains slight regression. In the TCP case
      the hash is saved in the socket so there is no regression.
      
      Automatic flow labels disabled:
      
        TCP_RR:
          86.53% CPU utilization
          127/195/322 90/95/99% latencies
          1.40498e+06 tps
      
        UDP_RR:
          90.70% CPU utilization
          118/168/243 90/95/99% latencies
          1.50309e+06 tps
      
      Automatic flow labels enabled:
      
        TCP_RR:
          85.90% CPU utilization
          128/199/337 90/95/99% latencies
          1.40051e+06
      
        UDP_RR
          92.61% CPU utilization
          115/164/236 90/95/99% latencies
          1.4687e+06
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb1ce2ef
  6. 02 7月, 2014 2 次提交
    • J
      pktgen: document tuning for max NIC performance · 9ceb87fc
      Jesper Dangaard Brouer 提交于
      Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
      running full.  Thus, the TX ring is artificially limiting pktgen.
      (Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
      counters.)
      
      Using ixgbe, the real reason behind the TX ring running full, is due
      to TX ring not being cleaned up fast enough. The ixgbe driver combines
      TX+RX ring cleanups, and the cleanup interval is affected by the
      ethtool --coalesce setting of parameter "rx-usecs".
      
      Do not increase the default NIC TX ring buffer or default cleanup
      interval.  Instead simply document that pktgen needs special NIC
      tuning for maximum packet per sec performance.
      
      Performance results with pktgen with clone_skb=100000.
      TX ring size 512 (default), adjusting "rx-usecs":
       (Single CPU performance, E5-2630, ixgbe)
       - 3935002 pps - rx-usecs:  1 (irqs:  9346)
       - 5132350 pps - rx-usecs: 10 (irqs: 99157)
       - 5375111 pps - rx-usecs: 20 (irqs: 50154)
       - 5454050 pps - rx-usecs: 30 (irqs: 33872)
       - 5496320 pps - rx-usecs: 40 (irqs: 26197)
       - 5502510 pps - rx-usecs: 50 (irqs: 21527)
      
      TX ring size adjusting (ethtool -G), "rx-usecs==1" (default):
       - 3935002 pps - tx-size:  512
       - 5354401 pps - tx-size:  768
       - 5356847 pps - tx-size: 1024
       - 5327595 pps - tx-size: 1536
       - 5356779 pps - tx-size: 2048
       - 5353438 pps - tx-size: 4096
      
      Notice after commit 6f25cd47 (pktgen: fix xmit test for BQL enabled
      devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
      the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
      more pressure on the TX ring buffers.
      
      It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
      pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ceb87fc
    • B
      ipv6: Allow accepting RA from local IP addresses. · d9333196
      Ben Greear 提交于
      This can be used in virtual networking applications, and
      may have other uses as well.  The option is disabled by
      default.
      
      A specific use case is setting up virtual routers, bridges, and
      hosts on a single OS without the use of network namespaces or
      virtual machines.  With proper use of ip rules, routing tables,
      veth interface pairs and/or other virtual interfaces,
      and applications that can bind to interfaces and/or IP addresses,
      it is possibly to create one or more virtual routers with multiple
      hosts attached.  The host interfaces can act as IPv6 systems,
      with radvd running on the ports in the virtual routers.  With the
      option provided in this patch enabled, those hosts can now properly
      obtain IPv6 addresses from the radvd.
      Signed-off-by: NBen Greear <greearb@candelatech.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9333196
  7. 12 6月, 2014 2 次提交
  8. 11 6月, 2014 1 次提交
    • A
      net: filter: cleanup A/X name usage · e430f34e
      Alexei Starovoitov 提交于
      The macro 'A' used in internal BPF interpreter:
       #define A regs[insn->a_reg]
      was easily confused with the name of classic BPF register 'A', since
      'A' would mean two different things depending on context.
      
      This patch is trying to clean up the naming and clarify its usage in the
      following way:
      
      - A and X are names of two classic BPF registers
      
      - BPF_REG_A denotes internal BPF register R0 used to map classic register A
        in internal BPF programs generated from classic
      
      - BPF_REG_X denotes internal BPF register R7 used to map classic register X
        in internal BPF programs generated from classic
      
      - internal BPF instruction format:
      struct sock_filter_int {
              __u8    code;           /* opcode */
              __u8    dst_reg:4;      /* dest register */
              __u8    src_reg:4;      /* source register */
              __s16   off;            /* signed offset */
              __s32   imm;            /* signed immediate constant */
      };
      
      - BPF_X/BPF_K is 1 bit used to encode source operand of instruction
      In classic:
        BPF_X - means use register X as source operand
        BPF_K - means use 32-bit immediate as source operand
      In internal:
        BPF_X - means use 'src_reg' register as source operand
        BPF_K - means use 32-bit immediate as source operand
      Suggested-by: NChema Gonzalez <chema@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NChema Gonzalez <chema@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e430f34e
  9. 24 5月, 2014 1 次提交
  10. 23 5月, 2014 1 次提交
  11. 19 5月, 2014 1 次提交
  12. 14 5月, 2014 1 次提交
  13. 05 5月, 2014 2 次提交
  14. 25 4月, 2014 2 次提交
    • M
      bonding: Add tlb_dynamic_lb parameter for tlb mode · e9f0fb88
      Mahesh Bandewar 提交于
      The aggresive load balancing causes packet re-ordering as active
      flows are moved from a slave to another within the group. Sometime
      this aggresive lb is not necessary if the preference is for less
      re-ordering. This parameter if used with value "0" disables
      this dynamic flow shuffling minimizing packet re-ordering. Of course
      the side effect is that it has to live with the static load balancing
      that the hashing distribution provides. This impact is less severe if
      the correct xmit-hashing-policy is used for the tlb setup.
      
      The default value of the parameter is set to "1" mimicing the earlier
      behavior.
      
      Ran the netperf test with 200 stream for 1 min between two hosts with
      4x1G trunk (xmit-lb mode with xmit-policy L3+4) before and after these
      changes. Following was the command used for those 200 instances -
      
          netperf -t TCP_RR -l 60 -s 5 -H <host> -- -r81920,81920
      
      Transactions per second:
          Before change: 1,367.11
          After  change: 1,470.65
      
      Change-Id: Ie3f75c77282cf602e83a6e833c6eb164e72a0990
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9f0fb88
    • M
      bonding: Added bond_tlb_xmit() for tlb mode. · f05b42ea
      Mahesh Bandewar 提交于
      Re-organized the xmit function for the lb mode separating tlb xmit
      from the alb mode. This will enable use of the hashing policies
      like 802.3ad mode. Also extended use of xmit-hash-policy to tlb mode.
      
      Now the tlb-mode defaults to BOND_XMIT_POLICY_LAYER2 if the xmit policy
      module parameter is not set (just like 802.3ad, or Xor mode).
      
      Change-Id: I140257403d272df75f477b380207338d0f04963e
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f05b42ea
  15. 24 4月, 2014 1 次提交
  16. 23 4月, 2014 1 次提交
  17. 01 4月, 2014 1 次提交
  18. 31 3月, 2014 1 次提交
  19. 28 3月, 2014 1 次提交
  20. 21 3月, 2014 2 次提交
  21. 19 3月, 2014 2 次提交
  22. 18 3月, 2014 1 次提交
  23. 13 3月, 2014 1 次提交
  24. 07 3月, 2014 1 次提交
  25. 03 3月, 2014 1 次提交
    • O
      can: remove CAN FD compatibility for CAN 2.0 sockets · 821047c4
      Oliver Hartkopp 提交于
      In commit e2d265d3 (canfd: add support for CAN FD in CAN_RAW sockets)
      CAN FD frames with a payload length up to 8 byte are passed to legacy
      sockets where the CAN FD support was not enabled by the application.
      
      After some discussions with developers at a fair this well meant feature
      leads to confusion as no clean switch for CAN / CAN FD is provided to the
      application programmer. Additionally a compatibility like this for legacy
      CAN_RAW sockets requires some compatibility handling for the sending, e.g.
      make CAN2.0 frames a CAN FD frame with BRS at transmission time (?!?).
      
      This will become a mess when people start to develop applications with
      real CAN FD hardware. This patch reverts the bad compatibility code
      together with the documentation describing the removed feature.
      Acked-by: NStephane Grosjean <s.grosjean@peak-system.com>
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      821047c4
  26. 27 2月, 2014 2 次提交
  27. 25 2月, 2014 1 次提交
  28. 19 2月, 2014 3 次提交
  29. 18 2月, 2014 1 次提交