1. 28 1月, 2014 2 次提交
  2. 23 1月, 2014 1 次提交
  3. 20 1月, 2014 2 次提交
  4. 14 1月, 2014 3 次提交
    • N
      packet: doc: describe PACKET_MMAP with one packet socket for rx and tx · 7e11daa7
      Norbert van Bolhuis 提交于
      Document how to use one AF_PACKET mmap socket for RX and TX.
      Signed-off-by: NNorbert van Bolhuis <nvbolhuis@aimvalley.nl>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e11daa7
    • H
      ipv4: introduce hardened ip_no_pmtu_disc mode · 8ed1dc44
      Hannes Frederic Sowa 提交于
      This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
      to be honored by protocols which do more stringent validation on the
      ICMP's packet payload. This knob is useful for people who e.g. want to
      run an unmodified DNS server in a namespace where they need to use pmtu
      for TCP connections (as they are used for zone transfers or fallback
      for requests) but don't want to use possibly spoofed UDP pmtu information.
      
      Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
      if the returned packet is in the window or if the association is valid.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Suggested-by: NFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ed1dc44
    • H
      ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing · f87c10a8
      Hannes Frederic Sowa 提交于
      While forwarding we should not use the protocol path mtu to calculate
      the mtu for a forwarded packet but instead use the interface mtu.
      
      We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
      introduced for multicast forwarding. But as it does not conflict with
      our usage in unicast code path it is perfect for reuse.
      
      I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
      along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
      dependencies because of IPSKB_FORWARDED.
      
      Because someone might have written a software which does probe
      destinations manually and expects the kernel to honour those path mtus
      I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
      can disable this new behaviour. We also still use mtus which are locked on a
      route for forwarding.
      
      The reason for this change is, that path mtus information can be injected
      into the kernel via e.g. icmp_err protocol handler without verification
      of local sockets. As such, this could cause the IPv4 forwarding path to
      wrongfully emit fragmentation needed notifications or start to fragment
      packets along a path.
      
      Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
      won't be set and further fragmentation logic will use the path mtu to
      determine the fragmentation size. They also recheck packet size with
      help of path mtu discovery and report appropriate errors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f87c10a8
  5. 12 1月, 2014 1 次提交
  6. 08 1月, 2014 1 次提交
  7. 03 1月, 2014 1 次提交
  8. 01 1月, 2014 1 次提交
  9. 31 12月, 2013 1 次提交
  10. 21 12月, 2013 1 次提交
  11. 19 12月, 2013 1 次提交
  12. 18 12月, 2013 1 次提交
  13. 17 12月, 2013 1 次提交
  14. 16 12月, 2013 1 次提交
  15. 12 12月, 2013 1 次提交
  16. 10 12月, 2013 3 次提交
    • F
      net: phy: consolidate PHY reset in phy_init_hw() · 87aa9f9c
      Florian Fainelli 提交于
      There are quite a lot of drivers touching a PHY device MII_BMCR
      register to reset the PHY without taking care of:
      
      1) ensuring that BMCR_RESET is cleared after a given timeout
      2) the PHY state machine resuming to the proper state and re-applying
      potentially changed settings such as auto-negotiation
      
      Introduce phy_poll_reset() which will take care of polling the MII_BMCR
      for the BMCR_RESET bit to be cleared after a given timeout or return a
      timeout error code.
      
      In order to make sure the PHY is in a correct state, phy_init_hw() first
      issues a software reset through MII_BMCR and then applies any fixups.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87aa9f9c
    • D
      packet: introduce PACKET_QDISC_BYPASS socket option · d346a3fa
      Daniel Borkmann 提交于
      This patch introduces a PACKET_QDISC_BYPASS socket option, that
      allows for using a similar xmit() function as in pktgen instead
      of taking the dev_queue_xmit() path. This can be very useful when
      PF_PACKET applications are required to be used in a similar
      scenario as pktgen, but with full, flexible packet payload that
      needs to be provided, for example.
      
      On default, nothing changes in behaviour for normal PF_PACKET
      TX users, so everything stays as is for applications. New users,
      however, can now set PACKET_QDISC_BYPASS if needed to prevent
      own packets from i) reentering packet_rcv() and ii) to directly
      push the frame to the driver.
      
      In doing so we can increase pps (here 64 byte packets) for
      PF_PACKET a bit:
      
        # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
        1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
        2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
        3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
        4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
        5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
        6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
        7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
        8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
        9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
       10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
       11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
       [...]
       20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081
      
       [**]: qdisc path with packet_rcv(), how probably most people
             seem to use it (hopefully not anymore if not needed)
      
      The test was done using a modified trafgen, sending a simple
      static 64 bytes packet, on all CPUs.  The trick in the fast
      "qdisc path" case, is to avoid reentering packet_rcv() by
      setting the RAW socket protocol to zero, like:
      socket(PF_PACKET, SOCK_RAW, 0);
      
      Tradeoffs are documented as well in this patch, clearly, if
      queues are busy, we will drop more packets, tc disciplines are
      ignored, and these packets are not visible to taps anymore. For
      a pktgen like scenario, we argue that this is acceptable.
      
      The pointer to the xmit function has been placed in packet
      socket structure hole between cached_dev and prot_hook that
      is hot anyway as we're working on cached_dev in each send path.
      
      Done in joint work together with Jesper Dangaard Brouer.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d346a3fa
    • D
      packet: fix send path when running with proto == 0 · 66e56cd4
      Daniel Borkmann 提交于
      Commit e40526cb introduced a cached dev pointer, that gets
      hooked into register_prot_hook(), __unregister_prot_hook() to
      update the device used for the send path.
      
      We need to fix this up, as otherwise this will not work with
      sockets created with protocol = 0, plus with sll_protocol = 0
      passed via sockaddr_ll when doing the bind.
      
      So instead, assign the pointer directly. The compiler can inline
      these helper functions automagically.
      
      While at it, also assume the cached dev fast-path as likely(),
      and document this variant of socket creation as it seems it is
      not widely used (seems not even the author of TX_RING was aware
      of that in his reference example [1]). Tested with reproducer
      from e40526cb.
      
       [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example
      
      Fixes: e40526cb ("packet: fix use after free race in send path when dev is released")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Tested-by: NSalam Noureddine <noureddine@aristanetworks.com>
      Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66e56cd4
  17. 07 12月, 2013 1 次提交
    • E
      tcp: auto corking · f54b3111
      Eric Dumazet 提交于
      With the introduction of TCP Small Queues, TSO auto sizing, and TCP
      pacing, we can implement Automatic Corking in the kernel, to help
      applications doing small write()/sendmsg() to TCP sockets.
      
      Idea is to change tcp_push() to check if the current skb payload is
      under skb optimal size (a multiple of MSS bytes)
      
      If under 'size_goal', and at least one packet is still in Qdisc or
      NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
      will be delayed up to TX completion time.
      
      This delay might allow the application to coalesce more bytes
      in the skb in following write()/sendmsg()/sendfile() system calls.
      
      The exact duration of the delay is depending on the dynamics
      of the system, and might be zero if no packet for this flow
      is actually held in Qdisc or NIC TX ring.
      
      Using FQ/pacing is a way to increase the probability of
      autocorking being triggered.
      
      Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
      this feature and default it to 1 (enabled)
      
      Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
      This counter is incremented every time we detected skb was under used
      and its flush was deferred.
      
      Tested:
      
      Interesting effects when using line buffered commands under ssh.
      
      Excellent performance results in term of cpu usage and total throughput.
      
      lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
      lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
      9410.39
      
       Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
      
            35209.439626 task-clock                #    2.901 CPUs utilized
                   2,294 context-switches          #    0.065 K/sec
                     101 CPU-migrations            #    0.003 K/sec
                   4,079 page-faults               #    0.116 K/sec
          97,923,241,298 cycles                    #    2.781 GHz                     [83.31%]
          51,832,908,236 stalled-cycles-frontend   #   52.93% frontend cycles idle    [83.30%]
          25,697,986,603 stalled-cycles-backend    #   26.24% backend  cycles idle    [66.70%]
         102,225,978,536 instructions              #    1.04  insns per cycle
                                                   #    0.51  stalled cycles per insn [83.38%]
          18,657,696,819 branches                  #  529.906 M/sec                   [83.29%]
              91,679,646 branch-misses             #    0.49% of all branches         [83.40%]
      
            12.136204899 seconds time elapsed
      
      lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
      lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
      6624.89
      
       Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
            40045.864494 task-clock                #    3.301 CPUs utilized
                     171 context-switches          #    0.004 K/sec
                      53 CPU-migrations            #    0.001 K/sec
                   4,080 page-faults               #    0.102 K/sec
         111,340,458,645 cycles                    #    2.780 GHz                     [83.34%]
          61,778,039,277 stalled-cycles-frontend   #   55.49% frontend cycles idle    [83.31%]
          29,295,522,759 stalled-cycles-backend    #   26.31% backend  cycles idle    [66.67%]
         108,654,349,355 instructions              #    0.98  insns per cycle
                                                   #    0.57  stalled cycles per insn [83.34%]
          19,552,170,748 branches                  #  488.244 M/sec                   [83.34%]
             157,875,417 branch-misses             #    0.81% of all branches         [83.34%]
      
            12.130267788 seconds time elapsed
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f54b3111
  18. 26 11月, 2013 1 次提交
    • L
      cfg80211: consolidate passive-scan and no-ibss flags · 8fe02e16
      Luis R. Rodriguez 提交于
      These two flags are used for the same purpose, just
      combine them into a no-ir flag to annotate no initiating
      radiation is allowed.
      
      Old userspace sending either flag will have it treated as
      the no-ir flag. To be considerate to older userspace we
      also send both the no-ir flag and the old no-ibss flags.
      Newer userspace will have to be aware of older kernels.
      
      Update all places in the tree using these flags with the
      following semantic patch:
      
      @@
      @@
      -NL80211_RRF_PASSIVE_SCAN
      +NL80211_RRF_NO_IR
      @@
      @@
      -NL80211_RRF_NO_IBSS
      +NL80211_RRF_NO_IR
      @@
      @@
      -IEEE80211_CHAN_PASSIVE_SCAN
      +IEEE80211_CHAN_NO_IR
      @@
      @@
      -IEEE80211_CHAN_NO_IBSS
      +IEEE80211_CHAN_NO_IR
      @@
      @@
      -NL80211_RRF_NO_IR | NL80211_RRF_NO_IR
      +NL80211_RRF_NO_IR
      @@
      @@
      -IEEE80211_CHAN_NO_IR | IEEE80211_CHAN_NO_IR
      +IEEE80211_CHAN_NO_IR
      @@
      @@
      -(NL80211_RRF_NO_IR)
      +NL80211_RRF_NO_IR
      @@
      @@
      -(IEEE80211_CHAN_NO_IR)
      +IEEE80211_CHAN_NO_IR
      
      Along with some hand-optimisations in documentation, to
      remove duplicates and to fix some indentation.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@do-not-panic.com>
      [do all the driver updates in one go]
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      8fe02e16
  19. 23 11月, 2013 1 次提交
  20. 20 11月, 2013 1 次提交
  21. 15 11月, 2013 1 次提交
    • E
      tcp: tsq: restore minimal amount of queueing · 98e09386
      Eric Dumazet 提交于
      After commit c9eeec26 ("tcp: TSQ can use a dynamic limit"), several
      users reported throughput regressions, notably on mvneta and wifi
      adapters.
      
      802.11 AMPDU requires a fair amount of queueing to be effective.
      
      This patch partially reverts the change done in tcp_write_xmit()
      so that the minimal amount is sysctl_tcp_limit_output_bytes.
      
      It also remove the use of this sysctl while building skb stored
      in write queue, as TSO autosizing does the right thing anyway.
      
      Users with well behaving NICS and correct qdisc (like sch_fq),
      can then lower the default sysctl_tcp_limit_output_bytes value from
      128KB to 8KB.
      
      This new usage of sysctl_tcp_limit_output_bytes permits each driver
      authors to check how their driver performs when/if the value is set
      to a minimum of 4KB.
      
      Normally, line rate for a single TCP flow should be possible,
      but some drivers rely on timers to perform TX completion and
      too long TX completion delays prevent reaching full throughput.
      
      Fixes: c9eeec26 ("tcp: TSQ can use a dynamic limit")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NSujith Manoharan <sujith@msujith.org>
      Reported-by: NArnaud Ebalard <arno@natisbad.org>
      Tested-by: NSujith Manoharan <sujith@msujith.org>
      Cc: Felix Fietkau <nbd@openwrt.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98e09386
  22. 08 11月, 2013 1 次提交
  23. 05 11月, 2013 2 次提交
    • Y
      tcp: properly handle stretch acks in slow start · 9f9843a7
      Yuchung Cheng 提交于
      Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
      regardless the number of packets. Consequently slow start performance
      is highly dependent on the degree of the stretch ACKs caused by
      receiver or network ACK compression mechanisms (e.g., delayed-ACK,
      GRO, etc).  But slow start algorithm is to send twice the amount of
      packets of packets left so it should process a stretch ACK of degree
      N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
      follow up patch will use the remainder of the N (if greater than 1)
      to adjust cwnd in the congestion avoidance phase.
      
      In addition this patch retires the experimental limited slow start
      (LSS) feature. LSS has multiple drawbacks but questionable benefit. The
      fractional cwnd increase in LSS requires a loop in slow start even
      though it's rarely used. Configuring such an increase step via a global
      sysctl on different BDPS seems hard. Finally and most importantly the
      slow start overshoot concern is now better covered by the Hybrid slow
      start (hystart) enabled by default.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9843a7
    • Y
      tcp: enable sockets to use MSG_FASTOPEN by default · 0d41cca4
      Yuchung Cheng 提交于
      Applications have started to use Fast Open (e.g., Chrome browser has
      such an optional flag) and the feature has gone through several
      generations of kernels since 3.7 with many real network tests. It's
      time to enable this flag by default for applications to test more
      conveniently and extensively.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d41cca4
  24. 04 11月, 2013 1 次提交
  25. 01 11月, 2013 1 次提交
  26. 31 10月, 2013 1 次提交
  27. 28 10月, 2013 1 次提交
  28. 19 10月, 2013 2 次提交
  29. 10 10月, 2013 1 次提交
  30. 04 10月, 2013 1 次提交
  31. 16 9月, 2013 1 次提交
  32. 11 9月, 2013 1 次提交