1. 10 6月, 2015 1 次提交
  2. 09 6月, 2015 2 次提交
  3. 04 6月, 2015 5 次提交
  4. 31 5月, 2015 3 次提交
  5. 30 5月, 2015 1 次提交
  6. 28 5月, 2015 1 次提交
    • E
      tcp/dccp: try to not exhaust ip_local_port_range in connect() · 07f4c900
      Eric Dumazet 提交于
      A long standing problem on busy servers is the tiny available TCP port
      range (/proc/sys/net/ipv4/ip_local_port_range) and the default
      sequential allocation of source ports in connect() system call.
      
      If a host is having a lot of active TCP sessions, chances are
      very high that all ports are in use by at least one flow,
      and subsequent bind(0) attempts fail, or have to scan a big portion of
      space to find a slot.
      
      In this patch, I changed the starting point in __inet_hash_connect()
      so that we try to favor even [1] ports, leaving odd ports for bind()
      users.
      
      We still perform a sequential search, so there is no guarantee, but
      if connect() targets are very different, end result is we leave
      more ports available to bind(), and we spread them all over the range,
      lowering time for both connect() and bind() to find a slot.
      
      This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
      is even, ie if start/end values have different parity.
      
      Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
      32768 - 60999 (instead of 32768 - 61000)
      
      There is no change on security aspects here, only some poor hashing
      schemes could be eventually impacted by this change.
      
      [1] : The odd/even property depends on ip_local_port_range values parity
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07f4c900
  7. 27 5月, 2015 1 次提交
  8. 23 5月, 2015 6 次提交
  9. 20 5月, 2015 1 次提交
    • D
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann 提交于
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: NBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49213555
  10. 19 5月, 2015 1 次提交
  11. 18 5月, 2015 1 次提交
  12. 16 5月, 2015 4 次提交
  13. 15 5月, 2015 1 次提交
    • F
      of: mdio: Add a "broken-turn-around" property · ab6016e0
      Florian Fainelli 提交于
      Some Ethernet PHY devices/switches may not properly release the MDIO bus
      during turn-around time, and fail to drive it low, which can be seen by
      some controllers as a read failure, while the data clocked in is still
      correct.
      
      Add a boolean property "broken-turn-around" which is parsed by the
      generic MDIO bus probing code and will set the corresponding bit in the
      MDIO bus phy_ignore_ta_mask bitmask for MDIO bus drivers to utilize that
      information.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab6016e0
  14. 14 5月, 2015 2 次提交
  15. 13 5月, 2015 1 次提交
  16. 11 5月, 2015 5 次提交
    • X
      KVM: MMU: fix SMAP virtualization · 0be0226f
      Xiao Guangrong 提交于
      KVM may turn a user page to a kernel page when kernel writes a readonly
      user page if CR0.WP = 1. This shadow page entry will be reused after
      SMAP is enabled so that kernel is allowed to access this user page
      
      Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
      once CR4.SMAP is updated
      Signed-off-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0be0226f
    • M
      bonding: Implement user key part of port_key in an AD system. · d22a5fc0
      Mahesh Bandewar 提交于
      The port key has three components - user-key, speed-part, and duplex-part.
      The LSBit is for the duplex-part, next 5 bits are for the speed while the
      remaining 10 bits are the user defined key bits. Get these 10 bits
      from the user-space (through the SysFs interface) and use it to form the
      admin port-key. Allowed range for the user-key is 0 - 1023 (10 bits). If
      it is not provided then use zero for the user-key-bits (default).
      
      It can set using following example code -
      
         # modprobe bonding mode=4
         # usr_port_key=$(( RANDOM & 0x3FF ))
         # echo $usr_port_key > /sys/class/net/bond0/bonding/ad_user_port_key
         # echo +eth1 > /sys/class/net/bond0/bonding/slaves
         ...
         # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: * fixed up style issues reported by checkpatch
           * fixed up context from change in ad_actor_sys_prio patch]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d22a5fc0
    • M
      bonding: Allow userspace to set actors' macaddr in an AD-system. · 74514957
      Mahesh Bandewar 提交于
      In an AD system, the communication between actor and partner is the
      business between these two entities. In the current setup anyone on the
      same L2 can "guess" the LACPDU contents and then possibly send the
      spoofed LACPDUs and trick the partner causing connectivity issues for
      the AD system. This patch allows to use a random mac-address obscuring
      it's identity making it harder for someone in the L2 is do the same thing.
      
      This patch allows user-space to choose the mac-address for the AD-system.
      This mac-address can not be NULL or a Multicast. If the mac-address is set
      from user-space; kernel will honor it and will not overwrite it. In the
      absence (value from user space); the logic will default to using the
      masters' mac as the mac-address for the AD-system.
      
      It can be set using example code below -
      
         # modprobe bonding mode=4
         # sys_mac_addr=$(printf '%02x:%02x:%02x:%02x:%02x:%02x' \
                          $(( (RANDOM & 0xFE) | 0x02 )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )) \
                          $(( RANDOM & 0xFF )))
         # echo $sys_mac_addr > /sys/class/net/bond0/bonding/ad_actor_system
         # echo +eth1 > /sys/class/net/bond0/bonding/slaves
         ...
         # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: fixed up style issues reported by checkpatch]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74514957
    • M
      bonding: Allow userspace to set actors' system_priority in AD system · 6791e466
      Mahesh Bandewar 提交于
      This patch allows user to randomize the system-priority in an ad-system.
      The allowed range is 1 - 0xFFFF while default value is 0xFFFF. If user
      does not specify this value, the system defaults to 0xFFFF, which is
      what it was before this patch.
      
      Following example code could set the value -
          # modprobe bonding mode=4
          # sys_prio=$(( 1 + RANDOM + RANDOM ))
          # echo $sys_prio > /sys/class/net/bond0/bonding/ad_actor_sys_prio
          # echo +eth1 > /sys/class/net/bond0/bonding/slaves
          ...
          # ip link set bond0 up
      Signed-off-by: NMahesh Bandewar <maheshb@google.com>
      Reviewed-by: NNikolay Aleksandrov <nikolay@redhat.com>
      [jt: * fixed up style issues reported by checkpatch
           * changed how the default value is set in bond_check_params(), this
             makes the default consistent between what gets set for a new bond
             and what the default is claimed to be in the bonding options.]
      Signed-off-by: NJonathan Toppins <jtoppins@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6791e466
    • P
      pty: Fix input race when closing · 1a48632f
      Peter Hurley 提交于
      A read() from a pty master may mistakenly indicate EOF (errno == -EIO)
      after the pty slave has closed, even though input data remains to be read.
      For example,
      
             pty slave       |        input worker        |    pty master
                             |                            |
                             |                            |   n_tty_read()
      pty_write()            |                            |     input avail? no
        add data             |                            |     sleep
        schedule worker  --->|                            |     .
                             |---> flush_to_ldisc()       |     .
      pty_close()            |       fill read buffer     |     .
        wait for worker      |       wakeup reader    --->|     .
                             |       read buffer full?    |---> input avail ? yes
                             |<---   yes - exit worker    |     copy 4096 bytes to user
        TTY_OTHER_CLOSED <---|                            |<--- kick worker
                             |                            |
      
      		                **** New read() before worker starts ****
      
                             |                            |   n_tty_read()
                             |                            |     input avail? no
                             |                            |     TTY_OTHER_CLOSED? yes
                             |                            |     return -EIO
      
      Several conditions are required to trigger this race:
      1. the ldisc read buffer must become full so the input worker exits
      2. the read() count parameter must be >= 4096 so the ldisc read buffer
         is empty
      3. the subsequent read() occurs before the kicked worker has processed
         more input
      
      However, the underlying cause of the race is that data is pipelined, while
      tty state is not; ie., data already written by the pty slave end is not
      yet visible to the pty master end, but state changes by the pty slave end
      are visible to the pty master end immediately.
      
      Pipeline the TTY_OTHER_CLOSED state through input worker to the reader.
      1. Introduce TTY_OTHER_DONE which is set by the input worker when
         TTY_OTHER_CLOSED is set and either the input buffers are flushed or
         input processing has completed. Readers/polls are woken when
         TTY_OTHER_DONE is set.
      2. Reader/poll checks TTY_OTHER_DONE instead of TTY_OTHER_CLOSED.
      3. A new input worker is started from pty_close() after setting
         TTY_OTHER_CLOSED, which ensures the TTY_OTHER_DONE state will be
         set if the last input worker is already finished (or just about to
         exit).
      
      Remove tty_flush_to_ldisc(); no in-tree callers.
      
      Fixes: 52bce7f8 ("pty, n_tty: Simplify input processing on final close")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96311
      BugLink: http://bugs.launchpad.net/bugs/1429756
      Cc: <stable@vger.kernel.org> # 3.19+
      Reported-by: NAndy Whitcroft <apw@canonical.com>
      Reported-by: NH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NPeter Hurley <peter@hurleysoftware.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a48632f
  17. 10 5月, 2015 3 次提交
    • A
      pktgen: introduce xmit_mode '<start_xmit|netif_receive>' · 62f64aed
      Alexei Starovoitov 提交于
      Introduce xmit_mode 'netif_receive' for pktgen which generates the
      packets using familiar pktgen commands, but feeds them into
      netif_receive_skb() instead of ndo_start_xmit().
      
      Default mode is called 'start_xmit'.
      
      It is designed to test netif_receive_skb and ingress qdisc
      performace only. Make sure to understand how it works before
      using it for other rx benchmarking.
      
      Sample script 'pktgen.sh':
      \#!/bin/bash
      function pgset() {
        local result
      
        echo $1 > $PGDEV
      
        result=`cat $PGDEV | fgrep "Result: OK:"`
        if [ "$result" = "" ]; then
          cat $PGDEV | fgrep Result:
        fi
      }
      
      [ -z "$1" ] && echo "Usage: $0 DEV" && exit 1
      ETH=$1
      
      PGDEV=/proc/net/pktgen/kpktgend_0
      pgset "rem_device_all"
      pgset "add_device $ETH"
      
      PGDEV=/proc/net/pktgen/$ETH
      pgset "xmit_mode netif_receive"
      pgset "pkt_size 60"
      pgset "dst 198.18.0.1"
      pgset "dst_mac 90:e2:ba:ff:ff:ff"
      pgset "count 10000000"
      pgset "burst 32"
      
      PGDEV=/proc/net/pktgen/pgctrl
      echo "Running... ctrl^C to stop"
      pgset "start"
      echo "Done"
      cat /proc/net/pktgen/$ETH
      
      Usage:
      $ sudo ./pktgen.sh eth2
      ...
      Result: OK: 232376(c232372+d3) usec, 10000000 (60byte,0frags)
        43033682pps 20656Mb/sec (20656167360bps) errors: 10000000
      
      Raw netif_receive_skb speed should be ~43 million packet
      per second on 3.7Ghz x86 and 'perf report' should look like:
        37.69%  kpktgend_0   [kernel.vmlinux]  [k] __netif_receive_skb_core
        25.81%  kpktgend_0   [kernel.vmlinux]  [k] kfree_skb
         7.22%  kpktgend_0   [kernel.vmlinux]  [k] ip_rcv
         5.68%  kpktgend_0   [pktgen]          [k] pktgen_thread_worker
      
      If fib_table_lookup is seen on top, it means skb was processed
      by the stack. To benchmark netif_receive_skb only make sure
      that 'dst_mac' of your pktgen script is different from
      receiving device mac and it will be dropped by ip_rcv
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62f64aed
    • J
      pktgen: adjust flag NO_TIMESTAMP to be more pktgen compliant · f1f00d8f
      Jesper Dangaard Brouer 提交于
      Allow flag NO_TIMESTAMP to turn timestamping on again, like other flags,
      with a negation of the flag like !NO_TIMESTAMP.
      
      Also document the option flag NO_TIMESTAMP.
      
      Fixes: afb84b62 ("pktgen: add flag NO_TIMESTAMP to disable timestamping")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1f00d8f
    • H
      devicetree: Add compatible string for Zynq Ultrascale+ MPSoC · 988d6f07
      Harini Katakam 提交于
      Add "cdns,zynqmp-gem" to be used for Zynq Ultrascale+ MPSoC.
      Signed-off-by: NHarini Katakam <harinik@xilinx.com>
      Reviewed-by: NPunnaiah Choudary Kalluri <punnaia@xilinx.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      988d6f07
  18. 09 5月, 2015 1 次提交