1. 25 3月, 2008 1 次提交
  2. 23 3月, 2008 2 次提交
    • S
      [IPV4] fib_trie: fix warning from rcu_assign_poinger · 6440cc9e
      Stephen Hemminger 提交于
      This gets rid of a warning caused by the test in rcu_assign_pointer.
      I tried to fix rcu_assign_pointer, but that devolved into a long set
      of discussions about doing it right that came to no real solution.
      Since the test in rcu_assign_pointer for constant NULL would never
      succeed in fib_trie, just open code instead.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6440cc9e
    • H
      [TCP]: Let skbs grow over a page on fast peers · 69d15067
      Herbert Xu 提交于
      While testing the virtio-net driver on KVM with TSO I noticed
      that TSO performance with a 1500 MTU is significantly worse
      compared to the performance of non-TSO with a 16436 MTU.  The
      packet dump shows that most of the packets sent are smaller
      than a page.
      
      Looking at the code this actually is quite obvious as it always
      stop extending the packet if it's the first packet yet to be
      sent and if it's larger than the MSS.  Since each extension is
      bound by the page size, this means that (given a 1500 MTU) we're
      very unlikely to construct packets greater than a page, provided
      that the receiver and the path is fast enough so that packets can
      always be sent immediately.
      
      The fix is also quite obvious.  The push calls inside the loop
      is just an optimisation so that we don't end up doing all the
      sending at the end of the loop.  Therefore there is no specific
      reason why it has to do so at MSS boundaries.  For TSO, the
      most natural extension of this optimisation is to do the pushing
      once the skb exceeds the TSO size goal.
      
      This is what the patch does and testing with KVM shows that the
      TSO performance with a 1500 MTU easily surpasses that of a 16436
      MTU and indeed the packet sizes sent are generally larger than
      16436.
      
      I don't see any obvious downsides for slower peers or connections,
      but it would be prudent to test this extensively to ensure that
      those cases don't regress.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69d15067
  3. 22 3月, 2008 1 次提交
  4. 21 3月, 2008 2 次提交
    • P
      [TCP]: Fix shrinking windows with window scaling · 607bfbf2
      Patrick McHardy 提交于
      When selecting a new window, tcp_select_window() tries not to shrink
      the offered window by using the maximum of the remaining offered window
      size and the newly calculated window size. The newly calculated window
      size is always a multiple of the window scaling factor, the remaining
      window size however might not be since it depends on rcv_wup/rcv_nxt.
      This means we're effectively shrinking the window when scaling it down.
      
      
      The dump below shows the problem (scaling factor 2^7):
      
      - Window size of 557 (71296) is advertised, up to 3111907257:
      
      IP 172.2.2.3.33000 > 172.2.2.2.33000: . ack 3111835961 win 557 <...>
      
      - New window size of 514 (65792) is advertised, up to 3111907217, 40 bytes
        below the last end:
      
      IP 172.2.2.3.33000 > 172.2.2.2.33000: . 3113575668:3113577116(1448) ack 3111841425 win 514 <...>
      
      The number 40 results from downscaling the remaining window:
      
      3111907257 - 3111841425 = 65832
      65832 / 2^7 = 514
      65832 % 2^7 = 40
      
      If the sender uses up the entire window before it is shrunk, this can have
      chaotic effects on the connection. When sending ACKs, tcp_acceptable_seq()
      will notice that the window has been shrunk since tcp_wnd_end() is before
      tp->snd_nxt, which makes it choose tcp_wnd_end() as sequence number.
      This will fail the receivers checks in tcp_sequence() however since it
      is before it's tp->rcv_wup, making it respond with a dupack.
      
      If both sides are in this condition, this leads to a constant flood of
      ACKs until the connection times out.
      
      Make sure the window is never shrunk by aligning the remaining window to
      the window scaling factor.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      607bfbf2
    • D
      [NETFILTER]: ipt_recent: sanity check hit count · d0ebf133
      Daniel Hokka Zakrisson 提交于
      If a rule using ipt_recent is created with a hit count greater than
      ip_pkt_list_tot, the rule will never match as it cannot keep track
      of enough timestamps. This patch makes ipt_recent refuse to create such
      rules.
      
      With ip_pkt_list_tot's default value of 20, the following can be used
      to reproduce the problem.
      
      nc -u -l 0.0.0.0 1234 &
      for i in `seq 1 100`; do echo $i | nc -w 1 -u 127.0.0.1 1234; done
      
      This limits it to 20 packets:
      iptables -A OUTPUT -p udp --dport 1234 -m recent --set --name test \
               --rsource
      iptables -A OUTPUT -p udp --dport 1234 -m recent --update --seconds \
               60 --hitcount 20 --name test --rsource -j DROP
      
      While this is unlimited:
      iptables -A OUTPUT -p udp --dport 1234 -m recent --set --name test \
               --rsource
      iptables -A OUTPUT -p udp --dport 1234 -m recent --update --seconds \
               60 --hitcount 21 --name test --rsource -j DROP
      
      With the patch the second rule-set will throw an EINVAL.
      Reported-by: NSean Kennedy <skennedy@vcn.com>
      Signed-off-by: NDaniel Hokka Zakrisson <daniel@hozac.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0ebf133
  5. 18 3月, 2008 2 次提交
  6. 12 3月, 2008 1 次提交
  7. 05 3月, 2008 2 次提交
    • S
      [IPCONFIG]: The kernel gets no IP from some DHCP servers · dea75bdf
      Stephen Hemminger 提交于
      From: Stephen Hemminger <shemminger@linux-foundation.org>
      
      Based upon a patch by Marcel Wappler:
       
         This patch fixes a DHCP issue of the kernel: some DHCP servers
         (i.e.  in the Linksys WRT54Gv5) are very strict about the contents
         of the DHCPDISCOVER packet they receive from clients.
       
         Table 5 in RFC2131 page 36 requests the fields 'ciaddr' and
         'siaddr' MUST be set to '0'.  These DHCP servers ignore Linux
         kernel's DHCP discovery packets with these two fields set to
         '255.255.255.255' (in contrast to popular DHCP clients, such as
         'dhclient' or 'udhcpc').  This leads to a not booting system.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dea75bdf
    • H
      [ESP]: Add select on AUTHENC · ed58dd41
      Herbert Xu 提交于
      Now the ESP uses the AEAD interface even for algorithms which are
      not combined mode, we need to select CONFIG_CRYPTO_AUTHENC as
      otherwise only combined mode algorithms will work.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed58dd41
  8. 04 3月, 2008 1 次提交
  9. 29 2月, 2008 3 次提交
  10. 27 2月, 2008 2 次提交
    • P
      [INET]: Don't create tunnels with '%' in name. · b37d428b
      Pavel Emelyanov 提交于
      Four tunnel drivers (ip_gre, ipip, ip6_tunnel and sit) can receive a
      pre-defined name for a device from the userspace.  Since these drivers
      call the register_netdevice() (rtnl_lock, is held), which does _not_
      generate the device's name, this name may contain a '%' character.
      
      Not sure how bad is this to have a device with a '%' in its name, but
      all the other places either use the register_netdev(), which call the
      dev_alloc_name(), or explicitly call the dev_alloc_name() before
      registering, i.e. do not allow for such names.
      
      This had to be prior to the commit 34cc7b, but I forgot to number the
      patches and this one got lost, sorry.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b37d428b
    • B
      [IPV4]: Reset scope when changing address · 148f9729
      Bjorn Mork 提交于
      This bug did bite at least one user, who did have to resort to rebooting
      the system after an "ifconfig eth0 127.0.0.1" typo.
      
      Deleting the address and adding a new is a less intrusive workaround.
      But I still beleive this is a bug that should be fixed.  Some way or
      another.
      
      Another possibility would be to remove the scope mangling based on
      address.  This will always be incomplete (are 127/8 the only address
      space with host scope requirements?)
      
      We set the scope to RT_SCOPE_HOST if an IPv4 interface is configured
      with a loopback address (127/8).  The scope is never reset, and will
      remain set to RT_SCOPE_HOST after changing the address. This patch
      resets the scope if the address is changed again, to restore normal
      functionality.
      Signed-off-by: NBjorn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      148f9729
  11. 24 2月, 2008 1 次提交
  12. 20 2月, 2008 3 次提交
  13. 18 2月, 2008 3 次提交
  14. 14 2月, 2008 2 次提交
  15. 13 2月, 2008 5 次提交
    • H
      [IPSEC]: Fix bogus usage of u64 on input sequence number · b318e0e4
      Herbert Xu 提交于
      Al Viro spotted a bogus use of u64 on the input sequence number which
      is big-endian.  This patch fixes it by giving the input sequence number
      its own member in the xfrm_skb_cb structure.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b318e0e4
    • D
      [NDISC]: Fix race in generic address resolution · 69cc64d8
      David S. Miller 提交于
      Frank Blaschka provided the bug report and the initial suggested fix
      for this bug.  He also validated this version of this fix.
      
      The problem is that the access to neigh->arp_queue is inconsistent, we
      grab references when dropping the lock lock to call
      neigh->ops->solicit() but this does not prevent other threads of
      control from trying to send out that packet at the same time causing
      corruptions because both code paths believe they have exclusive access
      to the skb.
      
      The best option seems to be to hold the write lock on neigh->lock
      during the ->solicit() call.  I looked at all of the ndisc_ops
      implementations and this seems workable.  The only case that needs
      special care is the IPV4 ARP implementation of arp_solicit().  It
      wants to take neigh->lock as a reader to protect the header entry in
      neigh->ha during the emission of the soliciation.  We can simply
      remove the read lock calls to take care of that since holding the lock
      as a writer at the caller providers a superset of the protection
      afforded by the existing read locking.
      
      The rest of the ->solicit() implementations don't care whether the
      neigh is locked or not.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      69cc64d8
    • S
      fib_trie: /proc/net/route performance improvement · 8315f5d8
      Stephen Hemminger 提交于
      Use key/offset caching to change /proc/net/route (use by iputils route)
      from O(n^2) to O(n). This improves performance from 30sec with 160,000
      routes to 1sec.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8315f5d8
    • S
      fib_trie: handle empty tree · ec28cf73
      Stephen Hemminger 提交于
      This fixes possible problems when trie_firstleaf() returns NULL
      to trie_leafindex().
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec28cf73
    • D
      [IPV4]: Remove IP_TOS setting privilege checks. · e4f8b5d4
      David S. Miller 提交于
      Various RFCs have all sorts of things to say about the CS field of the
      DSCP value.  In particular they try to make the distinction between
      values that should be used by "user applications" and things like
      routing daemons.
      
      This seems to have influenced the CAP_NET_ADMIN check which exists for
      IP_TOS socket option settings, but in fact it has an off-by-one error
      so it wasn't allowing CS5 which is meant for "user applications" as
      well.
      
      Further adding to the inconsistency and brokenness here, IPV6 does not
      validate the DSCP values specified for the IPV6_TCLASS socket option.
      
      The real actual uses of these TOS values are system specific in the
      final analysis, and these RFC recommendations are just that, "a
      recommendation".  In fact the standards very purposefully use
      "SHOULD" and "SHOULD NOT" when describing how these values can be
      used.
      
      In the final analysis the only clean way to provide consistency here
      is to remove the CAP_NET_ADMIN check.  The alternatives just don't
      work out:
      
      1) If we add the CAP_NET_ADMIN check to ipv6, this can break existing
         setups.
      
      2) If we just fix the off-by-one error in the class comparison in
         IPV4, certain DSCP values can be used in IPV6 but not IPV4 by
         default.  So people will just ask for a sysctl asking to
         override that.
      
      I checked several other freely available kernel trees and they
      do not make any privilege checks in this area like we do.  For
      the BSD stacks, this goes back all the way to Stevens Volume 2
      and beyond.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4f8b5d4
  16. 10 2月, 2008 1 次提交
  17. 08 2月, 2008 2 次提交
  18. 06 2月, 2008 2 次提交
  19. 05 2月, 2008 4 次提交