1. 29 1月, 2011 3 次提交
  2. 28 1月, 2011 3 次提交
  3. 27 1月, 2011 1 次提交
    • D
      net: Implement read-only protection and COW'ing of metrics. · 62fa8a84
      David S. Miller 提交于
      Routing metrics are now copy-on-write.
      
      Initially a route entry points it's metrics at a read-only location.
      If a routing table entry exists, it will point there.  Else it will
      point at the all zero metric place-holder called 'dst_default_metrics'.
      
      The writeability state of the metrics is stored in the low bits of the
      metrics pointer, we have two bits left to spare if we want to store
      more states.
      
      For the initial implementation, COW is implemented simply via kmalloc.
      However future enhancements will change this to place the writable
      metrics somewhere else, in order to increase sharing.  Very likely
      this "somewhere else" will be the inetpeer cache.
      
      Note also that this means that metrics updates may transiently fail
      if we cannot COW the metrics successfully.
      
      But even by itself, this patch should decrease memory usage and
      increase cache locality especially for routing workloads.  In those
      cases the read-only metric copies stay in place and never get written
      to.
      
      TCP workloads where metrics get updated, and those rare cases where
      PMTU triggers occur, will take a very slight performance hit.  But
      that hit will be alleviated when the long-term writable metrics
      move to a more sharable location.
      
      Since the metrics storage went from a u32 array of RTAX_MAX entries to
      what is essentially a pointer, some retooling of the dst_entry layout
      was necessary.
      
      Most importantly, we need to preserve the alignment of the reference
      count so that it doesn't share cache lines with the read-mostly state,
      as per Eric Dumazet's alignment assertion checks.
      
      The only non-trivial bit here is the move of the 'flags' member into
      the writeable cacheline.  This is OK since we are always accessing the
      flags around the same moment when we made a modification to the
      reference count.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62fa8a84
  4. 26 1月, 2011 1 次提交
    • J
      TCP: fix a bug that triggers large number of TCP RST by mistake · 44f5324b
      Jerry Chu 提交于
      This patch fixes a bug that causes TCP RST packets to be generated
      on otherwise correctly behaved applications, e.g., no unread data
      on close,..., etc. To trigger the bug, at least two conditions must
      be met:
      
      1. The FIN flag is set on the last data packet, i.e., it's not on a
      separate, FIN only packet.
      2. The size of the last data chunk on the receive side matches
      exactly with the size of buffer posted by the receiver, and the
      receiver closes the socket without any further read attempt.
      
      This bug was first noticed on our netperf based testbed for our IW10
      proposal to IETF where a large number of RST packets were observed.
      netperf's read side code meets the condition 2 above 100%.
      
      Before the fix, tcp_data_queue() will queue the last skb that meets
      condition 1 to sk_receive_queue even though it has fully copied out
      (skb_copy_datagram_iovec()) the data. Then if condition 2 is also met,
      tcp_recvmsg() often returns all the copied out data successfully
      without actually consuming the skb, due to a check
      "if ((chunk = len - tp->ucopy.len) != 0) {"
      and
      "len -= chunk;"
      after tcp_prequeue_process() that causes "len" to become 0 and an
      early exit from the big while loop.
      
      I don't see any reason not to free the skb whose data have been fully
      consumed in tcp_data_queue(), regardless of the FIN flag.  We won't
      get there if MSG_PEEK is on. Am I missing some arcane cases related
      to urgent data?
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f5324b
  5. 25 1月, 2011 4 次提交
  6. 20 1月, 2011 3 次提交
  7. 19 1月, 2011 1 次提交
    • J
      netfilter: nf_conntrack: nf_conntrack snmp helper · 93557f53
      Jiri Olsa 提交于
      Adding support for SNMP broadcast connection tracking. The SNMP
      broadcast requests are now paired with the SNMP responses.
      Thus allowing using SNMP broadcasts with firewall enabled.
      
      Please refer to the following conversation:
      http://marc.info/?l=netfilter-devel&m=125992205006600&w=2
      
      Patrick McHardy wrote:
      > > The best solution would be to add generic broadcast tracking, the
      > > use of expectations for this is a bit of abuse.
      > > The second best choice I guess would be to move the help() function
      > > to a shared module and generalize it so it can be used for both.
      This patch implements the "second best choice".
      
      Since the netbios-ns conntrack module uses the same helper
      functionality as the snmp, only one helper function is added
      for both snmp and netbios-ns modules into the new object -
      nf_conntrack_broadcast.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      93557f53
  8. 18 1月, 2011 2 次提交
  9. 14 1月, 2011 2 次提交
  10. 13 1月, 2011 1 次提交
    • E
      netfilter: x_table: speedup compat operations · 255d0dc3
      Eric Dumazet 提交于
      One iptables invocation with 135000 rules takes 35 seconds of cpu time
      on a recent server, using a 32bit distro and a 64bit kernel.
      
      We eventually trigger NMI/RCU watchdog.
      
      INFO: rcu_sched_state detected stall on CPU 3 (t=6000 jiffies)
      
      COMPAT mode has quadratic behavior and consume 16 bytes of memory per
      rule.
      
      Switch the xt_compat algos to use an array instead of list, and use a
      binary search to locate an offset in the sorted array.
      
      This halves memory need (8 bytes per rule), and removes quadratic
      behavior [ O(N*N) -> O(N*log2(N)) ]
      
      Time of iptables goes from 35 s to 150 ms.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      255d0dc3
  11. 12 1月, 2011 2 次提交
  12. 11 1月, 2011 2 次提交
  13. 10 1月, 2011 1 次提交
  14. 07 1月, 2011 2 次提交
    • P
      netfilter: fix export secctx error handling · cba85b53
      Pablo Neira Ayuso 提交于
      In 1ae4de0c, the secctx was exported
      via the /proc/net/netfilter/nf_conntrack and ctnetlink interfaces
      instead of the secmark.
      
      That patch introduced the use of security_secid_to_secctx() which may
      return a non-zero value on error.
      
      In one of my setups, I have NF_CONNTRACK_SECMARK enabled but no
      security modules. Thus, security_secid_to_secctx() returns a negative
      value that results in the breakage of the /proc and `conntrack -L'
      outputs. To fix this, we skip the inclusion of secctx if the
      aforementioned function fails.
      
      This patch also fixes the dynamic netlink message size calculation
      if security_secid_to_secctx() returns an error, since its logic is
      also wrong.
      
      This problem exists in Linux kernel >= 2.6.37.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cba85b53
    • E
      ipv4: IP defragmentation must be ECN aware · 6623e3b2
      Eric Dumazet 提交于
      RFC3168 (The Addition of Explicit Congestion Notification to IP)
      states :
      
      5.3.  Fragmentation
      
         ECN-capable packets MAY have the DF (Don't Fragment) bit set.
         Reassembly of a fragmented packet MUST NOT lose indications of
         congestion.  In other words, if any fragment of an IP packet to be
         reassembled has the CE codepoint set, then one of two actions MUST be
         taken:
      
            * Set the CE codepoint on the reassembled packet.  However, this
              MUST NOT occur if any of the other fragments contributing to
              this reassembly carries the Not-ECT codepoint.
      
            * The packet is dropped, instead of being reassembled, for any
              other reason.
      
      This patch implements this requirement for IPv4, choosing the first
      action :
      
      If one fragment had NO-ECT codepoint
              reassembled frame has NO-ECT
      ElIf one fragment had CE codepoint
              reassembled frame has CE
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6623e3b2
  15. 05 1月, 2011 1 次提交
    • J
      ipv4/route.c: respect prefsrc for local routes · 9fc3bbb4
      Joel Sing 提交于
      The preferred source address is currently ignored for local routes,
      which results in all local connections having a src address that is the
      same as the local dst address. Fix this by respecting the preferred source
      address when it is provided for local routes.
      
      This bug can be demonstrated as follows:
      
       # ifconfig dummy0 192.168.0.1
       # ip route show table local | grep local.*dummy0
       local 192.168.0.1 dev dummy0  proto kernel  scope host  src 192.168.0.1
       # ip route change table local local 192.168.0.1 dev dummy0 \
           proto kernel scope host src 127.0.0.1
       # ip route show table local | grep local.*dummy0
       local 192.168.0.1 dev dummy0  proto kernel  scope host  src 127.0.0.1
      
      We now establish a local connection and verify the source IP
      address selection:
      
       # nc -l 192.168.0.1 3128 &
       # nc 192.168.0.1 3128 &
       # netstat -ant | grep 192.168.0.1:3128.*EST
       tcp        0      0 192.168.0.1:3128        192.168.0.1:33228 ESTABLISHED
       tcp        0      0 192.168.0.1:33228       192.168.0.1:3128  ESTABLISHED
      Signed-off-by: NJoel Sing <jsing@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fc3bbb4
  16. 26 12月, 2010 1 次提交
  17. 24 12月, 2010 3 次提交
  18. 21 12月, 2010 2 次提交
  19. 17 12月, 2010 2 次提交
  20. 15 12月, 2010 1 次提交
  21. 14 12月, 2010 2 次提交