1. 02 2月, 2011 1 次提交
    • D
      ipv4: Remove fib_hash. · 3630b7c0
      David S. Miller 提交于
      The time has finally come to remove the hash based routing table
      implementation in ipv4.
      
      FIB Trie is mature, well tested, and I've done an audit of it's code
      to confirm that it implements insert, delete, and lookup with the same
      identical semantics as fib_hash did.
      
      If there are any semantic differences found in fib_trie, we should
      simply fix them.
      
      I've placed the trie statistic config option under advanced router
      configuration.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      3630b7c0
  2. 01 2月, 2011 2 次提交
    • D
      ipv4: Consolidate all default route selection implementations. · 0c838ff1
      David S. Miller 提交于
      Both fib_trie and fib_hash have a local implementation of
      fib_table_select_default().  This is completely unnecessary
      code duplication.
      
      Since we now remember the fib_table and the head of the fib
      alias list of the default route, we can implement one single
      generic version of this routine.
      
      Looking at the fib_hash implementation you may get the impression
      that it's possible for there to be multiple top-level routes in
      the table for the default route.  The truth is, it isn't, the
      insert code will only allow one entry to exist in the zero
      prefix hash table, because all keys evaluate to zero and all
      keys in a hash table must be unique.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c838ff1
    • D
      ipv4: Remember FIB alias list head and table in lookup results. · 5b470441
      David S. Miller 提交于
      This will be used later to implement fib_select_default() in a
      completely generic manner, instead of the current situation where the
      default route is re-looked up in the TRIE/HASH table and then the
      available aliases are analyzed.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b470441
  3. 30 1月, 2011 1 次提交
    • E
      net: Add compat ioctl support for the ipv4 multicast ioctl SIOCGETSGCNT · 709b46e8
      Eric W. Biederman 提交于
      SIOCGETSGCNT is not a unique ioctl value as it it maps tio SIOCPROTOPRIVATE +1,
      which unfortunately means the existing infrastructure for compat networking
      ioctls is insufficient.  A trivial compact ioctl implementation would conflict
      with:
      
      SIOCAX25ADDUID
      SIOCAIPXPRISLT
      SIOCGETSGCNT_IN6
      SIOCGETSGCNT
      SIOCRSSCAUSE
      SIOCX25SSUBSCRIP
      SIOCX25SDTEFACILITIES
      
      To make this work I have updated the compat_ioctl decode path to mirror the
      the normal ioctl decode path.  I have added an ipv4 inet_compat_ioctl function
      so that I can have ipv4 specific compat ioctls.   I have added a compat_ioctl
      function into struct proto so I can break out ioctls by which kind of ip socket
      I am using.  I have added a compat_raw_ioctl function because SIOCGETSGCNT only
      works on raw sockets.  I have added a ipmr_compat_ioctl that mirrors the normal
      ipmr_ioctl.
      
      This was necessary because unfortunately the struct layout for the SIOCGETSGCNT
      has unsigned longs in it so changes between 32bit and 64bit kernels.
      
      This change was sufficient to run a 32bit ip multicast routing daemon on a
      64bit kernel.
      Reported-by: NBill Fenner <fenner@aristanetworks.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      709b46e8
  4. 29 1月, 2011 3 次提交
  5. 28 1月, 2011 3 次提交
  6. 27 1月, 2011 1 次提交
    • D
      net: Implement read-only protection and COW'ing of metrics. · 62fa8a84
      David S. Miller 提交于
      Routing metrics are now copy-on-write.
      
      Initially a route entry points it's metrics at a read-only location.
      If a routing table entry exists, it will point there.  Else it will
      point at the all zero metric place-holder called 'dst_default_metrics'.
      
      The writeability state of the metrics is stored in the low bits of the
      metrics pointer, we have two bits left to spare if we want to store
      more states.
      
      For the initial implementation, COW is implemented simply via kmalloc.
      However future enhancements will change this to place the writable
      metrics somewhere else, in order to increase sharing.  Very likely
      this "somewhere else" will be the inetpeer cache.
      
      Note also that this means that metrics updates may transiently fail
      if we cannot COW the metrics successfully.
      
      But even by itself, this patch should decrease memory usage and
      increase cache locality especially for routing workloads.  In those
      cases the read-only metric copies stay in place and never get written
      to.
      
      TCP workloads where metrics get updated, and those rare cases where
      PMTU triggers occur, will take a very slight performance hit.  But
      that hit will be alleviated when the long-term writable metrics
      move to a more sharable location.
      
      Since the metrics storage went from a u32 array of RTAX_MAX entries to
      what is essentially a pointer, some retooling of the dst_entry layout
      was necessary.
      
      Most importantly, we need to preserve the alignment of the reference
      count so that it doesn't share cache lines with the read-mostly state,
      as per Eric Dumazet's alignment assertion checks.
      
      The only non-trivial bit here is the move of the 'flags' member into
      the writeable cacheline.  This is OK since we are always accessing the
      flags around the same moment when we made a modification to the
      reference count.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62fa8a84
  7. 26 1月, 2011 1 次提交
    • J
      TCP: fix a bug that triggers large number of TCP RST by mistake · 44f5324b
      Jerry Chu 提交于
      This patch fixes a bug that causes TCP RST packets to be generated
      on otherwise correctly behaved applications, e.g., no unread data
      on close,..., etc. To trigger the bug, at least two conditions must
      be met:
      
      1. The FIN flag is set on the last data packet, i.e., it's not on a
      separate, FIN only packet.
      2. The size of the last data chunk on the receive side matches
      exactly with the size of buffer posted by the receiver, and the
      receiver closes the socket without any further read attempt.
      
      This bug was first noticed on our netperf based testbed for our IW10
      proposal to IETF where a large number of RST packets were observed.
      netperf's read side code meets the condition 2 above 100%.
      
      Before the fix, tcp_data_queue() will queue the last skb that meets
      condition 1 to sk_receive_queue even though it has fully copied out
      (skb_copy_datagram_iovec()) the data. Then if condition 2 is also met,
      tcp_recvmsg() often returns all the copied out data successfully
      without actually consuming the skb, due to a check
      "if ((chunk = len - tp->ucopy.len) != 0) {"
      and
      "len -= chunk;"
      after tcp_prequeue_process() that causes "len" to become 0 and an
      early exit from the big while loop.
      
      I don't see any reason not to free the skb whose data have been fully
      consumed in tcp_data_queue(), regardless of the FIN flag.  We won't
      get there if MSG_PEEK is on. Am I missing some arcane cases related
      to urgent data?
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f5324b
  8. 25 1月, 2011 4 次提交
  9. 20 1月, 2011 3 次提交
  10. 19 1月, 2011 1 次提交
    • J
      netfilter: nf_conntrack: nf_conntrack snmp helper · 93557f53
      Jiri Olsa 提交于
      Adding support for SNMP broadcast connection tracking. The SNMP
      broadcast requests are now paired with the SNMP responses.
      Thus allowing using SNMP broadcasts with firewall enabled.
      
      Please refer to the following conversation:
      http://marc.info/?l=netfilter-devel&m=125992205006600&w=2
      
      Patrick McHardy wrote:
      > > The best solution would be to add generic broadcast tracking, the
      > > use of expectations for this is a bit of abuse.
      > > The second best choice I guess would be to move the help() function
      > > to a shared module and generalize it so it can be used for both.
      This patch implements the "second best choice".
      
      Since the netbios-ns conntrack module uses the same helper
      functionality as the snmp, only one helper function is added
      for both snmp and netbios-ns modules into the new object -
      nf_conntrack_broadcast.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      93557f53
  11. 18 1月, 2011 2 次提交
  12. 14 1月, 2011 2 次提交
  13. 13 1月, 2011 1 次提交
    • E
      netfilter: x_table: speedup compat operations · 255d0dc3
      Eric Dumazet 提交于
      One iptables invocation with 135000 rules takes 35 seconds of cpu time
      on a recent server, using a 32bit distro and a 64bit kernel.
      
      We eventually trigger NMI/RCU watchdog.
      
      INFO: rcu_sched_state detected stall on CPU 3 (t=6000 jiffies)
      
      COMPAT mode has quadratic behavior and consume 16 bytes of memory per
      rule.
      
      Switch the xt_compat algos to use an array instead of list, and use a
      binary search to locate an offset in the sorted array.
      
      This halves memory need (8 bytes per rule), and removes quadratic
      behavior [ O(N*N) -> O(N*log2(N)) ]
      
      Time of iptables goes from 35 s to 150 ms.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      255d0dc3
  14. 12 1月, 2011 2 次提交
  15. 11 1月, 2011 2 次提交
  16. 10 1月, 2011 1 次提交
  17. 07 1月, 2011 2 次提交
    • P
      netfilter: fix export secctx error handling · cba85b53
      Pablo Neira Ayuso 提交于
      In 1ae4de0c, the secctx was exported
      via the /proc/net/netfilter/nf_conntrack and ctnetlink interfaces
      instead of the secmark.
      
      That patch introduced the use of security_secid_to_secctx() which may
      return a non-zero value on error.
      
      In one of my setups, I have NF_CONNTRACK_SECMARK enabled but no
      security modules. Thus, security_secid_to_secctx() returns a negative
      value that results in the breakage of the /proc and `conntrack -L'
      outputs. To fix this, we skip the inclusion of secctx if the
      aforementioned function fails.
      
      This patch also fixes the dynamic netlink message size calculation
      if security_secid_to_secctx() returns an error, since its logic is
      also wrong.
      
      This problem exists in Linux kernel >= 2.6.37.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cba85b53
    • E
      ipv4: IP defragmentation must be ECN aware · 6623e3b2
      Eric Dumazet 提交于
      RFC3168 (The Addition of Explicit Congestion Notification to IP)
      states :
      
      5.3.  Fragmentation
      
         ECN-capable packets MAY have the DF (Don't Fragment) bit set.
         Reassembly of a fragmented packet MUST NOT lose indications of
         congestion.  In other words, if any fragment of an IP packet to be
         reassembled has the CE codepoint set, then one of two actions MUST be
         taken:
      
            * Set the CE codepoint on the reassembled packet.  However, this
              MUST NOT occur if any of the other fragments contributing to
              this reassembly carries the Not-ECT codepoint.
      
            * The packet is dropped, instead of being reassembled, for any
              other reason.
      
      This patch implements this requirement for IPv4, choosing the first
      action :
      
      If one fragment had NO-ECT codepoint
              reassembled frame has NO-ECT
      ElIf one fragment had CE codepoint
              reassembled frame has CE
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6623e3b2
  18. 05 1月, 2011 1 次提交
    • J
      ipv4/route.c: respect prefsrc for local routes · 9fc3bbb4
      Joel Sing 提交于
      The preferred source address is currently ignored for local routes,
      which results in all local connections having a src address that is the
      same as the local dst address. Fix this by respecting the preferred source
      address when it is provided for local routes.
      
      This bug can be demonstrated as follows:
      
       # ifconfig dummy0 192.168.0.1
       # ip route show table local | grep local.*dummy0
       local 192.168.0.1 dev dummy0  proto kernel  scope host  src 192.168.0.1
       # ip route change table local local 192.168.0.1 dev dummy0 \
           proto kernel scope host src 127.0.0.1
       # ip route show table local | grep local.*dummy0
       local 192.168.0.1 dev dummy0  proto kernel  scope host  src 127.0.0.1
      
      We now establish a local connection and verify the source IP
      address selection:
      
       # nc -l 192.168.0.1 3128 &
       # nc 192.168.0.1 3128 &
       # netstat -ant | grep 192.168.0.1:3128.*EST
       tcp        0      0 192.168.0.1:3128        192.168.0.1:33228 ESTABLISHED
       tcp        0      0 192.168.0.1:33228       192.168.0.1:3128  ESTABLISHED
      Signed-off-by: NJoel Sing <jsing@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9fc3bbb4
  19. 26 12月, 2010 1 次提交
  20. 24 12月, 2010 3 次提交
  21. 21 12月, 2010 2 次提交
  22. 17 12月, 2010 1 次提交