1. 31 10月, 2008 1 次提交
  2. 30 10月, 2008 3 次提交
  3. 29 10月, 2008 6 次提交
    • E
      udp: calculate udp_mem based on low memory instead of all memory · 8203efb3
      Eric Dumazet 提交于
      This patch mimics commit 57413ebc
      (tcp: calculate tcp_mem based on low memory instead of all memory)
      
      The udp_mem array which contains limits on the total amount of memory
      used by UDP sockets is calculated based on nr_all_pages.  On a 32 bits
      x86 system, we should base this on the number of lowmem pages.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8203efb3
    • E
      udp: RCU handling for Unicast packets. · 271b72c7
      Eric Dumazet 提交于
      Goals are :
      
      1) Optimizing handling of incoming Unicast UDP frames, so that no memory
       writes should happen in the fast path.
      
       Note: Multicasts and broadcasts still will need to take a lock,
       because doing a full lockless lookup in this case is difficult.
      
      2) No expensive operations in the socket bind/unhash phases :
        - No expensive synchronize_rcu() calls.
      
        - No added rcu_head in socket structure, increasing memory needs,
        but more important, forcing us to use call_rcu() calls,
        that have the bad property of making sockets structure cold.
        (rcu grace period between socket freeing and its potential reuse
         make this socket being cold in CPU cache).
        David did a previous patch using call_rcu() and noticed a 20%
        impact on TCP connection rates.
        Quoting Cristopher Lameter :
         "Right. That results in cacheline cooldown. You'd want to recycle
          the object as they are cache hot on a per cpu basis. That is screwed
          up by the delayed regular rcu processing. We have seen multiple
          regressions due to cacheline cooldown.
          The only choice in cacheline hot sensitive areas is to deal with the
          complexity that comes with SLAB_DESTROY_BY_RCU or give up on RCU."
      
        - Because udp sockets are allocated from dedicated kmem_cache,
        use of SLAB_DESTROY_BY_RCU can help here.
      
      Theory of operation :
      ---------------------
      
      As the lookup is lockfree (using rcu_read_lock()/rcu_read_unlock()),
      special attention must be taken by readers and writers.
      
      Use of SLAB_DESTROY_BY_RCU is tricky too, because a socket can be freed,
      reused, inserted in a different chain or in worst case in the same chain
      while readers could do lookups in the same time.
      
      In order to avoid loops, a reader must check each socket found in a chain
      really belongs to the chain the reader was traversing. If it finds a
      mismatch, lookup must start again at the begining. This *restart* loop
      is the reason we had to use rdlock for the multicast case, because
      we dont want to send same message several times to the same socket.
      
      We use RCU only for fast path.
      Thus, /proc/net/udp still takes spinlocks.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271b72c7
    • E
      udp: introduce struct udp_table and multiple spinlocks · 645ca708
      Eric Dumazet 提交于
      UDP sockets are hashed in a 128 slots hash table.
      
      This hash table is protected by *one* rwlock.
      
      This rwlock is readlocked each time an incoming UDP message is handled.
      
      This rwlock is writelocked each time a socket must be inserted in
      hash table (bind time), or deleted from this table (close time)
      
      This is not scalable on SMP machines :
      
      1) Even in read mode, lock() and unlock() are atomic operations and
       must dirty a contended cache line, shared by all cpus.
      
      2) A writer might be starved if many readers are 'in flight'. This can
       happen on a machine with some NIC receiving many UDP messages. User
       process can be delayed a long time at socket creation/dismantle time.
      
      This patch prepares RCU migration, by introducing 'struct udp_table
      and struct udp_hslot', and using one spinlock per chain, to reduce
      contention on central rwlock.
      
      Introducing one spinlock per chain reduces latencies, for port
      randomization on heavily loaded UDP servers. This also speedup
      bindings to specific ports.
      
      udp_lib_unhash() was uninlined, becoming to big.
      
      Some cleanups were done to ease review of following patch
      (RCUification of UDP Unicast lookups)
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      645ca708
    • H
      0c6ce78a
    • A
      net: don't use INIT_RCU_HEAD · 93adcc80
      Alexey Dobriyan 提交于
      call_rcu() will unconditionally rewrite RCU head anyway.
      Applies to 
      	struct neigh_parms
      	struct neigh_table
      	struct net
      	struct cipso_v4_doi
      	struct in_ifaddr
      	struct in_device
      	rt->u.dst
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93adcc80
    • A
      net: reduce structures when XFRM=n · def8b4fa
      Alexey Dobriyan 提交于
      ifdef out
      * struct sk_buff::sp		(pointer)
      * struct dst_entry::xfrm	(pointer)
      * struct sock::sk_policy	(2 pointers)
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      def8b4fa
  4. 28 10月, 2008 1 次提交
    • N
      net: implement emergency route cache rebulds when gc_elasticity is exceeded · 1080d709
      Neil Horman 提交于
      This is a patch to provide on demand route cache rebuilding.  Currently, our
      route cache is rebulid periodically regardless of need.  This introduced
      unneeded periodic latency.  This patch offers a better approach.  Using code
      provided by Eric Dumazet, we compute the standard deviation of the average hash
      bucket chain length while running rt_check_expire.  Should any given chain
      length grow to larger that average plus 4 standard deviations, we trigger an
      emergency hash table rebuild for that net namespace.  This allows for the common
      case in which chains are well behaved and do not grow unevenly to not incur any
      latency at all, while those systems (which may be being maliciously attacked),
      only rebuild when the attack is detected.  This patch take 2 other factors into
      account:
      1) chains with multiple entries that differ by attributes that do not affect the
      hash value are only counted once, so as not to unduly bias system to rebuilding
      if features like QOS are heavily used
      2) if rebuilding crosses a certain threshold (which is adjustable via the added
      sysctl in this patch), route caching is disabled entirely for that net
      namespace, since constant rebuilding is less efficient that no caching at all
      
      Tested successfully by me.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1080d709
  5. 27 10月, 2008 1 次提交
  6. 24 10月, 2008 1 次提交
    • I
      tcp: Restore ordering of TCP options for the sake of inter-operability · fd6149d3
      Ilpo Järvinen 提交于
      This is not our bug! Sadly some devices cannot cope with the change
      of TCP option ordering which was a result of the recent rewrite of
      the option code (not that there was some particular reason steming
      from the rewrite for the reordering) though any ordering of TCP
      options is perfectly legal. Thus we restore the original ordering
      to allow interoperability with/through such broken devices and add
      some warning about this trap. Since the reordering just happened
      without any particular reason, this change shouldn't cost us
      anything.
      
      There are already couple of known failure reports (within close
      proximity of the last release), so the problem might be more
      wide-spread than a single device. And other reports which may
      be due to the same problem though the symptoms were less obvious.
      Analysis of one of the case revealed (with very high probability)
      that sack capability cannot be negotiated as the first option
      (SYN never got a response).
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Reported-by: NAldo Maggi <sentiniate@tiscali.it>
      Tested-by: NAldo Maggi <sentiniate@tiscali.it>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd6149d3
  7. 22 10月, 2008 1 次提交
    • I
      tcp: should use number of sack blocks instead of -1 · 75e3d8db
      Ilpo Järvinen 提交于
      While looking for the recent "sack issue" I also read all eff_sacks
      usage that was played around by some relevant commit. I found
      out that there's another thing that is asking for a fix (unrelated
      to the "sack issue" though).
      
      This feature has probably very little significance in practice.
      Opposite direction timeout with bidirectional tcp comes to me as
      the most likely scenario though there might be other cases as
      well related to non-data segments we send (e.g., response to the
      opposite direction segment). Also some ACK losses or option space
      wasted for other purposes is necessary to prevent the earlier
      SACK feedback getting to the sender.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75e3d8db
  8. 20 10月, 2008 2 次提交
  9. 17 10月, 2008 3 次提交
  10. 15 10月, 2008 2 次提交
    • P
      netfilter: ctnetlink: remove bogus module dependency between ctnetlink and nf_nat · e6a7d3c0
      Pablo Neira Ayuso 提交于
      This patch removes the module dependency between ctnetlink and
      nf_nat by means of an indirect call that is initialized when
      nf_nat is loaded. Now, nf_conntrack_netlink only requires
      nf_conntrack and nfnetlink.
      
      This patch puts nfnetlink_parse_nat_setup_hook into the
      nf_conntrack_core to avoid dependencies between ctnetlink,
      nf_conntrack_ipv4 and nf_conntrack_ipv6.
      
      This patch also introduces the function ctnetlink_change_nat
      that is only invoked from the creation path. Actually, the
      nat handling cannot be invoked from the update path since
      this is not allowed. By introducing this function, we remove
      the useless nat handling in the update path and we avoid
      deadlock-prone code.
      
      This patch also adds the required EAGAIN logic for nfnetlink.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6a7d3c0
    • P
      netfilter: restore lost #ifdef guarding defrag exception · 38f7ac3e
      Patrick McHardy 提交于
      Nir Tzachar <nir.tzachar@gmail.com> reported a warning when sending
      fragments over loopback with NAT:
      
      [ 6658.338121] WARNING: at net/ipv4/netfilter/nf_nat_standalone.c:89 nf_nat_fn+0x33/0x155()
      
      The reason is that defragmentation is skipped for already tracked connections.
      This is wrong in combination with NAT and ip_conntrack actually had some ifdefs
      to avoid this behaviour when NAT is compiled in.
      
      The entire "optimization" may seem a bit silly, for now simply restoring the
      lost #ifdef is the easiest solution until we can come up with something better.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38f7ac3e
  11. 14 10月, 2008 1 次提交
  12. 12 10月, 2008 1 次提交
  13. 11 10月, 2008 2 次提交
  14. 10 10月, 2008 11 次提交
    • P
      cipso: Add support for native local labeling and fixup mapping names · 15c45f7b
      Paul Moore 提交于
      This patch accomplishes three minor tasks: add a new tag type for local
      labeling, rename the CIPSO_V4_MAP_STD define to CIPSO_V4_MAP_TRANS and
      replace some of the CIPSO "magic numbers" with constants from the header
      file.  The first change allows CIPSO to support full LSM labels/contexts,
      not just MLS attributes.  The second change brings the mapping names inline
      with what userspace is using, compatibility is preserved since we don't
      actually change the value.  The last change is to aid readability and help
      prevent mistakes.
      Signed-off-by: NPaul Moore <paul.moore@hp.com>
      15c45f7b
    • P
      selinux: Set socket NetLabel based on connection endpoint · 014ab19a
      Paul Moore 提交于
      Previous work enabled the use of address based NetLabel selectors, which while
      highly useful, brought the potential for additional per-packet overhead when
      used.  This patch attempts to solve that by applying NetLabel socket labels
      when sockets are connect()'d.  This should alleviate the per-packet NetLabel
      labeling for all connected sockets (yes, it even works for connected DGRAM
      sockets).
      Signed-off-by: NPaul Moore <paul.moore@hp.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      014ab19a
    • P
      netlabel: Add functionality to set the security attributes of a packet · 948bf85c
      Paul Moore 提交于
      This patch builds upon the new NetLabel address selector functionality by
      providing the NetLabel KAPI and CIPSO engine support needed to enable the
      new packet-based labeling.  The only new addition to the NetLabel KAPI at
      this point is shown below:
      
       * int netlbl_skbuff_setattr(skb, family, secattr)
      
      ... and is designed to be called from a Netfilter hook after the packet's
      IP header has been populated such as in the FORWARD or LOCAL_OUT hooks.
      
      This patch also provides the necessary SELinux hooks to support this new
      functionality.  Smack support is not currently included due to uncertainty
      regarding the permissions needed to expand the Smack network access controls.
      Signed-off-by: NPaul Moore <paul.moore@hp.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      948bf85c
    • P
      netlabel: Replace protocol/NetLabel linking with refrerence counts · b1edeb10
      Paul Moore 提交于
      NetLabel has always had a list of backpointers in the CIPSO DOI definition
      structure which pointed to the NetLabel LSM domain mapping structures which
      referenced the CIPSO DOI struct.  The rationale for this was that when an
      administrator removed a CIPSO DOI from the system all of the associated
      NetLabel LSM domain mappings should be removed as well; a list of
      backpointers made this a simple operation.
      
      Unfortunately, while the backpointers did make the removal easier they were
      a bit of a mess from an implementation point of view which was making
      further development difficult.  Since the removal of a CIPSO DOI is a
      realtively rare event it seems to make sense to remove this backpointer
      list as the optimization was hurting us more then it was helping.  However,
      we still need to be able to track when a CIPSO DOI definition is being used
      so replace the backpointer list with a reference count.  In order to
      preserve the current functionality of removing the associated LSM domain
      mappings when a CIPSO DOI is removed we walk the LSM domain mapping table,
      removing the relevant entries.
      Signed-off-by: NPaul Moore <paul.moore@hp.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      b1edeb10
    • E
      udp: complete port availability checking · f24d43c0
      Eric Dumazet 提交于
      While looking at UDP port randomization, I noticed it
      was litle bit pessimistic, not looking at type of sockets
      (IPV6/IPV4) and not looking at bound addresses if any.
      
      We should perform same tests than when binding to a
      specific port.
      
      This permits a cleanup of udp_lib_get_port()
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f24d43c0
    • I
      tcpv[46]: fix md5 pseudoheader address field ordering · 78e645cb
      Ilpo Järvinen 提交于
      Maybe it's just me but I guess those md5 people made a mess
      out of it by having *_md5_hash_* to use daddr, saddr order
      instead of the one that is natural (and equal to what csum
      functions use). For the segment were sending, the original
      addresses are reversed so buff's saddr == skb's daddr and
      vice-versa.
      
      Maybe I can finally proceed with unification of some code
      after fixing it first... :-)
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78e645cb
    • H
      inet: Make tunnel RX/TX byte counters more consistent · 64194c31
      Herbert Xu 提交于
      This patch makes the RX/TX byte counters for IPIP, GRE and SIT more
      consistent.  Previously we included the external IP headers on the
      way out but not when the packet is inbound.
      
      The new scheme is to count payload only in both directions.  For
      IPIP and SIT this simply means the exclusion of the external IP
      header.  For GRE this means that we exclude the GRE header as
      well.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64194c31
    • H
      gre: Add Transparent Ethernet Bridging · e1a80002
      Herbert Xu 提交于
      This patch adds support for Ethernet over GRE encapsulation.
      This is exposed to user-space with a new link type of "gretap"
      instead of "gre".  It will create an ARPHRD_ETHER device in
      lieu of the usual ARPHRD_IPGRE.
      
      Note that to preserver backwards compatibility all Transparent
      Ethernet Bridging packets are passed to an ARPHRD_IPGRE tunnel
      if its key matches and there is no ARPHRD_ETHER device whose
      key matches more closely.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1a80002
    • H
      gre: Add netlink interface · c19e654d
      Herbert Xu 提交于
      This patch adds a netlink interface that will eventually displace
      the existing ioctl interface.  It utilises the elegant rtnl_link_ops
      mechanism.
      
      This also means that user-space no longer needs to rely on the
      tunnel interface being of type GRE to identify GRE tunnels.  The
      identification can now occur using rtnl_link_ops.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c19e654d
    • H
      gre: Move MTU setting out of ipgre_tunnel_bind_dev · 42aa9162
      Herbert Xu 提交于
      This patch moves the dev->mtu setting out of ipgre_tunnel_bind_dev.
      This is in prepartion of using rtnl_link where we'll need to make
      the MTU setting conditional on whether the user has supplied an
      MTU.  This also requires the move of the ipgre_tunnel_bind_dev
      call out of the dev->init function so that we can access the user
      parameters later.
      
      This patch also adds a check to prevent setting the MTU below
      the minimum of 68.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42aa9162
    • H
      gre: Use needed_headroom · c95b819a
      Herbert Xu 提交于
      Now that we have dev->needed_headroom, we can use it instead of
      having a bogus dev->hard_header_len.  This also allows us to
      include dev->hard_header_len in the MTU computation so that when
      we do have a meaningful hard_harder_len in future it is included
      automatically in figuring out the MTU.
      
      Incidentally, this fixes a bug where we ignored the needed_headroom
      field of the underlying device in calculating our own hard_header_len.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c95b819a
  15. 09 10月, 2008 4 次提交
    • S
      ipvs: Remove stray file left over from ipvs move · 071d7ab6
      Sven Wegener 提交于
      Commit cb7f6a7b ("IPVS: Move IPVS to
      net/netfilter/ipvs") has left a stray file in the old location of ipvs.
      Signed-off-by: NSven Wegener <sven.wegener@stealer.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      071d7ab6
    • E
      inet: cleanup of local_port_range · 3c689b73
      Eric Dumazet 提交于
      I noticed sysctl_local_port_range[] and its associated seqlock
      sysctl_local_port_range_lock were on separate cache lines.
      Moreover, sysctl_local_port_range[] was close to unrelated
      variables, highly modified, leading to cache misses.
      
      Moving these two variables in a structure can help data
      locality and moving this structure to read_mostly section
      helps sharing of this data among cpus.
      
      Cleanup of extern declarations (moved in include file where
      they belong), and use of inet_get_local_port_range()
      accessor instead of direct access to ports values.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c689b73
    • E
      udp: Improve port randomization · 9088c560
      Eric Dumazet 提交于
      Current UDP port allocation is suboptimal.
      We select the shortest chain to chose a port (out of 512)
      that will hash in this shortest chain.
      
      First, it can lead to give not so ramdom ports and ease
      give attackers more opportunities to break the system.
      
      Second, it can consume a lot of CPU to scan all table
      in order to find the shortest chain.
      
      Third, in some pathological cases we can fail to find
      a free port even if they are plenty of them.
      
      This patch zap the search for a short chain and only
      use one random seed. Problem of getting long chains
      should be addressed in another way, since we can
      obtain long chains with non random ports.
      
      Based on a report and patch from Vitaly Mayatskikh
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9088c560
    • I
      tcp: fix length used for checksum in a reset · 52cd5750
      Ilpo Järvinen 提交于
      While looking for some common code I came across difference
      in checksum calculation between tcp_v6_send_(reset|ack) I
      couldn't explain. I checked both v4 and v6 and found out that
      both seem to have the same "feature". I couldn't find anything
      in rfc nor anywhere else which would state that md5 option
      should be ignored like it was in case of reset so I came to
      a conclusion that this is probably a genuine bug. I suspect
      that addition of md5 just was fooled by the excessive
      copy-paste code in those functions and the reset part was
      never tested well enough to find out the problem.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52cd5750