1. 18 6月, 2009 1 次提交
    • J
      ipv4: Fix fib_trie rebalancing, part 2 · 7b85576d
      Jarek Poplawski 提交于
      My previous patch, which explicitly delays freeing of tnodes by adding
      them to the list to flush them after the update is finished, isn't
      strict enough. It treats exceptionally tnodes without parent, assuming
      they are newly created, so "invisible" for the read side yet.
      
      But the top tnode doesn't have parent as well, so we have to exclude
      all exceptions (at least until a better way is found). Additionally we
      need to move rcu assignment of this node before flushing, so the
      return type of the trie_rebalance() function is changed.
      Reported-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b85576d
  2. 15 6月, 2009 1 次提交
    • J
      ipv4: Fix fib_trie rebalancing · e0f7cb8c
      Jarek Poplawski 提交于
      While doing trie_rebalance(): resize(), inflate(), halve() RCU free
      tnodes before updating their parents. It depends on RCU delaying the
      real destruction, but if RCU readers start after call_rcu() and before
      parent update they could access freed memory.
      
      It is currently prevented with preempt_disable() on the update side,
      but it's not safe, except maybe classic RCU, plus it conflicts with
      memory allocations with GFP_KERNEL flag used from these functions.
      
      This patch explicitly delays freeing of tnodes by adding them to the
      list, which is flushed after the update is finished.
      Reported-by: NYan Zheng <zheng.yan@oracle.com>
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e0f7cb8c
  3. 14 6月, 2009 3 次提交
  4. 12 6月, 2009 1 次提交
    • P
      netfilter: ip_tables: fix build error · 24992eac
      Patrick McHardy 提交于
      Fix build error introduced by commit bb70dfa5 (netfilter: xtables:
      consolidate comefrom debug cast access):
      
      net/ipv4/netfilter/ip_tables.c: In function 'ipt_do_table':
      net/ipv4/netfilter/ip_tables.c:421: error: 'comefrom' undeclared (first use in this function)
      net/ipv4/netfilter/ip_tables.c:421: error: (Each undeclared identifier is reported only once
      net/ipv4/netfilter/ip_tables.c:421: error: for each function it appears in.)
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      24992eac
  5. 11 6月, 2009 1 次提交
    • E
      net: No more expensive sock_hold()/sock_put() on each tx · 2b85a34e
      Eric Dumazet 提交于
      One of the problem with sock memory accounting is it uses
      a pair of sock_hold()/sock_put() for each transmitted packet.
      
      This slows down bidirectional flows because the receive path
      also needs to take a refcount on socket and might use a different
      cpu than transmit path or transmit completion path. So these
      two atomic operations also trigger cache line bounces.
      
      We can see this in tx or tx/rx workloads (media gateways for example),
      where sock_wfree() can be in top five functions in profiles.
      
      We use this sock_hold()/sock_put() so that sock freeing
      is delayed until all tx packets are completed.
      
      As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
      by one unit at init time, until sk_free() is called.
      Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
      to decrement initial offset and atomicaly check if any packets
      are in flight.
      
      skb_set_owner_w() doesnt call sock_hold() anymore
      
      sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
      reached 0 to perform the final freeing.
      
      Drawback is that a skb->truesize error could lead to unfreeable sockets, or
      even worse, prematurely calling __sk_free() on a live socket.
      
      Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
      on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
      contention point. 5 % speedup on a UDP transmit workload (depends
      on number of flows), lowering TX completion cpu usage.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b85a34e
  6. 10 6月, 2009 1 次提交
  7. 09 6月, 2009 2 次提交
  8. 08 6月, 2009 1 次提交
    • J
      netfilter: nf_ct_icmp: keep the ICMP ct entries longer · f87fb666
      Jan Kasprzak 提交于
      Current conntrack code kills the ICMP conntrack entry as soon as
      the first reply is received. This is incorrect, as we then see only
      the first ICMP echo reply out of several possible duplicates as
      ESTABLISHED, while the rest will be INVALID. Also this unnecessarily
      increases the conntrackd traffic on H-A firewalls.
      
      Make all the ICMP conntrack entries (including the replied ones)
      last for the default of nf_conntrack_icmp{,v6}_timeout seconds.
      Signed-off-by: NJan "Yenya" Kasprzak <kas@fi.muni.cz>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      f87fb666
  9. 05 6月, 2009 1 次提交
  10. 04 6月, 2009 2 次提交
  11. 03 6月, 2009 3 次提交
    • E
      net: skb->dst accessors · adf30907
      Eric Dumazet 提交于
      Define three accessors to get/set dst attached to a skb
      
      struct dst_entry *skb_dst(const struct sk_buff *skb)
      
      void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
      
      void skb_dst_drop(struct sk_buff *skb)
      This one should replace occurrences of :
      dst_release(skb->dst)
      skb->dst = NULL;
      
      Delete skb->dst field
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adf30907
    • E
      net: skb->rtable accessor · 511c3f92
      Eric Dumazet 提交于
      Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb
      
      Delete skb->rtable field
      
      Setting rtable is not allowed, just set dst instead as rtable is an alias.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      511c3f92
    • P
      netfilter: conntrack: simplify event caching system · 17e6e4ea
      Pablo Neira Ayuso 提交于
      This patch simplifies the conntrack event caching system by removing
      several events:
      
       * IPCT_[*]_VOLATILE, IPCT_HELPINFO and IPCT_NATINFO has been deleted
         since the have no clients.
       * IPCT_COUNTER_FILLING which is a leftover of the 32-bits counter
         days.
       * IPCT_REFRESH which is not of any use since we always include the
         timeout in the messages.
      
      After this patch, the existing events are:
      
       * IPCT_NEW, IPCT_RELATED and IPCT_DESTROY, that are used to identify
       addition and deletion of entries.
       * IPCT_STATUS, that notes that the status bits have changes,
       eg. IPS_SEEN_REPLY and IPS_ASSURED.
       * IPCT_PROTOINFO, that reports that internal protocol information has
       changed, eg. the TCP, DCCP and SCTP protocol state.
       * IPCT_HELPER, that a helper has been assigned or unassigned to this
       entry.
       * IPCT_MARK and IPCT_SECMARK, that reports that the mark has changed, this
       covers the case when a mark is set to zero.
       * IPCT_NATSEQADJ, to report that there's updates in the NAT sequence
       adjustment.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      17e6e4ea
  12. 02 6月, 2009 2 次提交
  13. 30 5月, 2009 1 次提交
    • I
      tcp: fix loop in ofo handling code and reduce its complexity · 2df9001e
      Ilpo Järvinen 提交于
      Somewhat luckily, I was looking into these parts with very fine
      comb because I've made somewhat similar changes on the same
      area (conflicts that arose weren't that lucky though). The loop
      was very much overengineered recently in commit 91521944
      (tcp: Use SKB queue and list helpers instead of doing it
      by-hand), while it basically just wants to know if there are
      skbs after 'skb'.
      
      Also it got broken because skb1 = skb->next got translated into
      skb1 = skb1->next (though abstracted) improperly. Note that
      'skb1' is pointing to previous sk_buff than skb or NULL if at
      head. Two things went wrong:
      - We'll kfree 'skb' on the first iteration instead of the
        skbuff following 'skb' (it would require required SACK reneging
        to recover I think).
      - The list head case where 'skb1' is NULL is checked too early
        and the loop won't execute whereas it previously did.
      
      Conclusion, mostly revert the recent changes which makes the
      cset very messy looking but using proper accessor in the
      previous-like version.
      
      The effective changes against the original can be viewed with:
        git-diff 91521944^ \
      		net/ipv4/tcp_input.c | sed -n -e '57,70 p'
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2df9001e
  14. 29 5月, 2009 3 次提交
  15. 27 5月, 2009 6 次提交
  16. 26 5月, 2009 1 次提交
  17. 22 5月, 2009 1 次提交
  18. 21 5月, 2009 3 次提交
    • R
      net: Remove unused parameter from fill method in fib_rules_ops. · 04af8cf6
      Rami Rosen 提交于
      The netlink message header (struct nlmsghdr) is an unused parameter in
      fill method of fib_rules_ops struct.  This patch removes this
      parameter from this method and fixes the places where this method is
      called.
      
      (include/net/fib_rules.h)
      Signed-off-by: NRami Rosen <ramirose@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04af8cf6
    • E
      net: fix rtable leak in net/ipv4/route.c · 1ddbcb00
      Eric Dumazet 提交于
      Alexander V. Lukyanov found a regression in 2.6.29 and made a complete
      analysis found in http://bugzilla.kernel.org/show_bug.cgi?id=13339
      Quoted here because its a perfect one :
      
      begin_of_quotation
       2.6.29 patch has introduced flexible route cache rebuilding. Unfortunately the
       patch has at least one critical flaw, and another problem.
      
       rt_intern_hash calculates rthi pointer, which is later used for new entry
       insertion. The same loop calculates cand pointer which is used to clean the
       list. If the pointers are the same, rtable leak occurs, as first the cand is
       removed then the new entry is appended to it.
      
       This leak leads to unregister_netdevice problem (usage count > 0).
      
       Another problem of the patch is that it tries to insert the entries in certain
       order, to facilitate counting of entries distinct by all but QoS parameters.
       Unfortunately, referencing an existing rtable entry moves it to list beginning,
       to speed up further lookups, so the carefully built order is destroyed.
      
       For the first problem the simplest patch it to set rthi=0 when rthi==cand, but
       it will also destroy the ordering.
      end_of_quotation
      
      Problematic commit is 1080d709
      (net: implement emergency route cache rebulds when gc_elasticity is exceeded)
      
      Trying to keep dst_entries ordered is too complex and breaks the fact that
      order should depend on the frequency of use for garbage collection.
      
      A possible fix is to make rt_intern_hash() simpler, and only makes
      rt_check_expire() a litle bit smarter, being able to cope with an arbitrary
      entries order. The added loop is running on cache hot data, while cpu
      is prefetching next object, so should be unnoticied.
      Reported-and-analyzed-by: NAlexander V. Lukyanov <lav@yar.ru>
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ddbcb00
    • E
      net: fix length computation in rt_check_expire() · cf8da764
      Eric Dumazet 提交于
      rt_check_expire() computes average and standard deviation of chain lengths,
      but not correclty reset length to 0 at beginning of each chain.
      This probably gives overflows for sum2 (and sum) on loaded machines instead
      of meaningful results.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf8da764
  19. 20 5月, 2009 1 次提交
    • C
      ipv4: teach ipconfig about the MTU option in DHCP · 9643f455
      Chris Friesen 提交于
      The DHCP spec allows the server to specify the MTU.  This can be useful
      for netbooting with UDP-based NFS-root on a network using jumbo frames.
      This patch allows the kernel IP autoconfiguration to handle this option
      correctly.
      
      It would be possible to use initramfs and add a script to set the MTU,
      but that seems like a complicated solution if no initramfs is otherwise
      necessary, and would bloat the kernel image more than this code would.
      
      This patch was originally submitted to LKML in 2003 by Hans-Peter Jansen.
      Signed-off-by: NChris Friesen <cfriesen@nortel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9643f455
  20. 19 5月, 2009 5 次提交