1. 15 7月, 2009 1 次提交
  2. 12 7月, 2009 1 次提交
  3. 10 7月, 2009 1 次提交
    • J
      net: adding memory barrier to the poll and receive callbacks · a57de0b4
      Jiri Olsa 提交于
      Adding memory barrier after the poll_wait function, paired with
      receive callbacks. Adding fuctions sock_poll_wait and sk_has_sleeper
      to wrap the memory barrier.
      
      Without the memory barrier, following race can happen.
      The race fires, when following code paths meet, and the tp->rcv_nxt
      and __add_wait_queue updates stay in CPU caches.
      
      CPU1                         CPU2
      
      sys_select                   receive packet
        ...                        ...
        __add_wait_queue           update tp->rcv_nxt
        ...                        ...
        tp->rcv_nxt check          sock_def_readable
        ...                        {
        schedule                      ...
                                      if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
                                              wake_up_interruptible(sk->sk_sleep)
                                      ...
                                   }
      
      If there was no cache the code would work ok, since the wait_queue and
      rcv_nxt are opposit to each other.
      
      Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
      passed the tp->rcv_nxt check and sleeps, or will get the new value for
      tp->rcv_nxt and will return with new data mask.
      In both cases the process (CPU1) is being added to the wait queue, so the
      waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
      
      The bad case is when the __add_wait_queue changes done by CPU1 stay in its
      cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
      endup calling schedule and sleep forever if there are no more data on the
      socket.
      
      Calls to poll_wait in following modules were ommited:
      	net/bluetooth/af_bluetooth.c
      	net/irda/af_irda.c
      	net/irda/irnet/irnet_ppp.c
      	net/mac80211/rc80211_pid_debugfs.c
      	net/phonet/socket.c
      	net/rds/af_rds.c
      	net/rfkill/core.c
      	net/sunrpc/cache.c
      	net/sunrpc/rpc_pipe.c
      	net/tipc/socket.c
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a57de0b4
  4. 09 7月, 2009 1 次提交
    • J
      ipv4: Fix fib_trie rebalancing, part 4 (root thresholds) · 345aa031
      Jarek Poplawski 提交于
      Pawel Staszewski wrote:
      <blockquote>
      Some time ago i report this:
      http://bugzilla.kernel.org/show_bug.cgi?id=6648
      
      and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
      dmesg output:
      oprofile: using NMI interrupt.
      Fix inflate_threshold_root. Now=15 size=11 bits
      ...
      Fix inflate_threshold_root. Now=15 size=11 bits
      
      cat /proc/net/fib_triestat
      Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
      Main:
              Aver depth:     2.28
              Max depth:      6
              Leaves:         276539
              Prefixes:       289922
              Internal nodes: 66762
                1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5
      9: 1  18: 1
              Pointers: 691228
      Null ptrs: 347928
      Total size: 35709  kB
      </blockquote>
      
      It seems, the current threshold for root resizing is too aggressive,
      and it causes misleading warnings during big updates, but it might be
      also responsible for memory problems, especially with non-preempt
      configs, when RCU freeing is delayed long after call_rcu.
      
      It should be also mentioned that because of non-atomic changes during
      resizing/rebalancing the current lookup algorithm can miss valid leaves
      so it's additional argument to shorten these activities even at a cost
      of a minimally longer searching.
      
      This patch restores values before the patch "[IPV4]: fib_trie root
      node settings", commit: 965ffea4 from
      v2.6.22.
      
      Pawel's report:
      <blockquote>
      I dont see any big change of (cpu load or faster/slower
      routing/propagating routes from bgpd or something else) - in avg there
      is from 2% to 3% more of CPU load i dont know why but it is - i change
      from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL
      1 30"
      always avg cpu load was from 2 to 3% more compared to "no preempt"
      [...]
      cat /proc/net/fib_triestat
      Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
      Main:
              Aver depth:     2.44
              Max depth:      6
              Leaves:         277814
              Prefixes:       291306
              Internal nodes: 66420
                1: 32737  2: 14850  3: 10332  4: 4871  5: 2313  6: 942  7: 371  8: 3  17: 1
              Pointers: 599098
      Null ptrs: 254865
      Total size: 18067  kB
      </blockquote>
      
      According to this and other similar reports average depth is slightly
      increased (~0.2), and root nodes are shorter (log 17 vs. 18), but
      there is no visible performance decrease. So, until memory handling is
      improved or added parameters for changing this individually, this
      patch resets to safer defaults.
      Reported-by: NPawel Staszewski <pstaszewski@itcare.pl>
      Reported-by: NJorge Boncompte [DTI2] <jorge@dti2.net>
      Signed-off-by: NJarek Poplawski <jarkao2@gmail.com>
      Tested-by: NPawel Staszewski <pstaszewski@itcare.pl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      345aa031
  5. 04 7月, 2009 1 次提交
  6. 01 7月, 2009 2 次提交
  7. 30 6月, 2009 2 次提交
  8. 29 6月, 2009 1 次提交
  9. 27 6月, 2009 1 次提交
  10. 26 6月, 2009 1 次提交
    • W
      tcp: missing check ACK flag of received segment in FIN-WAIT-2 state · 1ac530b3
      Wei Yongjun 提交于
      RFC0793 defined that in FIN-WAIT-2 state if the ACK bit is off drop
      the segment and return[Page 72]. But this check is missing in function
      tcp_timewait_state_process(). This cause the segment with FIN flag but
      no ACK has two diffent action:
      
      Case 1:
          Node A                      Node B
                    <-------------    FIN,ACK
                                      (enter FIN-WAIT-1)
          ACK       ------------->
                                      (enter FIN-WAIT-2)
          FIN       ------------->    discard
                                      (move sk to tw list)
      
      Case 2:
          Node A                      Node B
                    <-------------    FIN,ACK
                                      (enter FIN-WAIT-1)
          ACK       ------------->
                                      (enter FIN-WAIT-2)
                                      (move sk to tw list)
          FIN       ------------->
      
                    <-------------    ACK
      
      This patch fixed the problem.
      Signed-off-by: NWei Yongjun <yjwei@cn.fujitsu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ac530b3
  11. 24 6月, 2009 1 次提交
    • N
      ipv4 routing: Ensure that route cache entries are usable and reclaimable with caching is off · b6280b47
      Neil Horman 提交于
      When route caching is disabled (rt_caching returns false), We still use route
      cache entries that are created and passed into rt_intern_hash once.  These
      routes need to be made usable for the one call path that holds a reference to
      them, and they need to be reclaimed when they're finished with their use.  To be
      made usable, they need to be associated with a neighbor table entry (which they
      currently are not), otherwise iproute_finish2 just discards the packet, since we
      don't know which L2 peer to send the packet to.  To do this binding, we need to
      follow the path a bit higher up in rt_intern_hash, which calls
      arp_bind_neighbour, but not assign the route entry to the hash table.
      Currently, if caching is off, we simply assign the route to the rp pointer and
      are reutrn success.  This patch associates us with a neighbor entry first.
      
      Secondly, we need to make sure that any single use routes like this are known to
      the garbage collector when caching is off.  If caching is off, and we try to
      hash in a route, it will leak when its refcount reaches zero.  To avoid this,
      this patch calls rt_free on the route cache entry passed into rt_intern_hash.
      This places us on the gc list for the route cache garbage collector, so that
      when its refcount reaches zero, it will be reclaimed (Thanks to Alexey for this
      suggestion).
      
      I've tested this on a local system here, and with these patches in place, I'm
      able to maintain routed connectivity to remote systems, even if I set
      /proc/sys/net/ipv4/rt_cache_rebuild_count to -1, which forces rt_caching to
      return false.
      Signed-off-by: NNeil Horman <nhorman@redhat.com>
      Reported-by: NJarek Poplawski <jarkao2@gmail.com>
      Reported-by: NMaxime Bizon <mbizon@freebox.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6280b47
  12. 20 6月, 2009 1 次提交
    • N
      ipv4: fix NULL pointer + success return in route lookup path · 73e42897
      Neil Horman 提交于
      Don't drop route if we're not caching	
      
      	I recently got a report of an oops on a route lookup.  Maxime was
      testing what would happen if route caching was turned off (doing so by setting
      making rt_caching always return 0), and found that it triggered an oops.  I
      looked at it and found that the problem stemmed from the fact that the route
      lookup routines were returning success from their lookup paths (which is good),
      but never set the **rp pointer to anything (which is bad).  This happens because
      in rt_intern_hash, if rt_caching returns false, we call rt_drop and return 0.
      This almost emulates slient success.  What we should be doing is assigning *rp =
      rt and _not_ dropping the route.  This way, during slow path lookups, when we
      create a new route cache entry, we don't immediately discard it, rather we just
      don't add it into the cache hash table, but we let this one lookup use it for
      the purpose of this route request.  Maxime has tested and reports it prevents
      the oops.  There is still a subsequent routing issue that I'm looking into
      further, but I'm confident that, even if its related to this same path, this
      patch makes sense to take.
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73e42897
  13. 18 6月, 2009 2 次提交
  14. 15 6月, 2009 2 次提交
  15. 14 6月, 2009 3 次提交
  16. 12 6月, 2009 1 次提交
    • P
      netfilter: ip_tables: fix build error · 24992eac
      Patrick McHardy 提交于
      Fix build error introduced by commit bb70dfa5 (netfilter: xtables:
      consolidate comefrom debug cast access):
      
      net/ipv4/netfilter/ip_tables.c: In function 'ipt_do_table':
      net/ipv4/netfilter/ip_tables.c:421: error: 'comefrom' undeclared (first use in this function)
      net/ipv4/netfilter/ip_tables.c:421: error: (Each undeclared identifier is reported only once
      net/ipv4/netfilter/ip_tables.c:421: error: for each function it appears in.)
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      24992eac
  17. 11 6月, 2009 1 次提交
    • E
      net: No more expensive sock_hold()/sock_put() on each tx · 2b85a34e
      Eric Dumazet 提交于
      One of the problem with sock memory accounting is it uses
      a pair of sock_hold()/sock_put() for each transmitted packet.
      
      This slows down bidirectional flows because the receive path
      also needs to take a refcount on socket and might use a different
      cpu than transmit path or transmit completion path. So these
      two atomic operations also trigger cache line bounces.
      
      We can see this in tx or tx/rx workloads (media gateways for example),
      where sock_wfree() can be in top five functions in profiles.
      
      We use this sock_hold()/sock_put() so that sock freeing
      is delayed until all tx packets are completed.
      
      As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
      by one unit at init time, until sk_free() is called.
      Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
      to decrement initial offset and atomicaly check if any packets
      are in flight.
      
      skb_set_owner_w() doesnt call sock_hold() anymore
      
      sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
      reached 0 to perform the final freeing.
      
      Drawback is that a skb->truesize error could lead to unfreeable sockets, or
      even worse, prematurely calling __sk_free() on a live socket.
      
      Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
      on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
      contention point. 5 % speedup on a UDP transmit workload (depends
      on number of flows), lowering TX completion cpu usage.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b85a34e
  18. 10 6月, 2009 1 次提交
  19. 09 6月, 2009 2 次提交
  20. 08 6月, 2009 1 次提交
    • J
      netfilter: nf_ct_icmp: keep the ICMP ct entries longer · f87fb666
      Jan Kasprzak 提交于
      Current conntrack code kills the ICMP conntrack entry as soon as
      the first reply is received. This is incorrect, as we then see only
      the first ICMP echo reply out of several possible duplicates as
      ESTABLISHED, while the rest will be INVALID. Also this unnecessarily
      increases the conntrackd traffic on H-A firewalls.
      
      Make all the ICMP conntrack entries (including the replied ones)
      last for the default of nf_conntrack_icmp{,v6}_timeout seconds.
      Signed-off-by: NJan "Yenya" Kasprzak <kas@fi.muni.cz>
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      f87fb666
  21. 05 6月, 2009 1 次提交
  22. 04 6月, 2009 2 次提交
  23. 03 6月, 2009 3 次提交
    • E
      net: skb->dst accessors · adf30907
      Eric Dumazet 提交于
      Define three accessors to get/set dst attached to a skb
      
      struct dst_entry *skb_dst(const struct sk_buff *skb)
      
      void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst)
      
      void skb_dst_drop(struct sk_buff *skb)
      This one should replace occurrences of :
      dst_release(skb->dst)
      skb->dst = NULL;
      
      Delete skb->dst field
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      adf30907
    • E
      net: skb->rtable accessor · 511c3f92
      Eric Dumazet 提交于
      Define skb_rtable(const struct sk_buff *skb) accessor to get rtable from skb
      
      Delete skb->rtable field
      
      Setting rtable is not allowed, just set dst instead as rtable is an alias.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      511c3f92
    • P
      netfilter: conntrack: simplify event caching system · 17e6e4ea
      Pablo Neira Ayuso 提交于
      This patch simplifies the conntrack event caching system by removing
      several events:
      
       * IPCT_[*]_VOLATILE, IPCT_HELPINFO and IPCT_NATINFO has been deleted
         since the have no clients.
       * IPCT_COUNTER_FILLING which is a leftover of the 32-bits counter
         days.
       * IPCT_REFRESH which is not of any use since we always include the
         timeout in the messages.
      
      After this patch, the existing events are:
      
       * IPCT_NEW, IPCT_RELATED and IPCT_DESTROY, that are used to identify
       addition and deletion of entries.
       * IPCT_STATUS, that notes that the status bits have changes,
       eg. IPS_SEEN_REPLY and IPS_ASSURED.
       * IPCT_PROTOINFO, that reports that internal protocol information has
       changed, eg. the TCP, DCCP and SCTP protocol state.
       * IPCT_HELPER, that a helper has been assigned or unassigned to this
       entry.
       * IPCT_MARK and IPCT_SECMARK, that reports that the mark has changed, this
       covers the case when a mark is set to zero.
       * IPCT_NATSEQADJ, to report that there's updates in the NAT sequence
       adjustment.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      17e6e4ea
  24. 02 6月, 2009 2 次提交
  25. 30 5月, 2009 1 次提交
    • I
      tcp: fix loop in ofo handling code and reduce its complexity · 2df9001e
      Ilpo Järvinen 提交于
      Somewhat luckily, I was looking into these parts with very fine
      comb because I've made somewhat similar changes on the same
      area (conflicts that arose weren't that lucky though). The loop
      was very much overengineered recently in commit 91521944
      (tcp: Use SKB queue and list helpers instead of doing it
      by-hand), while it basically just wants to know if there are
      skbs after 'skb'.
      
      Also it got broken because skb1 = skb->next got translated into
      skb1 = skb1->next (though abstracted) improperly. Note that
      'skb1' is pointing to previous sk_buff than skb or NULL if at
      head. Two things went wrong:
      - We'll kfree 'skb' on the first iteration instead of the
        skbuff following 'skb' (it would require required SACK reneging
        to recover I think).
      - The list head case where 'skb1' is NULL is checked too early
        and the loop won't execute whereas it previously did.
      
      Conclusion, mostly revert the recent changes which makes the
      cset very messy looking but using proper accessor in the
      previous-like version.
      
      The effective changes against the original can be viewed with:
        git-diff 91521944^ \
      		net/ipv4/tcp_input.c | sed -n -e '57,70 p'
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2df9001e
  26. 29 5月, 2009 3 次提交
  27. 27 5月, 2009 1 次提交