1. 05 11月, 2014 2 次提交
  2. 04 11月, 2014 2 次提交
    • E
      net: add rbnode to struct sk_buff · 56b17425
      Eric Dumazet 提交于
      Yaogong replaces TCP out of order receive queue by an RB tree.
      
      As netem already does a private skb->{next/prev/tstamp} union
      with a 'struct rb_node', lets do this in a cleaner way.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yaogong Wang <wygivan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56b17425
    • E
      net: less interrupt masking in NAPI · d75b1ade
      Eric Dumazet 提交于
      net_rx_action() can mask irqs a single time to transfert sd->poll_list
      into a private list, for a very short duration.
      
      Then, napi_complete() can avoid masking irqs again,
      and net_rx_action() only needs to mask irq again in slow path.
      
      This patch removes 2 couples of irq mask/unmask per typical NAPI run,
      more if multiple napi were triggered.
      
      Note this also allows to give control back to caller (do_softirq())
      more often, so that other softirq handlers can be called a bit earlier,
      or ksoftirqd can be wakeup earlier under pressure.
      
      This was developed while testing an alternative to RX interrupt
      mitigation to reduce latencies while keeping or improving GRO
      aggregation on fast NIC.
      
      Idea is to test napi->gro_list at the end of a napi->poll() and
      reschedule one NAPI poll, but after servicing a full round of
      softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
      is currently serviced by idle task or ksoftirqd, and resched not needed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d75b1ade
  3. 01 11月, 2014 3 次提交
  4. 31 10月, 2014 24 次提交
  5. 30 10月, 2014 9 次提交
    • N
      neigh: optimize neigh_parms_release() · 75fbfd33
      Nicolas Dichtel 提交于
      In neigh_parms_release() we loop over all entries to find the entry given in
      argument and being able to remove it from the list. By using a double linked
      list, we can avoid this loop.
      
      Here are some numbers with 30 000 dummy interfaces configured:
      
      Before the patch:
      $ time rmmod dummy
      real	2m0.118s
      user	0m0.000s
      sys	1m50.048s
      
      After the patch:
      $ time rmmod dummy
      real	1m9.970s
      user	0m0.000s
      sys	0m47.976s
      Suggested-by: NThierry Herbelot <thierry.herbelot@6wind.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75fbfd33
    • E
      net: introduce napi_schedule_irqoff() · bc9ad166
      Eric Dumazet 提交于
      napi_schedule() can be called from any context and has to mask hard
      irqs.
      
      Add a variant that can only be called from hard interrupts handlers
      or when irqs are already masked.
      
      Many NIC drivers can use it from their hard IRQ handler instead of
      generic variant.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc9ad166
    • N
      inet: frags: remove the WARN_ON from inet_evict_bucket · d70127e8
      Nikolay Aleksandrov 提交于
      The WARN_ON in inet_evict_bucket can be triggered by a valid case:
      inet_frag_kill and inet_evict_bucket can be running in parallel on the
      same queue which means that there has been at least one more ref added
      by a previous inet_frag_find call, but inet_frag_kill can delete the
      timer before inet_evict_bucket which will cause the WARN_ON() there to
      trigger since we'll have refcnt!=1. Now, this case is valid because the
      queue is being "killed" for some reason (removed from the chain list and
      its timer deleted) so it will get destroyed in the end by one of the
      inet_frag_put() calls which reaches 0 i.e. refcnt is still valid.
      
      CC: Florian Westphal <fw@strlen.de>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McLean <chutzpah@gentoo.org>
      
      Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue")
      Reported-by: NPatrick McLean <chutzpah@gentoo.org>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d70127e8
    • N
      inet: frags: fix a race between inet_evict_bucket and inet_frag_kill · 65ba1f1e
      Nikolay Aleksandrov 提交于
      When the evictor is running it adds some chosen frags to a local list to
      be evicted once the chain lock has been released but at the same time
      the *frag_queue can be running for some of the same queues and it
      may call inet_frag_kill which will wait on the chain lock and
      will then delete the queue from the wrong list since it was added in the
      eviction one. The fix is simple - check if the queue has the evict flag
      set under the chain lock before deleting it, this is safe because the
      evict flag is set only under that lock and having the flag set also means
      that the queue has been detached from the chain list, so no need to delete
      it again.
      An important note to make is that we're safe w.r.t refcnt because
      inet_frag_kill and inet_evict_bucket will sync on the del_timer operation
      where only one of the two can succeed (or if the timer is executing -
      none of them), the cases are:
      1. inet_frag_kill succeeds in del_timer
       - then the timer ref is removed, but inet_evict_bucket will not add
         this queue to its expire list but will restart eviction in that chain
      2. inet_evict_bucket succeeds in del_timer
       - then the timer ref is kept until the evictor "expires" the queue, but
         inet_frag_kill will remove the initial ref and will set
         INET_FRAG_COMPLETE which will make the frag_expire fn just to remove
         its ref.
      In the end all of the queue users will do an inet_frag_put and the one
      that reaches 0 will free it. The refcount balance should be okay.
      
      CC: Florian Westphal <fw@strlen.de>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Patrick McLean <chutzpah@gentoo.org>
      
      Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue")
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NPatrick McLean <chutzpah@gentoo.org>
      Tested-by: NPatrick McLean <chutzpah@gentoo.org>
      Signed-off-by: NNikolay Aleksandrov <nikolay@redhat.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      65ba1f1e
    • E
      net: ipv6: Add a sysctl to make optimistic addresses useful candidates · 7fd2561e
      Erik Kline 提交于
      Add a sysctl that causes an interface's optimistic addresses
      to be considered equivalent to other non-deprecated addresses
      for source address selection purposes.  Preferred addresses
      will still take precedence over optimistic addresses, subject
      to other ranking in the source address selection algorithm.
      
      This is useful where different interfaces are connected to
      different networks from different ISPs (e.g., a cell network
      and a home wifi network).
      
      The current behaviour complies with RFC 3484/6724, and it
      makes sense if the host has only one interface, or has
      multiple interfaces on the same network (same or cooperating
      administrative domain(s), but not in the multiple distinct
      networks case.
      
      For example, if a mobile device has an IPv6 address on an LTE
      network and then connects to IPv6-enabled wifi, while the wifi
      IPv6 address is undergoing DAD, IPv6 connections will try use
      the wifi default route with the LTE IPv6 address, and will get
      stuck until they time out.
      
      Also, because optimistic nodes can receive frames, issue
      an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC
      flag appropriately set).  A second RTM_NEWADDR is sent if DAD
      completes (the address flags have changed), otherwise an
      RTM_DELADDR is sent.
      
      Also: add an entry in ip-sysctl.txt for optimistic_dad.
      Signed-off-by: NErik Kline <ek@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fd2561e
    • E
      tcp: allow for bigger reordering level · dca145ff
      Eric Dumazet 提交于
      While testing upcoming Yaogong patch (converting out of order queue
      into an RB tree), I hit the max reordering level of linux TCP stack.
      
      Reordering level was limited to 127 for no good reason, and some
      network setups [1] can easily reach this limit and get limited
      throughput.
      
      Allow a new max limit of 300, and add a sysctl to allow admins to even
      allow bigger (or lower) values if needed.
      
      [1] Aggregation of links, per packet load balancing, fabrics not doing
       deep packet inspections, alternative TCP congestion modules...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yaogong Wang <wygivan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca145ff
    • T
      net: skb_segment() should preserve backpressure · 432c856f
      Toshiaki Makita 提交于
      This patch generalizes commit d6a4a104 ("tcp: GSO should be TSQ
      friendly") to protocols using skb_set_owner_w()
      
      TCP uses its own destructor (tcp_wfree) and needs a more complex scheme
      as explained in commit 6ff50cd5 ("tcp: gso: do not generate out of
      order packets")
      
      This allows UDP sockets using UFO to get proper backpressure,
      thus avoiding qdisc drops and excessive cpu usage.
      
      Here are performance test results (macvlan on vlan):
      
      - Before
      # netperf -t UDP_STREAM ...
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992   65507   60.00      144096 1224195    1258.56
      212992           60.00          51              0.45
      
      Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
      Average:        all      0.23      0.00     25.26      0.08      0.00     74.43
      
      - After
      # netperf -t UDP_STREAM ...
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992   65507   60.00      109593      0     957.20
      212992           60.00      109593            957.20
      
      Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
      Average:        all      0.18      0.00      8.38      0.02      0.00     91.43
      
      [edumazet] Rewrote patch and changelog.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      432c856f
    • L
      ipv6: notify userspace when we added or changed an ipv6 token · b2ed64a9
      Lubomir Rintel 提交于
      NetworkManager might want to know that it changed when the router advertisement
      arrives.
      Signed-off-by: NLubomir Rintel <lkundrak@v3.sk>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2ed64a9
    • W
      sch_pie: schedule the timer after all init succeed · d5610902
      WANG Cong 提交于
      Cc: Vijay Subramanian <vijaynsu@cisco.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      d5610902