1. 31 10月, 2014 6 次提交
  2. 30 10月, 2014 5 次提交
    • N
      neigh: optimize neigh_parms_release() · 75fbfd33
      Nicolas Dichtel 提交于
      In neigh_parms_release() we loop over all entries to find the entry given in
      argument and being able to remove it from the list. By using a double linked
      list, we can avoid this loop.
      
      Here are some numbers with 30 000 dummy interfaces configured:
      
      Before the patch:
      $ time rmmod dummy
      real	2m0.118s
      user	0m0.000s
      sys	1m50.048s
      
      After the patch:
      $ time rmmod dummy
      real	1m9.970s
      user	0m0.000s
      sys	0m47.976s
      Suggested-by: NThierry Herbelot <thierry.herbelot@6wind.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75fbfd33
    • E
      net: introduce napi_schedule_irqoff() · bc9ad166
      Eric Dumazet 提交于
      napi_schedule() can be called from any context and has to mask hard
      irqs.
      
      Add a variant that can only be called from hard interrupts handlers
      or when irqs are already masked.
      
      Many NIC drivers can use it from their hard IRQ handler instead of
      generic variant.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc9ad166
    • E
      net: ipv6: Add a sysctl to make optimistic addresses useful candidates · 7fd2561e
      Erik Kline 提交于
      Add a sysctl that causes an interface's optimistic addresses
      to be considered equivalent to other non-deprecated addresses
      for source address selection purposes.  Preferred addresses
      will still take precedence over optimistic addresses, subject
      to other ranking in the source address selection algorithm.
      
      This is useful where different interfaces are connected to
      different networks from different ISPs (e.g., a cell network
      and a home wifi network).
      
      The current behaviour complies with RFC 3484/6724, and it
      makes sense if the host has only one interface, or has
      multiple interfaces on the same network (same or cooperating
      administrative domain(s), but not in the multiple distinct
      networks case.
      
      For example, if a mobile device has an IPv6 address on an LTE
      network and then connects to IPv6-enabled wifi, while the wifi
      IPv6 address is undergoing DAD, IPv6 connections will try use
      the wifi default route with the LTE IPv6 address, and will get
      stuck until they time out.
      
      Also, because optimistic nodes can receive frames, issue
      an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC
      flag appropriately set).  A second RTM_NEWADDR is sent if DAD
      completes (the address flags have changed), otherwise an
      RTM_DELADDR is sent.
      
      Also: add an entry in ip-sysctl.txt for optimistic_dad.
      Signed-off-by: NErik Kline <ek@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fd2561e
    • E
      tcp: allow for bigger reordering level · dca145ff
      Eric Dumazet 提交于
      While testing upcoming Yaogong patch (converting out of order queue
      into an RB tree), I hit the max reordering level of linux TCP stack.
      
      Reordering level was limited to 127 for no good reason, and some
      network setups [1] can easily reach this limit and get limited
      throughput.
      
      Allow a new max limit of 300, and add a sysctl to allow admins to even
      allow bigger (or lower) values if needed.
      
      [1] Aggregation of links, per packet load balancing, fabrics not doing
       deep packet inspections, alternative TCP congestion modules...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yaogong Wang <wygivan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca145ff
    • T
      net: skb_segment() should preserve backpressure · 432c856f
      Toshiaki Makita 提交于
      This patch generalizes commit d6a4a104 ("tcp: GSO should be TSQ
      friendly") to protocols using skb_set_owner_w()
      
      TCP uses its own destructor (tcp_wfree) and needs a more complex scheme
      as explained in commit 6ff50cd5 ("tcp: gso: do not generate out of
      order packets")
      
      This allows UDP sockets using UFO to get proper backpressure,
      thus avoiding qdisc drops and excessive cpu usage.
      
      Here are performance test results (macvlan on vlan):
      
      - Before
      # netperf -t UDP_STREAM ...
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992   65507   60.00      144096 1224195    1258.56
      212992           60.00          51              0.45
      
      Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
      Average:        all      0.23      0.00     25.26      0.08      0.00     74.43
      
      - After
      # netperf -t UDP_STREAM ...
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992   65507   60.00      109593      0     957.20
      212992           60.00      109593            957.20
      
      Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
      Average:        all      0.18      0.00      8.38      0.02      0.00     91.43
      
      [edumazet] Rewrote patch and changelog.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      432c856f
  3. 29 10月, 2014 3 次提交
  4. 28 10月, 2014 7 次提交
  5. 25 10月, 2014 2 次提交
  6. 24 10月, 2014 4 次提交
  7. 23 10月, 2014 3 次提交
  8. 22 10月, 2014 4 次提交
    • S
      net: sched: initialize bstats syncp · 7c1c97d5
      Sabrina Dubroca 提交于
      Use netdev_alloc_pcpu_stats to allocate percpu stats and initialize syncp.
      
      Fixes: 22e0f8b9 "net: sched: make bstats per cpu and estimator RCU safe"
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c1c97d5
    • T
      netlink: Re-add locking to netlink_lookup() and seq walker · 78fd1d0a
      Thomas Graf 提交于
      The synchronize_rcu() in netlink_release() introduces unacceptable
      latency. Reintroduce minimal lookup so we can drop the
      synchronize_rcu() until socket destruction has been RCUfied.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Reported-by: NSteinar H. Gunderson <sgunderson@bigfoot.com>
      Reported-and-tested-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NThomas Graf <tgraf@suug.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78fd1d0a
    • Y
      tipc: fix lockdep warning when intra-node messages are delivered · 1a194c2d
      Ying Xue 提交于
      When running tipcTC&tipcTS test suite, below lockdep unsafe locking
      scenario is reported:
      
      [ 1109.997854]
      [ 1109.997988] =================================
      [ 1109.998290] [ INFO: inconsistent lock state ]
      [ 1109.998575] 3.17.0-rc1+ #113 Not tainted
      [ 1109.998762] ---------------------------------
      [ 1109.998762] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      [ 1109.998762] swapper/7/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
      [ 1109.998762]  (slock-AF_TIPC){+.?...}, at: [<ffffffffa0011969>] tipc_sk_rcv+0x49/0x2b0 [tipc]
      [ 1109.998762] {SOFTIRQ-ON-W} state was registered at:
      [ 1109.998762]   [<ffffffff810a4770>] __lock_acquire+0x6a0/0x1d80
      [ 1109.998762]   [<ffffffff810a6555>] lock_acquire+0x95/0x1e0
      [ 1109.998762]   [<ffffffff81a2d1ce>] _raw_spin_lock+0x3e/0x80
      [ 1109.998762]   [<ffffffffa0011969>] tipc_sk_rcv+0x49/0x2b0 [tipc]
      [ 1109.998762]   [<ffffffffa0004fe8>] tipc_link_xmit+0xa8/0xc0 [tipc]
      [ 1109.998762]   [<ffffffffa000ec6f>] tipc_sendmsg+0x15f/0x550 [tipc]
      [ 1109.998762]   [<ffffffffa000f165>] tipc_connect+0x105/0x140 [tipc]
      [ 1109.998762]   [<ffffffff817676ee>] SYSC_connect+0xae/0xc0
      [ 1109.998762]   [<ffffffff81767b7e>] SyS_connect+0xe/0x10
      [ 1109.998762]   [<ffffffff817a9788>] compat_SyS_socketcall+0xb8/0x200
      [ 1109.998762]   [<ffffffff81a306e5>] sysenter_dispatch+0x7/0x1f
      [ 1109.998762] irq event stamp: 241060
      [ 1109.998762] hardirqs last  enabled at (241060): [<ffffffff8105a4ad>] __local_bh_enable_ip+0x6d/0xd0
      [ 1109.998762] hardirqs last disabled at (241059): [<ffffffff8105a46f>] __local_bh_enable_ip+0x2f/0xd0
      [ 1109.998762] softirqs last  enabled at (241020): [<ffffffff81059a52>] _local_bh_enable+0x22/0x50
      [ 1109.998762] softirqs last disabled at (241021): [<ffffffff8105a626>] irq_exit+0x96/0xc0
      [ 1109.998762]
      [ 1109.998762] other info that might help us debug this:
      [ 1109.998762]  Possible unsafe locking scenario:
      [ 1109.998762]
      [ 1109.998762]        CPU0
      [ 1109.998762]        ----
      [ 1109.998762]   lock(slock-AF_TIPC);
      [ 1109.998762]   <Interrupt>
      [ 1109.998762]     lock(slock-AF_TIPC);
      [ 1109.998762]
      [ 1109.998762]  *** DEADLOCK ***
      [ 1109.998762]
      [ 1109.998762] 2 locks held by swapper/7/0:
      [ 1109.998762]  #0:  (rcu_read_lock){......}, at: [<ffffffff81782dc9>] __netif_receive_skb_core+0x69/0xb70
      [ 1109.998762]  #1:  (rcu_read_lock){......}, at: [<ffffffffa0001c90>] tipc_l2_rcv_msg+0x40/0x260 [tipc]
      [ 1109.998762]
      [ 1109.998762] stack backtrace:
      [ 1109.998762] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 3.17.0-rc1+ #113
      [ 1109.998762] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
      [ 1109.998762]  ffffffff82745830 ffff880016c03828 ffffffff81a209eb 0000000000000007
      [ 1109.998762]  ffff880017b3cac0 ffff880016c03888 ffffffff81a1c5ef 0000000000000001
      [ 1109.998762]  ffff880000000001 ffff880000000000 ffffffff81012d4f 0000000000000000
      [ 1109.998762] Call Trace:
      [ 1109.998762]  <IRQ>  [<ffffffff81a209eb>] dump_stack+0x4e/0x68
      [ 1109.998762]  [<ffffffff81a1c5ef>] print_usage_bug+0x1f1/0x202
      [ 1109.998762]  [<ffffffff81012d4f>] ? save_stack_trace+0x2f/0x50
      [ 1109.998762]  [<ffffffff810a406c>] mark_lock+0x28c/0x2f0
      [ 1109.998762]  [<ffffffff810a3440>] ? print_irq_inversion_bug.part.46+0x1f0/0x1f0
      [ 1109.998762]  [<ffffffff810a467d>] __lock_acquire+0x5ad/0x1d80
      [ 1109.998762]  [<ffffffff810a70dd>] ? trace_hardirqs_on+0xd/0x10
      [ 1109.998762]  [<ffffffff8108ace8>] ? sched_clock_cpu+0x98/0xc0
      [ 1109.998762]  [<ffffffff8108ad2b>] ? local_clock+0x1b/0x30
      [ 1109.998762]  [<ffffffff810a10dc>] ? lock_release_holdtime.part.29+0x1c/0x1a0
      [ 1109.998762]  [<ffffffff8108aa05>] ? sched_clock_local+0x25/0x90
      [ 1109.998762]  [<ffffffffa000dec0>] ? tipc_sk_get+0x60/0x80 [tipc]
      [ 1109.998762]  [<ffffffff810a6555>] lock_acquire+0x95/0x1e0
      [ 1109.998762]  [<ffffffffa0011969>] ? tipc_sk_rcv+0x49/0x2b0 [tipc]
      [ 1109.998762]  [<ffffffff810a6fb6>] ? trace_hardirqs_on_caller+0xa6/0x1c0
      [ 1109.998762]  [<ffffffff81a2d1ce>] _raw_spin_lock+0x3e/0x80
      [ 1109.998762]  [<ffffffffa0011969>] ? tipc_sk_rcv+0x49/0x2b0 [tipc]
      [ 1109.998762]  [<ffffffffa000dec0>] ? tipc_sk_get+0x60/0x80 [tipc]
      [ 1109.998762]  [<ffffffffa0011969>] tipc_sk_rcv+0x49/0x2b0 [tipc]
      [ 1109.998762]  [<ffffffffa00076bd>] tipc_rcv+0x5ed/0x960 [tipc]
      [ 1109.998762]  [<ffffffffa0001d1c>] tipc_l2_rcv_msg+0xcc/0x260 [tipc]
      [ 1109.998762]  [<ffffffffa0001c90>] ? tipc_l2_rcv_msg+0x40/0x260 [tipc]
      [ 1109.998762]  [<ffffffff81783345>] __netif_receive_skb_core+0x5e5/0xb70
      [ 1109.998762]  [<ffffffff81782dc9>] ? __netif_receive_skb_core+0x69/0xb70
      [ 1109.998762]  [<ffffffff81784eb9>] ? dev_gro_receive+0x259/0x4e0
      [ 1109.998762]  [<ffffffff817838f6>] __netif_receive_skb+0x26/0x70
      [ 1109.998762]  [<ffffffff81783acd>] netif_receive_skb_internal+0x2d/0x1f0
      [ 1109.998762]  [<ffffffff81785518>] napi_gro_receive+0xd8/0x240
      [ 1109.998762]  [<ffffffff815bf854>] e1000_clean_rx_irq+0x2c4/0x530
      [ 1109.998762]  [<ffffffff815c1a46>] e1000_clean+0x266/0x9c0
      [ 1109.998762]  [<ffffffff8108ad2b>] ? local_clock+0x1b/0x30
      [ 1109.998762]  [<ffffffff8108aa05>] ? sched_clock_local+0x25/0x90
      [ 1109.998762]  [<ffffffff817842b1>] net_rx_action+0x141/0x310
      [ 1109.998762]  [<ffffffff810bd710>] ? handle_fasteoi_irq+0xe0/0x150
      [ 1109.998762]  [<ffffffff81059fa6>] __do_softirq+0x116/0x4d0
      [ 1109.998762]  [<ffffffff8105a626>] irq_exit+0x96/0xc0
      [ 1109.998762]  [<ffffffff81a30d07>] do_IRQ+0x67/0x110
      [ 1109.998762]  [<ffffffff81a2ee2f>] common_interrupt+0x6f/0x6f
      [ 1109.998762]  <EOI>  [<ffffffff8100d2b7>] ? default_idle+0x37/0x250
      [ 1109.998762]  [<ffffffff8100d2b5>] ? default_idle+0x35/0x250
      [ 1109.998762]  [<ffffffff8100dd1f>] arch_cpu_idle+0xf/0x20
      [ 1109.998762]  [<ffffffff810999fd>] cpu_startup_entry+0x27d/0x4d0
      [ 1109.998762]  [<ffffffff81034c78>] start_secondary+0x188/0x1f0
      
      When intra-node messages are delivered from one process to another
      process, tipc_link_xmit() doesn't disable BH before it directly calls
      tipc_sk_rcv() on process context to forward messages to destination
      socket. Meanwhile, if messages delivered by remote node arrive at the
      node and their destinations are also the same socket, tipc_sk_rcv()
      running on process context might be preempted by tipc_sk_rcv() running
      BH context. As a result, the latter cannot obtain the socket lock as
      the lock was obtained by the former, however, the former has no chance
      to be run as the latter is owning the CPU now, so headlock happens. To
      avoid it, BH should be always disabled in tipc_sk_rcv().
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Reviewed-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a194c2d
    • Y
      tipc: fix a potential deadlock · 7b8613e0
      Ying Xue 提交于
      Locking dependency detected below possible unsafe locking scenario:
      
                 CPU0                          CPU1
      T0:  tipc_named_rcv()                tipc_rcv()
      T1:  [grab nametble write lock]*     [grab node lock]*
      T2:  tipc_update_nametbl()           tipc_node_link_up()
      T3:  tipc_nodesub_subscribe()        tipc_nametbl_publish()
      T4:  [grab node lock]*               [grab nametble write lock]*
      
      The opposite order of holding nametbl write lock and node lock on
      above two different paths may result in a deadlock. If we move the
      the updating of the name table after link state named out of node
      lock, the reverse order of holding locks will be eliminated, and
      as a result, the deadlock risk.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b8613e0
  9. 21 10月, 2014 3 次提交
    • F
      net: core: handle encapsulation offloads when computing segment lengths · f993bc25
      Florian Westphal 提交于
      if ->encapsulation is set we have to use inner_tcp_hdrlen and add the
      size of the inner network headers too.
      
      This is 'mostly harmless'; tbf might send skb that is slightly over
      quota or drop skb even if it would have fit.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f993bc25
    • F
      net: make skb_gso_segment error handling more robust · 330966e5
      Florian Westphal 提交于
      skb_gso_segment has three possible return values:
      1. a pointer to the first segmented skb
      2. an errno value (IS_ERR())
      3. NULL.  This can happen when GSO is used for header verification.
      
      However, several callers currently test IS_ERR instead of IS_ERR_OR_NULL
      and would oops when NULL is returned.
      
      Note that these call sites should never actually see such a NULL return
      value; all callers mask out the GSO bits in the feature argument.
      
      However, there have been issues with some protocol handlers erronously not
      respecting the specified feature mask in some cases.
      
      It is preferable to get 'have to turn off hw offloading, else slow' reports
      rather than 'kernel crashes'.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      330966e5
    • F
      net: gso: use feature flag argument in all protocol gso handlers · 1e16aa3d
      Florian Westphal 提交于
      skb_gso_segment() has a 'features' argument representing offload features
      available to the output path.
      
      A few handlers, e.g. GRE, instead re-fetch the features of skb->dev and use
      those instead of the provided ones when handing encapsulation/tunnels.
      
      Depending on dev->hw_enc_features of the output device skb_gso_segment() can
      then return NULL even when the caller has disabled all GSO feature bits,
      as segmentation of inner header thinks device will take care of segmentation.
      
      This e.g. affects the tbf scheduler, which will silently drop GRE-encap GSO skbs
      that did not fit the remaining token quota as the segmentation does not work
      when device supports corresponding hw offload capabilities.
      
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e16aa3d
  10. 20 10月, 2014 1 次提交
  11. 19 10月, 2014 2 次提交