1. 01 12月, 2015 1 次提交
  2. 25 11月, 2015 1 次提交
  3. 23 11月, 2015 1 次提交
    • N
      net: ipmr: fix static mfc/dev leaks on table destruction · 0e615e96
      Nikolay Aleksandrov 提交于
      When destroying an mrt table the static mfc entries and the static
      devices are kept, which leads to devices that can never be destroyed
      (because of refcnt taken) and leaked memory, for example:
      unreferenced object 0xffff880034c144c0 (size 192):
        comm "mfc-broken", pid 4777, jiffies 4320349055 (age 46001.964s)
        hex dump (first 32 bytes):
          98 53 f0 34 00 88 ff ff 98 53 f0 34 00 88 ff ff  .S.4.....S.4....
          ef 0a 0a 14 01 02 03 04 00 00 00 00 01 00 00 00  ................
        backtrace:
          [<ffffffff815c1b9e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811ea6e0>] kmem_cache_alloc+0x190/0x300
          [<ffffffff815931cb>] ip_mroute_setsockopt+0x5cb/0x910
          [<ffffffff8153d575>] do_ip_setsockopt.isra.11+0x105/0xff0
          [<ffffffff8153e490>] ip_setsockopt+0x30/0xa0
          [<ffffffff81564e13>] raw_setsockopt+0x33/0x90
          [<ffffffff814d1e14>] sock_common_setsockopt+0x14/0x20
          [<ffffffff814d0b51>] SyS_setsockopt+0x71/0xc0
          [<ffffffff815cdbf6>] entry_SYSCALL_64_fastpath+0x16/0x7a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Make sure that everything is cleaned on netns destruction.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Reviewed-by: NCong Wang <cwang@twopensource.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e615e96
  4. 20 11月, 2015 3 次提交
  5. 19 11月, 2015 2 次提交
    • E
      tcp: md5: fix lockdep annotation · 1b8e6a01
      Eric Dumazet 提交于
      When a passive TCP is created, we eventually call tcp_md5_do_add()
      with sk pointing to the child. It is not owner by the user yet (we
      will add this socket into listener accept queue a bit later anyway)
      
      But we do own the spinlock, so amend the lockdep annotation to avoid
      following splat :
      
      [ 8451.090932] net/ipv4/tcp_ipv4.c:923 suspicious rcu_dereference_protected() usage!
      [ 8451.090932]
      [ 8451.090932] other info that might help us debug this:
      [ 8451.090932]
      [ 8451.090934]
      [ 8451.090934] rcu_scheduler_active = 1, debug_locks = 1
      [ 8451.090936] 3 locks held by socket_sockopt_/214795:
      [ 8451.090936]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff855c6ac1>] __netif_receive_skb_core+0x151/0xe90
      [ 8451.090947]  #1:  (rcu_read_lock){.+.+..}, at: [<ffffffff85618143>] ip_local_deliver_finish+0x43/0x2b0
      [ 8451.090952]  #2:  (slock-AF_INET){+.-...}, at: [<ffffffff855acda5>] sk_clone_lock+0x1c5/0x500
      [ 8451.090958]
      [ 8451.090958] stack backtrace:
      [ 8451.090960] CPU: 7 PID: 214795 Comm: socket_sockopt_
      
      [ 8451.091215] Call Trace:
      [ 8451.091216]  <IRQ>  [<ffffffff856fb29c>] dump_stack+0x55/0x76
      [ 8451.091229]  [<ffffffff85123b5b>] lockdep_rcu_suspicious+0xeb/0x110
      [ 8451.091235]  [<ffffffff8564544f>] tcp_md5_do_add+0x1bf/0x1e0
      [ 8451.091239]  [<ffffffff85645751>] tcp_v4_syn_recv_sock+0x1f1/0x4c0
      [ 8451.091242]  [<ffffffff85642b27>] ? tcp_v4_md5_hash_skb+0x167/0x190
      [ 8451.091246]  [<ffffffff85647c78>] tcp_check_req+0x3c8/0x500
      [ 8451.091249]  [<ffffffff856451ae>] ? tcp_v4_inbound_md5_hash+0x11e/0x190
      [ 8451.091253]  [<ffffffff85647170>] tcp_v4_rcv+0x3c0/0x9f0
      [ 8451.091256]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
      [ 8451.091260]  [<ffffffff856181b6>] ip_local_deliver_finish+0xb6/0x2b0
      [ 8451.091263]  [<ffffffff85618143>] ? ip_local_deliver_finish+0x43/0x2b0
      [ 8451.091267]  [<ffffffff85618d38>] ip_local_deliver+0x48/0x80
      [ 8451.091270]  [<ffffffff85618510>] ip_rcv_finish+0x160/0x700
      [ 8451.091273]  [<ffffffff8561900e>] ip_rcv+0x29e/0x3d0
      [ 8451.091277]  [<ffffffff855c74b7>] __netif_receive_skb_core+0xb47/0xe90
      
      Fixes: a8afca03 ("tcp: md5: protects md5sig_info with RCU")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b8e6a01
    • S
      945fae44
  6. 17 11月, 2015 1 次提交
  7. 16 11月, 2015 1 次提交
  8. 09 11月, 2015 1 次提交
  9. 06 11月, 2015 3 次提交
  10. 05 11月, 2015 3 次提交
    • D
      net: Fix prefsrc lookups · e1b8d903
      David Ahern 提交于
      A bug report (https://bugzilla.kernel.org/show_bug.cgi?id=107071) noted
      that the follwoing ip command is failing with v4.3:
      
          $ ip route add 10.248.5.0/24 dev bond0.250 table vlan_250 src 10.248.5.154
          RTNETLINK answers: Invalid argument
      
      021dd3b8 changed the lookup of the given preferred source address to
      use the table id passed in, but this assumes the local entries are in the
      given table which is not necessarily true for non-VRF use cases. When
      validating the preferred source fallback to the local table on failure.
      
      Fixes: 021dd3b8 ("net: Add routes to the table associated with the device")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1b8d903
    • W
      ipv4: fix a potential deadlock in mcast getsockopt() path · 87e9f031
      WANG Cong 提交于
      Sasha reported the following lockdep warning:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(sk_lock-AF_INET);
                                      lock(rtnl_mutex);
                                      lock(sk_lock-AF_INET);
         lock(rtnl_mutex);
      
      This is due to that for IP_MSFILTER and MCAST_MSFILTER, we take
      rtnl lock before the socket lock in setsockopt() path, but take
      the socket lock before rtnl lock in getsockopt() path. All the
      rest optnames are setsockopt()-only.
      
      Fix this by aligning the getsockopt() path with the setsockopt()
      path, so that all mcast socket path would be locked in the same
      order.
      
      Note, IPv6 part is different where rtnl lock is not held.
      
      Fixes: 54ff9ef3 ("ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}")
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Reviewed-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87e9f031
    • W
      ipv4: disable BH when changing ip local port range · 4ee3bd4a
      WANG Cong 提交于
      This fixes the following lockdep warning:
      
       [ INFO: inconsistent lock state ]
       4.3.0-rc7+ #1197 Not tainted
       ---------------------------------
       inconsistent {IN-SOFTIRQ-R} -> {SOFTIRQ-ON-W} usage.
       sysctl/1019 [HC0[0]:SC0[0]:HE1:SE1] takes:
        (&(&net->ipv4.ip_local_ports.lock)->seqcount){+.+-..}, at: [<ffffffff81921de7>] ipv4_local_port_range+0xb4/0x12a
       {IN-SOFTIRQ-R} state was registered at:
         [<ffffffff810bd682>] __lock_acquire+0x2f6/0xdf0
         [<ffffffff810be6d5>] lock_acquire+0x11c/0x1a4
         [<ffffffff818e599c>] inet_get_local_port_range+0x4e/0xae
         [<ffffffff8166e8e3>] udp_flow_src_port.constprop.40+0x23/0x116
         [<ffffffff81671cb9>] vxlan_xmit_one+0x219/0xa6a
         [<ffffffff81672f75>] vxlan_xmit+0xa6b/0xaa5
         [<ffffffff817f2deb>] dev_hard_start_xmit+0x2ae/0x465
         [<ffffffff817f35ed>] __dev_queue_xmit+0x531/0x633
         [<ffffffff817f3702>] dev_queue_xmit_sk+0x13/0x15
         [<ffffffff818004a5>] neigh_resolve_output+0x12f/0x14d
         [<ffffffff81959cfa>] ip6_finish_output2+0x344/0x39f
         [<ffffffff8195bf58>] ip6_finish_output+0x88/0x8e
         [<ffffffff8195bfef>] ip6_output+0x91/0xe5
         [<ffffffff819792ae>] dst_output_sk+0x47/0x4c
         [<ffffffff81979392>] NF_HOOK_THRESH.constprop.30+0x38/0x82
         [<ffffffff8197981e>] mld_sendpack+0x189/0x266
         [<ffffffff8197b28b>] mld_ifc_timer_expire+0x1ef/0x223
         [<ffffffff810de581>] call_timer_fn+0xfb/0x28c
         [<ffffffff810ded1e>] run_timer_softirq+0x1c7/0x1f1
      
      Fixes: b8f1a556 ("udp: Add function to make source port for UDP tunnels")
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ee3bd4a
  11. 03 11月, 2015 4 次提交
    • E
      net: fix percpu memory leaks · 1d6119ba
      Eric Dumazet 提交于
      This patch fixes following problems :
      
      1) percpu_counter_init() can return an error, therefore
        init_frag_mem_limit() must propagate this error so that
        inet_frags_init_net() can do the same up to its callers.
      
      2) If ip[46]_frags_ns_ctl_register() fail, we must unwind
         properly and free the percpu_counter.
      
      Without this fix, we leave freed object in percpu_counters
      global list (if CONFIG_HOTPLUG_CPU) leading to crashes.
      
      This bug was detected by KASAN and syzkaller tool
      (http://github.com/google/syzkaller)
      
      Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d6119ba
    • E
      net: make skb_set_owner_w() more robust · 9e17f8a4
      Eric Dumazet 提交于
      skb_set_owner_w() is called from various places that assume
      skb->sk always point to a full blown socket (as it changes
      sk->sk_wmem_alloc)
      
      We'd like to attach skb to request sockets, and in the future
      to timewait sockets as well. For these kind of pseudo sockets,
      we need to take a traditional refcount and use sock_edemux()
      as the destructor.
      
      It is now time to un-inline skb_set_owner_w(), being too big.
      
      Fixes: ca6fb065 ("tcp: attach SYNACK messages to request sockets instead of listener")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Bisected-by: NHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e17f8a4
    • A
      ipmr: fix possible race resulting from improper usage of IP_INC_STATS_BH() in preemptible context. · 44f49dd8
      Ani Sinha 提交于
      Fixes the following kernel BUG :
      
      BUG: using __this_cpu_add() in preemptible [00000000] code: bash/2758
      caller is __this_cpu_preempt_check+0x13/0x15
      CPU: 0 PID: 2758 Comm: bash Tainted: P           O   3.18.19 #2
       ffffffff8170eaca ffff880110d1b788 ffffffff81482b2a 0000000000000000
       0000000000000000 ffff880110d1b7b8 ffffffff812010ae ffff880007cab800
       ffff88001a060800 ffff88013a899108 ffff880108b84240 ffff880110d1b7c8
      Call Trace:
      [<ffffffff81482b2a>] dump_stack+0x52/0x80
      [<ffffffff812010ae>] check_preemption_disabled+0xce/0xe1
      [<ffffffff812010d4>] __this_cpu_preempt_check+0x13/0x15
      [<ffffffff81419d60>] ipmr_queue_xmit+0x647/0x70c
      [<ffffffff8141a154>] ip_mr_forward+0x32f/0x34e
      [<ffffffff8141af76>] ip_mroute_setsockopt+0xe03/0x108c
      [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
      [<ffffffff810e6974>] ? pollwake+0x4d/0x51
      [<ffffffff81058ac0>] ? default_wake_function+0x0/0xf
      [<ffffffff810553fc>] ? get_parent_ip+0x11/0x42
      [<ffffffff810613d9>] ? __wake_up_common+0x45/0x77
      [<ffffffff81486ea9>] ? _raw_spin_unlock_irqrestore+0x1d/0x32
      [<ffffffff810618bc>] ? __wake_up_sync_key+0x4a/0x53
      [<ffffffff8139a519>] ? sock_def_readable+0x71/0x75
      [<ffffffff813dd226>] do_ip_setsockopt+0x9d/0xb55
      [<ffffffff81429818>] ? unix_seqpacket_sendmsg+0x3f/0x41
      [<ffffffff813963fe>] ? sock_sendmsg+0x6d/0x86
      [<ffffffff813959d4>] ? sockfd_lookup_light+0x12/0x5d
      [<ffffffff8139650a>] ? SyS_sendto+0xf3/0x11b
      [<ffffffff810d5738>] ? new_sync_read+0x82/0xaa
      [<ffffffff813ddd19>] compat_ip_setsockopt+0x3b/0x99
      [<ffffffff813fb24a>] compat_raw_setsockopt+0x11/0x32
      [<ffffffff81399052>] compat_sock_common_setsockopt+0x18/0x1f
      [<ffffffff813c4d05>] compat_SyS_setsockopt+0x1a9/0x1cf
      [<ffffffff813c4149>] compat_SyS_socketcall+0x180/0x1e3
      [<ffffffff81488ea1>] cstar_dispatch+0x7/0x1e
      Signed-off-by: NAni Sinha <ani@arista.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f49dd8
    • P
      ipv4: use l4 hash for locally generated multipath flows · 9920e48b
      Paolo Abeni 提交于
      This patch changes how the multipath hash is computed for locally
      generated flows: now the hash comprises l4 information.
      
      This allows better utilization of the available paths when the existing
      flows have the same source IP and the same destination IP: with l3 hash,
      even when multiple connections are in place simultaneously, a single path
      will be used, while with l4 hash we can use all the available paths.
      
      v2 changes:
      - use get_hash_from_flowi4() instead of implementing just another l4 hash
        function
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9920e48b
  12. 02 11月, 2015 4 次提交
    • J
      ipv4: update RTNH_F_LINKDOWN flag on UP event · c9b3292e
      Julian Anastasov 提交于
      When nexthop is part of multipath route we should clear the
      LINKDOWN flag when link goes UP or when first address is added.
      This is needed because we always set LINKDOWN flag when DEAD flag
      was set but now on UP the nexthop is not dead anymore. Examples when
      LINKDOWN bit can be forgotten when no NETDEV_CHANGE is delivered:
      
      - link goes down (LINKDOWN is set), then link goes UP and device
      shows carrier OK but LINKDOWN remains set
      
      - last address is deleted (LINKDOWN is set), then address is
      added and device shows carrier OK but LINKDOWN remains set
      
      Steps to reproduce:
      modprobe dummy
      ifconfig dummy0 192.168.168.1 up
      
      here add a multipath route where one nexthop is for dummy0:
      
      ip route add 1.2.3.4 nexthop dummy0 nexthop SOME_OTHER_DEVICE
      ifconfig dummy0 down
      ifconfig dummy0 up
      
      now ip route shows nexthop that is not dead. Now set the sysctl var:
      
      echo 1 > /proc/sys/net/ipv4/conf/dummy0/ignore_routes_with_linkdown
      
      now ip route will show a dead nexthop because the forgotten
      RTNH_F_LINKDOWN is propagated as RTNH_F_DEAD.
      
      Fixes: 8a3d0316 ("net: track link-status of ipv4 nexthops")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9b3292e
    • J
      ipv4: fix to not remove local route on link down · 4f823def
      Julian Anastasov 提交于
      When fib_netdev_event calls fib_disable_ip on NETDEV_DOWN event
      we should not delete the local routes if the local address
      is still present. The confusion comes from the fact that both
      fib_netdev_event and fib_inetaddr_event use the NETDEV_DOWN
      constant. Fix it by returning back the variable 'force'.
      
      Steps to reproduce:
      modprobe dummy
      ifconfig dummy0 192.168.168.1 up
      ifconfig dummy0 down
      ip route list table local | grep dummy | grep host
      local 192.168.168.1 dev dummy0  proto kernel  scope host  src 192.168.168.1
      
      Fixes: 8a3d0316 ("net: track link-status of ipv4 nexthops")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f823def
    • H
      ipv4: add defensive check for CHECKSUM_PARTIAL skbs in ip_fragment · dbd3393c
      Hannes Frederic Sowa 提交于
      CHECKSUM_PARTIAL skbs should never arrive in ip_fragment. If we get one
      of those warn about them once and handle them gracefully by recalculating
      the checksum.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Benjamin Coddington <bcodding@redhat.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dbd3393c
    • H
      ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets · d749c9cb
      Hannes Frederic Sowa 提交于
      We cannot reliable calculate packet size on MSG_MORE corked sockets
      and thus cannot decide if they are going to be fragmented later on,
      so better not use CHECKSUM_PARTIAL in the first place.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Cc: Benjamin Coddington <bcodding@redhat.com>
      Cc: Tom Herbert <tom@herbertland.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d749c9cb
  13. 28 10月, 2015 1 次提交
  14. 27 10月, 2015 1 次提交
  15. 23 10月, 2015 6 次提交
    • E
      tcp/dccp: fix hashdance race for passive sessions · 5e0724d0
      Eric Dumazet 提交于
      Multiple cpus can process duplicates of incoming ACK messages
      matching a SYN_RECV request socket. This is a rare event under
      normal operations, but definitely can happen.
      
      Only one must win the race, otherwise corruption would occur.
      
      To fix this without adding new atomic ops, we use logic in
      inet_ehash_nolisten() to detect the request was present in the same
      ehash bucket where we try to insert the new child.
      
      If request socket was not found, we have to undo the child creation.
      
      This actually removes a spin_lock()/spin_unlock() pair in
      reqsk_queue_unlink() for the fast path.
      
      Fixes: e994b2f0 ("tcp: do not lock listener to process SYN packets")
      Fixes: 079096f1 ("tcp/dccp: install syn_recv requests into ehash table")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e0724d0
    • P
      ipv4: implement support for NOPREFIXROUTE ifa flag for ipv4 address · 7b131180
      Paolo Abeni 提交于
      Currently adding a new ipv4 address always cause the creation of the
      related network route, with default metric. When a host has multiple
      interfaces on the same network, multiple routes with the same metric
      are created.
      
      If the userspace wants to set specific metric on each routes, i.e.
      giving better metric to ethernet links in respect to Wi-Fi ones,
      the network routes must be deleted and recreated, which is error-prone.
      
      This patch implements the support for IFA_F_NOPREFIXROUTE for ipv4
      address. When an address is added with such flag set, no associated
      network route is created, no network route is deleted when
      said IP is gone and it's up to the user space manage such route.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b131180
    • A
      tcp: allow dctcp alpha to drop to zero · c80dbe04
      Andrew Shewmaker 提交于
      If alpha is strictly reduced by alpha >> dctcp_shift_g and if alpha is less
      than 1 << dctcp_shift_g, then alpha may never reach zero. For example,
      given shift_g=4 and alpha=15, alpha >> dctcp_shift_g yields 0 and alpha
      remains 15. The effect isn't noticeable in this case below cwnd=137, but
      could gradually drive uncongested flows with leftover alpha down to
      cwnd=137. A larger dctcp_shift_g would have a greater effect.
      
      This change causes alpha=15 to drop to 0 instead of being decrementing by 1
      as it would when alpha=16. However, it requires one less conditional to
      implement since it doesn't have to guard against subtracting 1 from 0U. A
      decay of 15 is not unreasonable since an equal or greater amount occurs at
      alpha >= 240.
      Signed-off-by: NAndrew G. Shewmaker <agshew@gmail.com>
      Acked-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c80dbe04
    • S
      xfrm4: Reload skb header pointers after calling pskb_may_pull. · ea673a4d
      Steffen Klassert 提交于
      A call to pskb_may_pull may change the pointers into the packet,
      so reload the pointers after the call.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      ea673a4d
    • S
      xfrm4: Fix header checks in _decode_session4. · 1a14f1e5
      Steffen Klassert 提交于
      We skip the header informations if the data pointer points
      already behind the header in question for some protocols.
      This is because we call pskb_may_pull with a negative value
      converted to unsigened int from pskb_may_pull in this case.
      Skipping the header informations can lead to incorrect policy
      lookups, so fix it by a check of the data pointer position
      before we call pskb_may_pull.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      1a14f1e5
    • P
      openvswitch: Fix egress tunnel info. · fc4099f1
      Pravin B Shelar 提交于
      While transitioning to netdev based vport we broke OVS
      feature which allows user to retrieve tunnel packet egress
      information for lwtunnel devices.  Following patch fixes it
      by introducing ndo operation to get the tunnel egress info.
      Same ndo operation can be used for lwtunnel devices and compat
      ovs-tnl-vport devices. So after adding such device operation
      we can remove similar operation from ovs-vport.
      
      Fixes: 614732ea ("openvswitch: Use regular VXLAN net_device device").
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc4099f1
  16. 22 10月, 2015 2 次提交
    • R
      tcp: remove improper preemption check in tcp_xmit_probe_skb() · e2e8009f
      Renato Westphal 提交于
      Commit e520af48 introduced the following bug when setting the
      TCP_REPAIR sockoption:
      
      [ 2860.657036] BUG: using __this_cpu_add() in preemptible [00000000] code: daemon/12164
      [ 2860.657045] caller is __this_cpu_preempt_check+0x13/0x20
      [ 2860.657049] CPU: 1 PID: 12164 Comm: daemon Not tainted 4.2.3 #1
      [ 2860.657051] Hardware name: Dell Inc. PowerEdge R210 II/0JP7TR, BIOS 2.0.5 03/13/2012
      [ 2860.657054]  ffffffff81c7f071 ffff880231e9fdf8 ffffffff8185d765 0000000000000002
      [ 2860.657058]  0000000000000001 ffff880231e9fe28 ffffffff8146ed91 ffff880231e9fe18
      [ 2860.657062]  ffffffff81cd1a5d ffff88023534f200 ffff8800b9811000 ffff880231e9fe38
      [ 2860.657065] Call Trace:
      [ 2860.657072]  [<ffffffff8185d765>] dump_stack+0x4f/0x7b
      [ 2860.657075]  [<ffffffff8146ed91>] check_preemption_disabled+0xe1/0xf0
      [ 2860.657078]  [<ffffffff8146edd3>] __this_cpu_preempt_check+0x13/0x20
      [ 2860.657082]  [<ffffffff817e0bc7>] tcp_xmit_probe_skb+0xc7/0x100
      [ 2860.657085]  [<ffffffff817e1e2d>] tcp_send_window_probe+0x2d/0x30
      [ 2860.657089]  [<ffffffff817d1d8c>] do_tcp_setsockopt.isra.29+0x74c/0x830
      [ 2860.657093]  [<ffffffff817d1e9c>] tcp_setsockopt+0x2c/0x30
      [ 2860.657097]  [<ffffffff81767b74>] sock_common_setsockopt+0x14/0x20
      [ 2860.657100]  [<ffffffff817669e1>] SyS_setsockopt+0x71/0xc0
      [ 2860.657104]  [<ffffffff81865172>] entry_SYSCALL_64_fastpath+0x16/0x75
      
      Since tcp_xmit_probe_skb() can be called from process context, use
      NET_INC_STATS() instead of NET_INC_STATS_BH().
      
      Fixes: e520af48 ("tcp: add TCPWinProbe and TCPKeepAlive SNMP counters")
      Signed-off-by: NRenato Westphal <renatow@taghos.com.br>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2e8009f
    • A
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen 提交于
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      rtnl_link_get_size().
      
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1974ed0
  17. 21 10月, 2015 5 次提交
    • Y
      tcp: use RACK to detect losses · 4f41b1c5
      Yuchung Cheng 提交于
      This patch implements the second half of RACK that uses the the most
      recent transmit time among all delivered packets to detect losses.
      
      tcp_rack_mark_lost() is called upon receiving a dubious ACK.
      It then checks if an not-yet-sacked packet was sent at least
      "reo_wnd" prior to the sent time of the most recently delivered.
      If so the packet is deemed lost.
      
      The "reo_wnd" reordering window starts with 1msec for fast loss
      detection and changes to min-RTT/4 when reordering is observed.
      We found 1msec accommodates well on tiny degree of reordering
      (<3 pkts) on faster links. We use min-RTT instead of SRTT because
      reordering is more of a path property but SRTT can be inflated by
      self-inflicated congestion. The factor of 4 is borrowed from the
      delayed early retransmit and seems to work reasonably well.
      
      Since RACK is still experimental, it is now used as a supplemental
      loss detection on top of existing algorithms. It is only effective
      after the fast recovery starts or after the timeout occurs. The
      fast recovery is still triggered by FACK and/or dupack threshold
      instead of RACK.
      
      We introduce a new sysctl net.ipv4.tcp_recovery for future
      experiments of loss recoveries. For now RACK can be disabled by
      setting it to 0.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f41b1c5
    • Y
      tcp: track the packet timings in RACK · 659a8ad5
      Yuchung Cheng 提交于
      This patch is the first half of the RACK loss recovery.
      
      RACK loss recovery uses the notion of time instead
      of packet sequence (FACK) or counts (dupthresh). It's inspired by the
      previous FACK heuristic in tcp_mark_lost_retrans(): when a limited
      transmit (new data packet) is sacked, then current retransmitted
      sequence below the newly sacked sequence must been lost,
      since at least one round trip time has elapsed.
      
      But it has several limitations:
      1) can't detect tail drops since it depends on limited transmit
      2) is disabled upon reordering (assumes no reordering)
      3) only enabled in fast recovery ut not timeout recovery
      
      RACK (Recently ACK) addresses these limitations with the notion
      of time instead: a packet P1 is lost if a later packet P2 is s/acked,
      as at least one round trip has passed.
      
      Since RACK cares about the time sequence instead of the data sequence
      of packets, it can detect tail drops when later retransmission is
      s/acked while FACK or dupthresh can't. For reordering RACK uses a
      dynamically adjusted reordering window ("reo_wnd") to reduce false
      positives on ever (small) degree of reordering.
      
      This patch implements tcp_advanced_rack() which tracks the
      most recent transmission time among the packets that have been
      delivered (ACKed or SACKed) in tp->rack.mstamp. This timestamp
      is the key to determine which packet has been lost.
      
      Consider an example that the sender sends six packets:
      T1: P1 (lost)
      T2: P2
      T3: P3
      T4: P4
      T100: sack of P2. rack.mstamp = T2
      T101: retransmit P1
      T102: sack of P2,P3,P4. rack.mstamp = T4
      T205: ACK of P4 since the hole is repaired. rack.mstamp = T101
      
      We need to be careful about spurious retransmission because it may
      falsely advance tp->rack.mstamp by an RTT or an RTO, causing RACK
      to falsely mark all packets lost, just like a spurious timeout.
      
      We identify spurious retransmission by the ACK's TS echo value.
      If TS option is not applicable but the retransmission is acknowledged
      less than min-RTT ago, it is likely to be spurious. We refrain from
      using the transmission time of these spurious retransmissions.
      
      The second half is implemented in the next patch that marks packet
      lost using RACK timestamp.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      659a8ad5
    • Y
      tcp: add tcp_tsopt_ecr_before helper · 77c63127
      Yuchung Cheng 提交于
      a helper to prepare the main RACK patch
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77c63127
    • Y
      tcp: remove tcp_mark_lost_retrans() · af82f4e8
      Yuchung Cheng 提交于
      Remove the existing lost retransmit detection because RACK subsumes
      it completely. This also stops the overloading the ack_seq field of
      the skb control block.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af82f4e8
    • Y
      tcp: track min RTT using windowed min-filter · f6722583
      Yuchung Cheng 提交于
      Kathleen Nichols' algorithm for tracking the minimum RTT of a
      data stream over some measurement window. It uses constant space
      and constant time per update. Yet it almost always delivers
      the same minimum as an implementation that has to keep all
      the data in the window. The measurement window is tunable via
      sysctl.net.ipv4.tcp_min_rtt_wlen with a default value of 5 minutes.
      
      The algorithm keeps track of the best, 2nd best & 3rd best min
      values, maintaining an invariant that the measurement time of
      the n'th best >= n-1'th best. It also makes sure that the three
      values are widely separated in the time window since that bounds
      the worse case error when that data is monotonically increasing
      over the window.
      
      Upon getting a new min, we can forget everything earlier because
      it has no value - the new min is less than everything else in the
      window by definition and it's the most recent. So we restart fresh
      on every new min and overwrites the 2nd & 3rd choices. The same
      property holds for the 2nd & 3rd best.
      
      Therefore we have to maintain two invariants to maximize the
      information in the samples, one on values (1st.v <= 2nd.v <=
      3rd.v) and the other on times (now-win <=1st.t <= 2nd.t <= 3rd.t <=
      now). These invariants determine the structure of the code
      
      The RTT input to the windowed filter is the minimum RTT measured
      from ACK or SACK, or as the last resort from TCP timestamps.
      
      The accessor tcp_min_rtt() returns the minimum RTT seen in the
      window. ~0U indicates it is not available. The minimum is 1usec
      even if the true RTT is below that.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6722583