1. 13 6月, 2015 7 次提交
  2. 02 6月, 2015 3 次提交
  3. 01 6月, 2015 3 次提交
    • F
      net: dsa: Properly propagate errors from dsa_switch_setup_one · 24595346
      Florian Fainelli 提交于
      While shuffling some code around, dsa_switch_setup_one() was introduced,
      and it was modified to return either an error code using ERR_PTR() or a
      NULL pointer when running out of memory or failing to setup a switch.
      
      This is a problem for its caler: dsa_switch_setup() which uses IS_ERR()
      and expects to find an error code, not a NULL pointer, so we still try
      to proceed with dsa_switch_setup() and operate on invalid memory
      addresses. This can be easily reproduced by having e.g: the bcm_sf2
      driver built-in, but having no such switch, such that drv->setup will
      fail.
      
      Fix this by using PTR_ERR() consistently which is both more informative
      and avoids for the caller to use IS_ERR_OR_NULL().
      
      Fixes: df197195 ("net: dsa: split dsa_switch_setup into two functions")
      Reported-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Tested-by: NAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24595346
    • N
      tcp: fix child sockets to use system default congestion control if not set · 9f950415
      Neal Cardwell 提交于
      Linux 3.17 and earlier are explicitly engineered so that if the app
      doesn't specifically request a CC module on a listener before the SYN
      arrives, then the child gets the system default CC when the connection
      is established. See tcp_init_congestion_control() in 3.17 or earlier,
      which says "if no choice made yet assign the current value set as
      default". The change ("net: tcp: assign tcp cong_ops when tcp sk is
      created") altered these semantics, so that children got their parent
      listener's congestion control even if the system default had changed
      after the listener was created.
      
      This commit returns to those original semantics from 3.17 and earlier,
      since they are the original semantics from 2007 in 4d4d3d1e ("[TCP]:
      Congestion control initialization."), and some Linux congestion
      control workflows depend on that.
      
      In summary, if a listener socket specifically sets TCP_CONGESTION to
      "x", or the route locks the CC module to "x", then the child gets
      "x". Otherwise the child gets current system default from
      net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and
      earlier, and this commit restores that.
      
      Fixes: 55d8694f ("net: tcp: assign tcp cong_ops when tcp sk is created")
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f950415
    • E
      udp: fix behavior of wrong checksums · beb39db5
      Eric Dumazet 提交于
      We have two problems in UDP stack related to bogus checksums :
      
      1) We return -EAGAIN to application even if receive queue is not empty.
         This breaks applications using edge trigger epoll()
      
      2) Under UDP flood, we can loop forever without yielding to other
         processes, potentially hanging the host, especially on non SMP.
      
      This patch is an attempt to make things better.
      
      We might in the future add extra support for rt applications
      wanting to better control time spent doing a recv() in a hostile
      environment. For example we could validate checksums before queuing
      packets in socket receive queue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      beb39db5
  4. 31 5月, 2015 1 次提交
  5. 28 5月, 2015 4 次提交
    • A
      ip_vti/ip6_vti: Preserve skb->mark after rcv_cb call · d55c670c
      Alexander Duyck 提交于
      The vti6_rcv_cb and vti_rcv_cb calls were leaving the skb->mark modified
      after completing the function.  This resulted in the original skb->mark
      value being lost.  Since we only need skb->mark to be set for
      xfrm_policy_check we can pull the assignment into the rcv_cb calls and then
      just restore the original mark after xfrm_policy_check has been completed.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      d55c670c
    • A
      xfrm: Override skb->mark with tunnel->parm.i_key in xfrm_input · 049f8e2e
      Alexander Duyck 提交于
      This change makes it so that if a tunnel is defined we just use the mark
      from the tunnel instead of the mark from the skb header.  By doing this we
      can avoid the need to set skb->mark inside of the tunnel receive functions.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      049f8e2e
    • A
      ip_vti/ip6_vti: Do not touch skb->mark on xmit · cd5279c1
      Alexander Duyck 提交于
      Instead of modifying skb->mark we can simply modify the flowi_mark that is
      generated as a result of the xfrm_decode_session.  By doing this we don't
      need to actually touch the skb->mark and it can be preserved as it passes
      out through the tunnel.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      cd5279c1
    • W
      net_sched: invoke ->attach() after setting dev->qdisc · 86e363dc
      WANG Cong 提交于
      For mq qdisc, we add per tx queue qdisc to root qdisc
      for display purpose, however, that happens too early,
      before the new dev->qdisc is finally set, this causes
      q->list points to an old root qdisc which is going to be
      freed right before assigning with a new one.
      
      Fix this by moving ->attach() after setting dev->qdisc.
      
      For the record, this fixes the following crash:
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 975 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()
       list_del corruption. prev->next should be ffff8800d1998ae8, but was 6b6b6b6b6b6b6b6b
       CPU: 1 PID: 975 Comm: tc Not tainted 4.1.0-rc4+ #1019
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        0000000000000009 ffff8800d73fb928 ffffffff81a44e7f 0000000047574756
        ffff8800d73fb978 ffff8800d73fb968 ffffffff810790da ffff8800cfc4cd20
        ffffffff814e725b ffff8800d1998ae8 ffffffff82381250 0000000000000000
       Call Trace:
        [<ffffffff81a44e7f>] dump_stack+0x4c/0x65
        [<ffffffff810790da>] warn_slowpath_common+0x9c/0xb6
        [<ffffffff814e725b>] ? __list_del_entry+0x5a/0x98
        [<ffffffff81079162>] warn_slowpath_fmt+0x46/0x48
        [<ffffffff81820eb0>] ? dev_graft_qdisc+0x5e/0x6a
        [<ffffffff814e725b>] __list_del_entry+0x5a/0x98
        [<ffffffff814e72a7>] list_del+0xe/0x2d
        [<ffffffff81822f05>] qdisc_list_del+0x1e/0x20
        [<ffffffff81820cd1>] qdisc_destroy+0x30/0xd6
        [<ffffffff81822676>] qdisc_graft+0x11d/0x243
        [<ffffffff818233c1>] tc_get_qdisc+0x1a6/0x1d4
        [<ffffffff810b5eaf>] ? mark_lock+0x2e/0x226
        [<ffffffff817ff8f5>] rtnetlink_rcv_msg+0x181/0x194
        [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19
        [<ffffffff817ff72e>] ? rtnl_lock+0x17/0x19
        [<ffffffff817ff774>] ? __rtnl_unlock+0x17/0x17
        [<ffffffff81855dc6>] netlink_rcv_skb+0x4d/0x93
        [<ffffffff817ff756>] rtnetlink_rcv+0x26/0x2d
        [<ffffffff818544b2>] netlink_unicast+0xcb/0x150
        [<ffffffff81161db9>] ? might_fault+0x59/0xa9
        [<ffffffff81854f78>] netlink_sendmsg+0x4fa/0x51c
        [<ffffffff817d6e09>] sock_sendmsg_nosec+0x12/0x1d
        [<ffffffff817d8967>] sock_sendmsg+0x29/0x2e
        [<ffffffff817d8cf3>] ___sys_sendmsg+0x1b4/0x23a
        [<ffffffff8100a1b8>] ? native_sched_clock+0x35/0x37
        [<ffffffff810a1d83>] ? sched_clock_local+0x12/0x72
        [<ffffffff810a1fd4>] ? sched_clock_cpu+0x9e/0xb7
        [<ffffffff810def2a>] ? current_kernel_time+0xe/0x32
        [<ffffffff810b4bc5>] ? lock_release_holdtime.part.29+0x71/0x7f
        [<ffffffff810ddebf>] ? read_seqcount_begin.constprop.27+0x5f/0x76
        [<ffffffff810b6292>] ? trace_hardirqs_on_caller+0x17d/0x199
        [<ffffffff811b14d5>] ? __fget_light+0x50/0x78
        [<ffffffff817d9808>] __sys_sendmsg+0x42/0x60
        [<ffffffff817d9838>] SyS_sendmsg+0x12/0x1c
        [<ffffffff81a50e97>] system_call_fastpath+0x12/0x6f
       ---[ end trace ef29d3fb28e97ae7 ]---
      
      For long term, we probably need to clean up the qdisc_graft() code
      in case it hides other bugs like this.
      
      Fixes: 95dc1929 ("pkt_sched: give visibility to mq slave qdiscs")
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86e363dc
  6. 27 5月, 2015 1 次提交
  7. 23 5月, 2015 6 次提交
    • E
      bridge: fix lockdep splat · 93a33a58
      Eric Dumazet 提交于
      Following lockdep splat was reported :
      
      [   29.382286] ===============================
      [   29.382315] [ INFO: suspicious RCU usage. ]
      [   29.382344] 4.1.0-0.rc0.git11.1.fc23.x86_64 #1 Not tainted
      [   29.382380] -------------------------------
      [   29.382409] net/bridge/br_private.h:626 suspicious
      rcu_dereference_check() usage!
      [   29.382455]
                     other info that might help us debug this:
      
      [   29.382507]
                     rcu_scheduler_active = 1, debug_locks = 0
      [   29.382549] 2 locks held by swapper/0/0:
      [   29.382576]  #0:  (((&p->forward_delay_timer))){+.-...}, at:
      [<ffffffff81139f75>] call_timer_fn+0x5/0x4f0
      [   29.382660]  #1:  (&(&br->lock)->rlock){+.-...}, at:
      [<ffffffffa0450dc1>] br_forward_delay_timer_expired+0x31/0x140
      [bridge]
      [   29.382754]
                     stack backtrace:
      [   29.382787] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
      4.1.0-0.rc0.git11.1.fc23.x86_64 #1
      [   29.382838] Hardware name: LENOVO 422916G/LENOVO, BIOS A1KT53AUS 04/07/2015
      [   29.382882]  0000000000000000 3ebfc20364115825 ffff880666603c48
      ffffffff81892d4b
      [   29.382943]  0000000000000000 ffffffff81e124e0 ffff880666603c78
      ffffffff8110bcd7
      [   29.383004]  ffff8800785c9d00 ffff88065485ac58 ffff880c62002800
      ffff880c5fc88ac0
      [   29.383065] Call Trace:
      [   29.383084]  <IRQ>  [<ffffffff81892d4b>] dump_stack+0x4c/0x65
      [   29.383130]  [<ffffffff8110bcd7>] lockdep_rcu_suspicious+0xe7/0x120
      [   29.383178]  [<ffffffffa04520f9>] br_fill_ifinfo+0x4a9/0x6a0 [bridge]
      [   29.383225]  [<ffffffffa045266b>] br_ifinfo_notify+0x11b/0x4b0 [bridge]
      [   29.383271]  [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge]
      [   29.383320]  [<ffffffffa0450de8>]
      br_forward_delay_timer_expired+0x58/0x140 [bridge]
      [   29.383371]  [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge]
      [   29.383416]  [<ffffffff8113a033>] call_timer_fn+0xc3/0x4f0
      [   29.383454]  [<ffffffff81139f75>] ? call_timer_fn+0x5/0x4f0
      [   29.383493]  [<ffffffff8110a90f>] ? lock_release_holdtime.part.29+0xf/0x200
      [   29.383541]  [<ffffffffa0450d90>] ? br_hold_timer_expired+0x70/0x70 [bridge]
      [   29.383587]  [<ffffffff8113a6a4>] run_timer_softirq+0x244/0x490
      [   29.383629]  [<ffffffff810b68cc>] __do_softirq+0xec/0x670
      [   29.383666]  [<ffffffff810b70d5>] irq_exit+0x145/0x150
      [   29.383703]  [<ffffffff8189f506>] smp_apic_timer_interrupt+0x46/0x60
      [   29.383744]  [<ffffffff8189d523>] apic_timer_interrupt+0x73/0x80
      [   29.383782]  <EOI>  [<ffffffff816f131f>] ? cpuidle_enter_state+0x5f/0x2f0
      [   29.383832]  [<ffffffff816f131b>] ? cpuidle_enter_state+0x5b/0x2f0
      
      Problem here is that br_forward_delay_timer_expired() is a timer
      handler, calling br_ifinfo_notify() which assumes either rcu_read_lock()
      or RTNL are held.
      
      Simplest fix seems to add rcu read lock section.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Reported-by: NDominick Grift <dac.override@gmail.com>
      Cc: Vlad Yasevich <vyasevich@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93a33a58
    • A
      net: core: 'ethtool' issue with querying phy settings · f96dee13
      Arun Parameswaran 提交于
      When trying to configure the settings for PHY1, using commands
      like 'ethtool -s eth0 phyad 1 speed 100', the 'ethtool' seems to
      modify other settings apart from the speed of the PHY1, in the
      above case.
      
      The ethtool seems to query the settings for PHY0, and use this
      as the base to apply the new settings to the PHY1. This is
      causing the other settings of the PHY 1 to be wrongly
      configured.
      
      The issue is caused by the '_ethtool_get_settings()' API, which
      gets called because of the 'ETHTOOL_GSET' command, is clearing
      the 'cmd' pointer (of type 'struct ethtool_cmd') by calling
      memset. This clears all the parameters (if any) passed for the
      'ETHTOOL_GSET' cmd. So the driver's callback is always invoked
      with 'cmd->phy_address' as '0'.
      
      The '_ethtool_get_settings()' is called from other files in the
      'net/core'. So the fix is applied to the 'ethtool_get_settings()'
      which is only called in the context of the 'ethtool'.
      Signed-off-by: NArun Parameswaran <aparames@broadcom.com>
      Reviewed-by: NRay Jui <rjui@broadcom.com>
      Reviewed-by: NScott Branden <sbranden@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f96dee13
    • T
      bridge: fix parsing of MLDv2 reports · 47cc84ce
      Thadeu Lima de Souza Cascardo 提交于
      When more than a multicast address is present in a MLDv2 report, all but
      the first address is ignored, because the code breaks out of the loop if
      there has not been an error adding that address.
      
      This has caused failures when two guests connected through the bridge
      tried to communicate using IPv6. Neighbor discoveries would not be
      transmitted to the other guest when both used a link-local address and a
      static address.
      
      This only happens when there is a MLDv2 querier in the network.
      
      The fix will only break out of the loop when there is a failure adding a
      multicast address.
      
      The mdb before the patch:
      
      dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
      dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
      dev ovirtmgmt port bond0.86 grp ff02::2 temp
      
      After the patch:
      
      dev ovirtmgmt port vnet0 grp ff02::1:ff7d:6603 temp
      dev ovirtmgmt port vnet1 grp ff02::1:ff7d:6604 temp
      dev ovirtmgmt port bond0.86 grp ff02::fb temp
      dev ovirtmgmt port bond0.86 grp ff02::2 temp
      dev ovirtmgmt port bond0.86 grp ff02::d temp
      dev ovirtmgmt port vnet0 grp ff02::1:ff00:76 temp
      dev ovirtmgmt port bond0.86 grp ff02::16 temp
      dev ovirtmgmt port vnet1 grp ff02::1:ff00:77 temp
      dev ovirtmgmt port bond0.86 grp ff02::1:ff00:def temp
      dev ovirtmgmt port bond0.86 grp ff02::1:ffa1:40bf temp
      
      Fixes: 08b202b6 ("bridge br_multicast: IPv6 MLD support.")
      Reported-by: NRik Theys <Rik.Theys@esat.kuleuven.be>
      Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@redhat.com>
      Tested-by: NRik Theys <Rik.Theys@esat.kuleuven.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47cc84ce
    • M
      ipv4: fill in table id when replacing a route · d4e64c29
      Michal Kubeček 提交于
      When replacing an IPv4 route, tb_id member of the new fib_alias
      structure is not set in the replace code path so that the new route is
      ignored.
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4e64c29
    • E
      ipv4: Avoid crashing in ip_error · 381c759d
      Eric W. Biederman 提交于
      ip_error does not check if in_dev is NULL before dereferencing it.
      
      IThe following sequence of calls is possible:
      CPU A                          CPU B
      ip_rcv_finish
          ip_route_input_noref()
              ip_route_input_slow()
                                     inetdev_destroy()
          dst_input()
      
      With the result that a network device can be destroyed while processing
      an input packet.
      
      A crash was triggered with only unicast packets in flight, and
      forwarding enabled on the only network device.   The error condition
      was created by the removal of the network device.
      
      As such it is likely the that error code was -EHOSTUNREACH, and the
      action taken by ip_error (if in_dev had been accessible) would have
      been to not increment any counters and to have tried and likely failed
      to send an icmp error as the network device is going away.
      
      Therefore handle this weird case by just dropping the packet if
      !in_dev.  It will result in dropping the packet sooner, and will not
      result in an actual change of behavior.
      
      Fixes: 251da413 ("ipv4: Cache ip_error() routes even when not forwarding.")
      Reported-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Tested-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Signed-off-by: NVittorio Gambaletta <linuxbugs@vittgam.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      381c759d
    • E
      tcp: fix a potential deadlock in tcp_get_info() · d654976c
      Eric Dumazet 提交于
      Taking socket spinlock in tcp_get_info() can deadlock, as
      inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
      while packet processing can use the reverse locking order.
      
      We could avoid this locking for TCP_LISTEN states, but lockdep would
      certainly get confused as all TCP sockets share same lockdep classes.
      
      [  523.722504] ======================================================
      [  523.728706] [ INFO: possible circular locking dependency detected ]
      [  523.734990] 4.1.0-dbg-DEV #1676 Not tainted
      [  523.739202] -------------------------------------------------------
      [  523.745474] ss/18032 is trying to acquire lock:
      [  523.750002]  (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
      [  523.758129]
      [  523.758129] but task is already holding lock:
      [  523.763968]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
      [  523.774661]
      [  523.774661] which lock already depends on the new lock.
      [  523.774661]
      [  523.782850]
      [  523.782850] the existing dependency chain (in reverse order) is:
      [  523.790326]
      -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
      [  523.796599]        [<ffffffff811126bb>] lock_acquire+0xbb/0x270
      [  523.802565]        [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
      [  523.808628]        [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
      [  523.815273]        [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
      [  523.822067]        [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
      [  523.828199]        [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
      [  523.834331]        [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
      [  523.840202]        [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
      [  523.847214]        [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
      [  523.853440]        [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
      [  523.859624]        [<ffffffff81659db7>] ip_rcv+0x307/0x420
      
      Lets use u64_sync infrastructure instead. As a bonus, 64bit
      arches get optimized, as these are nop for them.
      
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d654976c
  8. 22 5月, 2015 1 次提交
    • D
      net: sched: fix call_rcu() race on classifier module unloads · c78e1746
      Daniel Borkmann 提交于
      Vijay reported that a loop as simple as ...
      
        while true; do
          tc qdisc add dev foo root handle 1: prio
          tc filter add dev foo parent 1: u32 match u32 0 0  flowid 1
          tc qdisc del dev foo root
          rmmod cls_u32
        done
      
      ... will panic the kernel. Moreover, he bisected the change
      apparently introducing it to 78fd1d0a ("netlink: Re-add
      locking to netlink_lookup() and seq walker").
      
      The removal of synchronize_net() from the netlink socket
      triggering the qdisc to be removed, seems to have uncovered
      an RCU resp. module reference count race from the tc API.
      Given that RCU conversion was done after e341694e ("netlink:
      Convert netlink_lookup() to use RCU protected hash table")
      which added the synchronize_net() originally, occasion of
      hitting the bug was less likely (not impossible though):
      
      When qdiscs that i) support attaching classifiers and,
      ii) have at least one of them attached, get deleted, they
      invoke tcf_destroy_chain(), and thus call into ->destroy()
      handler from a classifier module.
      
      After RCU conversion, all classifier that have an internal
      prio list, unlink them and initiate freeing via call_rcu()
      deferral.
      
      Meanhile, tcf_destroy() releases already reference to the
      tp->ops->owner module before the queued RCU callback handler
      has been invoked.
      
      Subsequent rmmod on the classifier module is then not prevented
      since all module references are already dropped.
      
      By the time, the kernel invokes the RCU callback handler from
      the module, that function address is then invalid.
      
      One way to fix it would be to add an rcu_barrier() to
      unregister_tcf_proto_ops() to wait for all pending call_rcu()s
      to complete.
      
      synchronize_rcu() is not appropriate as under heavy RCU
      callback load, registered call_rcu()s could be deferred
      longer than a grace period. In case we don't have any pending
      call_rcu()s, the barrier is allowed to return immediately.
      
      Since we came here via unregister_tcf_proto_ops(), there
      are no users of a given classifier anymore. Further nested
      call_rcu()s pointing into the module space are not being
      done anywhere.
      
      Only cls_bpf_delete_prog() may schedule a work item, to
      unlock pages eventually, but that is not in the range/context
      of cls_bpf anymore.
      
      Fixes: 25d8c0d5 ("net: rcu-ify tcf_proto")
      Fixes: 9888faef ("net: sched: cls_basic use RCU")
      Reported-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c78e1746
  9. 21 5月, 2015 5 次提交
    • H
      xfrm: Always zero high-order sequence number bits · 407d34ef
      Herbert Xu 提交于
      As we're now always including the high bits of the sequence number
      in the IV generation process we need to ensure that they don't
      contain crap.
      
      This patch ensures that the high sequence bits are always zeroed
      so that we don't leak random data into the IV.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      407d34ef
    • I
      Revert "libceph: clear r_req_lru_item in __unregister_linger_request()" · 521a04d0
      Ilya Dryomov 提交于
      This reverts commit ba9d114e.
      
      .. which introduced a regression that prevented all lingering requests
      requeued in kick_requests() from ever being sent to the OSDs, resulting
      in a lot of missed notifies.  In retrospect it's pretty obvious that
      r_req_lru_item item in the case of lingering requests can be used not
      only for notarget, but also for unsent linkage due to how tightly
      actual map and enqueue operations are coupled in __map_request().
      
      The assertion that was being silenced is taken care of in the previous
      ("libceph: request a new osdmap if lingering request maps to no osd")
      commit: by always kicking homeless lingering requests we ensure that
      none of them ends up on the notarget list outside of the critical
      section guarded by request_mutex.
      
      Cc: stable@vger.kernel.org # 3.18+, needs b0494532 "libceph: request a new osdmap if lingering request maps to no osd"
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      521a04d0
    • I
      libceph: request a new osdmap if lingering request maps to no osd · b0494532
      Ilya Dryomov 提交于
      This commit does two things.  First, if there are any homeless
      lingering requests, we now request a new osdmap even if the osdmap that
      is being processed brought no changes, i.e. if a given lingering
      request turned homeless in one of the previous epochs and remained
      homeless in the current epoch.  Not doing so leaves us with a stale
      osdmap and as a result we may miss our window for reestablishing the
      watch and lose notifies.
      
      MON=1 OSD=1:
      
          # cat linger-needmap.sh
          #!/bin/bash
          rbd create --size 1 test
          DEV=$(rbd map test)
          ceph osd out 0
          rbd map dne/dne # obtain a new osdmap as a side effect (!)
          sleep 1
          ceph osd in 0
          rbd resize --size 2 test
          # rbd info test | grep size -> 2M
          # blockdev --getsize $DEV -> 1M
      
      N.B.: Not obtaining a new osdmap in between "osd out" and "osd in"
      above is enough to make it miss that resize notify, but that is a
      bug^Wlimitation of ceph watch/notify v1.
      
      Second, homeless lingering requests are now kicked just like those
      lingering requests whose mapping has changed.  This is mainly to
      recognize that a homeless lingering request makes no sense and to
      preserve the invariant that a registered lingering request is not
      sitting on any of r_req_lru_item lists.  This spares us a WARN_ON,
      which commit ba9d114e ("libceph: clear r_req_lru_item in
      __unregister_linger_request()") tried to fix the _wrong_ way.
      
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: NSage Weil <sage@redhat.com>
      b0494532
    • M
      ipv6: fix ECMP route replacement · 27596472
      Michal Kubeček 提交于
      When replacing an IPv6 multipath route with "ip route replace", i.e.
      NLM_F_CREATE | NLM_F_REPLACE, fib6_add_rt2node() replaces only first
      matching route without fixing its siblings, resulting in corrupted
      siblings linked list; removing one of the siblings can then end in an
      infinite loop.
      
      IPv6 ECMP implementation is a bit different from IPv4 so that route
      replacement cannot work in exactly the same way. This should be a
      reasonable approximation:
      
      1. If the new route is ECMP-able and there is a matching ECMP-able one
      already, replace it and all its siblings (if any).
      
      2. If the new route is ECMP-able and no matching ECMP-able route exists,
      replace first matching non-ECMP-able (if any) or just add the new one.
      
      3. If the new route is not ECMP-able, replace first matching
      non-ECMP-able route (if any) or add the new route.
      
      We also need to remove the NLM_F_REPLACE flag after replacing old
      route(s) by first nexthop of an ECMP route so that each subsequent
      nexthop does not replace previous one.
      
      Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27596472
    • M
      ipv6: do not delete previously existing ECMP routes if add fails · 35f1b4e9
      Michal Kubeček 提交于
      If adding a nexthop of an IPv6 multipath route fails, comment in
      ip6_route_multipath() says we are going to delete all nexthops already
      added. However, current implementation deletes even the routes it
      hasn't even tried to add yet. For example, running
      
        ip route add 1234:5678::/64 \
            nexthop via fe80::aa dev dummy1 \
            nexthop via fe80::bb dev dummy1 \
            nexthop via fe80::cc dev dummy1
      
      twice results in removing all routes first command added.
      
      Limit the second (delete) run to nexthops that succeeded in the first
      (add) run.
      
      Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)")
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f1b4e9
  10. 20 5月, 2015 8 次提交
    • M
      mac80211: fix AP_VLAN crypto tailroom calculation · f9dca80b
      Michal Kazior 提交于
      Some splats I was seeing:
      
       (a) WARNING: CPU: 1 PID: 0 at /devel/src/linux/net/mac80211/wep.c:102 ieee80211_wep_add_iv
       (b) WARNING: CPU: 1 PID: 0 at /devel/src/linux/net/mac80211/wpa.c:73 ieee80211_tx_h_michael_mic_add
       (c) WARNING: CPU: 3 PID: 0 at /devel/src/linux/net/mac80211/wpa.c:433 ieee80211_crypto_ccmp_encrypt
      
      I've seen (a) and (b) with ath9k hw crypto and (c)
      with ath9k sw crypto. All of them were related to
      insufficient skb tailroom and I was able to
      trigger these with ping6 program.
      
      AP_VLANs may inherit crypto keys from parent AP.
      This wasn't considered and yielded problems in
      some setups resulting in inability to transmit
      data because mac80211 wouldn't resize skbs when
      necessary and subsequently drop some packets due
      to insufficient tailroom.
      
      For efficiency purposes don't inspect both AP_VLAN
      and AP sdata looking for tailroom counter. Instead
      update AP_VLAN tailroom counters whenever their
      master AP tailroom counter changes.
      Signed-off-by: NMichal Kazior <michal.kazior@tieto.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      f9dca80b
    • J
      mac80211: don't split remain-on-channel for coalescing · 252ec2b3
      Johannes Berg 提交于
      Due to remain-on-channel scheduling delays, when we split an ROC
      while coalescing, we'll usually get a picture like this:
      
      existing ROC:  |------------------|
      current time:              ^
      new ROC:                   |------|              |-------|
      
      If the expected response frames are then transmitted by the peer
      in the hole between the two fragments of the new ROC, we miss
      them and the process (e.g. ANQP query) fails.
      
      mac80211 expects that the window to miss something is small:
      
      existing ROC:  |------------------|
      new ROC:                   |------||-------|
      
      but that's normally not the case.
      
      To avoid this problem, coalesce only if the new ROC's duration
      is <= the remaining time on the existing one:
      
      existing ROC:  |------------------|
      new ROC:                   |-----|
      
      and never split a new one but schedule it afterwards instead:
      
      existing ROC:  |------------------|
      new ROC:                                       |-------------|
      
      type=bugfix
      bug=not-tracked
      fixes=unknown
      Reported-by: NMatti Gottlieb <matti.gottlieb@intel.com>
      Reviewed-by: NEliadX Peller <eliad@wizery.com>
      Reviewed-by: NMatti Gottlieb <matti.gottlieb@intel.com>
      Tested-by: NMatti Gottlieb <matti.gottlieb@intel.com>
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      252ec2b3
    • F
      Revert "netfilter: bridge: query conntrack about skb dnat" · faecbb45
      Florian Westphal 提交于
      This reverts commit c055d5b0.
      
      There are two issues:
      'dnat_took_place' made me think that this is related to
      -j DNAT/MASQUERADE.
      
      But thats only one part of the story.  This is also relevant for SNAT
      when we undo snat translation in reverse/reply direction.
      
      Furthermore, I originally wanted to do this mainly to avoid
      storing ipv6 addresses once we make DNAT/REDIRECT work
      for ipv6 on bridges.
      
      However, I forgot about SNPT/DNPT which is stateless.
      
      So we can't escape storing address for ipv6 anyway. Might as
      well do it for ipv4 too.
      Reported-and-tested-by: NBernhard Thaler <bernhard.thaler@wvnet.at>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      faecbb45
    • D
      netfilter: ensure number of counters is >0 in do_replace() · 1086bbe9
      Dave Jones 提交于
      After improving setsockopt() coverage in trinity, I started triggering
      vmalloc failures pretty reliably from this code path:
      
      warn_alloc_failed+0xe9/0x140
      __vmalloc_node_range+0x1be/0x270
      vzalloc+0x4b/0x50
      __do_replace+0x52/0x260 [ip_tables]
      do_ipt_set_ctl+0x15d/0x1d0 [ip_tables]
      nf_setsockopt+0x65/0x90
      ip_setsockopt+0x61/0xa0
      raw_setsockopt+0x16/0x60
      sock_common_setsockopt+0x14/0x20
      SyS_setsockopt+0x71/0xd0
      
      It turns out we don't validate that the num_counters field in the
      struct we pass in from userspace is initialized.
      
      The same problem also exists in ebtables, arptables, ipv6, and the
      compat variants.
      Signed-off-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      1086bbe9
    • F
      netfilter: nfnetlink_{log,queue}: Register pernet in first place · 3bfe0498
      Francesco Ruggeri 提交于
      nfnetlink_{log,queue}_init() register the netlink callback nf*_rcv_nl_event
      before registering the pernet_subsys, but the callback relies on data
      structures allocated by pernet init functions.
      
      When nfnetlink_{log,queue} is loaded, if a netlink message is received after
      the netlink callback is registered but before the pernet_subsys is registered,
      the kernel will panic in the sequence
      
      nfulnl_rcv_nl_event
        nfnl_log_pernet
          net_generic
            BUG_ON(id == 0)  where id is nfnl_log_net_id.
      
      The panic can be easily reproduced in 4.0.3 by:
      
      while true ;do modprobe nfnetlink_log ; rmmod nfnetlink_log ; done &
      while true ;do ip netns add dummy ; ip netns del dummy ; done &
      
      This patch moves register_pernet_subsys to earlier in nfnetlink_log_init.
      
      Notice that the BUG_ON hit in 4.0.3 was recently removed in 2591ffd3
      ["netns: remove BUG_ONs from net_generic()"].
      Signed-off-by: NFrancesco Ruggeri <fruggeri@arista.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      3bfe0498
    • Y
      tcp: don't over-send F-RTO probes · b7b0ed91
      Yuchung Cheng 提交于
      After sending the new data packets to probe (step 2), F-RTO may
      incorrectly send more probes if the next ACK advances SND_UNA and
      does not sack new packet. However F-RTO RFC 5682 probes at most
      once. This bug may cause sender to always send new data instead of
      repairing holes, inducing longer HoL blocking on the receiver for
      the application.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7b0ed91
    • Y
      tcp: only undo on partial ACKs in CA_Loss · da34ac76
      Yuchung Cheng 提交于
      Undo based on TCP timestamps should only happen on ACKs that advance
      SND_UNA, according to the Eifel algorithm in RFC 3522:
      
      Section 3.2:
      
        (4) If the value of the Timestamp Echo Reply field of the
            acceptable ACK's Timestamps option is smaller than the
            value of RetransmitTS, then proceed to step (5),
      
      Section Terminology:
         We use the term 'acceptable ACK' as defined in [RFC793].  That is an
         ACK that acknowledges previously unacknowledged data.
      
      This is because upon receiving an out-of-order packet, the receiver
      returns the last timestamp that advances RCV_NXT, not the current
      timestamp of the packet in the DUPACK. Without checking the flag,
      the DUPACK will cause tcp_packet_delayed() to return true and
      tcp_try_undo_loss() will revert cwnd reduction.
      
      Note that we check the condition in CA_Recovery already by only
      calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or
      tcp_try_undo_recovery() if snd_una crosses high_seq.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da34ac76
    • H
      net/ipv6/udp: Fix ipv6 multicast socket filter regression · 33b4b015
      Henning Rogge 提交于
      Commit <5cf3d461> ("udp: Simplify__udp*_lib_mcast_deliver")
      simplified the filter for incoming IPv6 multicast but removed
      the check of the local socket address and the UDP destination
      address.
      
      This patch restores the filter to prevent sockets bound to a IPv6
      multicast IP to receive other UDP traffic link unicast.
      Signed-off-by: NHenning Rogge <hrogge@gmail.com>
      Fixes: 5cf3d461 ("udp: Simplify__udp*_lib_mcast_deliver")
      Cc: "David S. Miller" <davem@davemloft.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33b4b015
  11. 19 5月, 2015 1 次提交