1. 12 10月, 2018 4 次提交
    • Y
      tipc: eliminate possible recursive locking detected by LOCKDEP · a1f8dd34
      Ying Xue 提交于
      When booting kernel with LOCKDEP option, below warning info was found:
      
      WARNING: possible recursive locking detected
      4.19.0-rc7+ #14 Not tainted
      --------------------------------------------
      swapper/0/1 is trying to acquire lock:
      00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
      include/linux/spinlock.h:334 [inline]
      00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at:
      tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
      
      but task is already holding lock:
      00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
      include/linux/spinlock.h:334 [inline]
      00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
      tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&list->lock)->rlock#4);
        lock(&(&list->lock)->rlock#4);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      2 locks held by swapper/0/1:
       #0: 00000000f7539d34 (pernet_ops_rwsem){+.+.}, at:
      register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051
       #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
      spin_lock_bh include/linux/spinlock.h:334 [inline]
       #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
      tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849
      
      stack backtrace:
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1af/0x295 lib/dump_stack.c:113
       print_deadlock_bug kernel/locking/lockdep.c:1759 [inline]
       check_deadlock kernel/locking/lockdep.c:1803 [inline]
       validate_chain kernel/locking/lockdep.c:2399 [inline]
       __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411
       lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900
       __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
       _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
       spin_lock_bh include/linux/spinlock.h:334 [inline]
       tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
       tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526
       tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521
       tipc_init_net+0x472/0x610 net/tipc/core.c:82
       ops_init+0xf7/0x520 net/core/net_namespace.c:129
       __register_pernet_operations net/core/net_namespace.c:940 [inline]
       register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011
       register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052
       tipc_init+0x83/0x104 net/tipc/core.c:140
       do_one_initcall+0x109/0x70a init/main.c:885
       do_initcall_level init/main.c:953 [inline]
       do_initcalls init/main.c:961 [inline]
       do_basic_setup init/main.c:979 [inline]
       kernel_init_freeable+0x4bd/0x57f init/main.c:1144
       kernel_init+0x13/0x180 init/main.c:1063
       ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413
      
      The reason why the noise above was complained by LOCKDEP is because we
      nested to hold l->wakeupq.lock and l->inputq->lock in tipc_link_reset
      function. In fact it's unnecessary to move skb buffer from l->wakeupq
      queue to l->inputq queue while holding the two locks at the same time.
      Instead, we can move skb buffers in l->wakeupq queue to a temporary
      list first and then move the buffers of the temporary list to l->inputq
      queue, which is also safe for us.
      
      Fixes: 3f32d0be ("tipc: lock wakeup & inputq at tipc_link_reset()")
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Acked-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1f8dd34
    • D
      Merge branch 'net-explicitly-requires-bash-when-needed' · 26b1f4cb
      David S. Miller 提交于
      Paolo Abeni says:
      
      ====================
      net: explicitly requires bash when needed.
      
      Some test scripts require bash-only features but use the default shell.
      This may cause random failures if the default shell is not bash.
      Instead of doing a potentially complex rewrite of such scripts, these patches
      require the bash interpreter, where needed.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26b1f4cb
    • P
      selftests: udpgso_bench.sh explicitly requires bash · 12a2ea96
      Paolo Abeni 提交于
      The udpgso_bench.sh script requires several bash-only features. This
      may cause random failures if the default shell is not bash.
      Address the above explicitly requiring bash as the script interpreter
      
      Fixes: 3a687bef ("selftests: udp gso benchmark")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12a2ea96
    • P
      selftests: rtnetlink.sh explicitly requires bash. · 3c718e67
      Paolo Abeni 提交于
      the script rtnetlink.sh requires a bash-only features (sleep with sub-second
      precision). This may cause random test failure if the default shell is not
      bash.
      Address the above explicitly requiring bash as the script interpreter.
      
      Fixes: 33b01b7b ("selftests: add rtnetlink test script")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3c718e67
  2. 11 10月, 2018 14 次提交
    • G
      qmi_wwan: Added support for Gemalto's Cinterion ALASxx WWAN interface · 4f761770
      Giacinto Cifelli 提交于
      Added support for Gemalto's Cinterion ALASxx WWAN interfaces
      by adding QMI_FIXED_INTF with Cinterion's VID and PID.
      Signed-off-by: NGiacinto Cifelli <gciofono@gmail.com>
      Acked-by: NBjørn Mork <bjorn@mork.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f761770
    • P
      tipc: queue socket protocol error messages into socket receive buffer · e7eb0582
      Parthasarathy Bhuvaragan 提交于
      In tipc_sk_filter_rcv(), when we detect protocol messages with error we
      call tipc_sk_conn_proto_rcv() and let it reset the connection and notify
      the socket by calling sk->sk_state_change().
      
      However, tipc_sk_filter_rcv() may have been called from the function
      tipc_backlog_rcv(), in which case the socket lock is held and the socket
      already awake. This means that the sk_state_change() call is ignored and
      the error notification lost. Now the receive queue will remain empty and
      the socket sleeps forever.
      
      In this commit, we convert the protocol message into a connection abort
      message and enqueue it into the socket's receive queue. By this addition
      to the above state change we cover all conditions.
      Acked-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NParthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e7eb0582
    • J
      tipc: set link tolerance correctly in broadcast link · 047491ea
      Jon Maloy 提交于
      In the patch referred to below we added link tolerance as an additional
      criteria for declaring broadcast transmission "stale" and resetting the
      affected links.
      
      However, the 'tolerance' field of the broadcast link is never set, and
      remains at zero. This renders the whole commit without the intended
      improving effect, but luckily also with no negative effect.
      
      In this commit we add the missing initialization.
      
      Fixes: a4dc70d4 ("tipc: extend link reset criteria for stale packet retransmission")
      Signed-off-by: NJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      047491ea
    • D
      Merge branch 'net-ipv4-fixes-for-PMTU-when-link-MTU-changes' · 28b6bfeb
      David S. Miller 提交于
      Sabrina Dubroca says:
      
      ====================
      net: ipv4: fixes for PMTU when link MTU changes
      
      The first patch adapts the changes that commit e9fa1495 ("ipv6:
      Reflect MTU changes on PMTU of exceptions for MTU-less routes") did in
      IPv6 to IPv4: lower PMTU when the first hop's MTU drops below it, and
      raise PMTU when the first hop was limiting PMTU discovery and its MTU
      is increased.
      
      The second patch fixes bugs introduced in commit d52e5a7e ("ipv4:
      lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu") that
      only appear once the first patch is applied.
      
      Selftests for these cases were introduced in net-next commit
      e44e428f ("selftests: pmtu: add basic IPv4 and IPv6 PMTU tests")
      
      v2: add cover letter, and fix a few small things in patch 1
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28b6bfeb
    • S
      net: ipv4: don't let PMTU updates increase route MTU · 28d35bcd
      Sabrina Dubroca 提交于
      When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
      received, we must clamp its value. However, we can receive a PMTU
      exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
      increase in PMTU.
      
      To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.
      
      Before this patch, in case of an update, the exception's MTU would
      always change. Now, an exception can have only its lock flag updated,
      but not the MTU, so we need to add a check on locking to the following
      "is this exception getting updated, or close to expiring?" test.
      
      Fixes: d52e5a7e ("ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28d35bcd
    • S
      net: ipv4: update fnhe_pmtu when first hop's MTU changes · af7d6cce
      Sabrina Dubroca 提交于
      Since commit 5aad1de5 ("ipv4: use separate genid for next hop
      exceptions"), exceptions get deprecated separately from cached
      routes. In particular, administrative changes don't clear PMTU anymore.
      
      As Stefano described in commit e9fa1495 ("ipv6: Reflect MTU changes
      on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
      the local MTU change can become stale:
       - if the local MTU is now lower than the PMTU, that PMTU is now
         incorrect
       - if the local MTU was the lowest value in the path, and is increased,
         we might discover a higher PMTU
      
      Similarly to what commit e9fa1495 did for IPv6, update PMTU in those
      cases.
      
      If the exception was locked, the discovered PMTU was smaller than the
      minimal accepted PMTU. In that case, if the new local MTU is smaller
      than the current PMTU, let PMTU discovery figure out if locking of the
      exception is still needed.
      
      To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
      notifier. By the time the notifier is called, dev->mtu has been
      changed. This patch adds the old MTU as additional information in the
      notifier structure, and a new call_netdevice_notifiers_u32() function.
      
      Fixes: 5aad1de5 ("ipv4: use separate genid for next hop exceptions")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af7d6cce
    • M
      net/ipv6: stop leaking percpu memory in fib6 info · 7abab7b9
      Mike Rapoport 提交于
      The fib6_info_alloc() function allocates percpu memory to hold per CPU
      pointers to rt6_info, but this memory is never freed. Fix it.
      
      Fixes: a64efe14 ("net/ipv6: introduce fib6_info struct and helpers")
      Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7abab7b9
    • D
      Merge tag 'rxrpc-fixes-20181008' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 49b538e7
      David S. Miller 提交于
      David Howells says:
      
      ====================
      rxrpc: Fix packet reception code
      
      Here are a set of patches that prepares for and fix problems in rxrpc's
      package reception code.  There serious problems are:
      
       (A) There's a window between binding the socket and setting the data_ready
           hook in which packets can find their way into the UDP socket's receive
           queues.
      
       (B) The skb_recv_udp() will return an error (and clear the error state) if
           there was an error on the Tx side.  rxrpc doesn't handle this.
      
       (C) The rxrpc data_ready handler doesn't fully drain the UDP receive
           queue.
      
       (D) The rxrpc data_ready handler assumes it is called in a non-reentrant
       state.
      
      The second patch fixes (A) - (C); the third patch renders (B) and (C)
      non-issues by using the recap_rcv hook instead of data_ready - and the
      final patch fixes (D).  That last is the most complex.
      
      The preparatory patches are:
      
       (1) Fix some places that are doing things in the wrong net namespace.
      
       (2) Stop taking the rcu read lock as it's held by the IP input routine in
           the call chain.
      
       (3) Only end the Tx phase if *we* rotated the final packet out of the Tx
           buffer.
      
       (4) Don't assume that the call state won't change after dropping the
           call_state lock.
      
       (5) Only take receive window and MTU suze parameters from an ACK packet if
           it's the latest ACK packet.
      
       (6) Record connection-level abort information correctly.
      
       (7) Fix a trace line.
      
      And then there are three main patches - note that these are mixed in with
      the preparatory patches somewhat:
      
       (1) Fix the setup window (A), skb_recv_udp() error check (B) and packet
           drainage (C).
      
       (2) Switch to using the encap_rcv instead of data_ready to cut out the
           effects of the UDP read queues and get the packets delivered directly.
      
       (3) Add more locking into the various packet input paths to defend against
           re-entrance (D).
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49b538e7
    • K
      rds: RDS (tcp) hangs on sendto() to unresponding address · 9a4890bd
      Ka-Cheong Poon 提交于
      In rds_send_mprds_hash(), if the calculated hash value is non-zero and
      the MPRDS connections are not yet up, it will wait.  But it should not
      wait if the send is non-blocking.  In this case, it should just use the
      base c_path for sending the message.
      Signed-off-by: NKa-Cheong Poon <ka-cheong.poon@oracle.com>
      Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9a4890bd
    • E
      net: make skb_partial_csum_set() more robust against overflows · 52b5d6f5
      Eric Dumazet 提交于
      syzbot managed to crash in skb_checksum_help() [1] :
      
              BUG_ON(offset + sizeof(__sum16) > skb_headlen(skb));
      
      Root cause is the following check in skb_partial_csum_set()
      
      	if (unlikely(start > skb_headlen(skb)) ||
      	    unlikely((int)start + off > skb_headlen(skb) - 2))
      		return false;
      
      If skb_headlen(skb) is 1, then (skb_headlen(skb) - 2) becomes 0xffffffff
      and the check fails to detect that ((int)start + off) is off the limit,
      since the compare is unsigned.
      
      When we fix that, then the first condition (start > skb_headlen(skb))
      becomes obsolete.
      
      Then we should also check that (skb_headroom(skb) + start) wont
      overflow 16bit field.
      
      [1]
      kernel BUG at net/core/dev.c:2880!
      invalid opcode: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 7330 Comm: syz-executor4 Not tainted 4.19.0-rc6+ #253
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:skb_checksum_help+0x9e3/0xbb0 net/core/dev.c:2880
      Code: 85 00 ff ff ff 48 c1 e8 03 42 80 3c 28 00 0f 84 09 fb ff ff 48 8b bd 00 ff ff ff e8 97 a8 b9 fb e9 f8 fa ff ff e8 2d 09 76 fb <0f> 0b 48 8b bd 28 ff ff ff e8 1f a8 b9 fb e9 b1 f6 ff ff 48 89 cf
      RSP: 0018:ffff8801d83a6f60 EFLAGS: 00010293
      RAX: ffff8801b9834380 RBX: ffff8801b9f8d8c0 RCX: ffffffff8608c6d7
      RDX: 0000000000000000 RSI: ffffffff8608cc63 RDI: 0000000000000006
      RBP: ffff8801d83a7068 R08: ffff8801b9834380 R09: 0000000000000000
      R10: ffff8801d83a76d8 R11: 0000000000000000 R12: 0000000000000001
      R13: 0000000000010001 R14: 000000000000ffff R15: 00000000000000a8
      FS:  00007f1a66db5700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f7d77f091b0 CR3: 00000001ba252000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       skb_csum_hwoffload_help+0x8f/0xe0 net/core/dev.c:3269
       validate_xmit_skb+0xa2a/0xf30 net/core/dev.c:3312
       __dev_queue_xmit+0xc2f/0x3950 net/core/dev.c:3797
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3838
       packet_snd net/packet/af_packet.c:2928 [inline]
       packet_sendmsg+0x422d/0x64c0 net/packet/af_packet.c:2953
      
      Fixes: 5ff8dda3 ("net: Ensure partial checksum offset is inside the skb head")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52b5d6f5
    • D
      Merge branch 'devlink-param-type-string-fixes' · 8b79f410
      David S. Miller 提交于
      Moshe Shemesh says:
      
      ====================
      devlink param type string fixes
      
      This patchset fixes devlink param infrastructure for string param type.
      
      The devlink param infrastructure doesn't handle copying the string data
      correctly.  The first two patches fix it and the third patch adds helper
      function to safely copy string value without exceeding
      DEVLINK_PARAM_MAX_STRING_VALUE.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8b79f410
    • M
      devlink: Add helper function for safely copy string param · bde74ad1
      Moshe Shemesh 提交于
      Devlink string param buffer is allocated at the size of
      DEVLINK_PARAM_MAX_STRING_VALUE. Add helper function which makes sure
      this size is not exceeded.
      Renamed DEVLINK_PARAM_MAX_STRING_VALUE to
      __DEVLINK_PARAM_MAX_STRING_VALUE to emphasize that it should be used by
      devlink only. The driver should use the helper function instead to
      verify it doesn't exceed the allowed length.
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bde74ad1
    • M
      devlink: Fix param cmode driverinit for string type · 1276534c
      Moshe Shemesh 提交于
      Driverinit configuration mode value is held by devlink to enable the
      driver fetch the value after reload command. In case the param type is
      string devlink should copy the value from driver string buffer to
      devlink string buffer on devlink_param_driverinit_value_set() and
      vice-versa on devlink_param_driverinit_value_get().
      
      Fixes: ec01aeb1 ("devlink: Add support for get/set driverinit value")
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1276534c
    • M
      devlink: Fix param set handling for string type · f355cfcd
      Moshe Shemesh 提交于
      In case devlink param type is string, it needs to copy the string value
      it got from the input to devlink_param_value.
      
      Fixes: e3b7ca18 ("devlink: Add param set command")
      Signed-off-by: NMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f355cfcd
  3. 10 10月, 2018 5 次提交
  4. 09 10月, 2018 4 次提交
    • D
      rxrpc: Fix the packet reception routine · c1e15b49
      David Howells 提交于
      The rxrpc_input_packet() function and its call tree was built around the
      assumption that data_ready() handler called from UDP to inform a kernel
      service that there is data to be had was non-reentrant.  This means that
      certain locking could be dispensed with.
      
      This, however, turns out not to be the case with a multi-queue network card
      that can deliver packets to multiple cpus simultaneously.  Each of those
      cpus can be in the rxrpc_input_packet() function at the same time.
      
      Fix by adding or changing some structure members:
      
       (1) Add peer->rtt_input_lock to serialise access to the RTT buffer.
      
       (2) Make conn->service_id into a 32-bit variable so that it can be
           cmpxchg'd on all arches.
      
       (3) Add call->input_lock to serialise access to the Rx/Tx state.  Note
           that although the Rx and Tx states are (almost) entirely separate,
           there's no point completing the separation and having separate locks
           since it's a bi-phasal RPC protocol rather than a bi-direction
           streaming protocol.  Data transmission and data reception do not take
           place simultaneously on any particular call.
      
      and making the following functional changes:
      
       (1) In rxrpc_input_data(), hold call->input_lock around the core to
           prevent simultaneous producing of packets into the Rx ring and
           updating of tracking state for a particular call.
      
       (2) In rxrpc_input_ping_response(), only read call->ping_serial once, and
           check it before checking RXRPC_CALL_PINGING as that's a cheaper test.
           The bit test and bit clear can then be combined.  No further locking
           is needed here.
      
       (3) In rxrpc_input_ack(), take call->input_lock after we've parsed much of
           the ACK packet.  The superseded ACK check is then done both before and
           after the lock is taken.
      
           The handing of ackinfo data is split, parsing before the lock is taken
           and processing with it held.  This is keyed on rxMTU being non-zero.
      
           Congestion management is also done within the locked section.
      
       (4) In rxrpc_input_ackall(), take call->input_lock around the Tx window
           rotation.  The ACKALL packet carries no information and is only really
           useful after all packets have been transmitted since it's imprecise.
      
       (5) In rxrpc_input_implicit_end_call(), we use rx->incoming_lock to
           prevent calls being simultaneously implicitly ended on two cpus and
           also to prevent any races with incoming call setup.
      
       (6) In rxrpc_input_packet(), use cmpxchg() to effect the service upgrade
           on a connection.  It is only permitted to happen once for a
           connection.
      
       (7) In rxrpc_new_incoming_call(), we have to recheck the routing inside
           rx->incoming_lock to see if someone else set up the call, connection
           or peer whilst we were getting there.  We can't trust the values from
           the earlier routing check unless we pin refs on them - which we want
           to avoid.
      
           Further, we need to allow for an incoming call to have its state
           changed on another CPU between us making it live and us adjusting it
           because the conn is now in the RXRPC_CONN_SERVICE state.
      
       (8) In rxrpc_peer_add_rtt(), take peer->rtt_input_lock around the access
           to the RTT buffer.  Don't need to lock around setting peer->rtt.
      
      For reference, the inventory of state-accessing or state-altering functions
      used by the packet input procedure is:
      
      > rxrpc_input_packet()
        * PACKET CHECKING
      
        * ROUTING
          > rxrpc_post_packet_to_local()
          > rxrpc_find_connection_rcu() - uses RCU
            > rxrpc_lookup_peer_rcu() - uses RCU
            > rxrpc_find_service_conn_rcu() - uses RCU
            > idr_find() - uses RCU
      
        * CONNECTION-LEVEL PROCESSING
          - Service upgrade
            - Can only happen once per conn
            ! Changed to use cmpxchg
          > rxrpc_post_packet_to_conn()
          - Setting conn->hi_serial
            - Probably safe not using locks
            - Maybe use cmpxchg
      
        * CALL-LEVEL PROCESSING
          > Old-call checking
            > rxrpc_input_implicit_end_call()
              > rxrpc_call_completed()
      	> rxrpc_queue_call()
      	! Need to take rx->incoming_lock
      	> __rxrpc_disconnect_call()
      	> rxrpc_notify_socket()
          > rxrpc_new_incoming_call()
            - Uses rx->incoming_lock for the entire process
              - Might be able to drop this earlier in favour of the call lock
            > rxrpc_incoming_call()
            	! Conflicts with rxrpc_input_implicit_end_call()
          > rxrpc_send_ping()
            - Don't need locks to check rtt state
            > rxrpc_propose_ACK
      
        * PACKET DISTRIBUTION
          > rxrpc_input_call_packet()
            > rxrpc_input_data()
      	* QUEUE DATA PACKET ON CALL
      	> rxrpc_reduce_call_timer()
      	  - Uses timer_reduce()
      	! Needs call->input_lock()
      	> rxrpc_receiving_reply()
      	  ! Needs locking around ack state
      	  > rxrpc_rotate_tx_window()
      	  > rxrpc_end_tx_phase()
      	> rxrpc_proto_abort()
      	> rxrpc_input_dup_data()
      	- Fills the Rx buffer
      	- rxrpc_propose_ACK()
      	- rxrpc_notify_socket()
      
            > rxrpc_input_ack()
      	* APPLY ACK PACKET TO CALL AND DISCARD PACKET
      	> rxrpc_input_ping_response()
      	  - Probably doesn't need any extra locking
      	  ! Need READ_ONCE() on call->ping_serial
      	  > rxrpc_input_check_for_lost_ack()
      	    - Takes call->lock to consult Tx buffer
      	  > rxrpc_peer_add_rtt()
      	    ! Needs to take a lock (peer->rtt_input_lock)
      	    ! Could perhaps manage with cmpxchg() and xadd() instead
      	> rxrpc_input_requested_ack
      	  - Consults Tx buffer
      	    ! Probably needs a lock
      	  > rxrpc_peer_add_rtt()
      	> rxrpc_propose_ack()
      	> rxrpc_input_ackinfo()
      	  - Changes call->tx_winsize
      	    ! Use cmpxchg to handle change
      	    ! Should perhaps track serial number
      	  - Uses peer->lock to record MTU specification changes
      	> rxrpc_proto_abort()
      	! Need to take call->input_lock
      	> rxrpc_rotate_tx_window()
      	> rxrpc_end_tx_phase()
      	> rxrpc_input_soft_acks()
      	- Consults the Tx buffer
      	> rxrpc_congestion_management()
      	  - Modifies the Tx annotations
      	  ! Needs call->input_lock()
      	  > rxrpc_queue_call()
      
            > rxrpc_input_abort()
      	* APPLY ABORT PACKET TO CALL AND DISCARD PACKET
      	> rxrpc_set_call_completion()
      	> rxrpc_notify_socket()
      
            > rxrpc_input_ackall()
      	* APPLY ACKALL PACKET TO CALL AND DISCARD PACKET
      	! Need to take call->input_lock
      	> rxrpc_rotate_tx_window()
      	> rxrpc_end_tx_phase()
      
          > rxrpc_reject_packet()
      
      There are some functions used by the above that queue the packet, after
      which the procedure is terminated:
      
       - rxrpc_post_packet_to_local()
         - local->event_queue is an sk_buff_head
         - local->processor is a work_struct
       - rxrpc_post_packet_to_conn()
         - conn->rx_queue is an sk_buff_head
         - conn->processor is a work_struct
       - rxrpc_reject_packet()
         - local->reject_queue is an sk_buff_head
         - local->processor is a work_struct
      
      And some that offload processing to process context:
      
       - rxrpc_notify_socket()
         - Uses RCU lock
         - Uses call->notify_lock to call call->notify_rx
         - Uses call->recvmsg_lock to queue recvmsg side
       - rxrpc_queue_call()
         - call->processor is a work_struct
       - rxrpc_propose_ACK()
         - Uses call->lock to wrap __rxrpc_propose_ACK()
      
      And a bunch that complete a call, all of which use call->state_lock to
      protect the call state:
      
       - rxrpc_call_completed()
       - rxrpc_set_call_completion()
       - rxrpc_abort_call()
       - rxrpc_proto_abort()
         - Also uses rxrpc_queue_call()
      
      Fixes: 17926a79 ("[AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c1e15b49
    • D
      rxrpc: Fix the rxrpc_tx_packet trace line · 4e2abd3c
      David Howells 提交于
      Fix the rxrpc_tx_packet trace line by storing the where parameter.
      
      Fixes: 4764c0da ("rxrpc: Trace packet transmission")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      4e2abd3c
    • D
      rxrpc: Fix connection-level abort handling · 64753092
      David Howells 提交于
      Fix connection-level abort handling to cache the abort and error codes
      properly so that a new incoming call can be properly aborted if it races
      with the parent connection being aborted by another CPU.
      
      The abort_code and error parameters can then be dropped from
      rxrpc_abort_calls().
      
      Fixes: f5c17aae ("rxrpc: Calls should only have one terminal state")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      64753092
    • D
      rxrpc: Only take the rwind and mtu values from latest ACK · 298bc15b
      David Howells 提交于
      Move the out-of-order and duplicate ACK packet check to before the call to
      rxrpc_input_ackinfo() so that the receive window size and MTU size are only
      checked in the latest ACK packet and don't regress.
      
      Fixes: 248f219c ("rxrpc: Rewrite the data and ack handling code")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      298bc15b
  5. 08 10月, 2018 9 次提交
    • D
      rxrpc: Carry call state out of locked section in rxrpc_rotate_tx_window() · dfe99522
      David Howells 提交于
      Carry the call state out of the locked section in rxrpc_rotate_tx_window()
      rather than sampling it afterwards.  This is only used to select tracepoint
      data, but could have changed by the time we do the tracepoint.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      dfe99522
    • D
      rxrpc: Don't check RXRPC_CALL_TX_LAST after calling rxrpc_rotate_tx_window() · c479d5f2
      David Howells 提交于
      We should only call the function to end a call's Tx phase if we rotated the
      marked-last packet out of the transmission buffer.
      
      Make rxrpc_rotate_tx_window() return an indication of whether it just
      rotated the packet marked as the last out of the transmit buffer, carrying
      the information out of the locked section in that function.
      
      We can then check the return value instead of examining RXRPC_CALL_TX_LAST.
      
      Fixes: 70790dbe ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c479d5f2
    • D
      rxrpc: Don't need to take the RCU read lock in the packet receiver · bfd28211
      David Howells 提交于
      We don't need to take the RCU read lock in the rxrpc packet receive
      function because it's held further up the stack in the IP input routine
      around the UDP receive routines.
      
      Fix this by dropping the RCU read lock calls from rxrpc_input_packet().
      This simplifies the code.
      
      Fixes: 70790dbe ("rxrpc: Pass the last Tx packet marker in the annotation buffer")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      bfd28211
    • D
      rxrpc: Use the UDP encap_rcv hook · 5271953c
      David Howells 提交于
      Use the UDP encap_rcv hook to cut the bit out of the rxrpc packet reception
      in which a packet is placed onto the UDP receive queue and then immediately
      removed again by rxrpc.  Going via the queue in this manner seems like it
      should be unnecessary.
      
      This does, however, require the invention of a value to place in encap_type
      as that's one of the conditions to switch packets out to the encap_rcv
      hook.  Possibly the value doesn't actually matter for anything other than
      sockopts on the UDP socket, which aren't accessible outside of rxrpc
      anyway.
      
      This seems to cut a bit of time out of the time elapsed between each
      sk_buff being timestamped and turning up in rxrpc (the final number in the
      following trace excerpts).  I measured this by making the rxrpc_rx_packet
      trace point print the time elapsed between the skb being timestamped and
      the current time (in ns), e.g.:
      
      	... 424.278721: rxrpc_rx_packet: ...  ACK 25026
      
      So doing a 512MiB DIO read from my test server, with an unmodified kernel:
      
      	N       min     max     sum		mean    stddev
      	27605   2626    7581    7.83992e+07     2840.04 181.029
      
      and with the patch applied:
      
      	N       min     max     sum		mean    stddev
      	27547   1895    12165   6.77461e+07     2459.29 255.02
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5271953c
    • D
      Merge branch 'net-smc-userspace-breakage-fixes' · e2a322a0
      David S. Miller 提交于
      Eugene Syromiatnikov says:
      
      ====================
      net/smc: userspace breakage fixes
      
      These two patches correct some userspace-affecting issues introduced
      during 4.19 development cycle, specifically:
       * New structure "struct smcd_diag_dmbinfo" has been defined in a way
         that would lead to different layout of the structure on most 32-bit
         ABIs in comparison with layout on 64-bit ABIs;
       * One of the commits renamed an UAPI-exposed field name.
      
      Changes since v1:
       * Managed not to forget to add --cover-letter.
       * Commit ID format in commit message has been changed in accordance
         with Sergei Shtylyov's recommendations.
      ====================
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2a322a0
    • E
      net/smc: retain old name for diag_mode field · d4f0006a
      Eugene Syromiatnikov 提交于
      Commit c601171d ("net/smc: provide smc mode in smc_diag.c") changed
      the name of diag_fallback field of struct smc_diag_msg structure
      to diag_mode.  However, this structure is a part of UAPI, and this change
      breaks user space applications that use it ([1], for example).  Since
      the new name is more suitable, convert the field to a union that provides
      access to the data via both the new and the old name.
      
      [1] https://gitlab.com/strace/strace/blob/v4.24/netlink_smc_diag.c#L165
      
      Fixes: c601171d ("net/smc: provide smc mode in smc_diag.c")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4f0006a
    • E
      net/smc: use __aligned_u64 for 64-bit smc_diag fields · a21048c8
      Eugene Syromiatnikov 提交于
      Commit 4b1b7d3b ("net/smc: add SMC-D diag support") introduced
      new UAPI-exposed structure, struct smcd_diag_dmbinfo.  However,
      it's not usable by compat binaries, as it has different layout there.
      Probably, the most straightforward fix that will avoid similar issues
      in the future is to use __aligned_u64 for 64-bit fields.
      
      Fixes: 4b1b7d3b ("net/smc: add SMC-D diag support")
      Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a21048c8
    • A
      net: sched: cls_u32: fix hnode refcounting · 6d4c4077
      Al Viro 提交于
      cls_u32.c misuses refcounts for struct tc_u_hnode - it counts references
      via ->hlist and via ->tp_root together.  u32_destroy() drops the former
      and, in case when there had been links, leaves the sucker on the list.
      As the result, there's nothing to protect it from getting freed once links
      are dropped.
      That also makes the "is it busy" check incapable of catching the root
      hnode - it *is* busy (there's a reference from tp), but we don't see it as
      something separate.  "Is it our root?" check partially covers that, but
      the problem exists for others' roots as well.
      
      AFAICS, the minimal fix preserving the existing behaviour (where it doesn't
      include oopsen, that is) would be this:
              * count tp->root and tp_c->hlist as separate references.  I.e.
      have u32_init() set refcount to 2, not 1.
      	* in u32_destroy() we always drop the former;
      in u32_destroy_hnode() - the latter.
      
      	That way we have *all* references contributing to refcount.  List
      removal happens in u32_destroy_hnode() (called only when ->refcnt is 1)
      an in u32_destroy() in case of tc_u_common going away, along with
      everything reachable from it.  IOW, that way we know that
      u32_destroy_key() won't free something still on the list (or pointed to by
      someone's ->root).
      
      Reproducer:
      
      tc qdisc add dev eth0 ingress
      tc filter add dev eth0 parent ffff: protocol ip prio 100 handle 1: \
      u32 divisor 1
      tc filter add dev eth0 parent ffff: protocol ip prio 200 handle 2: \
      u32 divisor 1
      tc filter add dev eth0 parent ffff: protocol ip prio 100 \
      handle 1:0:11 u32 ht 1: link 801: offset at 0 mask 0f00 shift 6 \
      plus 0 eat match ip protocol 6 ff
      tc filter delete dev eth0 parent ffff: protocol ip prio 200
      tc filter change dev eth0 parent ffff: protocol ip prio 100 \
      handle 1:0:11 u32 ht 1: link 0: offset at 0 mask 0f00 shift 6 plus 0 \
      eat match ip protocol 6 ff
      tc filter delete dev eth0 parent ffff: protocol ip prio 100
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6d4c4077
    • J
      udp: Unbreak modules that rely on external __skb_recv_udp() availability · 7e823644
      Jiri Kosina 提交于
      Commit 2276f58a ("udp: use a separate rx queue for packet reception")
      turned static inline __skb_recv_udp() from being a trivial helper around
      __skb_recv_datagram() into a UDP specific implementaion, making it
      EXPORT_SYMBOL_GPL() at the same time.
      
      There are external modules that got broken by __skb_recv_udp() not being
      visible to them. Let's unbreak them by making __skb_recv_udp EXPORT_SYMBOL().
      
      Rationale (one of those) why this is actually "technically correct" thing
      to do: __skb_recv_udp() used to be an inline wrapper around
      __skb_recv_datagram(), which itself (still, and correctly so, I believe)
      is EXPORT_SYMBOL().
      
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: 2276f58a ("udp: use a separate rx queue for packet reception")
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e823644
  6. 06 10月, 2018 4 次提交
    • G
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · c1d84a1b
      Greg Kroah-Hartman 提交于
      Dave writes:
        "Networking fixes:
      
        1) Fix truncation of 32-bit right shift in bpf, from Jann Horn.
      
        2) Fix memory leak in wireless wext compat, from Stefan Seyfried.
      
        3) Use after free in cfg80211's reg_process_hint(), from Yu Zhao.
      
        4) Need to cancel pending work when unbinding in smsc75xx otherwise
           we oops, also from Yu Zhao.
      
        5) Don't allow enslaving a team device to itself, from Ido Schimmel.
      
        6) Fix backwards compat with older userspace for rtnetlink FDB dumps.
           From Mauricio Faria.
      
        7) Add validation of tc policy netlink attributes, from David Ahern.
      
        8) Fix RCU locking in rawv6_send_hdrinc(), from Wei Wang."
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (26 commits)
        net: mvpp2: Extract the correct ethtype from the skb for tx csum offload
        ipv6: take rcu lock in rawv6_send_hdrinc()
        net: sched: Add policy validation for tc attributes
        rtnetlink: fix rtnl_fdb_dump() for ndmsg header
        yam: fix a missing-check bug
        net: bpfilter: Fix type cast and pointer warnings
        net: cxgb3_main: fix a missing-check bug
        bpf: 32-bit RSH verification must truncate input before the ALU op
        net: phy: phylink: fix SFP interface autodetection
        be2net: don't flip hw_features when VXLANs are added/deleted
        net/packet: fix packet drop as of virtio gso
        net: dsa: b53: Keep CPU port as tagged in all VLANs
        openvswitch: load NAT helper
        bnxt_en: get the reduced max_irqs by the ones used by RDMA
        bnxt_en: free hwrm resources, if driver probe fails.
        bnxt_en: Fix enables field in HWRM_QUEUE_COS2BW_CFG request
        bnxt_en: Fix VNIC reservations on the PF.
        team: Forbid enslaving team device to itself
        net/usb: cancel pending work when unbinding smsc75xx
        mlxsw: spectrum: Delete RIF when VLAN device is removed
        ...
      c1d84a1b
    • G
      Merge branch 'akpm' · 091a1eaa
      Greg Kroah-Hartman 提交于
      * akpm:
        mm: madvise(MADV_DODUMP): allow hugetlbfs pages
        ocfs2: fix locking for res->tracking and dlm->tracking_list
        mm/vmscan.c: fix int overflow in callers of do_shrink_slab()
        mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly
        mm/vmstat.c: fix outdated vmstat_text
        proc: restrict kernel stack dumps to root
        mm/hugetlb: add mmap() encodings for 32MB and 512MB page sizes
        mm/migrate.c: split only transparent huge pages when allocation fails
        ipc/shm.c: use ERR_CAST() for shm_lock() error return
        mm/gup_benchmark: fix unsigned comparison to zero in __gup_benchmark_ioctl
        mm, thp: fix mlocking THP page with migration enabled
        ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
        hugetlb: take PMD sharing into account when flushing tlb/caches
        mm: migration: fix migration of huge PMD shared pages
      091a1eaa
    • D
      mm: madvise(MADV_DODUMP): allow hugetlbfs pages · d41aa525
      Daniel Black 提交于
      Reproducer, assuming 2M of hugetlbfs available:
      
      Hugetlbfs mounted, size=2M and option user=testuser
      
        # mount | grep ^hugetlbfs
        hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
        # sysctl vm.nr_hugepages=1
        vm.nr_hugepages = 1
        # grep Huge /proc/meminfo
        AnonHugePages:         0 kB
        ShmemHugePages:        0 kB
        HugePages_Total:       1
        HugePages_Free:        1
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB
        Hugetlb:            2048 kB
      
      Code:
      
        #include <sys/mman.h>
        #include <stddef.h>
        #define SIZE 2*1024*1024
        int main()
        {
          void *ptr;
          ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
          madvise(ptr, SIZE, MADV_DONTDUMP);
          madvise(ptr, SIZE, MADV_DODUMP);
        }
      
      Compile and strace:
      
        mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
        madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
        madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)
      
      hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
      author testing with analysis from Florian Weimer[1].
      
      The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
      consequence of the large useage of VM_DONTEXPAND in device drivers.
      
      A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
      marked DODUMP.
      
      A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
      memory for a while and later request that madvise(MADV_DODUMP) on the same
      memory.  We correct this omission by allowing madvice(MADV_DODUMP) on
      hugetlbfs pages.
      
      [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
      [2] commit 0103bd16 ("mm: prepare VM_DONTDUMP for using in drivers")
      
      Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
      Link: https://lists.launchpad.net/maria-discuss/msg05245.html
      Fixes: 0103bd16 ("mm: prepare VM_DONTDUMP for using in drivers")
      Reported-by: NKenneth Penza <kpenza@gmail.com>
      Signed-off-by: NDaniel Black <daniel@linux.ibm.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d41aa525
    • A
      ocfs2: fix locking for res->tracking and dlm->tracking_list · cbe355f5
      Ashish Samant 提交于
      In dlm_init_lockres() we access and modify res->tracking and
      dlm->tracking_list without holding dlm->track_lock.  This can cause list
      corruptions and can end up in kernel panic.
      
      Fix this by locking res->tracking and dlm->tracking_list with
      dlm->track_lock instead of dlm->spinlock.
      
      Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.comSigned-off-by: NAshish Samant <ashish.samant@oracle.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Acked-by: NJoseph Qi <jiangqi903@gmail.com>
      Acked-by: NJun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cbe355f5