1. 19 5月, 2020 3 次提交
  2. 10 5月, 2020 2 次提交
  3. 08 5月, 2020 1 次提交
  4. 07 5月, 2020 4 次提交
  5. 05 5月, 2020 7 次提交
  6. 02 5月, 2020 1 次提交
    • D
      ipv6: Use global sernum for dst validation with nexthop objects · 8f34e53b
      David Ahern 提交于
      Nik reported a bug with pcpu dst cache when nexthop objects are
      used illustrated by the following:
          $ ip netns add foo
          $ ip -netns foo li set lo up
          $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
          $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
          $ ip li add veth1 type veth peer name veth2
          $ ip li set veth1 up
          $ ip addr add 2001:db8:10::1/64 dev veth1
          $ ip li set dev veth2 netns foo
          $ ip -netns foo li set veth2 up
          $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
          $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Create a pcpu entry on cpu 0:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
      
          Re-add the route entry:
          $ ip -6 ro del 2001:db8:11::1
          $ ip -6 route add 2001:db8:11::1/128 nhid 100
      
          Route get on cpu 0 returns the stale pcpu:
          $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
          RTNETLINK answers: Network is unreachable
      
          While cpu 1 works:
          $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
          2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium
      
      Conversion of FIB entries to work with external nexthop objects
      missed an important difference between IPv4 and IPv6 - how dst
      entries are invalidated when the FIB changes. IPv4 has a per-network
      namespace generation id (rt_genid) that is bumped on changes to the FIB.
      Checking if a dst_entry is still valid means comparing rt_genid in the
      rtable to the current value of rt_genid for the namespace.
      
      IPv6 also has a per network namespace counter, fib6_sernum, but the
      count is saved per fib6_node. With the per-node counter only dst_entries
      based on fib entries under the node are invalidated when changes are
      made to the routes - limiting the scope of invalidations. IPv6 uses a
      reference in the rt6_info, 'from', to track the corresponding fib entry
      used to create the dst_entry. When validating a dst_entry, the 'from'
      is used to backtrack to the fib6_node and check the sernum of it to the
      cookie passed to the dst_check operation.
      
      With the inline format (nexthop definition inline with the fib6_info),
      dst_entries cached in the fib6_nh have a 1:1 correlation between fib
      entries, nexthop data and dst_entries. With external nexthops, IPv6
      looks more like IPv4 which means multiple fib entries across disparate
      fib6_nodes can all reference the same fib6_nh. That means validation
      of dst_entries based on external nexthops needs to use the IPv4 format
      - the per-network namespace counter.
      
      Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
      rt6_get_cookie to return sernum if it is set and update dst_check for
      IPv6 to look for sernum set and based the check on it if so. Finally,
      rt6_get_pcpu_route needs to validate the cached entry before returning
      a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
      __mkroute_output for IPv4).
      
      This problem only affects routes using the new, external nexthops.
      
      Thanks to the kbuild test robot for catching the IS_ENABLED needed
      around rt_genid_ipv6 before I sent this out.
      
      Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
      Reported-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Tested-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8f34e53b
  7. 01 5月, 2020 4 次提交
    • T
      tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040 · b7237487
      Toke Høiland-Jørgensen 提交于
      RFC 6040 recommends propagating an ECT(1) mark from an outer tunnel header
      to the inner header if that inner header is already marked as ECT(0). When
      RFC 6040 decapsulation was implemented, this case of propagation was not
      added. This simply appears to be an oversight, so let's fix that.
      
      Fixes: eccc1bb8 ("tunnel: drop packet if ECN present with not-ECT")
      Reported-by: NBob Briscoe <ietf@bobbriscoe.net>
      Reported-by: NOlivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
      Cc: Dave Taht <dave.taht@gmail.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7237487
    • K
      security: Fix the default value of fs_context_parse_param hook · 54261af4
      KP Singh 提交于
      security_fs_context_parse_param is called by vfs_parse_fs_param and
      a succussful return value (i.e 0) implies that a parameter will be
      consumed by the LSM framework. This stops all further parsing of the
      parmeter by VFS. Furthermore, if an LSM hook returns a success, the
      remaining LSM hooks are not invoked for the parameter.
      
      The current default behavior of returning success means that all the
      parameters are expected to be parsed by the LSM hook and none of them
      end up being populated by vfs in fs_context
      
      This was noticed when lsm=bpf is supplied on the command line before any
      other LSM. As the bpf lsm uses this default value to implement a default
      hook, this resulted in a failure to parse any fs_context parameters and
      a failure to mount the root filesystem.
      
      Fixes: 98e828a0 ("security: Refactor declaration of LSM hooks")
      Reported-by: NMikko Ylinen <mikko.ylinen@linux.intel.com>
      Signed-off-by: NKP Singh <kpsingh@google.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      54261af4
    • P
      mptcp: move option parsing into mptcp_incoming_options() · cfde141e
      Paolo Abeni 提交于
      The mptcp_options_received structure carries several per
      packet flags (mp_capable, mp_join, etc.). Such fields must
      be cleared on each packet, even on dropped ones or packet
      not carrying any MPTCP options, but the current mptcp
      code clears them only on TCP option reset.
      
      On several races/corner cases we end-up with stray bits in
      incoming options, leading to WARN_ON splats. e.g.:
      
      [  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
      [  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
      [  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
      [  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
      [  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
      [  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
      [  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
      [  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      [  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
      [  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
      [  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
      [  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
      [  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
      [  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
      [  171.232586] Call Trace:
      [  171.233109]  <IRQ>
      [  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
      [  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
      [  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
      [  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
      [  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
      [  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
      [  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
      [  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
      [  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
      [  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
      [  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
      [  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
      [  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
      [  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
      [  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
      [  171.282358]  </IRQ>
      
      We could address the issue clearing explicitly the relevant fields
      in several places - tcp_parse_option, tcp_fast_parse_options,
      possibly others.
      
      Instead we move the MPTCP option parsing into the already existing
      mptcp ingress hook, so that we need to clear the fields in a single
      place.
      
      This allows us dropping an MPTCP hook from the TCP code and
      removing the quite large mptcp_options_received from the tcp_sock
      struct. On the flip side, the MPTCP sockets will traverse the
      option space twice (in tcp_parse_option() and in
      mptcp_incoming_options(). That looks acceptable: we already
      do that for syn and 3rd ack packets, plain TCP socket will
      benefit from it, and even MPTCP sockets will experience better
      code locality, reducing the jumps between TCP and MPTCP code.
      
      v1 -> v2:
       - rebased on current '-net' tree
      
      Fixes: 648ef4b8 ("mptcp: Implement MPTCP receive path")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfde141e
    • P
      mptcp: consolidate synack processing. · 263e1201
      Paolo Abeni 提交于
      Currently the MPTCP code uses 2 hooks to process syn-ack
      packets, mptcp_rcv_synsent() and the sk_rx_dst_set()
      callback.
      
      We can drop the first, moving the relevant code into the
      latter, reducing the hooking into the TCP code. This is
      also needed by the next patch.
      
      v1 -> v2:
       - use local tcp sock ptr instead of casting the sk variable
         several times - DaveM
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      263e1201
  8. 30 4月, 2020 2 次提交
  9. 29 4月, 2020 3 次提交
    • O
      NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION · dff58530
      Olga Kornievskaia 提交于
      Currently, if the client sends BIND_CONN_TO_SESSION with
      NFS4_CDFC4_FORE_OR_BOTH but only gets NFS4_CDFS4_FORE back it ignores
      that it wasn't able to enable a backchannel.
      
      To make sure, the client sends BIND_CONN_TO_SESSION as the first
      operation on the connections (ie., no other session compounds haven't
      been sent before), and if the client's request to bind the backchannel
      is not satisfied, then reset the connection and retry.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      dff58530
    • N
      SUNRPC: defer slow parts of rpc_free_client() to a workqueue. · 7c4310ff
      NeilBrown 提交于
      The rpciod workqueue is on the write-out path for freeing dirty memory,
      so it is important that it never block waiting for memory to be
      allocated - this can lead to a deadlock.
      
      rpc_execute() - which is often called by an rpciod work item - calls
      rcp_task_release_client() which can lead to rpc_free_client().
      
      rpc_free_client() makes two calls which could potentially block wating
      for memory allocation.
      
      rpc_clnt_debugfs_unregister() calls into debugfs and will block while
      any of the debugfs files are being accessed.  In particular it can block
      while any of the 'open' methods are being called and all of these use
      malloc for one thing or another.  So this can deadlock if the memory
      allocation waits for NFS to complete some writes via rpciod.
      
      rpc_clnt_remove_pipedir() can take the inode_lock() and while it isn't
      obvious that memory allocations can happen while the lock it held, it is
      safer to assume they might and to not let rpciod call
      rpc_clnt_remove_pipedir().
      
      So this patch moves these two calls (together with the final kfree() and
      rpciod_down()) into a work-item to be run from the system work-queue.
      rpciod can continue its important work, and the final stages of the free
      can happen whenever they happen.
      
      I have seen this deadlock on a 4.12 based kernel where debugfs used
      synchronize_srcu() when removing objects.  synchronize_srcu() requires a
      workqueue and there were no free workther threads and none could be
      allocated.  While debugsfs no longer uses SRCU, I believe the deadlock
      is still possible.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      7c4310ff
    • M
      drm/amdgpu: add tiling flags from Mesa · c938628c
      Marek Olšák 提交于
      DCC_INDEPENDENT_128B is needed for displayble DCC on gfx10.
      SCANOUT is not needed by the kernel, but Mesa uses it.
      Signed-off-by: NMarek Olšák <marek.olsak@amd.com>
      Acked-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      c938628c
  10. 28 4月, 2020 13 次提交
    • U
      amba: Initialize dma_parms for amba devices · f4584884
      Ulf Hansson 提交于
      It's currently the amba driver's responsibility to initialize the pointer,
      dma_parms, for its corresponding struct device. The benefit with this
      approach allows us to avoid the initialization and to not waste memory for
      the struct device_dma_parameters, as this can be decided on a case by case
      basis.
      
      However, it has turned out that this approach is not very practical. Not
      only does it lead to open coding, but also to real errors. In principle
      callers of dma_set_max_seg_size() doesn't check the error code, but just
      assumes it succeeds.
      
      For these reasons, let's do the initialization from the common amba bus at
      the device registration point. This also follows the way the PCI devices
      are being managed, see pci_device_add().
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: <stable@vger.kernel.org>
      Tested-by: NHaibo Chen <haibo.chen@nxp.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20200422101013.31267-1-ulf.hansson@linaro.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4584884
    • U
      driver core: platform: Initialize dma_parms for platform devices · 9495b7e9
      Ulf Hansson 提交于
      It's currently the platform driver's responsibility to initialize the
      pointer, dma_parms, for its corresponding struct device. The benefit with
      this approach allows us to avoid the initialization and to not waste memory
      for the struct device_dma_parameters, as this can be decided on a case by
      case basis.
      
      However, it has turned out that this approach is not very practical.  Not
      only does it lead to open coding, but also to real errors. In principle
      callers of dma_set_max_seg_size() doesn't check the error code, but just
      assumes it succeeds.
      
      For these reasons, let's do the initialization from the common platform bus
      at the device registration point. This also follows the way the PCI devices
      are being managed, see pci_device_add().
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Tested-by: NHaibo Chen <haibo.chen@nxp.com>
      Reviewed-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20200422100954.31211-1-ulf.hansson@linaro.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9495b7e9
    • R
      locktorture.c: Fix if-statement empty body warnings · be44ae62
      Randy Dunlap 提交于
      When using -Wextra, gcc complains about torture_preempt_schedule()
      when its definition is empty (i.e., when CONFIG_PREEMPTION is not
      set/enabled).  Fix these warnings by adding an empty do-while block
      for that macro when CONFIG_PREEMPTION is not set.
      Fixes these build warnings:
      
      ../kernel/locking/locktorture.c:119:29: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      ../kernel/locking/locktorture.c:166:29: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      ../kernel/locking/locktorture.c:337:29: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      ../kernel/locking/locktorture.c:490:29: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      ../kernel/locking/locktorture.c:528:29: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      ../kernel/locking/locktorture.c:553:29: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      
      I have verified that there is no object code change (with gcc 7.5.0).
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Signed-off-by: N"Paul E. McKenney" <paulmck@kernel.org>
      be44ae62
    • P
      rcu-tasks: Add Kconfig option to mediate smp_mb() vs. IPI · 9ae58d7b
      Paul E. McKenney 提交于
      This commit provides a new TASKS_TRACE_RCU_READ_MB Kconfig option that
      enables use of read-side memory barriers by both rcu_read_lock_trace()
      and rcu_read_unlock_trace() when the are executed with the
      current->trc_reader_special.b.need_mb flag set.  This flag is currently
      never set.  Doing that is the subject of a later commit.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      9ae58d7b
    • P
      rcu-tasks: Split ->trc_reader_need_end · 276c4104
      Paul E. McKenney 提交于
      This commit splits ->trc_reader_need_end by using the rcu_special union.
      This change permits readers to check to see if a memory barrier is
      required without any added overhead in the common case where no such
      barrier is required.  This commit also adds the read-side checking.
      Later commits will add the machinery to properly set the new
      ->trc_reader_special.b.need_mb field.
      
      This commit also makes rcu_read_unlock_trace_special() tolerate nested
      read-side critical sections within interrupt and NMI handlers.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      276c4104
    • P
      rcu-tasks: Make RCU Tasks Trace make use of RCU scheduler hooks · 43766c3e
      Paul E. McKenney 提交于
      This commit makes the calls to rcu_tasks_qs() detect and report
      quiescent states for RCU tasks trace.  If the task is in a quiescent
      state and if ->trc_reader_checked is not yet set, the task sets its own
      ->trc_reader_checked.  This will cause the grace-period kthread to
      remove it from the holdout list if it still remains there.
      
      [ paulmck: Fix conditional compilation per kbuild test robot feedback. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      43766c3e
    • P
      rcu-tasks: Add an RCU Tasks Trace to simplify protection of tracing hooks · d5f177d3
      Paul E. McKenney 提交于
      Because RCU does not watch exception early-entry/late-exit, idle-loop,
      or CPU-hotplug execution, protection of tracing and BPF operations is
      needlessly complicated.  This commit therefore adds a variant of
      Tasks RCU that:
      
      o	Has explicit read-side markers to allow finite grace periods in
      	the face of in-kernel loops for PREEMPT=n builds.  These markers
      	are rcu_read_lock_trace() and rcu_read_unlock_trace().
      
      o	Protects code in the idle loop, exception entry/exit, and
      	CPU-hotplug code paths.  In this respect, RCU-tasks trace is
      	similar to SRCU, but with lighter-weight readers.
      
      o	Avoids expensive read-side instruction, having overhead similar
      	to that of Preemptible RCU.
      
      There are of course downsides:
      
      o	The grace-period code can send IPIs to CPUs, even when those
      	CPUs are in the idle loop or in nohz_full userspace.  This is
      	mitigated by later commits.
      
      o	It is necessary to scan the full tasklist, much as for Tasks RCU.
      
      o	There is a single callback queue guarded by a single lock,
      	again, much as for Tasks RCU.  However, those early use cases
      	that request multiple grace periods in quick succession are
      	expected to do so from a single task, which makes the single
      	lock almost irrelevant.  If needed, multiple callback queues
      	can be provided using any number of schemes.
      
      Perhaps most important, this variant of RCU does not affect the vanilla
      flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
      readers can operate from idle, offline, and exception entry/exit in no
      way enables rcu_preempt and rcu_sched readers to do so.
      
      The memory ordering was outlined here:
      https://lore.kernel.org/lkml/20200319034030.GX3199@paulmck-ThinkPad-P72/
      
      This effort benefited greatly from off-list discussions of BPF
      requirements with Alexei Starovoitov and Andrii Nakryiko.  At least
      some of the on-list discussions are captured in the Link: tags below.
      In addition, KCSAN was quite helpful in finding some early bugs.
      
      Link: https://lore.kernel.org/lkml/20200219150744.428764577@infradead.org/
      Link: https://lore.kernel.org/lkml/87mu8p797b.fsf@nanos.tec.linutronix.de/
      Link: https://lore.kernel.org/lkml/20200225221305.605144982@linutronix.de/
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Andrii Nakryiko <andriin@fb.com>
      [ paulmck: Apply feedback from Steve Rostedt and Joel Fernandes. ]
      [ paulmck: Decrement trc_n_readers_need_end upon IPI failure. ]
      [ paulmck: Fix locking issue reported by rcutorture. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      d5f177d3
    • P
      rcu-tasks: Add an RCU-tasks rude variant · c84aad76
      Paul E. McKenney 提交于
      This commit adds a "rude" variant of RCU-tasks that has as quiescent
      states schedule(), cond_resched_tasks_rcu_qs(), userspace execution,
      and (in theory, anyway) cond_resched().  In other words, RCU-tasks rude
      readers are regions of code with preemption disabled, but excluding code
      early in the CPU-online sequence and late in the CPU-offline sequence.
      Updates make use of IPIs and force an IPI and a context switch on each
      online CPU.  This variant is useful in some situations in tracing.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      [ paulmck: Apply EXPORT_SYMBOL_GPL() feedback from Qiujun Huang. ]
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      [ paulmck: Apply review feedback from Steve Rostedt. ]
      c84aad76
    • P
      rcu-tasks: Refactor RCU-tasks to allow variants to be added · 5873b8a9
      Paul E. McKenney 提交于
      This commit splits out generic processing from RCU-tasks-specific
      processing in order to allow additional flavors to be added.  It also
      adds a def_bool TASKS_RCU_GENERIC to enable the common RCU-tasks
      infrastructure code.
      
      This is primarily, but not entirely, a code-movement commit.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      5873b8a9
    • P
      rcu: Reinstate synchronize_rcu_mult() · b3d73156
      Paul E. McKenney 提交于
      With the advent and likely usage of synchronize_rcu_rude(), there is
      again a need to wait on multiple types of RCU grace periods, for
      example, call_rcu_tasks() and call_rcu_tasks_rude().  This commit
      therefore reinstates synchronize_rcu_mult() in order to allow these
      grace periods to be straightforwardly waited on concurrently.
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      b3d73156
    • P
      sched/core: Add function to sample state of locked-down task · 2beaf328
      Paul E. McKenney 提交于
      A running task's state can be sampled in a consistent manner (for example,
      for diagnostic purposes) simply by invoking smp_call_function_single()
      on its CPU, which may be obtained using task_cpu(), then having the
      IPI handler verify that the desired task is in fact still running.
      However, if the task is not running, this sampling can in theory be done
      immediately and directly.  In practice, the task might start running at
      any time, including during the sampling period.  Gaining a consistent
      sample of a not-running task therefore requires that something be done
      to lock down the target task's state.
      
      This commit therefore adds a try_invoke_on_locked_down_task() function
      that invokes a specified function if the specified task can be locked
      down, returning true if successful and if the specified function returns
      true.  Otherwise this function simply returns false.  Given that the
      function passed to try_invoke_on_nonrunning_task() might be invoked with
      a runqueue lock held, that function had better be quite lightweight.
      
      The function is passed the target task's task_struct pointer and the
      argument passed to try_invoke_on_locked_down_task(), allowing easy access
      to task state and to a location for further variables to be passed in
      and out.
      
      Note that the specified function will be called even if the specified
      task is currently running.  The function can use ->on_rq and task_curr()
      to quickly and easily determine the task's state, and can return false
      if this state is not to the function's liking.  The caller of the
      try_invoke_on_locked_down_task() would then see the false return value,
      and could take appropriate action, for example, trying again later or
      sending an IPI if matters are more urgent.
      
      It is expected that use cases such as the RCU CPU stall warning code will
      simply return false if the task is currently running.  However, there are
      use cases involving nohz_full CPUs where the specified function might
      instead fall back to an alternative sampling scheme that relies on heavier
      synchronization (such as memory barriers) in the target task.
      
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      [ paulmck: Apply feedback from Peter Zijlstra and Steven Rostedt. ]
      [ paulmck: Invoke if running to handle feedback from Mathieu Desnoyers. ]
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      2beaf328
    • L
      rcu: Remove unused ->rcu_read_unlock_special.b.deferred_qs field · f0bdf6d4
      Lai Jiangshan 提交于
      The ->rcu_read_unlock_special.b.deferred_qs field is set to true in
      rcu_read_unlock_special() but never set to false.  This is not
      particularly useful, so this commit removes this field.
      
      The only possible justification for this field is to ease debugging
      of RCU deferred quiscent states, but the combination of the other
      ->rcu_read_unlock_special fields plus ->rcu_blocked_node and of course
      ->rcu_read_lock_nesting should cover debugging needs.  And if this last
      proves incorrect, this patch can always be reverted, along with the
      required setting of ->rcu_read_unlock_special.b.deferred_qs to false
      in rcu_preempt_deferred_qs_irqrestore().
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      f0bdf6d4
    • P
      rcu: Add rcu_gp_might_be_stalled() · 6be7436d
      Paul E. McKenney 提交于
      This commit adds rcu_gp_might_be_stalled(), which returns true if there
      is some reason to believe that the RCU grace period is stalled.  The use
      case is where an RCU free-memory path needs to allocate memory in order
      to free it, a situation that should be avoided where possible.
      
      But where it is necessary, there is always the alternative of using
      synchronize_rcu() to wait for a grace period in order to avoid the
      allocation.  And if the grace period is stalled, allocating memory to
      asynchronously wait for it is a bad idea of epic proportions: Far better
      to let others use the memory, because these others might actually be
      able to free that memory before the grace period ends.
      
      Thus, rcu_gp_might_be_stalled() can be used to help decide whether
      allocating memory on an RCU free path is a semi-reasonable course
      of action.
      
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      6be7436d