1. 17 8月, 2017 16 次提交
    • P
      locking: Remove spin_unlock_wait() generic definitions · d3a024ab
      Paul E. McKenney 提交于
      There is no agreed-upon definition of spin_unlock_wait()'s semantics,
      and it appears that all callers could do just as well with a lock/unlock
      pair.  This commit therefore removes spin_unlock_wait() and related
      definitions from core code.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      d3a024ab
    • P
      exit: Replace spin_unlock_wait() with lock/unlock pair · 8083f293
      Paul E. McKenney 提交于
      There is no agreed-upon definition of spin_unlock_wait()'s semantics, and
      it appears that all callers could do just as well with a lock/unlock pair.
      This commit therefore replaces the spin_unlock_wait() call in do_exit()
      with spin_lock() followed immediately by spin_unlock().  This should be
      safe from a performance perspective because the lock is a per-task lock,
      and this is happening only at task-exit time.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      8083f293
    • P
      completion: Replace spin_unlock_wait() with lock/unlock pair · dec13c42
      Paul E. McKenney 提交于
      There is no agreed-upon definition of spin_unlock_wait()'s semantics,
      and it appears that all callers could do just as well with a lock/unlock
      pair.  This commit therefore replaces the spin_unlock_wait() call in
      completion_done() with spin_lock() followed immediately by spin_unlock().
      This should be safe from a performance perspective because the lock
      will be held only the wakeup happens really quickly.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      dec13c42
    • M
      membarrier: Provide expedited private command · 22e4ebb9
      Mathieu Desnoyers 提交于
      Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
      from all runqueues for which current thread's mm is the same as the
      thread calling sys_membarrier. It executes faster than the non-expedited
      variant (no blocking). It also works on NOHZ_FULL configurations.
      
      Scheduler-wise, it requires a memory barrier before and after context
      switching between processes (which have different mm). The memory
      barrier before context switch is already present. For the barrier after
      context switch:
      
      * Our TSO archs can do RELEASE without being a full barrier. Look at
        x86 spin_unlock() being a regular STORE for example.  But for those
        archs, all atomics imply smp_mb and all of them have atomic ops in
        switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
        barrier.
      
      * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
        the rest does indeed do smp_mb(), so there the spin_unlock() is a full
        barrier and we're good.
      
      * ARM64 has a very heavy barrier in switch_to(), which suffices.
      
      * PPC just removed its barrier from switch_to(), but appears to be
        talking about adding something to switch_mm(). So add a
        smp_mb__after_unlock_lock() for now, until this is settled on the PPC
        side.
      
      Changes since v3:
      - Properly document the memory barriers provided by each architecture.
      
      Changes since v2:
      - Address comments from Peter Zijlstra,
      - Add smp_mb__after_unlock_lock() after finish_lock_switch() in
        finish_task_switch() to add the memory barrier we need after storing
        to rq->curr. This is much simpler than the previous approach relying
        on atomic_dec_and_test() in mmdrop(), which actually added a memory
        barrier in the common case of switching between userspace processes.
      - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
        kernel, rather than having the whole membarrier system call returning
        -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
        Adapt the CMD_QUERY mask accordingly.
      
      Changes since v1:
      - move membarrier code under kernel/sched/ because it uses the
        scheduler runqueue,
      - only add the barrier when we switch from a kernel thread. The case
        where we switch from a user-space thread is already handled by
        the atomic_dec_and_test() in mmdrop().
      - add a comment to mmdrop() documenting the requirement on the implicit
        memory barrier.
      
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Boqun Feng <boqun.feng@gmail.com>
      CC: Andrew Hunter <ahh@google.com>
      CC: Maged Michael <maged.michael@gmail.com>
      CC: gromer@google.com
      CC: Avi Kivity <avi@scylladb.com>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NDave Watson <davejwatson@fb.com>
      22e4ebb9
    • P
      rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter() · 16c0b106
      Paul E. McKenney 提交于
      The rcu_idle_exit() and rcu_idle_enter() functions are exported because
      they were originally used by RCU_NONIDLE(), which was intended to
      be usable from modules.  However, RCU_NONIDLE() now instead uses
      rcu_irq_enter_irqson() and rcu_irq_exit_irqson(), which are not
      exported, and there have been no complaints.
      
      This commit therefore removes the exports from rcu_idle_exit() and
      rcu_idle_enter().
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      16c0b106
    • P
      rcu: Add warning to rcu_idle_enter() for irqs enabled · d4db30af
      Paul E. McKenney 提交于
      All current callers of rcu_idle_enter() have irqs disabled, and
      rcu_idle_enter() relies on this, but doesn't check.  This commit
      therefore adds a RCU_LOCKDEP_WARN() to add some verification to the trust.
      While we are there, pass "true" rather than "1" to rcu_eqs_enter().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d4db30af
    • P
      rcu: Make rcu_idle_enter() rely on callers disabling irqs · 3a607992
      Peter Zijlstra (Intel) 提交于
      All callers to rcu_idle_enter() have irqs disabled, so there is no
      point in rcu_idle_enter disabling them again.  This commit therefore
      replaces the irq disabling with a RCU_LOCKDEP_WARN().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3a607992
    • P
      rcu: Add assertions verifying blocked-tasks list · 2dee9404
      Paul E. McKenney 提交于
      This commit adds assertions verifying the consistency of the rcu_node
      structure's ->blkd_tasks list and its ->gp_tasks, ->exp_tasks, and
      ->boost_tasks pointers.  In particular, the ->blkd_tasks lists must be
      empty except for leaf rcu_node structures.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      2dee9404
    • M
      rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit() · 35fe723b
      Masami Hiramatsu 提交于
      Set disable_rcu_irq_enter on not only rcu_eqs_enter_common() but also
      rcu_eqs_exit(), since rcu_eqs_exit() suffers from the same issue as was
      fixed for rcu_eqs_enter_common() by commit 03ecd3f4 ("rcu/tracing:
      Add rcu_disabled to denote when rcu_irq_enter() will not work").
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      35fe723b
    • P
      rcu: Add TPS() protection for _rcu_barrier_trace strings · d8db2e86
      Paul E. McKenney 提交于
      The _rcu_barrier_trace() function is a wrapper for trace_rcu_barrier(),
      which needs TPS() protection for strings passed through the second
      argument.  However, it has escaped prior TPS()-ification efforts because
      it _rcu_barrier_trace() does not start with "trace_".  This commit
      therefore adds the needed TPS() protection
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      d8db2e86
    • L
      rcu: Use idle versions of swait to make idle-hack clear · d5374226
      Luis R. Rodriguez 提交于
      These RCU waits were set to use interruptible waits to avoid the kthreads
      contributing to system load average, even though they are not interruptible
      as they are spawned from a kthread. Use the new TASK_IDLE swaits which makes
      our goal clear, and removes confusion about these paths possibly being
      interruptible -- they are not.
      
      When the system is idle the RCU grace-period kthread will spend all its time
      blocked inside the swait_event_interruptible(). If the interruptible() was
      not used, then this kthread would contribute to the load average. This means
      that an idle system would have a load average of 2 (or 3 if PREEMPT=y),
      rather than the load average of 0 that almost fifty years of UNIX has
      conditioned sysadmins to expect.
      
      The same argument applies to swait_event_interruptible_timeout() use. The
      RCU grace-period kthread spends its time blocked inside this call while
      waiting for grace periods to complete. In particular, if there was only one
      busy CPU, but that CPU was frequently invoking call_rcu(), then the RCU
      grace-period kthread would spend almost all its time blocked inside the
      swait_event_interruptible_timeout(). This would mean that the load average
      would be 2 rather than the expected 1 for the single busy CPU.
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NLuis R. Rodriguez <mcgrof@kernel.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d5374226
    • P
      rcu: Add event tracing to ->gp_tasks update at GP start · c5ebe66c
      Paul E. McKenney 提交于
      There is currently event tracing to track when a task is preempted
      within a preemptible RCU read-side critical section, and also when that
      task subsequently reaches its outermost rcu_read_unlock(), but none
      indicating when a new grace period starts when that grace period must
      wait on pre-existing readers that have been been preempted at least once
      since the beginning of their current RCU read-side critical sections.
      
      This commit therefore adds an event trace at grace-period start in
      the case where there are such readers.  Note that only the first
      reader in the list is traced.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c5ebe66c
    • P
      rcu: Move rcu.h to new trivial-function style · 7414fac0
      Paul E. McKenney 提交于
      This commit saves a few lines in kernel/rcu/rcu.h by moving to single-line
      definitions for trivial functions, instead of the old style where the
      two curly braces each get their own line.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      7414fac0
    • P
      rcu: Add TPS() to event-traced strings · bedbb648
      Paul E. McKenney 提交于
      Strings used in event tracing need to be specially handled, for example,
      using the TPS() macro.  Without the TPS() macro, although output looks
      fine from within a running kernel, extracting traces from a crash dump
      produces garbage instead of strings.  This commit therefore adds the TPS()
      macro to some unadorned strings that were passed to event-tracing macros.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      bedbb648
    • P
      rcu: Create reasonable API for do_exit() TASKS_RCU processing · ccdd29ff
      Paul E. McKenney 提交于
      Currently, the exit-time support for TASKS_RCU is open-coded in do_exit().
      This commit creates exit_tasks_rcu_start() and exit_tasks_rcu_finish()
      APIs for do_exit() use.  This has the benefit of confining the use of the
      tasks_rcu_exit_srcu variable to one file, allowing it to become static.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      ccdd29ff
    • P
      rcu: Drive TASKS_RCU directly off of PREEMPT · 7e42776d
      Paul E. McKenney 提交于
      The actual use of TASKS_RCU is only when PREEMPT, otherwise RCU-sched
      is used instead.  This commit therefore makes synchronize_rcu_tasks()
      and call_rcu_tasks() available always, but mapped to synchronize_sched()
      and call_rcu_sched(), respectively, when !PREEMPT.  This approach also
      allows some #ifdefs to be removed from rcutorture.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      7e42776d
  2. 12 8月, 2017 1 次提交
  3. 11 8月, 2017 2 次提交
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
    • J
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner 提交于
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
  4. 10 8月, 2017 1 次提交
    • M
      futex: Remove unnecessary warning from get_futex_key · 48fb6f4d
      Mel Gorman 提交于
      Commit 65d8fc77 ("futex: Remove requirement for lock_page() in
      get_futex_key()") removed an unnecessary lock_page() with the
      side-effect that page->mapping needed to be treated very carefully.
      
      Two defensive warnings were added in case any assumption was missed and
      the first warning assumed a correct application would not alter a
      mapping backing a futex key.  Since merging, it has not triggered for
      any unexpected case but Mark Rutland reported the following bug
      triggering due to the first warning.
      
        kernel BUG at kernel/futex.c:679!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
        Hardware name: linux,dummy-virt (DT)
        task: ffff80001e271780 task.stack: ffff000010908000
        PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
      
      The fact that it's a bug instead of a warning was due to an unrelated
      arm64 problem, but the warning itself triggered because the underlying
      mapping changed.
      
      This is an application issue but from a kernel perspective it's a
      recoverable situation and the warning is unnecessary so this patch
      removes the warning.  The warning may potentially be triggered with the
      following test program from Mark although it may be necessary to adjust
      NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
      system.
      
          #include <linux/futex.h>
          #include <pthread.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/time.h>
          #include <unistd.h>
      
          #define NR_FUTEX_THREADS 16
          pthread_t threads[NR_FUTEX_THREADS];
      
          void *mem;
      
          #define MEM_PROT  (PROT_READ | PROT_WRITE)
          #define MEM_SIZE  65536
      
          static int futex_wrapper(int *uaddr, int op, int val,
                                   const struct timespec *timeout,
                                   int *uaddr2, int val3)
          {
              syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
          }
      
          void *poll_futex(void *unused)
          {
              for (;;) {
                  futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
              }
          }
      
          int main(int argc, char *argv[])
          {
              int i;
      
              mem = mmap(NULL, MEM_SIZE, MEM_PROT,
                     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      
              printf("Mapping @ %p\n", mem);
      
              printf("Creating futex threads...\n");
      
              for (i = 0; i < NR_FUTEX_THREADS; i++)
                  pthread_create(&threads[i], NULL, poll_futex, NULL);
      
              printf("Flipping mapping...\n");
              for (;;) {
                  mmap(mem, MEM_SIZE, MEM_PROT,
                       MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
              }
      
              return 0;
          }
      Reported-and-tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org # 4.7+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48fb6f4d
  5. 07 8月, 2017 1 次提交
  6. 03 8月, 2017 2 次提交
    • D
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin 提交于
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
      Reported-by: NCliff Spradlin <cspradlin@waymo.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
    • K
      pid: kill pidhash_size in pidhash_init() · 27e37d84
      Kefeng Wang 提交于
      After commit 3d375d78 ("mm: update callers to use HASH_ZERO flag"),
      drop unused pidhash_size in pidhash_init().
      
      Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NPavel Tatashin <Pasha.Tatashin@Oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27e37d84
  7. 01 8月, 2017 1 次提交
  8. 30 7月, 2017 2 次提交
  9. 29 7月, 2017 1 次提交
    • T
      sched: Allow migrating kthreads into online but inactive CPUs · 955dbdf4
      Tejun Heo 提交于
      Per-cpu workqueues have been tripping CPU affinity sanity checks while
      a CPU is being offlined.  A per-cpu kworker ends up running on a CPU
      which isn't its target CPU while the CPU is online but inactive.
      
      While the scheduler allows kthreads to wake up on an online but
      inactive CPU, it doesn't allow a running kthread to be migrated to
      such a CPU, which leads to an odd situation where setting affinity on
      a sleeping and running kthread leads to different results.
      
      Each mem-reclaim workqueue has one rescuer which guarantees forward
      progress and the rescuer needs to bind itself to the CPU which needs
      help in making forward progress; however, due to the above issue,
      while set_cpus_allowed_ptr() succeeds, the rescuer doesn't end up on
      the correct CPU if the CPU is in the process of going offline,
      tripping the sanity check and executing the work item on the wrong
      CPU.
      
      This patch updates __migrate_task() so that kthreads can be migrated
      into an inactive but online CPU.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      955dbdf4
  10. 28 7月, 2017 2 次提交
    • M
      workqueue: Work around edge cases for calc of pool's cpumask · 1ad0f0a7
      Michael Bringmann 提交于
      There is an underlying assumption/trade-off in many layers of the Linux
      system that CPU <-> node mapping is static.  This is despite the presence
      of features like NUMA and 'hotplug' that support the dynamic addition/
      removal of fundamental system resources like CPUs and memory.  PowerPC
      systems, however, do provide extensive features for the dynamic change
      of resources available to a system.
      
      Currently, there is little or no synchronization protection around the
      updating of the CPU <-> node mapping, and the export/update of this
      information for other layers / modules.  In systems which can change
      this mapping during 'hotplug', like PowerPC, the information is changing
      underneath all layers that might reference it.
      
      This patch attempts to ensure that a valid, usable cpumask attribute
      is used by the workqueue infrastructure when setting up new resource
      pools.  It prevents a crash that has been observed when an 'empty'
      cpumask is passed along to the worker/task scheduling code.  It is
      intended as a temporary workaround until a more fundamental review and
      correction of the issue can be done.
      
      [With additions to the patch provided by Tejun Hao <tj@kernel.org>]
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1ad0f0a7
    • P
      srcu: Provide ordering for CPU not involved in grace period · 35732cf9
      Paul E. McKenney 提交于
      Tree RCU guarantees that every online CPU has a memory barrier between
      any given grace period and any of that CPU's RCU read-side sections that
      must be ordered against that grace period.  Since RCU doesn't always
      know where read-side critical sections are, the actual implementation
      guarantees order against prior and subsequent non-idle non-offline code,
      whether in an RCU read-side critical section or not.  As a result, there
      does not need to be a memory barrier at the end of synchronize_rcu()
      and friends because the ordering internal to the grace period has
      ordered every CPU's post-grace-period execution against each CPU's
      pre-grace-period execution, again for all non-idle online CPUs.
      
      In contrast, SRCU can have non-idle online CPUs that are completely
      uninvolved in a given SRCU grace period, for example, a CPU that
      never runs any SRCU read-side critical sections and took no part in
      the grace-period processing.  It is in theory possible for a given
      synchronize_srcu()'s wakeup to be delivered to a CPU that was completely
      uninvolved in the prior SRCU grace period, which could mean that the
      code following that synchronize_srcu() would end up being unordered with
      respect to both the grace period and any pre-existing SRCU read-side
      critical sections.
      
      This commit therefore adds an smp_mb() to the end of __synchronize_srcu(),
      which prevents this scenario from occurring.
      Reported-by: NLance Roy <ldr709@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NLance Roy <ldr709@gmail.com>
      Cc: <stable@vger.kernel.org> # 4.12.x
      35732cf9
  11. 27 7月, 2017 1 次提交
    • T
      genirq/cpuhotplug: Revert "Set force affinity flag on hotplug migration" · 83979133
      Thomas Gleixner 提交于
      That commit was part of the changes moving x86 to the generic CPU hotplug
      interrupt migration code. The force flag was required on x86 before the
      hierarchical irqdomain rework, but invoking set_affinity() with force=true
      stayed and had no side effects.
      
      At some point in the past, the force flag got repurposed to support the
      exynos timer interrupt affinity setting to a not yet online CPU, so the
      interrupt controller callback does not verify the supplied affinity mask
      against cpu_online_mask.
      
      Setting the flag in the CPU hotplug code causes the cpu online masking to
      be blocked on these irq controllers and results in potentially affining an
      interrupt to the CPU which is unplugged, i.e. instead of moving it away,
      it's just reassigned to it.
      
      As the force flags is not longer needed on x86, it's safe to revert that
      patch so the ARM irqchips which use the force flag work again.
      
      Add comments to that effect, so this won't happen again.
      
      Note: The online mask handling should be done in the generic code and the
      force flag and the masking in the irq chips removed all together, but
      that's not a change possible for 4.13. 
      
      Fixes: 77f85e66 ("genirq/cpuhotplug: Set force affinity flag on hotplug migration")
      Reported-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: LAK <linux-arm-kernel@lists.infradead.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707271217590.3109@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      83979133
  12. 26 7月, 2017 10 次提交
    • P
      rcu: Move callback-list warning to irq-disable region · 09efeeee
      Paul E. McKenney 提交于
      After adopting callbacks from a newly offlined CPU, the adopting CPU
      checks to make sure that its callback list's count is zero only if the
      list has no callbacks and vice versa.  Unfortunately, it does so after
      enabling interrupts, which means that false positives are possible due to
      interrupt handlers invoking call_rcu().  Although these false positives
      are improbable, rcutorture did make it happen once.
      
      This commit therefore moves this check to an irq-disabled region of code,
      thus suppressing the false positive.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      09efeeee
    • P
      rcu: Remove unused RCU list functions · aed4e046
      Paul E. McKenney 提交于
      Given changes to callback migration, rcu_cblist_head(),
      rcu_cblist_tail(), rcu_cblist_count_cbs(), rcu_segcblist_segempty(),
      rcu_segcblist_dequeued_lazy(), and rcu_segcblist_new_cbs() are
      no longer used.  This commit therefore removes them.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      aed4e046
    • P
      rcu: Localize rcu_state ->orphan_pend and ->orphan_done · f2dbe4a5
      Paul E. McKenney 提交于
      Given that the rcu_state structure's >orphan_pend and ->orphan_done
      fields are used only during migration of callbacks from the recently
      offlined CPU to a surviving CPU, if rcu_send_cbs_to_orphanage() and
      rcu_adopt_orphan_cbs() are combined, these fields can become local
      variables in the combined function.  This commit therefore combines
      rcu_send_cbs_to_orphanage() and rcu_adopt_orphan_cbs() into a new
      rcu_segcblist_merge() function and removes the ->orphan_pend and
      ->orphan_done fields.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f2dbe4a5
    • P
      rcu: Advance callbacks after migration · 21cc2483
      Paul E. McKenney 提交于
      When migrating callbacks from a newly offlined CPU, we are already
      holding the root rcu_node structure's lock, so it costs almost nothing
      to advance and accelerate the newly migrated callbacks.  This patch
      therefore makes this advancing and acceleration happen.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      21cc2483
    • P
      rcu: Eliminate rcu_state ->orphan_lock · 537b85c8
      Paul E. McKenney 提交于
      The ->orphan_lock is acquired and released only within the
      rcu_migrate_callbacks() function, which now acquires the root rcu_node
      structure's ->lock.  This commit therefore eliminates the ->orphan_lock
      in favor of the root rcu_node structure's ->lock.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      537b85c8
    • P
      rcu: Advance outgoing CPU's callbacks before migrating them · 9fa46fb8
      Paul E. McKenney 提交于
      It is possible that the outgoing CPU is unaware of recent grace periods,
      and so it is also possible that some of its pending callbacks are actually
      ready to be invoked.  The current callback-migration code would needlessly
      force these callbacks to pass through another grace period.  This commit
      therefore invokes rcu_advance_cbs() on the outgoing CPU's callbacks in
      order to give them full credit for having passed through any recent
      grace periods.
      
      This also fixes an odd theoretical bug where there are no callbacks in
      the system except for those on the outgoing CPU, none of those callbacks
      have yet been associated with a grace-period number, there is never again
      another callback registered, and the surviving CPU never again takes a
      scheduling-clock interrupt, never goes idle, and never enters nohz_full
      userspace execution.  Yes, this is (just barely) possible.  It requires
      that the surviving CPU be a nohz_full CPU, that its scheduler-clock
      interrupt be shut off, and that it loop forever in the kernel.  You get
      bonus points if you can make this one happen!  ;-)
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      9fa46fb8
    • P
      rcu: Make NOCB CPUs migrate CBs directly from outgoing CPU · b1a2d79f
      Paul E. McKenney 提交于
      RCU's CPU-hotplug callback-migration code first moves the outgoing
      CPU's callbacks to ->orphan_done and ->orphan_pend, and only then
      moves them to the NOCB callback list.  This commit avoids the
      extra step (and simplifies the code) by moving the callbacks directly
      from the outgoing CPU's callback list to the NOCB callback list.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b1a2d79f
    • P
      rcu: Check for NOCB CPUs and empty lists earlier in CB migration · 95335c03
      Paul E. McKenney 提交于
      The current CPU-hotplug RCU-callback-migration code checks
      for the source (newly offlined) CPU being a NOCBs CPU down in
      rcu_send_cbs_to_orphanage().  This commit simplifies callback migration a
      bit by moving this check up to rcu_migrate_callbacks().  This commit also
      adds a check for the source CPU having no callbacks, which eases analysis
      of the rcu_send_cbs_to_orphanage() and rcu_adopt_orphan_cbs() functions.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      95335c03
    • P
      rcu: Remove orphan/adopt event-tracing fields · c47e067a
      Paul E. McKenney 提交于
      The rcu_node structure's ->n_cbs_orphaned and ->n_cbs_adopted fields
      are updated, but never read.  This commit therefore removes them.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      c47e067a
    • P
      torture: Fix typo suppressing CPU-hotplug statistics · a2b2df20
      Paul E. McKenney 提交于
      The torture status line contains a series of values preceded by "onoff:".
      The last value in that line, the one preceding the "HZ=" string, is
      always zero.  The reason that it is always zero is that torture_offline()
      was incrementing the sum_offl pointer instead of the value that this
      pointer referenced.  This commit therefore makes this increment operate
      on the statistic rather than the pointer to the statistic.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a2b2df20