1. 18 9月, 2015 2 次提交
  2. 17 9月, 2015 2 次提交
    • T
      cgroup: simplify threadgroup locking · 3014dde7
      Tejun Heo 提交于
      Note: This commit was originally committed as b5ba75b5 but got
            reverted by f9f9e7b7 due to the performance regression from
            the percpu_rwsem write down/up operations added to cgroup task
            migration path.  percpu_rwsem changes which alleviate the
            performance issue are pending for v4.4-rc1 merge window.
            Re-apply.
      
      Now that threadgroup locking is made global, code paths around it can
      be simplified.
      
      * lock-verify-unlock-retry dancing removed from __cgroup_procs_write().
      
      * Race protection against de_thread() removed from
        cgroup_update_dfl_csses().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
      3014dde7
    • T
      sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · 1ed13287
      Tejun Heo 提交于
      Note: This commit was originally committed as d59cfc09 but got
            reverted by 0c986253 due to the performance regression from
            the percpu_rwsem write down/up operations added to cgroup task
            migration path.  percpu_rwsem changes which alleviate the
            performance issue are pending for v4.4-rc1 merge window.
            Re-apply.
      
      The cgroup side of threadgroup locking uses signal_struct->group_rwsem
      to synchronize against threadgroup changes.  This per-process rwsem
      adds small overhead to thread creation, exit and exec paths, forces
      cgroup code paths to do lock-verify-unlock-retry dance in a couple
      places and makes it impossible to atomically perform operations across
      multiple processes.
      
      This patch replaces signal_struct->group_rwsem with a global
      percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
      side and contained in cgroups proper.  This patch converts one-to-one.
      
      This does make writer side heavier and lower the granularity; however,
      cgroup process migration is a fairly cold path, we do want to optimize
      thread operations over it and cgroup migration operations don't take
      enough time for the lower granularity to matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      1ed13287
  3. 16 9月, 2015 2 次提交
    • T
      Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem" · 0c986253
      Tejun Heo 提交于
      This reverts commit d59cfc09.
      
      d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with
      a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify
      threadgroup locking") changed how cgroup synchronizes against task
      fork and exits so that it uses global percpu_rwsem instead of
      per-process rwsem; unfortunately, the write [un]lock paths of
      percpu_rwsem always involve synchronize_rcu_expedited() which turned
      out to be too expensive.
      
      Improvements for percpu_rwsem are scheduled to be merged in the coming
      v4.4-rc1 merge window which alleviates this issue.  For now, revert
      the two commits to restore per-process rwsem.  They will be re-applied
      for the v4.4-rc1 merge window.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org # v4.2+
      0c986253
    • T
      Revert "cgroup: simplify threadgroup locking" · f9f9e7b7
      Tejun Heo 提交于
      This reverts commit b5ba75b5.
      
      d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with
      a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify
      threadgroup locking") changed how cgroup synchronizes against task
      fork and exits so that it uses global percpu_rwsem instead of
      per-process rwsem; unfortunately, the write [un]lock paths of
      percpu_rwsem always involve synchronize_rcu_expedited() which turned
      out to be too expensive.
      
      Improvements for percpu_rwsem are scheduled to be merged in the coming
      v4.4-rc1 merge window which alleviates this issue.  For now, revert
      the two commits to restore per-process rwsem.  They will be re-applied
      for the v4.4-rc1 merge window.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org # v4.2+
      f9f9e7b7
  4. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  5. 11 9月, 2015 14 次提交
    • I
      sysctl: fix int -> unsigned long assignments in INT_MIN case · 9a5bc726
      Ilya Dryomov 提交于
      The following
      
          if (val < 0)
              *lvalp = (unsigned long)-val;
      
      is incorrect because the compiler is free to assume -val to be positive
      and use a sign-extend instruction for extending the bit pattern.  This is
      a problem if val == INT_MIN:
      
          # echo -2147483648 >/proc/sys/dev/scsi/logging_level
          # cat /proc/sys/dev/scsi/logging_level
          -18446744071562067968
      
      Cast to unsigned long before negation - that way we first sign-extend and
      then negate an unsigned, which is well defined.  With this:
      
          # cat /proc/sys/dev/scsi/logging_level
          -2147483648
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Cc: Mikulas Patocka <mikulas@twibright.com>
      Cc: Robert Xiao <nneonneo@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a5bc726
    • B
      kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo · 1303a27c
      Baoquan He 提交于
      In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
      accordingly the MODULES_VADDR is changed to 0xffffffffa0000000.  However,
      in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
      And the kernel text mapping addr space is enlarged from 512M to 1G.  That
      means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
      support is not compiled in and 1G when kaslr support is compiled in.
      Accordingly the MODULES_VADDR is changed too to be:
      
          #define MODULES_VADDR    (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
      
      So when kaslr is compiled in and enabled, the kernel text mapping addr
      space and modules vaddr space need be adjusted.  Otherwise makedumpfile
      will collapse since the addr for some symbols is not correct.
      
      Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
      makedumpfile to help calculate MODULES_VADDR.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1303a27c
    • B
      kexec: align crash_notes allocation to make it be inside one physical page · bbb78b8f
      Baoquan He 提交于
      People reported that crash_notes in /proc/vmcore were corrupted and this
      cause crash kdump failure.  With code debugging and log we got the root
      cause.  This is because percpu variable crash_notes are allocated in 2
      vmalloc pages.  Currently percpu is based on vmalloc by default.  Vmalloc
      can't guarantee 2 continuous vmalloc pages are also on 2 continuous
      physical pages.  So when 1st kernel exports the starting address and size
      of crash_notes through sysfs like below:
      
      /sys/devices/system/cpu/cpux/crash_notes
      /sys/devices/system/cpu/cpux/crash_notes_size
      
      kdump kernel use them to get the content of crash_notes.  However the 2nd
      part may not be in the next neighbouring physical page as we expected if
      crash_notes are allocated accross 2 vmalloc pages.  That's why
      nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
      update_note_header_size_elf64() and cause note header merging failure or
      some warnings.
      
      In this patch change to call __alloc_percpu() to passed in the align value
      by rounding crash_notes_size up to the nearest power of two.  This makes
      sure the crash_notes is allocated inside one physical page since
      sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE.  Meanwhile add
      a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
      crash_notes definitely will be in 2 pages.  That need be avoided, and need
      be reported if it's unavoidable.
      
      [akpm@linux-foundation.org: use correct comment layout]
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbb78b8f
    • M
      kexec: remove unnecessary test in kimage_alloc_crash_control_pages() · 04e9949b
      Minfei Huang 提交于
      Transforming PFN(Page Frame Number) to struct page is never failure, so we
      can simplify the code logic to do the image->control_page assignment
      directly in the loop, and remove the unnecessary conditional judgement.
      Signed-off-by: NMinfei Huang <mnfhuang@gmail.com>
      Acked-by: NDave Young <dyoung@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Simon Horman <horms@verge.net.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04e9949b
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
    • D
      kexec: split kexec_file syscall code to kexec_file.c · a43cac0d
      Dave Young 提交于
      Split kexec_file syscall related code to another file kernel/kexec_file.c
      so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.
      
      Sharing variables and functions are moved to kernel/kexec_internal.h per
      suggestion from Vivek and Petr.
      
      [akpm@linux-foundation.org: fix bisectability]
      [akpm@linux-foundation.org: declare the various arch_kexec functions]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43cac0d
    • F
      kmod: handle UMH_WAIT_PROC from system unbound workqueue · bb304a5c
      Frederic Weisbecker 提交于
      The UMH_WAIT_PROC handler runs in its own thread in order to make sure
      that waiting for the exec kernel thread completion won't block other
      usermodehelper queued jobs.
      
      On older workqueue implementations, worklets couldn't sleep without
      blocking the rest of the queue.  But now the workqueue subsystem handles
      that.  Khelper still had the older limitation due to its singlethread
      properties but we replaced it to system unbound workqueues.
      
      Those are affine to the current node and can block up to some number of
      instances.
      
      They are a good candidate to handle UMH_WAIT_PROC assuming that we have
      enough system unbound workers to handle lots of parallel usermodehelper
      jobs.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb304a5c
    • F
      kmod: use system_unbound_wq instead of khelper · 90f02303
      Frederic Weisbecker 提交于
      We need to launch the usermodehelper kernel threads with the widest
      affinity and this is partly why we use khelper.  This workqueue has
      unbound properties and thus a wide affinity inherited by all its children.
      
      Now khelper also has special properties that we aren't much interested in:
      ordered and singlethread.  There is really no need about ordering as all
      we do is creating kernel threads.  This can be done concurrently.  And
      singlethread is a useless limitation as well.
      
      The workqueue engine already proposes generic unbound workqueues that
      don't share these useless properties and handle well parallel jobs.
      
      The only worrysome specific is their affinity to the node of the current
      CPU.  It's fine for creating the usermodehelper kernel threads but those
      inherit this affinity for longer jobs such as requesting modules.
      
      This patch proposes to use these node affine unbound workqueues assuming
      that a node is sufficient to handle several parallel usermodehelper
      requests.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f02303
    • F
      kmod: add up-to-date explanations on the purpose of each asynchronous levels · b639e86b
      Frederic Weisbecker 提交于
      There seem to be quite some confusions on the comments, likely due to
      changes that came after them.
      
      Now since it's very non obvious why we have 3 levels of asynchronous code
      to implement usermodehelpers, it's important to comment in detail the
      reason of this layout.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b639e86b
    • F
      kmod: remove unecessary explicit wide CPU affinity setting · d097c024
      Frederic Weisbecker 提交于
      Khelper is affine to all CPUs.  Now since it creates the
      call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
      affinity.
      
      As such explicitly forcing a wide affinity from those kernel threads
      is like a no-op.
      
      Just remove it. It's needless and it breaks CPU isolation users who
      rely on workqueue affinity tuning.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d097c024
    • F
      kmod: bunch of internal functions renames · b6b50a81
      Frederic Weisbecker 提交于
      This patchset does a bunch of cleanups and converts khelper to use system
      unbound workqueues.  The 3 first patches should be uncontroversial.  The
      last 2 patches are debatable.
      
      Kmod creates kernel threads that perform userspace jobs and we want those
      to have a large affinity in order not to contend busy CPUs.  This is
      (partly) why we use khelper which has a wide affinity that the kernel
      threads it create can inherit from.  Now khelper is a dedicated workqueue
      that has singlethread properties which we aren't interested in.
      
      Hence those two debatable changes:
      
      _ We would like to use generic workqueues. System unbound workqueues are
        a very good candidate but they are not wide affine, only node affine.
        Now probably a node is enough to perform many parallel kmod jobs.
      
      _ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC
        handler) to use the workqueue. It means that if the workqueue blocks,
        and no other worker can take pending kmod request, we can be screwed.
        Now if we have 512 threads, this should be enough.
      
      This patch (of 5):
      
      Underscores on function names aren't much verbose to explain the purpose
      of a function.  And kmod has interesting such flavours.
      
      Lets rename the following functions:
      
      * __call_usermodehelper -> call_usermodehelper_exec_work
      * ____call_usermodehelper -> call_usermodehelper_exec_async
      * wait_for_helper -> call_usermodehelper_exec_sync
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6b50a81
    • N
      kmod: correct documentation of return status of request_module · 60b61a6f
      NeilBrown 提交于
      If request_module() successfully runs modprobe, but modprobe exits with a
      non-zero status, then the return value from request_module() will be that
      (positive) error status.  So the return from request_module can be:
      
       negative errno
       zero for success
       positive exit code.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60b61a6f
    • J
      kernel/cred.c: remove unnecessary kdebug atomic reads · 52aa8536
      Joe Perches 提交于
      Commit e0e81739 ("CRED: Add some configurable debugging [try #6]")
      added the kdebug mechanism to this file back in 2009.
      
      The kdebug macro calls no_printk which always evaluates arguments.
      
      Most of the kdebug uses have an unnecessary call of
      	atomic_read(&cred->usage)
      
      Make the kdebug macro do nothing by defining it with
      	do { if (0) no_printk(...); } while (0)
      when not enabled.
      
      $ size kernel/cred.o* (defconfig x86-64)
         text	   data	    bss	    dec	    hex	filename
         2748	    336	      8	   3092	    c14	kernel/cred.o.new
         2788	    336	      8	   3132	    c3c	kernel/cred.o.old
      
      Miscellanea:
      o Neaten the #define kdebug macros while there
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52aa8536
    • W
      kernel/extable.c: remove duplicated include · 2307e1a3
      Wei Yongjun 提交于
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2307e1a3
  6. 10 9月, 2015 2 次提交
  7. 09 9月, 2015 2 次提交
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
    • K
      cgroup: fix seq_show_option merge with legacy_name · 61e57c0c
      Kees Cook 提交于
      When seq_show_option (commit a068acf2: "fs: create and use
      seq_show_option for escaping") was merged, it did not correctly collide
      with cgroup's addition of legacy_name (commit 3e1d2eed: "cgroup:
      introduce cgroup_subsys->legacy_name") changes.
      
      This fixes the reported name.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61e57c0c
  8. 06 9月, 2015 1 次提交
  9. 05 9月, 2015 14 次提交
    • A
      userfaultfd: activate syscall · 1380fca0
      Andrea Arcangeli 提交于
      This activates the userfaultfd syscall.
      
      [sfr@canb.auug.org.au: activate syscall fix]
      [akpm@linux-foundation.org: don't enable userfaultfd on powerpc]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1380fca0
    • A
      userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP · 16ba6f81
      Andrea Arcangeli 提交于
      These two flags gets set in vma->vm_flags to tell the VM common code
      if the userfaultfd is armed and in which mode (only tracking missing
      faults, only tracking wrprotect faults or both). If neither flags is
      set it means the userfaultfd is not armed on the vma.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16ba6f81
    • A
      userfaultfd: add vm_userfaultfd_ctx to the vm_area_struct · 745f234b
      Andrea Arcangeli 提交于
      This adds the vm_userfaultfd_ctx to the vm_area_struct.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      745f234b
    • A
      userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key · 51360155
      Andrea Arcangeli 提交于
      userfaultfd needs to wake all waitqueues (pass 0 as nr parameter), instead
      of the current hardcoded 1 (that would wake just the first waitqueue in
      the head list).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51360155
    • U
      watchdog: rename watchdog_suspend() and watchdog_resume() · ec6a9066
      Ulrich Obergfell 提交于
      Rename watchdog_suspend() to lockup_detector_suspend() and
      watchdog_resume() to lockup_detector_resume() to avoid confusion with the
      watchdog subsystem and to be consistent with the existing name
      lockup_detector_init().
      
      Also provide comment blocks to explain the watchdog_running and
      watchdog_suspended variables and their relationship.
      Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec6a9066
    • U
      watchdog: use suspend/resume interface in fixup_ht_bug() · 999bbe49
      Ulrich Obergfell 提交于
      Remove watchdog_nmi_disable_all() and watchdog_nmi_enable_all() since
      these functions are no longer needed.  If a subsystem has a need to
      deactivate the watchdog temporarily, it should utilize the
      watchdog_suspend() and watchdog_resume() functions.
      
      [akpm@linux-foundation.org: fix build with CONFIG_LOCKUP_DETECTOR=m]
      Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      999bbe49
    • U
      watchdog: use park/unpark functions in update_watchdog_all_cpus() · d4bdd0b2
      Ulrich Obergfell 提交于
      Remove update_watchdog() and restart_watchdog_hrtimer() since these
      functions are no longer needed.  Changes of parameters such as the sample
      period are honored at the time when the watchdog threads are being
      unparked.
      Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4bdd0b2
    • U
      watchdog: introduce watchdog_suspend() and watchdog_resume() · 8c073d27
      Ulrich Obergfell 提交于
      This interface can be utilized to deactivate the hard and soft lockup
      detector temporarily.  Callers are expected to minimize the duration of
      deactivation.  Multiple deactivations are allowed to occur in parallel but
      should be rare in practice.
      
      [akpm@linux-foundation.org: remove unneeded static initialization]
      Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c073d27
    • U
      watchdog: introduce watchdog_park_threads() and watchdog_unpark_threads() · 81a4beef
      Ulrich Obergfell 提交于
      Originally watchdog_nmi_enable(cpu) and watchdog_nmi_disable(cpu) were
      only called in watchdog thread context.  However, the following commits
      utilize these functions outside of watchdog thread context too.
      
        commit 9809b18f
        Author: Michal Hocko <mhocko@suse.cz>
        Date:   Tue Sep 24 15:27:30 2013 -0700
      
            watchdog: update watchdog_thresh properly
      
        commit b3738d29
        Author: Stephane Eranian <eranian@google.com>
        Date:   Mon Nov 17 20:07:03 2014 +0100
      
            watchdog: Add watchdog enable/disable all functions
      
      Hence, it is now possible that these functions execute concurrently with
      the same 'cpu' argument.  This concurrency is problematic because per-cpu
      'watchdog_ev' can be accessed/modified without adequate synchronization.
      
      The patch series aims to address the above problem.  However, instead of
      introducing locks to protect per-cpu 'watchdog_ev' a different approach is
      taken: Invoke these functions by parking and unparking the watchdog
      threads (to ensure they are always called in watchdog thread context).
      
        static struct smp_hotplug_thread watchdog_threads = {
                 ...
                .park   = watchdog_disable, // calls watchdog_nmi_disable()
                .unpark = watchdog_enable,  // calls watchdog_nmi_enable()
        };
      
      Both previously mentioned commits call these functions in a similar way
      and thus in principle contain some duplicate code.  The patch series also
      avoids this duplication by providing a commonly usable mechanism.
      
      - Patch 1/4 introduces the watchdog_{park|unpark}_threads functions that
        park/unpark all watchdog threads specified in 'watchdog_cpumask'. They
        are intended to be called inside of kernel/watchdog.c only.
      
      - Patch 2/4 introduces the watchdog_{suspend|resume} functions which can
        be utilized by external callers to deactivate the hard and soft lockup
        detector temporarily.
      
      - Patch 3/4 utilizes watchdog_{park|unpark}_threads to replace some code
        that was introduced by commit 9809b18f.
      
      - Patch 4/4 utilizes watchdog_{suspend|resume} to replace some code that
        was introduced by commit b3738d29.
      
      A few corner cases should be mentioned here for completeness.
      
      - kthread_park() of watchdog/N could hang if cpu N is already locked up.
        However, if watchdog is enabled the lockup will be detected anyway.
      
      - kthread_unpark() of watchdog/N could hang if cpu N got locked up after
        kthread_park(). The occurrence of this scenario should be _very_ rare
        in practice, in particular because it is not expected that temporary
        deactivation will happen frequently, and if it happens at all it is
        expected that the duration of deactivation will be short.
      
      This patch (of 4): introduce watchdog_park_threads() and watchdog_unpark_threads()
      
      These functions are intended to be used only from inside kernel/watchdog.c
      to park/unpark all watchdog threads that are specified in
      watchdog_cpumask.
      Signed-off-by: NUlrich Obergfell <uobergfe@redhat.com>
      Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81a4beef
    • G
      kernel/watchdog: move NMI function header declarations from watchdog.h to nmi.h · aacfbe6a
      Guenter Roeck 提交于
      The kernel's NMI watchdog has nothing to do with the watchdog subsystem.
      Its header declarations should be in linux/nmi.h, not linux/watchdog.h.
      
      The code provided two sets of dummy functions if HARDLOCKUP_DETECTOR is
      not configured, one in the include file and one in kernel/watchdog.c.
      Remove the dummy functions from kernel/watchdog.c and use those from the
      include file.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Don Zickus <dzickus@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aacfbe6a
    • F
      watchdog: simplify housekeeping affinity with the appropriate mask · 314b08ff
      Frederic Weisbecker 提交于
      housekeeping_mask gathers all the CPUs that aren't part of the nohz_full
      set.  This is exactly what we want the watchdog to be affine to without
      the need to use complicated cpumask operations.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314b08ff
    • F
      smpboot: allow passing the cpumask on per-cpu thread registration · 230ec939
      Frederic Weisbecker 提交于
      It makes the registration cheaper and simpler for the smpboot per-cpu
      kthread users that don't need to always update the cpumask after threads
      creation.
      
      [sfr@canb.auug.org.au: fix for allow passing the cpumask on per-cpu thread registration]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      230ec939
    • F
      smpboot: make cleanup to mirror setup · 3dd08c0c
      Frederic Weisbecker 提交于
      The per-cpu kthread cleanup() callback is the mirror of the setup()
      callback.  When the per-cpu kthread is started, it first calls setup()
      to initialize the resources which are then released by cleanup() when
      the kthread exits.
      
      Now since the introduction of a per-cpu kthread cpumask, the kthreads
      excluded by the cpumask on boot may happen to be parked immediately
      after their creation without taking the setup() stage, waiting to be
      asked to unpark to do so.  Then when smpboot_unregister_percpu_thread()
      is later called, the kthread is stopped without having ever called
      setup().
      
      But this triggers a bug as the kthread unconditionally calls cleanup()
      on exit but this doesn't mirror any setup().  Thus the kernel crashes
      because we try to free resources that haven't been initialized, as in
      the watchdog case:
      
          WATCHDOG disable 0
          WATCHDOG disable 1
          WATCHDOG disable 2
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: hrtimer_active+0x26/0x60
          [...]
          Call Trace:
            hrtimer_try_to_cancel+0x1c/0x280
            hrtimer_cancel+0x1d/0x30
            watchdog_disable+0x56/0x70
            watchdog_cleanup+0xe/0x10
            smpboot_thread_fn+0x23c/0x2c0
            kthread+0xf8/0x110
            ret_from_fork+0x3f/0x70
      
      This bug is currently masked with explicit kthread unparking before
      kthread_stop() on smpboot_destroy_threads(). This forces a call to
      setup() and then unpark().
      
      We could fix this by unconditionally calling setup() on kthread entry.
      But setup() isn't always cheap.  In the case of watchdog it launches
      hrtimer, perf events, etc...  So we may as well like to skip it if there
      are chances the kthread will never be used, as in a reduced cpumask value.
      
      So let's simply do a state machine check before calling cleanup() that
      makes sure setup() has been called before mirroring it.
      
      And remove the nasty hack workaround.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dd08c0c
    • F
      smpboot: fix memory leak on error handling · 5869b506
      Frederic Weisbecker 提交于
      The cpumask is allocated before threads get created. If the latter step
      fails, we need to free the cpumask.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5869b506