1. 08 10月, 2015 5 次提交
  2. 21 9月, 2015 9 次提交
    • P
      rcu: Make ->cpu_no_qs be a union for aggregate OR · 5b74c458
      Paul E. McKenney 提交于
      This commit converts the rcu_data structure's ->cpu_no_qs field
      to a union.  The bytewise side of this union allows individual access
      to indications as to whether this CPU needs to find a quiescent state
      for a normal (.norm) and/or expedited (.exp) grace period.  The setwise
      side of the union allows testing whether or not a quiescent state is
      needed at all, for either type of grace period.
      
      For now, only .norm is used.  A later commit will introduce the expedited
      usage.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5b74c458
    • P
      rcu: Invert passed_quiesce and rename to cpu_no_qs · 0d43eb34
      Paul E. McKenney 提交于
      This commit inverts the sense of the rcu_data structure's ->passed_quiesce
      field and renames it to ->cpu_no_qs.  This will allow a later commit to
      use an "aggregate OR" operation to test expedited as well as normal grace
      periods without added overhead.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0d43eb34
    • P
      rcu: Rename qs_pending to core_needs_qs · 97c668b8
      Paul E. McKenney 提交于
      An upcoming commit needs to invert the sense of the ->passed_quiesce
      rcu_data structure field, so this commit is taking this opportunity
      to clarify things a bit by renaming ->qs_pending to ->core_needs_qs.
      
      So if !rdp->core_needs_qs, then this CPU need not concern itself with
      quiescent states, in particular, it need not acquire its leaf rcu_node
      structure's ->lock to check.  Otherwise, it needs to report the next
      quiescent state.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      97c668b8
    • P
      rcu: Move synchronize_sched_expedited() to combining tree · bce5fa12
      Paul E. McKenney 提交于
      Currently, synchronize_sched_expedited() uses a single global counter
      to track the number of remaining context switches that the current
      expedited grace period must wait on.  This is problematic on large
      systems, where the resulting memory contention can be pathological.
      This commit therefore makes synchronize_sched_expedited() instead use
      the combining tree in the same manner as synchronize_rcu_expedited(),
      keeping memory contention down to a dull roar.
      
      This commit creates a temporary function sync_sched_exp_select_cpus()
      that is very similar to sync_rcu_exp_select_cpus().  A later commit
      will consolidate these two functions, which becomes possible when
      synchronize_sched_expedited() switches from stop_one_cpu_nowait() to
      smp_call_function_single().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      bce5fa12
    • P
      rcu: Use single-stage IPI algorithm for RCU expedited grace period · 8203d6d0
      Paul E. McKenney 提交于
      The current preemptible-RCU expedited grace-period algorithm invokes
      synchronize_sched_expedited() to enqueue all tasks currently running
      in a preemptible-RCU read-side critical section, then waits for all the
      ->blkd_tasks lists to drain.  This works, but results in both an IPI and
      a double context switch even on CPUs that do not happen to be running
      in a preemptible RCU read-side critical section.
      
      This commit implements a new algorithm that causes less OS jitter.
      This new algorithm IPIs all online CPUs that are not idle (from an
      RCU perspective), but refrains from self-IPIs.  If a CPU receiving
      this IPI is not in a preemptible RCU read-side critical section (or
      is just now exiting one), it pushes quiescence up the rcu_node tree,
      otherwise, it sets a flag that will be handled by the upcoming outermost
      rcu_read_unlock(), which will then push quiescence up the tree.
      
      The expedited grace period must of course wait on any pre-existing blocked
      readers, and newly blocked readers must be queued carefully based on
      the state of both the normal and the expedited grace periods.  This
      new queueing approach also avoids the need to update boost state,
      courtesy of the fact that blocked tasks are no longer ever migrated to
      the root rcu_node structure.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8203d6d0
    • P
      rcu: Consolidate tree setup for synchronize_rcu_expedited() · b9585e94
      Paul E. McKenney 提交于
      This commit replaces sync_rcu_preempt_exp_init1(() and
      sync_rcu_preempt_exp_init2() with sync_exp_reset_tree_hotplug()
      and sync_exp_reset_tree(), which will also be used by
      synchronize_sched_expedited(), and sync_rcu_exp_select_nodes(), which
      contains code specific to synchronize_rcu_expedited().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b9585e94
    • P
      rcu: Move rcu_report_exp_rnp() to allow consolidation · 7922cd0e
      Paul E. McKenney 提交于
      This is a nearly pure code-movement commit, moving rcu_report_exp_rnp(),
      sync_rcu_preempt_exp_done(), and rcu_preempted_readers_exp() so
      that later commits can make synchronize_sched_expedited() use them.
      The non-code-movement portion of this commit tags rcu_report_exp_rnp()
      as __maybe_unused to avoid build errors when CONFIG_PREEMPT=n.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      7922cd0e
    • P
      rcu: Use rsp->expedited_wq instead of sync_rcu_preempt_exp_wq · f4ecea30
      Paul E. McKenney 提交于
      Now that there is an ->expedited_wq waitqueue in each rcu_state structure,
      there is no need for the sync_rcu_preempt_exp_wq global variable.  This
      commit therefore substitutes ->expedited_wq for sync_rcu_preempt_exp_wq.
      It also initializes ->expedited_wq only once at boot instead of at the
      start of each expedited grace period.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f4ecea30
    • P
      rcu: Suppress lockdep false positive for rcp->exp_funnel_mutex · 19a5ecde
      Paul E. McKenney 提交于
      In kernels built with CONFIG_PREEMPT=y, synchronize_rcu_expedited()
      invokes synchronize_sched_expedited() while holding RCU-preempt's
      root rcu_node structure's ->exp_funnel_mutex, which is acquired after
      the rcu_data structure's ->exp_funnel_mutex.  The first thing that
      synchronize_sched_expedited() will do is acquire RCU-sched's rcu_data
      structure's ->exp_funnel_mutex.   There is no danger of an actual deadlock
      because the locking order is always from RCU-preempt's expedited mutexes
      to those of RCU-sched.  Unfortunately, lockdep considers both rcu_data
      structures' ->exp_funnel_mutex to be in the same lock class and therefore
      reports a deadlock cycle.
      
      This commit silences this false positive by placing RCU-sched's rcu_data
      structures' ->exp_funnel_mutex locks into their own lock class.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      19a5ecde
  3. 18 9月, 2015 1 次提交
  4. 16 9月, 2015 5 次提交
  5. 14 9月, 2015 1 次提交
  6. 13 9月, 2015 1 次提交
    • J
      time: Fix timekeeping_freqadjust()'s incorrect use of abs() instead of abs64() · 2619d7e9
      John Stultz 提交于
      The internal clocksteering done for fine-grained error
      correction uses a logarithmic approximation, so any time
      adjtimex() adjusts the clock steering, timekeeping_freqadjust()
      quickly approximates the correct clock frequency over a series
      of ticks.
      
      Unfortunately, the logic in timekeeping_freqadjust(), introduced
      in commit:
      
        dc491596 ("timekeeping: Rework frequency adjustments to work better w/ nohz")
      
      used the abs() function with a s64 error value to calculate the
      size of the approximated adjustment to be made.
      
      Per include/linux/kernel.h:
      
        "abs() should not be used for 64-bit types (s64, u64, long long) - use abs64()".
      
      Thus on 32-bit platforms, this resulted in the clocksteering to
      take a quite dampended random walk trying to converge on the
      proper frequency, which caused the adjustments to be made much
      slower then intended (most easily observed when large
      adjustments are made).
      
      This patch fixes the issue by using abs64() instead.
      Reported-by: NNuno Gonçalves <nunojpg@gmail.com>
      Tested-by: NNuno Goncalves <nunojpg@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: <stable@vger.kernel.org> # v3.17+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1441840051-20244-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2619d7e9
  7. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  8. 11 9月, 2015 16 次提交
    • W
      sched: 'Annotate' migrate_tasks() · 5473e0cc
      Wanpeng Li 提交于
      Kernel testing triggered this warning:
      
      | WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
      | Modules linked in:
      | CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c73 #2
      | Call Trace:
      |   dump_stack+0x4b/0x75
      |   warn_slowpath_common+0x8b/0xc0
      |   warn_slowpath_null+0x22/0x30
      |   do_set_cpus_allowed+0x7e/0x80
      |   cpuset_cpus_allowed_fallback+0x7c/0x170
      |   select_fallback_rq+0x221/0x280
      |   migration_call+0xe3/0x250
      |   notifier_call_chain+0x53/0x70
      |   __raw_notifier_call_chain+0x1e/0x30
      |   cpu_notify+0x28/0x50
      |   take_cpu_down+0x22/0x40
      |   multi_cpu_stop+0xd5/0x140
      |   cpu_stopper_thread+0xbc/0x170
      |   smpboot_thread_fn+0x174/0x2f0
      |   kthread+0xc4/0xe0
      |   ret_from_kernel_thread+0x21/0x30
      
      As Peterz pointed out:
      
      | So the normal rules for changing task_struct::cpus_allowed are holding
      | both pi_lock and rq->lock, such that holding either stabilizes the mask.
      |
      | This is so that wakeup can happen without rq->lock and load-balance
      | without pi_lock.
      |
      | From this we already get the relaxation that we can omit acquiring
      | rq->lock if the task is not on the rq, because in that case
      | load-balancing will not apply to it.
      |
      | ** these are the rules currently tested in do_set_cpus_allowed() **
      |
      | Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
      | unconditionally acquires both locks, we could get away with holding just
      | rq->lock when on_rq for modification because that'd still exclude
      | __set_cpus_allowed_ptr(), it would also work against
      | __kthread_bind_mask() because that assumes !on_rq.
      |
      | That said, this is all somewhat fragile.
      |
      | Now, I don't think dropping rq->lock is quite as disastrous as it
      | usually is because !cpu_active at this point, which means load-balance
      | will not interfere, but that too is somewhat fragile.
      |
      | So we end up with a choice of two fragile..
      
      This patch fixes it by following the rules for changing
      task_struct::cpus_allowed with both pi_lock and rq->lock held.
      Reported-by: Nkernel test robot <ying.huang@intel.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [ Modified changelog and patch. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/BLU436-SMTP1660820490DE202E3934ED3806E0@phx.gblSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5473e0cc
    • P
      locking/qspinlock/x86: Fix performance regression under unaccelerated VMs · 43b3f028
      Peter Zijlstra 提交于
      Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS
      set and Linus noted that the test-and-set implementation was retarded.
      
      One should spin on the variable with a load, not a RMW.
      
      While there, remove 'queued' from the name, as the lock isn't queued
      at all, but a simple test-and-set.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: stable@vger.kernel.org # v4.2+
      Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      43b3f028
    • I
      sysctl: fix int -> unsigned long assignments in INT_MIN case · 9a5bc726
      Ilya Dryomov 提交于
      The following
      
          if (val < 0)
              *lvalp = (unsigned long)-val;
      
      is incorrect because the compiler is free to assume -val to be positive
      and use a sign-extend instruction for extending the bit pattern.  This is
      a problem if val == INT_MIN:
      
          # echo -2147483648 >/proc/sys/dev/scsi/logging_level
          # cat /proc/sys/dev/scsi/logging_level
          -18446744071562067968
      
      Cast to unsigned long before negation - that way we first sign-extend and
      then negate an unsigned, which is well defined.  With this:
      
          # cat /proc/sys/dev/scsi/logging_level
          -2147483648
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Cc: Mikulas Patocka <mikulas@twibright.com>
      Cc: Robert Xiao <nneonneo@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a5bc726
    • B
      kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo · 1303a27c
      Baoquan He 提交于
      In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
      accordingly the MODULES_VADDR is changed to 0xffffffffa0000000.  However,
      in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
      And the kernel text mapping addr space is enlarged from 512M to 1G.  That
      means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
      support is not compiled in and 1G when kaslr support is compiled in.
      Accordingly the MODULES_VADDR is changed too to be:
      
          #define MODULES_VADDR    (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
      
      So when kaslr is compiled in and enabled, the kernel text mapping addr
      space and modules vaddr space need be adjusted.  Otherwise makedumpfile
      will collapse since the addr for some symbols is not correct.
      
      Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
      makedumpfile to help calculate MODULES_VADDR.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1303a27c
    • B
      kexec: align crash_notes allocation to make it be inside one physical page · bbb78b8f
      Baoquan He 提交于
      People reported that crash_notes in /proc/vmcore were corrupted and this
      cause crash kdump failure.  With code debugging and log we got the root
      cause.  This is because percpu variable crash_notes are allocated in 2
      vmalloc pages.  Currently percpu is based on vmalloc by default.  Vmalloc
      can't guarantee 2 continuous vmalloc pages are also on 2 continuous
      physical pages.  So when 1st kernel exports the starting address and size
      of crash_notes through sysfs like below:
      
      /sys/devices/system/cpu/cpux/crash_notes
      /sys/devices/system/cpu/cpux/crash_notes_size
      
      kdump kernel use them to get the content of crash_notes.  However the 2nd
      part may not be in the next neighbouring physical page as we expected if
      crash_notes are allocated accross 2 vmalloc pages.  That's why
      nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
      update_note_header_size_elf64() and cause note header merging failure or
      some warnings.
      
      In this patch change to call __alloc_percpu() to passed in the align value
      by rounding crash_notes_size up to the nearest power of two.  This makes
      sure the crash_notes is allocated inside one physical page since
      sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE.  Meanwhile add
      a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
      crash_notes definitely will be in 2 pages.  That need be avoided, and need
      be reported if it's unavoidable.
      
      [akpm@linux-foundation.org: use correct comment layout]
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Lisa Mitchell <lisa.mitchell@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbb78b8f
    • M
      kexec: remove unnecessary test in kimage_alloc_crash_control_pages() · 04e9949b
      Minfei Huang 提交于
      Transforming PFN(Page Frame Number) to struct page is never failure, so we
      can simplify the code logic to do the image->control_page assignment
      directly in the loop, and remove the unnecessary conditional judgement.
      Signed-off-by: NMinfei Huang <mnfhuang@gmail.com>
      Acked-by: NDave Young <dyoung@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Simon Horman <horms@verge.net.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04e9949b
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
    • D
      kexec: split kexec_file syscall code to kexec_file.c · a43cac0d
      Dave Young 提交于
      Split kexec_file syscall related code to another file kernel/kexec_file.c
      so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.
      
      Sharing variables and functions are moved to kernel/kexec_internal.h per
      suggestion from Vivek and Petr.
      
      [akpm@linux-foundation.org: fix bisectability]
      [akpm@linux-foundation.org: declare the various arch_kexec functions]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43cac0d
    • F
      kmod: handle UMH_WAIT_PROC from system unbound workqueue · bb304a5c
      Frederic Weisbecker 提交于
      The UMH_WAIT_PROC handler runs in its own thread in order to make sure
      that waiting for the exec kernel thread completion won't block other
      usermodehelper queued jobs.
      
      On older workqueue implementations, worklets couldn't sleep without
      blocking the rest of the queue.  But now the workqueue subsystem handles
      that.  Khelper still had the older limitation due to its singlethread
      properties but we replaced it to system unbound workqueues.
      
      Those are affine to the current node and can block up to some number of
      instances.
      
      They are a good candidate to handle UMH_WAIT_PROC assuming that we have
      enough system unbound workers to handle lots of parallel usermodehelper
      jobs.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb304a5c
    • F
      kmod: use system_unbound_wq instead of khelper · 90f02303
      Frederic Weisbecker 提交于
      We need to launch the usermodehelper kernel threads with the widest
      affinity and this is partly why we use khelper.  This workqueue has
      unbound properties and thus a wide affinity inherited by all its children.
      
      Now khelper also has special properties that we aren't much interested in:
      ordered and singlethread.  There is really no need about ordering as all
      we do is creating kernel threads.  This can be done concurrently.  And
      singlethread is a useless limitation as well.
      
      The workqueue engine already proposes generic unbound workqueues that
      don't share these useless properties and handle well parallel jobs.
      
      The only worrysome specific is their affinity to the node of the current
      CPU.  It's fine for creating the usermodehelper kernel threads but those
      inherit this affinity for longer jobs such as requesting modules.
      
      This patch proposes to use these node affine unbound workqueues assuming
      that a node is sufficient to handle several parallel usermodehelper
      requests.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f02303
    • F
      kmod: add up-to-date explanations on the purpose of each asynchronous levels · b639e86b
      Frederic Weisbecker 提交于
      There seem to be quite some confusions on the comments, likely due to
      changes that came after them.
      
      Now since it's very non obvious why we have 3 levels of asynchronous code
      to implement usermodehelpers, it's important to comment in detail the
      reason of this layout.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b639e86b
    • F
      kmod: remove unecessary explicit wide CPU affinity setting · d097c024
      Frederic Weisbecker 提交于
      Khelper is affine to all CPUs.  Now since it creates the
      call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
      affinity.
      
      As such explicitly forcing a wide affinity from those kernel threads
      is like a no-op.
      
      Just remove it. It's needless and it breaks CPU isolation users who
      rely on workqueue affinity tuning.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d097c024
    • F
      kmod: bunch of internal functions renames · b6b50a81
      Frederic Weisbecker 提交于
      This patchset does a bunch of cleanups and converts khelper to use system
      unbound workqueues.  The 3 first patches should be uncontroversial.  The
      last 2 patches are debatable.
      
      Kmod creates kernel threads that perform userspace jobs and we want those
      to have a large affinity in order not to contend busy CPUs.  This is
      (partly) why we use khelper which has a wide affinity that the kernel
      threads it create can inherit from.  Now khelper is a dedicated workqueue
      that has singlethread properties which we aren't interested in.
      
      Hence those two debatable changes:
      
      _ We would like to use generic workqueues. System unbound workqueues are
        a very good candidate but they are not wide affine, only node affine.
        Now probably a node is enough to perform many parallel kmod jobs.
      
      _ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC
        handler) to use the workqueue. It means that if the workqueue blocks,
        and no other worker can take pending kmod request, we can be screwed.
        Now if we have 512 threads, this should be enough.
      
      This patch (of 5):
      
      Underscores on function names aren't much verbose to explain the purpose
      of a function.  And kmod has interesting such flavours.
      
      Lets rename the following functions:
      
      * __call_usermodehelper -> call_usermodehelper_exec_work
      * ____call_usermodehelper -> call_usermodehelper_exec_async
      * wait_for_helper -> call_usermodehelper_exec_sync
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6b50a81
    • N
      kmod: correct documentation of return status of request_module · 60b61a6f
      NeilBrown 提交于
      If request_module() successfully runs modprobe, but modprobe exits with a
      non-zero status, then the return value from request_module() will be that
      (positive) error status.  So the return from request_module can be:
      
       negative errno
       zero for success
       positive exit code.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60b61a6f
    • J
      kernel/cred.c: remove unnecessary kdebug atomic reads · 52aa8536
      Joe Perches 提交于
      Commit e0e81739 ("CRED: Add some configurable debugging [try #6]")
      added the kdebug mechanism to this file back in 2009.
      
      The kdebug macro calls no_printk which always evaluates arguments.
      
      Most of the kdebug uses have an unnecessary call of
      	atomic_read(&cred->usage)
      
      Make the kdebug macro do nothing by defining it with
      	do { if (0) no_printk(...); } while (0)
      when not enabled.
      
      $ size kernel/cred.o* (defconfig x86-64)
         text	   data	    bss	    dec	    hex	filename
         2748	    336	      8	   3092	    c14	kernel/cred.o.new
         2788	    336	      8	   3132	    c3c	kernel/cred.o.old
      
      Miscellanea:
      o Neaten the #define kdebug macros while there
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52aa8536
    • W
      kernel/extable.c: remove duplicated include · 2307e1a3
      Wei Yongjun 提交于
      Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2307e1a3
  9. 10 9月, 2015 1 次提交