1. 27 10月, 2015 1 次提交
    • D
      memremap: fix highmem support · 182475b7
      Dan Williams 提交于
      Currently memremap checks if the range is "System RAM" and returns the
      kernel linear address.  This is broken for highmem platforms where a
      range may be "System RAM", but is not part of the kernel linear mapping.
      Fallback to ioremap_cache() in these cases, to let the arch code attempt
      to handle it.
      
      Note that ARM ioremap will WARN when attempting to remap ram, and in
      that case the caller needs to be fixed.  For this reason, existing
      ioremap_cache() usages for ARM are already trained to avoid attempts to
      remap ram.
      
      The impact of this bug is low for now since the pmem driver is the only
      user of memremap(), but this is important to fix before more conversions
      to memremap arrive in 4.4.
      
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reported-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      182475b7
  2. 23 10月, 2015 2 次提交
  3. 21 10月, 2015 1 次提交
  4. 20 10月, 2015 5 次提交
  5. 16 10月, 2015 2 次提交
  6. 12 10月, 2015 1 次提交
  7. 09 10月, 2015 2 次提交
  8. 06 10月, 2015 1 次提交
    • P
      sched/core: Fix TASK_DEAD race in finish_task_switch() · 95913d97
      Peter Zijlstra 提交于
      So the problem this patch is trying to address is as follows:
      
              CPU0                            CPU1
      
              context_switch(A, B)
                                              ttwu(A)
                                                LOCK A->pi_lock
                                                A->on_cpu == 0
              finish_task_switch(A)
                prev_state = A->state  <-.
                WMB                      |
                A->on_cpu = 0;           |
                UNLOCK rq0->lock         |
                                         |    context_switch(C, A)
                                         `--  A->state = TASK_DEAD
                prev_state == TASK_DEAD
                  put_task_struct(A)
                                              context_switch(A, C)
                                              finish_task_switch(A)
                                                A->state == TASK_DEAD
                                                  put_task_struct(A)
      
      The argument being that the WMB will allow the load of A->state on CPU0
      to cross over and observe CPU1's store of A->state, which will then
      result in a double-drop and use-after-free.
      
      Now the comment states (and this was true once upon a long time ago)
      that we need to observe A->state while holding rq->lock because that
      will order us against the wakeup; however the wakeup will not in fact
      acquire (that) rq->lock; it takes A->pi_lock these days.
      
      We can obviously fix this by upgrading the WMB to an MB, but that is
      expensive, so we'd rather avoid that.
      
      The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
      which avoids the MB on some archs, but not important ones like ARM.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org> # v3.1+
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: manfred@colorfullife.com
      Cc: will.deacon@arm.com
      Fixes: e4a52bcb ("sched: Remove rq->lock from the first half of ttwu()")
      Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      95913d97
  9. 03 10月, 2015 1 次提交
  10. 01 10月, 2015 2 次提交
    • B
      genirq: Fix race in register_irq_proc() · 95c2b175
      Ben Hutchings 提交于
      Per-IRQ directories in procfs are created only when a handler is first
      added to the irqdesc, not when the irqdesc is created.  In the case of
      a shared IRQ, multiple tasks can race to create a directory.  This
      race condition seems to have been present forever, but is easier to
      hit with async probing.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Link: http://lkml.kernel.org/r/1443266636.2004.2.camel@decadent.org.ukSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      95c2b175
    • S
      workqueue: make sure delayed work run in local cpu · 874bbfe6
      Shaohua Li 提交于
      My system keeps crashing with below message. vmstat_update() schedules a delayed
      work in current cpu and expects the work runs in the cpu.
      schedule_delayed_work() is expected to make delayed work run in local cpu. The
      problem is timer can be migrated with NO_HZ. __queue_work() queues work in
      timer handler, which could run in a different cpu other than where the delayed
      work is scheduled. The end result is the delayed work runs in different cpu.
      The patch makes __queue_delayed_work records local cpu earlier. Where the timer
      runs doesn't change where the work runs with the change.
      
      [   28.010131] ------------[ cut here ]------------
      [   28.010609] kernel BUG at ../mm/vmstat.c:1392!
      [   28.011099] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN
      [   28.011860] Modules linked in:
      [   28.012245] CPU: 0 PID: 289 Comm: kworker/0:3 Tainted: G        W4.3.0-rc3+ #634
      [   28.013065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
      [   28.014160] Workqueue: events vmstat_update
      [   28.014571] task: ffff880117682580 ti: ffff8800ba428000 task.ti: ffff8800ba428000
      [   28.015445] RIP: 0010:[<ffffffff8115f921>]  [<ffffffff8115f921>]vmstat_update+0x31/0x80
      [   28.016282] RSP: 0018:ffff8800ba42fd80  EFLAGS: 00010297
      [   28.016812] RAX: 0000000000000000 RBX: ffff88011a858dc0 RCX:0000000000000000
      [   28.017585] RDX: ffff880117682580 RSI: ffffffff81f14d8c RDI:ffffffff81f4df8d
      [   28.018366] RBP: ffff8800ba42fd90 R08: 0000000000000001 R09:0000000000000000
      [   28.019169] R10: 0000000000000000 R11: 0000000000000121 R12:ffff8800baa9f640
      [   28.019947] R13: ffff88011a81e340 R14: ffff88011a823700 R15:0000000000000000
      [   28.020071] FS:  0000000000000000(0000) GS:ffff88011a800000(0000)knlGS:0000000000000000
      [   28.020071] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   28.020071] CR2: 00007ff6144b01d0 CR3: 00000000b8e93000 CR4:00000000000006f0
      [   28.020071] Stack:
      [   28.020071]  ffff88011a858dc0 ffff8800baa9f640 ffff8800ba42fe00ffffffff8106bd88
      [   28.020071]  ffffffff8106bd0b 0000000000000096 0000000000000000ffffffff82f9b1e8
      [   28.020071]  ffffffff829f0b10 0000000000000000 ffffffff81f18460ffff88011a81e340
      [   28.020071] Call Trace:
      [   28.020071]  [<ffffffff8106bd88>] process_one_work+0x1c8/0x540
      [   28.020071]  [<ffffffff8106bd0b>] ? process_one_work+0x14b/0x540
      [   28.020071]  [<ffffffff8106c214>] worker_thread+0x114/0x460
      [   28.020071]  [<ffffffff8106c100>] ? process_one_work+0x540/0x540
      [   28.020071]  [<ffffffff81071bf8>] kthread+0xf8/0x110
      [   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
      [   28.020071]  [<ffffffff81a6522f>] ret_from_fork+0x3f/0x70
      [   28.020071]  [<ffffffff81071b00>] ?kthread_create_on_node+0x200/0x200
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v2.6.31+
      874bbfe6
  11. 23 9月, 2015 2 次提交
  12. 21 9月, 2015 1 次提交
    • P
      rcu: Suppress lockdep false positive for rcp->exp_funnel_mutex · 19a5ecde
      Paul E. McKenney 提交于
      In kernels built with CONFIG_PREEMPT=y, synchronize_rcu_expedited()
      invokes synchronize_sched_expedited() while holding RCU-preempt's
      root rcu_node structure's ->exp_funnel_mutex, which is acquired after
      the rcu_data structure's ->exp_funnel_mutex.  The first thing that
      synchronize_sched_expedited() will do is acquire RCU-sched's rcu_data
      structure's ->exp_funnel_mutex.   There is no danger of an actual deadlock
      because the locking order is always from RCU-preempt's expedited mutexes
      to those of RCU-sched.  Unfortunately, lockdep considers both rcu_data
      structures' ->exp_funnel_mutex to be in the same lock class and therefore
      reports a deadlock cycle.
      
      This commit silences this false positive by placing RCU-sched's rcu_data
      structures' ->exp_funnel_mutex locks into their own lock class.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      19a5ecde
  13. 18 9月, 2015 5 次提交
    • D
      sched: access local runqueue directly in single_task_running · 00cc1633
      Dominik Dingel 提交于
      Commit 2ee507c4 ("sched: Add function single_task_running to let a task
      check if it is the only task running on a cpu") referenced the current
      runqueue with the smp_processor_id.  When CONFIG_DEBUG_PREEMPT is enabled,
      that is only allowed if preemption is disabled or the currrent task is
      bound to the local cpu (e.g. kernel worker).
      
      With commit f7819512 ("kvm: add halt_poll_ns module parameter") KVM
      calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that
      generates a lot of kernel messages.
      
      To avoid adding preemption in that cases, as it would limit the usefulness,
      we change single_task_running to access directly the cpu local runqueue.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Fixes: 2ee507c4Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      00cc1633
    • P
      perf: Fix races in computing the header sizes · f73e22ab
      Peter Zijlstra 提交于
      There are two races with the current code:
      
       - Another event can join the group and compute a larger header_size
         concurrently, if the smaller store wins we'll have an incorrect
         header_size set.
      
       - We compute the header_size after the event becomes active,
         therefore its possible to use the size before its computed.
      
      Remedy the first by moving the computation inside the ctx::mutex lock,
      and the second by placing it _before_ perf_install_in_context().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f73e22ab
    • P
      perf: Fix u16 overflows · a723968c
      Peter Zijlstra 提交于
      Vince reported that its possible to overflow the various size fields
      and get weird stuff if you stick too many events in a group.
      
      Put a lid on this by requiring the fixed record size not exceed 16k.
      This is still a fair amount of events (silly amount really) and leaves
      plenty room for callchains and stack dwarves while also avoiding
      overflowing the u16 variables.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a723968c
    • P
      perf: Restructure perf syscall point of no return · f55fc2a5
      Peter Zijlstra 提交于
      The exclusive_event_installable() stuff only works because its
      exclusive with the grouping bits.
      
      Rework the code such that there is a sane place to error out before we
      go do things we cannot undo.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f55fc2a5
    • P
      sched: Fix crash trying to dequeue/enqueue the idle thread · de9b8f5d
      Peter Zijlstra 提交于
      Sasha reports that his virtual machine tries to schedule the idle
      thread since commit 6c37067e ("sched: Change the
      sched_class::set_cpus_allowed() calling context").
      
      Hit trace shows this happening from idle_thread_get()->init_idle(),
      which is the _second_ init_idle() invocation on that task_struct, the
      first being done through idle_init()->fork_idle(). (this code is
      insane...)
      
      Because we call init_idle() twice in a row, its ->sched_class ==
      &idle_sched_class and ->on_rq = TASK_ON_RQ_QUEUED. This means
      do_set_cpus_allowed() think we're queued and will call dequeue_task(),
      which is implemented with BUG() for the idle class, seeing how
      dequeueing the idle task is a daft thing.
      
      Aside of the whole insanity of calling init_idle() _twice_, change the
      code to call set_cpus_allowed_common() instead as this is 'obviously'
      before the idle task gets ran etc..
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 6c37067e ("sched: Change the sched_class::set_cpus_allowed() calling context")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      de9b8f5d
  14. 16 9月, 2015 7 次提交
  15. 14 9月, 2015 1 次提交
  16. 13 9月, 2015 1 次提交
    • J
      time: Fix timekeeping_freqadjust()'s incorrect use of abs() instead of abs64() · 2619d7e9
      John Stultz 提交于
      The internal clocksteering done for fine-grained error
      correction uses a logarithmic approximation, so any time
      adjtimex() adjusts the clock steering, timekeeping_freqadjust()
      quickly approximates the correct clock frequency over a series
      of ticks.
      
      Unfortunately, the logic in timekeeping_freqadjust(), introduced
      in commit:
      
        dc491596 ("timekeeping: Rework frequency adjustments to work better w/ nohz")
      
      used the abs() function with a s64 error value to calculate the
      size of the approximated adjustment to be made.
      
      Per include/linux/kernel.h:
      
        "abs() should not be used for 64-bit types (s64, u64, long long) - use abs64()".
      
      Thus on 32-bit platforms, this resulted in the clocksteering to
      take a quite dampended random walk trying to converge on the
      proper frequency, which caused the adjustments to be made much
      slower then intended (most easily observed when large
      adjustments are made).
      
      This patch fixes the issue by using abs64() instead.
      Reported-by: NNuno Gonçalves <nunojpg@gmail.com>
      Tested-by: NNuno Goncalves <nunojpg@gmail.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: <stable@vger.kernel.org> # v3.17+
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1441840051-20244-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2619d7e9
  17. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  18. 11 9月, 2015 4 次提交
    • W
      sched: 'Annotate' migrate_tasks() · 5473e0cc
      Wanpeng Li 提交于
      Kernel testing triggered this warning:
      
      | WARNING: CPU: 0 PID: 13 at kernel/sched/core.c:1156 do_set_cpus_allowed+0x7e/0x80()
      | Modules linked in:
      | CPU: 0 PID: 13 Comm: migration/0 Not tainted 4.2.0-rc1-00049-g25834c73 #2
      | Call Trace:
      |   dump_stack+0x4b/0x75
      |   warn_slowpath_common+0x8b/0xc0
      |   warn_slowpath_null+0x22/0x30
      |   do_set_cpus_allowed+0x7e/0x80
      |   cpuset_cpus_allowed_fallback+0x7c/0x170
      |   select_fallback_rq+0x221/0x280
      |   migration_call+0xe3/0x250
      |   notifier_call_chain+0x53/0x70
      |   __raw_notifier_call_chain+0x1e/0x30
      |   cpu_notify+0x28/0x50
      |   take_cpu_down+0x22/0x40
      |   multi_cpu_stop+0xd5/0x140
      |   cpu_stopper_thread+0xbc/0x170
      |   smpboot_thread_fn+0x174/0x2f0
      |   kthread+0xc4/0xe0
      |   ret_from_kernel_thread+0x21/0x30
      
      As Peterz pointed out:
      
      | So the normal rules for changing task_struct::cpus_allowed are holding
      | both pi_lock and rq->lock, such that holding either stabilizes the mask.
      |
      | This is so that wakeup can happen without rq->lock and load-balance
      | without pi_lock.
      |
      | From this we already get the relaxation that we can omit acquiring
      | rq->lock if the task is not on the rq, because in that case
      | load-balancing will not apply to it.
      |
      | ** these are the rules currently tested in do_set_cpus_allowed() **
      |
      | Now, since __set_cpus_allowed_ptr() uses task_rq_lock() which
      | unconditionally acquires both locks, we could get away with holding just
      | rq->lock when on_rq for modification because that'd still exclude
      | __set_cpus_allowed_ptr(), it would also work against
      | __kthread_bind_mask() because that assumes !on_rq.
      |
      | That said, this is all somewhat fragile.
      |
      | Now, I don't think dropping rq->lock is quite as disastrous as it
      | usually is because !cpu_active at this point, which means load-balance
      | will not interfere, but that too is somewhat fragile.
      |
      | So we end up with a choice of two fragile..
      
      This patch fixes it by following the rules for changing
      task_struct::cpus_allowed with both pi_lock and rq->lock held.
      Reported-by: Nkernel test robot <ying.huang@intel.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [ Modified changelog and patch. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/BLU436-SMTP1660820490DE202E3934ED3806E0@phx.gblSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5473e0cc
    • P
      locking/qspinlock/x86: Fix performance regression under unaccelerated VMs · 43b3f028
      Peter Zijlstra 提交于
      Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS
      set and Linus noted that the test-and-set implementation was retarded.
      
      One should spin on the variable with a load, not a RMW.
      
      While there, remove 'queued' from the name, as the lock isn't queued
      at all, but a simple test-and-set.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: stable@vger.kernel.org # v4.2+
      Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      43b3f028
    • I
      sysctl: fix int -> unsigned long assignments in INT_MIN case · 9a5bc726
      Ilya Dryomov 提交于
      The following
      
          if (val < 0)
              *lvalp = (unsigned long)-val;
      
      is incorrect because the compiler is free to assume -val to be positive
      and use a sign-extend instruction for extending the bit pattern.  This is
      a problem if val == INT_MIN:
      
          # echo -2147483648 >/proc/sys/dev/scsi/logging_level
          # cat /proc/sys/dev/scsi/logging_level
          -18446744071562067968
      
      Cast to unsigned long before negation - that way we first sign-extend and
      then negate an unsigned, which is well defined.  With this:
      
          # cat /proc/sys/dev/scsi/logging_level
          -2147483648
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      Cc: Mikulas Patocka <mikulas@twibright.com>
      Cc: Robert Xiao <nneonneo@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a5bc726
    • B
      kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo · 1303a27c
      Baoquan He 提交于
      In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
      accordingly the MODULES_VADDR is changed to 0xffffffffa0000000.  However,
      in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
      And the kernel text mapping addr space is enlarged from 512M to 1G.  That
      means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
      support is not compiled in and 1G when kaslr support is compiled in.
      Accordingly the MODULES_VADDR is changed too to be:
      
          #define MODULES_VADDR    (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
      
      So when kaslr is compiled in and enabled, the kernel text mapping addr
      space and modules vaddr space need be adjusted.  Otherwise makedumpfile
      will collapse since the addr for some symbols is not correct.
      
      Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
      makedumpfile to help calculate MODULES_VADDR.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1303a27c