1. 12 1月, 2015 1 次提交
    • M
      powerpc: Work around gcc bug in current_thread_info() · a87e810f
      Michael Ellerman 提交于
      In commit a3e5b356 "powerpc: Don't use local named register variable
      in current_thread_info" Anton changed the way we did current_thread_info()
      to accommodate LLVM, and it was not meant to have any effect elsewhere.
      
      Unfortunately it has exposed a gcc bug, where r1 gets copied into
      another register and then gcc uses that register to restore the toc
      after a function call, even when that register is volatile and has been
      clobbered by the function call.
      
      We could revert Anton's patch, but it's not clear the original code is
      safe either, we may just have been lucky.
      
      The cleanest solution is to just use the existing CURRENT_THREAD_INFO()
      asm macro, and call it using inline asm.
      
      Segher points out we don't need volatile on the asm, if the result of
      the shift is unused it's fine for the compiler to elide it.
      
      Fixes: a3e5b356 ("powerpc: Don't use local named register variable in current_thread_info")
      Reported-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a87e810f
  2. 29 12月, 2014 2 次提交
    • H
      powerpc/kdump: Ignore failure in enabling big endian exception during crash · c1caae3d
      Hari Bathini 提交于
      In LE kernel, we currently have a hack for kexec that resets the exception
      endian before starting a new kernel as the kernel that is loaded could be a
      big endian or a little endian kernel. In kdump case, resetting exception
      endian fails when one or more cpus is disabled. But we can ignore the failure
      and still go ahead, as in most cases crashkernel will be of same endianess
      as primary kernel and reseting endianess is not even needed in those cases.
      This patch adds a new inline function to say if this is kdump path. This
      function is used at places where such a check is needed.
      Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com>
      [mpe: Rename to kdump_in_progress(), use bool, and edit comment]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c1caae3d
    • P
      powerpc: Wire up sys_execveat() syscall · 1e5d0fdb
      Pranith Kumar 提交于
      Wire up sys_execveat(). This passes the selftests for the system call.
      
      Check success of execveat(3, '../execveat', 0)... [OK]
      Check success of execveat(5, 'execveat', 0)... [OK]
      Check success of execveat(6, 'execveat', 0)... [OK]
      Check success of execveat(-100, '/home/pranith/linux/...ftests/exec/execveat', 0)... [OK]
      Check success of execveat(99, '/home/pranith/linux/...ftests/exec/execveat', 0)... [OK]
      Check success of execveat(8, '', 4096)... [OK]
      Check success of execveat(17, '', 4096)... [OK]
      Check success of execveat(9, '', 4096)... [OK]
      Check success of execveat(14, '', 4096)... [OK]
      Check success of execveat(14, '', 4096)... [OK]
      Check success of execveat(15, '', 4096)... [OK]
      Check failure of execveat(8, '', 0) with ENOENT... [OK]
      Check failure of execveat(8, '(null)', 4096) with EFAULT... [OK]
      Check success of execveat(5, 'execveat.symlink', 0)... [OK]
      Check success of execveat(6, 'execveat.symlink', 0)... [OK]
      Check success of execveat(-100, '/home/pranith/linux/...xec/execveat.symlink', 0)... [OK]
      Check success of execveat(10, '', 4096)... [OK]
      Check success of execveat(10, '', 4352)... [OK]
      Check failure of execveat(5, 'execveat.symlink', 256) with ELOOP... [OK]
      Check failure of execveat(6, 'execveat.symlink', 256) with ELOOP... [OK]
      Check failure of execveat(-100, '/home/pranith/linux/tools/testing/selftests/exec/execveat.symlink', 256) with ELOOP... [OK]
      Check success of execveat(3, '../script', 0)... [OK]
      Check success of execveat(5, 'script', 0)... [OK]
      Check success of execveat(6, 'script', 0)... [OK]
      Check success of execveat(-100, '/home/pranith/linux/...elftests/exec/script', 0)... [OK]
      Check success of execveat(13, '', 4096)... [OK]
      Check success of execveat(13, '', 4352)... [OK]
      Check failure of execveat(18, '', 4096) with ENOENT... [OK]
      Check failure of execveat(7, 'script', 0) with ENOENT... [OK]
      Check success of execveat(16, '', 4096)... [OK]
      Check success of execveat(16, '', 4096)... [OK]
      Check success of execveat(4, '../script', 0)... [OK]
      Check success of execveat(4, 'script', 0)... [OK]
      Check success of execveat(4, '../script', 0)... [OK]
      Check failure of execveat(4, 'script', 0) with ENOENT... [OK]
      Check failure of execveat(5, 'execveat', 65535) with EINVAL... [OK]
      Check failure of execveat(5, 'no-such-file', 0) with ENOENT... [OK]
      Check failure of execveat(6, 'no-such-file', 0) with ENOENT... [OK]
      Check failure of execveat(-100, 'no-such-file', 0) with ENOENT... [OK]
      Check failure of execveat(5, '', 4096) with EACCES... [OK]
      Check failure of execveat(5, 'Makefile', 0) with EACCES... [OK]
      Check failure of execveat(11, '', 4096) with EACCES... [OK]
      Check failure of execveat(12, '', 4096) with EACCES... [OK]
      Check failure of execveat(99, '', 4096) with EBADF... [OK]
      Check failure of execveat(99, 'execveat', 0) with EBADF... [OK]
      Check failure of execveat(8, 'execveat', 0) with ENOTDIR... [OK]
      Invoke copy of 'execveat' via filename of length 4093:
      Check success of execveat(19, '', 4096)... [OK]
      Check success of execveat(5, 'xxxxxxxxxxxxxxxxxxxx...yyyyyyyyyyyyyyyyyyyy', 0)... [OK]
      Invoke copy of 'script' via filename of length 4093:
      Check success of execveat(20, '', 4096)... [OK]
      /bin/sh: 0: Can't open /dev/fd/5/xxxxxxx(... a long line of x's and y's, 0)... [OK]
      Check success of execveat(5, 'xxxxxxxxxxxxxxxxxxxx...yyyyyyyyyyyyyyyyyyyy', 0)... [OK]
      
      Tested on a 32-bit powerpc system.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1e5d0fdb
  3. 18 12月, 2014 1 次提交
    • M
      powerpc/uaccess: Allow get_user() with bitwise types · 505e4283
      Michael S. Tsirkin 提交于
      At the moment, if p and x are both of the same bitwise type
      (eg. __le32), get_user(x, p) produces a sparse warning.
      
      This is because *p is loaded into a long then cast back to typeof(*p).
      
      When typeof(*p) is a bitwise type (which is uncommon), such a cast needs
      __force, otherwise sparse produces a warning.
      
      For non-bitwise types __force should have no effect, and should not hide
      any legitimate errors.
      
      Note that we are casting to typeof(*p) not typeof(x). Even with the
      cast, if x and *p are of different types we should get the warning, so I
      think we are not loosing the ability to detect any actual errors.
      
      virtio would like to use bitwise types with get_user() so fix these
      spurious warnings by adding __force.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      [mpe: Fill in changelog with more details]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      505e4283
  4. 17 12月, 2014 4 次提交
    • S
      KVM: PPC: Book3S HV: Improve H_CONFER implementation · 90fd09f8
      Sam Bobroff 提交于
      Currently the H_CONFER hcall is implemented in kernel virtual mode,
      meaning that whenever a guest thread does an H_CONFER, all the threads
      in that virtual core have to exit the guest.  This is bad for
      performance because it interrupts the other threads even if they
      are doing useful work.
      
      The H_CONFER hcall is called by a guest VCPU when it is spinning on a
      spinlock and it detects that the spinlock is held by a guest VCPU that
      is currently not running on a physical CPU.  The idea is to give this
      VCPU's time slice to the holder VCPU so that it can make progress
      towards releasing the lock.
      
      To avoid having the other threads exit the guest unnecessarily,
      we add a real-mode implementation of H_CONFER that checks whether
      the other threads are doing anything.  If all the other threads
      are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
      it returns H_TOO_HARD which causes a guest exit and allows the
      H_CONFER to be handled in virtual mode.
      
      Otherwise it spins for a short time (up to 10 microseconds) to give
      other threads the chance to observe that this thread is trying to
      confer.  The spin loop also terminates when any thread exits the guest
      or when all other threads are idle or trying to confer.  If the
      timeout is reached, the H_CONFER returns H_SUCCESS.  In this case the
      guest VCPU will recheck the spinlock word and most likely call
      H_CONFER again.
      
      This also improves the implementation of the H_CONFER virtual mode
      handler.  If the VCPU is part of a virtual core (vcore) which is
      runnable, there will be a 'runner' VCPU which has taken responsibility
      for running the vcore.  In this case we yield to the runner VCPU
      rather than the target VCPU.
      
      We also introduce a check on the target VCPU's yield count: if it
      differs from the yield count passed to H_CONFER, the target VCPU
      has run since H_CONFER was called and may have already released
      the lock.  This check is required by PAPR.
      Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      90fd09f8
    • P
      KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register · 4a157d61
      Paul Mackerras 提交于
      There are two ways in which a guest instruction can be obtained from
      the guest in the guest exit code in book3s_hv_rmhandlers.S.  If the
      exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
      instruction), the offending instruction is in the HEIR register
      (Hypervisor Emulation Instruction Register).  If the exit was caused
      by a load or store to an emulated MMIO device, we load the instruction
      from the guest by turning data relocation on and loading the instruction
      with an lwz instruction.
      
      Unfortunately, in the case where the guest has opposite endianness to
      the host, these two methods give results of different endianness, but
      both get put into vcpu->arch.last_inst.  The HEIR value has been loaded
      using guest endianness, whereas the lwz will load the instruction using
      host endianness.  The rest of the code that uses vcpu->arch.last_inst
      assumes it was loaded using host endianness.
      
      To fix this, we define a new vcpu field to store the HEIR value.  Then,
      in kvmppc_handle_exit_hv(), we transfer the value from this new field to
      vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
      differ.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4a157d61
    • P
      KVM: PPC: Book3S HV: Remove code for PPC970 processors · c17b98cf
      Paul Mackerras 提交于
      This removes the code that was added to enable HV KVM to work
      on PPC970 processors.  The PPC970 is an old CPU that doesn't
      support virtualizing guest memory.  Removing PPC970 support also
      lets us remove the code for allocating and managing contiguous
      real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
      case, the code for pinning pages of guest memory when first
      accessed and keeping track of which pages have been pinned, and
      the code for handling H_ENTER hypercalls in virtual mode.
      
      Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
      The KVM_CAP_PPC_RMA capability now always returns 0.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c17b98cf
    • P
      KVM: PPC: Book3S HV: Simplify locking around stolen time calculations · 2711e248
      Paul Mackerras 提交于
      Currently the calculations of stolen time for PPC Book3S HV guests
      uses fields in both the vcpu struct and the kvmppc_vcore struct.  The
      fields in the kvmppc_vcore struct are protected by the
      vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for
      running the virtual core.  This works correctly but confuses lockdep,
      because it sees that the code takes the tbacct_lock for a vcpu in
      kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in
      vcore_stolen_time(), and it thinks there is a possibility of deadlock,
      causing it to print reports like this:
      
      =============================================
      [ INFO: possible recursive locking detected ]
      3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted
      ---------------------------------------------
      qemu-system-ppc/6188 is trying to acquire lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      
      but task is already holding lock:
       (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
        lock(&(&vcpu->arch.tbacct_lock)->rlock);
      
       *** DEADLOCK ***
      
       May be due to missing lock nesting notation
      
      3 locks held by qemu-system-ppc/6188:
       #0:  (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm]
       #1:  (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv]
       #2:  (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
      
      stack backtrace:
      CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89
      Call Trace:
      [c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable)
      [c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190
      [c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0
      [c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70
      [c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
      [c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv]
      [c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv]
      [c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
      [c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
      [c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm]
      [c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770
      [c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0
      [c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98
      
      In order to make the locking easier to analyse, we change the code to
      use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and
      preempt_tb fields.  This lock needs to be an irq-safe lock since it is
      used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv()
      functions, which are called with the scheduler rq lock held, which is
      an irq-safe lock.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      2711e248
  5. 15 12月, 2014 5 次提交
    • P
      KVM: PPC: Book3S HV: Fix computation of tlbie operand · d506735b
      Paul Mackerras 提交于
      The B (segment size) field in the RB operand for the tlbie
      instruction is two bits, which we get from the top two bits of
      the first doubleword of the HPT entry to be invalidated.  These
      bits go in bits 8 and 9 of the RB operand (bits 54 and 55 in IBM
      bit numbering).
      
      The compute_tlbie_rb() function gets these bits as v >> (62 - 8),
      which is not correct as it will bring in the top 10 bits, not
      just the top two.  These extra bits could corrupt the AP, AVAL
      and L fields in the RB value.  To fix this we shift right 62 bits
      and then shift left 8 bits, so we only get the two bits of the
      B field.
      
      The first doubleword of the HPT entry is under the control of the
      guest kernel.  In fact, Linux guests will always put zeroes in bits
      54 -- 61 (IBM bits 2 -- 9), but we should not rely on guests doing
      this.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d506735b
    • S
      powernv/powerpc: Add winkle support for offline cpus · 77b54e9f
      Shreyas B. Prabhu 提交于
      Winkle is a deep idle state supported in power8 chips. A core enters
      winkle when all the threads of the core enter winkle. In this state
      power supply to the entire chiplet i.e core, private L2 and private L3
      is turned off. As a result it gives higher powersavings compared to
      sleep.
      
      But entering winkle results in a total hypervisor state loss. Hence the
      hypervisor context has to be preserved before entering winkle and
      restored upon wake up.
      
      Power-on Reset Engine (PORE) is a dedicated engine which is responsible
      for powering on the chiplet during wake up. It can be programmed to
      restore the register contests of a few specific registers. This patch
      uses PORE to restore register state wherever possible and uses stack to
      save and restore rest of the necessary registers.
      
      With hypervisor state restore things fall under three categories-
      per-core state, per-subcore state and per-thread state. To manage this,
      extend the infrastructure introduced for sleep. Mainly we add a paca
      variable subcore_sibling_mask. Using this and the core_idle_state we can
      distingush first thread in core and subcore.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      77b54e9f
    • S
      powernv/cpuidle: Redesign idle states management · 7cba160a
      Shreyas B. Prabhu 提交于
      Deep idle states like sleep and winkle are per core idle states. A core
      enters these states only when all the threads enter either the
      particular idle state or a deeper one. There are tasks like fastsleep
      hardware bug workaround and hypervisor core state save which have to be
      done only by the last thread of the core entering deep idle state and
      similarly tasks like timebase resync, hypervisor core register restore
      that have to be done only by the first thread waking up from these
      state.
      
      The current idle state management does not have a way to distinguish the
      first/last thread of the core waking/entering idle states. Tasks like
      timebase resync are done for all the threads. This is not only is
      suboptimal, but can cause functionality issues when subcores and kvm is
      involved.
      
      This patch adds the necessary infrastructure to track idle states of
      threads in a per-core structure. It uses this info to perform tasks like
      fastsleep workaround and timebase resync only once per core.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Originally-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7cba160a
    • S
      powerpc/powernv: Enable Offline CPUs to enter deep idle states · 8eb8ac89
      Shreyas B. Prabhu 提交于
      The secondary threads should enter deep idle states so as to gain maximum
      powersavings when the entire core is offline. To do so the offline path
      must be made aware of the available deepest idle state. Hence probe the
      device tree for the possible idle states in powernv core code and
      expose the deepest idle state through flags.
      
      Since the  device tree is probed by the cpuidle driver as well, move
      the parameters required to discover the idle states into an appropriate
      common place to both the driver and the powernv core code.
      
      Another point is that fastsleep idle state may require workarounds in
      the kernel to function properly. This workaround is introduced in the
      subsequent patches. However neither the cpuidle driver or the hotplug
      path need be bothered about this workaround.
      
      They will be taken care of by the core powernv code.
      Originally-by: NSrivatsa S. Bhat <srivatsa@mit.edu>
      Signed-off-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Reviewed-by: NPaul Mackerras <paulus@samba.org>
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8eb8ac89
    • P
      powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode · 8117ac6a
      Paul Mackerras 提交于
      Currently, when going idle, we set the flag indicating that we are in
      nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
      (or sleep or rvwinkle) instruction, all with the MMU on.  This is bad
      for two reasons: (a) the architecture specifies that those instructions
      must be executed with the MMU off, and in fact with only the SF, HV, ME
      and possibly RI bits set, and (b) this introduces a race, because as
      soon as we set the flag, another thread can switch the MMU to a guest
      context.  If the race is lost, this thread will typically start looping
      on relocation-on ISIs at 0xc...4400.
      
      This fixes it by setting the MSR as required by the architecture before
      setting the flag or executing the nap/sleep/rvwinkle instruction.
      
      Cc: stable@vger.kernel.org
      [ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8117ac6a
  6. 14 12月, 2014 1 次提交
  7. 12 12月, 2014 3 次提交
    • R
      powerpc: add little endian flag to syscall_get_arch() · 63f13448
      Richard Guy Briggs 提交于
      Since both ppc and ppc64 have LE variants which are now reported by uname, add
      that flag (__AUDIT_ARCH_LE) to syscall_get_arch() and add AUDIT_ARCH_PPC64LE
      variant.
      
      Without this,  perf trace and auditctl fail.
      
      Mainline kernel reports ppc64le (per a0588015) but there is no matching
      AUDIT_ARCH_PPC64LE.
      
      Since 32-bit PPC LE is not supported by audit, don't advertise it in
      AUDIT_ARCH_PPC* variants.
      
      See:
      	https://www.redhat.com/archives/linux-audit/2014-August/msg00082.html
      	https://www.redhat.com/archives/linux-audit/2014-December/msg00004.htmlSigned-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Acked-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      63f13448
    • A
      arch: Add lightweight memory barriers dma_rmb() and dma_wmb() · 1077fa36
      Alexander Duyck 提交于
      There are a number of situations where the mandatory barriers rmb() and
      wmb() are used to order memory/memory operations in the device drivers
      and those barriers are much heavier than they actually need to be.  For
      example in the case of PowerPC wmb() calls the heavy-weight sync
      instruction when for coherent memory operations all that is really needed
      is an lsync or eieio instruction.
      
      This commit adds a coherent only version of the mandatory memory barriers
      rmb() and wmb().  In most cases this should result in the barrier being the
      same as the SMP barriers for the SMP case, however in some cases we use a
      barrier that is somewhere in between rmb() and smp_rmb().  For example on
      ARM the rmb barriers break down as follows:
      
        Barrier   Call     Explanation
        --------- -------- ----------------------------------
        rmb()     dsb()    Data synchronization barrier - system
        dma_rmb() dmb(osh) data memory barrier - outer sharable
        smp_rmb() dmb(ish) data memory barrier - inner sharable
      
      These new barriers are not as safe as the standard rmb() and wmb().
      Specifically they do not guarantee ordering between coherent and incoherent
      memories.  The primary use case for these would be to enforce ordering of
      reads and writes when accessing coherent memory that is shared between the
      CPU and a device.
      
      It may also be noted that there is no dma_mb().  Most architectures don't
      provide a good mechanism for performing a coherent only full barrier without
      resorting to the same mechanism used in mb().  As such there isn't much to
      be gained in trying to define such a function.
      
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Cc: Michael Ellerman <michael@ellerman.id.au>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1077fa36
    • A
      arch: Cleanup read_barrier_depends() and comments · 8a449718
      Alexander Duyck 提交于
      This patch is meant to cleanup the handling of read_barrier_depends and
      smp_read_barrier_depends.  In multiple spots in the kernel headers
      read_barrier_depends is defined as "do {} while (0)", however we then go
      into the SMP vs non-SMP sections and have the SMP version reference
      read_barrier_depends, and the non-SMP define it as yet another empty
      do/while.
      
      With this commit I went through and cleaned out the duplicate definitions
      and reduced the number of definitions down to 2 per header.  In addition I
      moved the 50 line comments for the macro from the x86 and mips headers that
      defined it as an empty do/while to those that were actually defining the
      macro, alpha and blackfin.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a449718
  8. 11 12月, 2014 2 次提交
  9. 08 12月, 2014 1 次提交
    • P
      powerpc/powernv: Return to cpu offline loop when finished in KVM guest · 56548fc0
      Paul Mackerras 提交于
      When a secondary hardware thread has finished running a KVM guest, we
      currently put that thread into nap mode using a nap instruction in
      the KVM code.  This changes the code so that instead of doing a nap
      instruction directly, we instead cause the call to power7_nap() that
      put the thread into nap mode to return.  The reason for doing this is
      to avoid having the KVM code having to know what low-power mode to
      put the thread into.
      
      In the case of a secondary thread used to run a KVM guest, the thread
      will be offline from the point of view of the host kernel, and the
      relevant power7_nap() call is the one in pnv_smp_cpu_disable().
      In this case we don't want to clear pending IPIs in the offline loop
      in that function, since that might cause us to miss the wakeup for
      the next time the thread needs to run a guest.  To tell whether or
      not to clear the interrupt, we use the SRR1 value returned from
      power7_nap(), and check if it indicates an external interrupt.  We
      arrange that the return from power7_nap() when we have finished running
      a guest returns 0, so pending interrupts don't get flushed in that
      case.
      
      Note that it is important a secondary thread that has finished
      executing in the guest, or that didn't have a guest to run, should
      not return to power7_nap's caller while the kvm_hstate.hwthread_req
      flag in the PACA is non-zero, because the return from power7_nap
      will reenable the MMU, and the MMU might still be in guest context.
      In this situation we spin at low priority in real mode waiting for
      hwthread_req to become zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      56548fc0
  10. 05 12月, 2014 1 次提交
    • A
      powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688
      Aneesh Kumar K.V 提交于
      upatepp can get called for a nohpte fault when we find from the linux
      page table that the translation was hashed before. In that case
      we are sure that there is no existing translation, hence we could
      avoid doing tlbie.
      
      We could possibly race with a parallel fault filling the TLB. But
      that should be ok because updatepp is only ever relaxing permissions.
      We also look at linux pte permission bits when filling hash pte
      permission bits. We also hold the linux pte busy bits while
      inserting/updating a hashpte entry, hence a paralle update of
      linux pte is not possible. On the other hand mprotect involves
      ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
      
      Performance number:
      We use randbox_access_bench written by Anton.
      
      Kernel with THP disabled and smaller hash page table size.
      
          86.60%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_updatepp
           2.10%  random_access_b  random_access_bench              [.] doit
           1.99%  random_access_b  [kernel.kallsyms]                [k] .do_raw_spin_lock
           1.85%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           1.26%  random_access_b  [kernel.kallsyms]                [k] .native_flush_hash_range
           1.18%  random_access_b  [kernel.kallsyms]                [k] .__delay
           0.69%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           0.37%  random_access_b  [kernel.kallsyms]                [k] .clear_user_page
           0.34%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           0.32%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           0.30%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
      
      With Fix:
      
          27.54%  random_access_b  random_access_bench              [.] doit
          22.90%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           5.76%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           5.20%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           5.12%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           4.80%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
           3.31%  random_access_b  [kernel.kallsyms]                [k] data_access_common
           1.84%  random_access_b  [kernel.kallsyms]                [k] .trace_hardirqs_on_caller
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aefa5688
  11. 02 12月, 2014 5 次提交
  12. 24 11月, 2014 1 次提交
  13. 18 11月, 2014 1 次提交
  14. 17 11月, 2014 3 次提交
    • W
      mmu_gather: move minimal range calculations into generic code · fb7332a9
      Will Deacon 提交于
      On architectures with hardware broadcasting of TLB invalidation messages
      , it makes sense to reduce the range of the mmu_gather structure when
      unmapping page ranges based on the dirty address information passed to
      tlb_remove_tlb_entry.
      
      arm64 already does this by directly manipulating the start/end fields
      of the gather structure, but this confuses the generic code which
      does not expect these fields to change and can end up calculating
      invalid, negative ranges when forcing a flush in zap_pte_range.
      
      This patch moves the minimal range calculation out of the arm64 code
      and into the generic implementation, simplifying zap_pte_range in the
      process (which no longer needs to care about start/end, since they will
      point to the appropriate ranges already). With the range being tracked
      by core code, the need_flush flag is dropped in favour of checking that
      the end of the range has actually been set.
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Cc: Michal Simek <monstr@monstr.eu>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      fb7332a9
    • N
      rtc/tpo: Driver to support rtc and wakeup on PowerNV platform · 16b1d26e
      Neelesh Gupta 提交于
      The patch implements the OPAL rtc driver that binds with the rtc
      driver subsystem. The driver uses the platform device infrastructure
      to probe the rtc device and register it to rtc class framework. The
      'wakeup' is supported depending upon the property 'has-tpo' present
      in the OF node. It provides a way to load the generic rtc driver in
      in the absence of an OPAL driver.
      
      The patch also moves the existing OPAL rtc get/set time interfaces to the
      new driver and exposes the necessary OPAL calls using EXPORT_SYMBOL_GPL.
      
      Test results:
      -------------
      Host:
      [root@tul169p1 ~]# ls -l /sys/class/rtc/
      total 0
      lrwxrwxrwx 1 root root 0 Oct 14 03:07 rtc0 -> ../../devices/opal-rtc/rtc/rtc0
      [root@tul169p1 ~]# cat /sys/devices/opal-rtc/rtc/rtc0/time
      08:10:07
      [root@tul169p1 ~]# echo `date '+%s' -d '+ 2 minutes'` > /sys/class/rtc/rtc0/wakealarm
      [root@tul169p1 ~]# cat /sys/class/rtc/rtc0/wakealarm
      1413274345
      [root@tul169p1 ~]#
      
      FSP:
      $ smgr mfgState
      standby
      $ rtim timeofday
      
      System time is valid: 2014/10/14 08:12:04.225115
      
      $ smgr mfgState
      ipling
      $
      
      CC: devicetree@vger.kernel.org
      CC: tglx@linutronix.de
      CC: rtc-linux@googlegroups.com
      CC: a.zummo@towertech.it
      Signed-off-by: NNeelesh Gupta <neelegup@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      16b1d26e
    • V
      powerpc: Use generic PIE randomization · 59994fb0
      Vineeth Vijayan 提交于
      Back in 2009 we merged 501cb16d "Randomise PIEs", which added support for
      randomizing PIE (Position Independent Executable) binaries.
      
      That commit added randomize_et_dyn(), which correctly randomized the addresses,
      but failed to honor PF_RANDOMIZE. That means it was not possible to disable PIE
      randomization via the personality flag, or /proc/sys/kernel/randomize_va_space.
      
      Since then there has been generic support for PIE randomization added to
      binfmt_elf.c, selectable via ARCH_BINFMT_ELF_RANDOMIZE_PIE.
      
      Enabling that allows us to drop randomize_et_dyn(), which means we start
      honoring PF_RANDOMIZE correctly.
      
      It also causes a fairly major change to how we layout PIE binaries.
      
      Currently we will place the binary at 512MB-520MB for 32 bit binaries, or
      512MB-1.5GB for 64 bit binaries, eg:
      
          $ cat /proc/$$/maps
          4e550000-4e580000 r-xp 00000000 08:02 129813       /bin/dash
          4e580000-4e590000 rw-p 00020000 08:02 129813       /bin/dash
          10014110000-10014140000 rw-p 00000000 00:00 0      [heap]
          3fffaa3f0000-3fffaa5a0000 r-xp 00000000 08:02 921  /lib/powerpc64le-linux-gnu/libc-2.19.so
          3fffaa5a0000-3fffaa5b0000 rw-p 001a0000 08:02 921  /lib/powerpc64le-linux-gnu/libc-2.19.so
          3fffaa5c0000-3fffaa5d0000 rw-p 00000000 00:00 0
          3fffaa5d0000-3fffaa5f0000 r-xp 00000000 00:00 0    [vdso]
          3fffaa5f0000-3fffaa620000 r-xp 00000000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
          3fffaa620000-3fffaa630000 rw-p 00020000 08:02 1246 /lib/powerpc64le-linux-gnu/ld-2.19.so
          3ffffc340000-3ffffc370000 rw-p 00000000 00:00 0    [stack]
      
      With this commit applied we don't do any special randomisation for the binary,
      and instead rely on mmap randomisation. This means the binary ends up at high
      addresses, eg:
      
          $ cat /proc/$$/maps
          3fff99820000-3fff999d0000 r-xp 00000000 08:02 921    /lib/powerpc64le-linux-gnu/libc-2.19.so
          3fff999d0000-3fff999e0000 rw-p 001a0000 08:02 921    /lib/powerpc64le-linux-gnu/libc-2.19.so
          3fff999f0000-3fff99a00000 rw-p 00000000 00:00 0
          3fff99a00000-3fff99a20000 r-xp 00000000 00:00 0      [vdso]
          3fff99a20000-3fff99a50000 r-xp 00000000 08:02 1246   /lib/powerpc64le-linux-gnu/ld-2.19.so
          3fff99a50000-3fff99a60000 rw-p 00020000 08:02 1246   /lib/powerpc64le-linux-gnu/ld-2.19.so
          3fff99a60000-3fff99a90000 r-xp 00000000 08:02 129813 /bin/dash
          3fff99a90000-3fff99aa0000 rw-p 00020000 08:02 129813 /bin/dash
          3fffc3de0000-3fffc3e10000 rw-p 00000000 00:00 0      [stack]
          3fffc55e0000-3fffc5610000 rw-p 00000000 00:00 0      [heap]
      
      Although this should be OK, it's possible it might break badly written
      binaries that make assumptions about the address space layout.
      Signed-off-by: NVineeth Vijayan <vvijayan@mvista.com>
      [mpe: Rewrite changelog]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      59994fb0
  15. 14 11月, 2014 3 次提交
  16. 12 11月, 2014 2 次提交
  17. 10 11月, 2014 4 次提交