1. 04 3月, 2015 1 次提交
    • M
      powerpc/smp: Wait until secondaries are active & online · 875ebe94
      Michael Ellerman 提交于
      Anton has a busy ppc64le KVM box where guests sometimes hit the infamous
      "kernel BUG at kernel/smpboot.c:134!" issue during boot:
      
        BUG_ON(td->cpu != smp_processor_id());
      
      Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
      output confirms it:
      
        CPU: 0
        Comm: watchdog/130
      
      The problem is that we aren't ensuring the CPU active bit is set for the
      secondary before allowing the master to continue on. The master unparks
      the secondary CPU's kthreads and the scheduler looks for a CPU to run
      on. It calls select_task_rq() and realises the suggested CPU is not in
      the cpus_allowed mask. It then ends up in select_fallback_rq(), and
      since the active bit isnt't set we choose some other CPU to run on.
      
      This seems to have been introduced by 6acbfb96 "sched: Fix hotplug
      vs. set_cpus_allowed_ptr()", which changed from setting active before
      online to setting active after online. However that was in turn fixing a
      bug where other code assumed an active CPU was also online, so we can't
      just revert that fix.
      
      The simplest fix is just to spin waiting for both active & online to be
      set. We already have a barrier prior to set_cpu_online() (which also
      sets active), to ensure all other setup is completed before online &
      active are set.
      
      Fixes: 6acbfb96 ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      875ebe94
  2. 18 2月, 2015 1 次提交
  3. 14 2月, 2015 1 次提交
  4. 13 2月, 2015 2 次提交
    • C
      powerpc: add running_clock for powerpc to prevent spurious softlockup warnings · 4be1b297
      Cyril Bur 提交于
      On POWER8 virtualised kernels the VTB register can be read to have a view
      of time that only increases while the guest is running.  This will prevent
      guests from seeing time jump if a guest is paused for significant amounts
      of time.
      
      On POWER7 and below virtualised kernels stolen time is subtracted from
      local_clock as a best effort approximation.  This will not eliminate
      spurious warnings in the case of a suspended guest but may reduce the
      occurance in the case of softlockups due to host over commit.
      
      Bare metal kernels should avoid reading the VTB as KVM does not restore
      sane values when not executing, the approxmation is fine as host kernels
      won't observe any stolen time.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrew Jones <drjones@redhat.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: chai wen <chaiw.fnst@cn.fujitsu.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ben Zhang <benzh@chromium.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4be1b297
    • A
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski 提交于
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f56141e3
  5. 02 2月, 2015 3 次提交
    • G
      powerpc/kernel: Avoid initializing device-tree pointer twice · fe12545e
      Gavin Shan 提交于
      As commit 50ba08f3 ("of/fdt: Don't clear initial_boot_params
      if fdt_check_header() fails") does, the device-tree pointer
      "initial_boot_params" is initialized by early_init_dt_verify(),
      which is called by early_init_devtree(). So we needn't explicitly
      initialize that again in early_init_devtree().
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fe12545e
    • M
      powerpc: Remove old compile time disabled syscall tracing code · a4bcbe6a
      Michael Ellerman 提交于
      We have code to do syscall tracing which is disabled at compile time by
      default. It's not been touched since the dawn of time (ie. v2.6.12).
      
      There are now better ways to do syscall tracing, ie. using the
      raw_syscall, or syscall tracepoints.
      
      For the specific case of tracing syscalls at boot on a system that
      doesn't get to userspace, you can boot with:
      
        trace_event=syscalls tp_printk=on
      
      Which will trace syscalls from boot, and echo all output to the console.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a4bcbe6a
    • M
      powerpc/kernel: Make syscall_exit a local label · 4c3b2168
      Michael Ellerman 提交于
      Currently when we back trace something that is in a syscall we see
      something like this:
      
      [c000000000000000] [c000000000000000] SyS_read+0x6c/0x110
      [c000000000000000] [c000000000000000] syscall_exit+0x0/0x98
      
      Although it's entirely correct, seeing syscall_exit at the bottom can be
      confusing - we were exiting from a syscall and then called SyS_read() ?
      
      If we instead change syscall_exit to be a local label we get something
      more intuitive:
      
      [c0000001fa46fde0] [c00000000026719c] SyS_read+0x6c/0x110
      [c0000001fa46fe30] [c000000000009264] system_call+0x38/0xd0
      
      ie. we were handling a system call, and it was SyS_read().
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4c3b2168
  6. 30 1月, 2015 7 次提交
  7. 28 1月, 2015 1 次提交
    • M
      powerpc: Remove some unused functions · 8aa989b8
      Michael Ellerman 提交于
      Remove slice_set_psize() which is not used.
      
      It was added in 3a8247cc "powerpc: Only demote individual slices
      rather than whole process" but was never used.
      
      Remove vsx_assist_exception() which is not used.
      
      It was added in ce48b210 "powerpc: Add VSX context save/restore,
      ptrace and signal support" but was never used.
      
      Remove generic_mach_cpu_die() which is not used.
      
      Its last caller was removed in 375f561a "powerpc/powernv: Always go
      into nap mode when CPU is offline".
      
      Remove mpc7448_hpc2_power_off() and mpc7448_hpc2_halt() which are
      unused.
      
      These were introduced in c5d56332 "[POWERPC] Add general support for
      mpc7448hpc2 (Taiga) platform" but were never used.
      
      This was partially found by using a static code analysis program called
      cppcheck.
      Signed-off-by: NRickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
      [mpe: Update changelog with details on when/why they are unused]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8aa989b8
  8. 27 1月, 2015 1 次提交
    • C
      powerpc/pseries: Fix endian problems with LE migration · 3df76a9d
      Cyril Bur 提交于
      RTAS events require arguments be passed in big endian while hypercalls
      have their arguments passed in registers and the values should therefore
      be in CPU endian.
      
      The "ibm,suspend_me" 'RTAS' call makes a sequence of hypercalls to setup
      one true RTAS call. This means that "ibm,suspend_me" is handled
      specially in the ppc_rtas() syscall.
      
      The ppc_rtas() syscall has its arguments in big endian and can therefore
      pass these arguments directly to the RTAS call. "ibm,suspend_me" is
      handled specially from within ppc_rtas() (by calling rtas_ibm_suspend_me())
      which has left an endian bug on little endian systems due to the
      requirement of hypercalls. The return value from rtas_ibm_suspend_me()
      gets returned in cpu endian, and is left unconverted, also a bug on
      little endian systems.
      
      rtas_ibm_suspend_me() does not actually make use of the rtas_args that
      it is passed. This patch removes the convoluted use of the rtas_args
      struct to pass params to rtas_ibm_suspend_me() in favour of passing what
      it needs as actual arguments. This patch also ensures the two callers of
      rtas_ibm_suspend_me() pass function parameters in cpu endian and in the
      case of ppc_rtas(), converts the return value.
      
      migrate_store() (the other caller of rtas_ibm_suspend_me()) is from a
      sysfs file which deals with everything in cpu endian so this function
      only underwent cleanup.
      
      This patch has been tested with KVM both LE and BE and on PowerVM both
      LE and BE. Under QEMU/KVM the migration happens without touching these
      code pathes.
      
      For PowerVM there is no obvious regression on BE and the LE code path
      now provides the correct parameters to the hypervisor.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3df76a9d
  9. 23 1月, 2015 6 次提交
  10. 21 1月, 2015 1 次提交
  11. 17 1月, 2015 1 次提交
    • Y
      powerpc/PCI: Clip bridge windows to fit in upstream windows · 3ebfe46a
      Yinghai Lu 提交于
      Every PCI-PCI bridge window should fit inside an upstream bridge window
      because orphaned address space is unreachable from the primary side of the
      upstream bridge.  If we inherit invalid bridge windows that overlap an
      upstream window from firmware, clip them to fit and update the bridge
      accordingly.
      
      [bhelgaas: changelog]
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=85491Reported-by: NMarek Kordik <kordikmarek@gmail.com>
      Fixes: 5b285415 ("PCI: Restrict 64-bit prefetchable bridge windows to 64-bit resources")
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Michael Ellerman <mpe@ellerman.id.au>
      CC: Gavin Shan <gwshan@linux.vnet.ibm.com>
      CC: Anton Blanchard <anton@samba.org>
      CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
      CC: Wei Yang <weiyang@linux.vnet.ibm.com>
      CC: Andrew Murray <amurray@embedded-bits.co.uk>
      CC: linuxppc-dev@lists.ozlabs.org
      3ebfe46a
  12. 29 12月, 2014 3 次提交
  13. 17 12月, 2014 2 次提交
    • P
      KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register · 4a157d61
      Paul Mackerras 提交于
      There are two ways in which a guest instruction can be obtained from
      the guest in the guest exit code in book3s_hv_rmhandlers.S.  If the
      exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
      instruction), the offending instruction is in the HEIR register
      (Hypervisor Emulation Instruction Register).  If the exit was caused
      by a load or store to an emulated MMIO device, we load the instruction
      from the guest by turning data relocation on and loading the instruction
      with an lwz instruction.
      
      Unfortunately, in the case where the guest has opposite endianness to
      the host, these two methods give results of different endianness, but
      both get put into vcpu->arch.last_inst.  The HEIR value has been loaded
      using guest endianness, whereas the lwz will load the instruction using
      host endianness.  The rest of the code that uses vcpu->arch.last_inst
      assumes it was loaded using host endianness.
      
      To fix this, we define a new vcpu field to store the HEIR value.  Then,
      in kvmppc_handle_exit_hv(), we transfer the value from this new field to
      vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
      differ.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      4a157d61
    • P
      KVM: PPC: Book3S HV: Remove code for PPC970 processors · c17b98cf
      Paul Mackerras 提交于
      This removes the code that was added to enable HV KVM to work
      on PPC970 processors.  The PPC970 is an old CPU that doesn't
      support virtualizing guest memory.  Removing PPC970 support also
      lets us remove the code for allocating and managing contiguous
      real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
      case, the code for pinning pages of guest memory when first
      accessed and keeping track of which pages have been pinned, and
      the code for handling H_ENTER hypercalls in virtual mode.
      
      Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
      The KVM_CAP_PPC_RMA capability now always returns 0.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      c17b98cf
  14. 15 12月, 2014 3 次提交
    • S
      powernv/powerpc: Add winkle support for offline cpus · 77b54e9f
      Shreyas B. Prabhu 提交于
      Winkle is a deep idle state supported in power8 chips. A core enters
      winkle when all the threads of the core enter winkle. In this state
      power supply to the entire chiplet i.e core, private L2 and private L3
      is turned off. As a result it gives higher powersavings compared to
      sleep.
      
      But entering winkle results in a total hypervisor state loss. Hence the
      hypervisor context has to be preserved before entering winkle and
      restored upon wake up.
      
      Power-on Reset Engine (PORE) is a dedicated engine which is responsible
      for powering on the chiplet during wake up. It can be programmed to
      restore the register contests of a few specific registers. This patch
      uses PORE to restore register state wherever possible and uses stack to
      save and restore rest of the necessary registers.
      
      With hypervisor state restore things fall under three categories-
      per-core state, per-subcore state and per-thread state. To manage this,
      extend the infrastructure introduced for sleep. Mainly we add a paca
      variable subcore_sibling_mask. Using this and the core_idle_state we can
      distingush first thread in core and subcore.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      77b54e9f
    • S
      powernv/cpuidle: Redesign idle states management · 7cba160a
      Shreyas B. Prabhu 提交于
      Deep idle states like sleep and winkle are per core idle states. A core
      enters these states only when all the threads enter either the
      particular idle state or a deeper one. There are tasks like fastsleep
      hardware bug workaround and hypervisor core state save which have to be
      done only by the last thread of the core entering deep idle state and
      similarly tasks like timebase resync, hypervisor core register restore
      that have to be done only by the first thread waking up from these
      state.
      
      The current idle state management does not have a way to distinguish the
      first/last thread of the core waking/entering idle states. Tasks like
      timebase resync are done for all the threads. This is not only is
      suboptimal, but can cause functionality issues when subcores and kvm is
      involved.
      
      This patch adds the necessary infrastructure to track idle states of
      threads in a per-core structure. It uses this info to perform tasks like
      fastsleep workaround and timebase resync only once per core.
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Originally-by: NPreeti U. Murthy <preeti@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: linux-pm@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7cba160a
    • P
      powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode · 8117ac6a
      Paul Mackerras 提交于
      Currently, when going idle, we set the flag indicating that we are in
      nap mode (paca->kvm_hstate.hwthread_state) and then execute the nap
      (or sleep or rvwinkle) instruction, all with the MMU on.  This is bad
      for two reasons: (a) the architecture specifies that those instructions
      must be executed with the MMU off, and in fact with only the SF, HV, ME
      and possibly RI bits set, and (b) this introduces a race, because as
      soon as we set the flag, another thread can switch the MMU to a guest
      context.  If the race is lost, this thread will typically start looping
      on relocation-on ISIs at 0xc...4400.
      
      This fixes it by setting the MSR as required by the architecture before
      setting the flag or executing the nap/sleep/rvwinkle instruction.
      
      Cc: stable@vger.kernel.org
      [ shreyas@linux.vnet.ibm.com: Edited to handle LE ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NShreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8117ac6a
  15. 09 12月, 2014 1 次提交
    • A
      powerpc: Secondary CPUs must set cpu_callin_map after setting active and online · 7c5c92ed
      Anton Blanchard 提交于
      I have a busy ppc64le KVM box where guests sometimes hit the infamous
      "kernel BUG at kernel/smpboot.c:134!" issue during boot:
      
        BUG_ON(td->cpu != smp_processor_id());
      
      Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
      output confirms it:
      
        CPU: 0
        Comm: watchdog/130
      
      The problem is that we aren't ensuring the CPU active and online bits are set
      before allowing the master to continue on. The master unparks the secondary
      CPUs kthreads and the scheduler looks for a CPU to run on. It calls
      select_task_rq and realises the suggested CPU is not in the cpus_allowed
      mask. It then ends up in select_fallback_rq, and since the active and
      online bits aren't set we choose some other CPU to run on.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7c5c92ed
  16. 08 12月, 2014 1 次提交
    • P
      powerpc/powernv: Return to cpu offline loop when finished in KVM guest · 56548fc0
      Paul Mackerras 提交于
      When a secondary hardware thread has finished running a KVM guest, we
      currently put that thread into nap mode using a nap instruction in
      the KVM code.  This changes the code so that instead of doing a nap
      instruction directly, we instead cause the call to power7_nap() that
      put the thread into nap mode to return.  The reason for doing this is
      to avoid having the KVM code having to know what low-power mode to
      put the thread into.
      
      In the case of a secondary thread used to run a KVM guest, the thread
      will be offline from the point of view of the host kernel, and the
      relevant power7_nap() call is the one in pnv_smp_cpu_disable().
      In this case we don't want to clear pending IPIs in the offline loop
      in that function, since that might cause us to miss the wakeup for
      the next time the thread needs to run a guest.  To tell whether or
      not to clear the interrupt, we use the SRR1 value returned from
      power7_nap(), and check if it indicates an external interrupt.  We
      arrange that the return from power7_nap() when we have finished running
      a guest returns 0, so pending interrupts don't get flushed in that
      case.
      
      Note that it is important a secondary thread that has finished
      executing in the guest, or that didn't have a guest to run, should
      not return to power7_nap's caller while the kvm_hstate.hwthread_req
      flag in the PACA is non-zero, because the return from power7_nap
      will reenable the MMU, and the MMU might still be in guest context.
      In this situation we spin at low priority in real mode waiting for
      hwthread_req to become zero.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      56548fc0
  17. 05 12月, 2014 2 次提交
    • M
      powerpc/book3s: Fix partial invalidation of TLBs in MCE code. · 682e77c8
      Mahesh Salgaonkar 提交于
      The existing MCE code calls flush_tlb hook with IS=0 (single page) resulting
      in partial invalidation of TLBs which is not right. This patch fixes
      that by passing IS=0xc00 to invalidate whole TLB for successful recovery
      from TLB and ERAT errors.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      682e77c8
    • A
      powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688
      Aneesh Kumar K.V 提交于
      upatepp can get called for a nohpte fault when we find from the linux
      page table that the translation was hashed before. In that case
      we are sure that there is no existing translation, hence we could
      avoid doing tlbie.
      
      We could possibly race with a parallel fault filling the TLB. But
      that should be ok because updatepp is only ever relaxing permissions.
      We also look at linux pte permission bits when filling hash pte
      permission bits. We also hold the linux pte busy bits while
      inserting/updating a hashpte entry, hence a paralle update of
      linux pte is not possible. On the other hand mprotect involves
      ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
      
      Performance number:
      We use randbox_access_bench written by Anton.
      
      Kernel with THP disabled and smaller hash page table size.
      
          86.60%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_updatepp
           2.10%  random_access_b  random_access_bench              [.] doit
           1.99%  random_access_b  [kernel.kallsyms]                [k] .do_raw_spin_lock
           1.85%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           1.26%  random_access_b  [kernel.kallsyms]                [k] .native_flush_hash_range
           1.18%  random_access_b  [kernel.kallsyms]                [k] .__delay
           0.69%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           0.37%  random_access_b  [kernel.kallsyms]                [k] .clear_user_page
           0.34%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           0.32%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           0.30%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
      
      With Fix:
      
          27.54%  random_access_b  random_access_bench              [.] doit
          22.90%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           5.76%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           5.20%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           5.12%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           4.80%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
           3.31%  random_access_b  [kernel.kallsyms]                [k] data_access_common
           1.84%  random_access_b  [kernel.kallsyms]                [k] .trace_hardirqs_on_caller
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aefa5688
  18. 02 12月, 2014 3 次提交