1. 17 4月, 2015 1 次提交
    • A
      powerpc/mm/thp: Make page table walk safe against thp split/collapse · 691e95fd
      Aneesh Kumar K.V 提交于
      We can disable a THP split or a hugepage collapse by disabling irq.
      We do send IPI to all the cpus in the early part of split/collapse,
      and disabling local irq ensure we don't make progress with
      split/collapse. If the THP is getting split we return NULL from
      find_linux_pte_or_hugepte(). For all the current callers it should be ok.
      We need to be careful if we want to use returned pte_t pointer outside
      the irq disabled region. W.r.t to THP split, the pfn remains the same,
      but then a hugepage collapse will result in a pfn change. There are
      few steps we can take to avoid a hugepage collapse.One way is to take page
      reference inside the irq disable region. Other option is to take
      mmap_sem so that a parallel collapse will not happen. We can also
      disable collapse by taking pmd_lock. Another method used by kvm
      subsystem is to check whether we had a mmu_notifer update in between
      using mmu_notifier_retry().
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      691e95fd
  2. 14 4月, 2015 2 次提交
  3. 11 4月, 2015 11 次提交
  4. 27 3月, 2015 1 次提交
    • J
      powerpc/perf: add missing put_cpu_var in power_pmu_event_init · 68de8867
      Jan Stancek 提交于
      One path in power_pmu_event_init() calls get_cpu_var(), but is
      missing matching call to put_cpu_var(), which causes preemption
      imbalance and crash in user-space:
      
        Page fault in user mode with in_atomic() = 1 mm = c000001fefa5a280
        NIP = 3fff9bf2cae0  MSR = 900000014280f032
        Oops: Weird page fault, sig: 11 [#23]
        SMP NR_CPUS=2048 NUMA PowerNV
        Modules linked in: <snip>
        CPU: 43 PID: 10285 Comm: a.out Tainted: G      D         4.0.0-rc5+ #1
        task: c000001fe82c9200 ti: c000001fe835c000 task.ti: c000001fe835c000
        NIP: 00003fff9bf2cae0 LR: 00003fff9bee4898 CTR: 00003fff9bf2cae0
        REGS: c000001fe835fea0 TRAP: 0401   Tainted: G      D          (4.0.0-rc5+)
        MSR: 900000014280f032 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI>  CR: 22000028  XER: 00000000
        CFAR: 00003fff9bee4894 SOFTE: 1
         GPR00: 00003fff9bee494c 00003fffe01c2ee0 00003fff9c084410 0000000010020068
         GPR04: 0000000000000000 0000000000000002 0000000000000008 0000000000000001
         GPR08: 0000000000000001 00003fff9c074a30 00003fff9bf2cae0 00003fff9bf2cd70
         GPR12: 0000000052000022 00003fff9c10b700
        NIP [00003fff9bf2cae0] 0x3fff9bf2cae0
        LR [00003fff9bee4898] 0x3fff9bee4898
        Call Trace:
        ---[ end trace 5d3d952b5d4185d4 ]---
      
        BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41
        in_atomic(): 1, irqs_disabled(): 0, pid: 10285, name: a.out
        INFO: lockdep is turned off.
        CPU: 43 PID: 10285 Comm: a.out Tainted: G      D         4.0.0-rc5+ #1
        Call Trace:
        [c000001fe835f990] [c00000000089c014] .dump_stack+0x98/0xd4 (unreliable)
        [c000001fe835fa10] [c0000000000e4138] .___might_sleep+0x1d8/0x2e0
        [c000001fe835faa0] [c000000000888da8] .down_read+0x38/0x110
        [c000001fe835fb30] [c0000000000bf2f4] .exit_signals+0x24/0x160
        [c000001fe835fbc0] [c0000000000abde0] .do_exit+0xd0/0xe70
        [c000001fe835fcb0] [c00000000001f4c4] .die+0x304/0x450
        [c000001fe835fd60] [c00000000088e1f4] .do_page_fault+0x2d4/0x900
        [c000001fe835fe30] [c000000000008664] handle_page_fault+0x10/0x30
        note: a.out[10285] exited with preempt_count 1
      
      Reproducer:
        #include <stdio.h>
        #include <unistd.h>
        #include <syscall.h>
        #include <sys/types.h>
        #include <sys/stat.h>
        #include <linux/perf_event.h>
        #include <linux/hw_breakpoint.h>
      
        static struct perf_event_attr event = {
                .type = PERF_TYPE_RAW,
                .size = sizeof(struct perf_event_attr),
                .sample_type = PERF_SAMPLE_BRANCH_STACK,
                .branch_sample_type = PERF_SAMPLE_BRANCH_ANY_RETURN,
        };
      
        int main()
        {
                syscall(__NR_perf_event_open, &event, 0, -1, -1, 0);
        }
      Signed-off-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      68de8867
  5. 07 3月, 2015 1 次提交
  6. 19 2月, 2015 1 次提交
    • P
      perf, powerpc: Fix up flush_branch_stack() users · acba3c7e
      Peter Zijlstra 提交于
      The recent LBR rework for x86 left a stray flush_branch_stack() user in
      the PowerPC code, fix that up.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      acba3c7e
  7. 02 2月, 2015 4 次提交
  8. 30 1月, 2015 2 次提交
  9. 12 12月, 2014 2 次提交
  10. 03 11月, 2014 1 次提交
    • C
      powerpc: Replace __get_cpu_var uses · 69111bac
      Christoph Lameter 提交于
      This still has not been merged and now powerpc is the only arch that does
      not have this change. Sorry about missing linuxppc-dev before.
      
      V2->V2
        - Fix up to work against 3.18-rc1
      
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      At the end of the patch set all uses of __get_cpu_var have been removed so
      the macro is removed too.
      
      The patch set includes passes over all arches as well. Once these operations
      are used throughout then specialized macros can be defined in non -x86
      arches as well in order to optimize per cpu access by f.e.  using a global
      register that may be set to the per cpu base.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      [mpe: Fix build errors caused by set/or_softirq_pending(), and rework
            assignment in __set_breakpoint() to use memcpy().]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      69111bac
  11. 28 10月, 2014 1 次提交
    • P
      perf: Fix and clean up initialization of pmu::event_idx · c719f560
      Peter Zijlstra 提交于
      Andy reported that the current state of event_idx is rather confused.
      So remove all but the x86_pmu implementation and change the default to
      return 0 (the safe option).
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Cody P Schafer <dev@codyps.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Himangi Saraogi <himangi774@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: sukadev@linux.vnet.ibm.com <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Huth <thuth@linux.vnet.ibm.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux390@de.ibm.com
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c719f560
  12. 07 10月, 2014 2 次提交
  13. 25 9月, 2014 1 次提交
  14. 09 9月, 2014 1 次提交
    • A
      powerpc/perf: Fix ABIv2 kernel backtraces · 85101af1
      Anton Blanchard 提交于
      ABIv2 kernels are failing to backtrace through the kernel. An example:
      
      39.30%  readseek2_proce  [kernel.kallsyms]    [k] find_get_entry
                  |
                  --- find_get_entry
                     __GI___libc_read
      
      The problem is in valid_next_sp() where we check that the new stack
      pointer is at least STACK_FRAME_OVERHEAD below the previous one.
      
      ABIv1 has a minimum stack frame size of 112 bytes consisting of 48 bytes
      and 64 bytes of parameter save area. ABIv2 changes that to 32 bytes
      with no paramter save area.
      
      STACK_FRAME_OVERHEAD is in theory the minimum stack frame size,
      but we over 240 uses of it, some of which assume that it includes
      space for the parameter area.
      
      We need to work through all our stack defines and rationalise them
      but let's fix perf now by creating STACK_FRAME_MIN_SIZE and using
      in valid_next_sp(). This fixes the issue:
      
      30.64%  readseek2_proce  [kernel.kallsyms]    [k] find_get_entry
                  |
                  --- find_get_entry
                     pagecache_get_page
                     generic_file_read_iter
                     new_sync_read
                     vfs_read
                     sys_read
                     syscall_exit
                     __GI___libc_read
      
      Cc: stable@vger.kernel.org # 3.16+
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      85101af1
  15. 27 8月, 2014 2 次提交
    • T
      Revert "powerpc: Replace __get_cpu_var uses" · 23f66e2d
      Tejun Heo 提交于
      This reverts commit 5828f666 due to
      build failure after merging with pending powerpc changes.
      
      Link: http://lkml.kernel.org/g/20140827142243.6277eaff@canb.auug.org.auSigned-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      23f66e2d
    • C
      powerpc: Replace __get_cpu_var uses · 5828f666
      Christoph Lameter 提交于
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      At the end of the patch set all uses of __get_cpu_var have been removed so
      the macro is removed too.
      
      The patch set includes passes over all arches as well. Once these operations
      are used throughout then specialized macros can be defined in non -x86
      arches as well in order to optimize per cpu access by f.e.  using a global
      register that may be set to the per cpu base.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      tj: Folded a fix patch.
          http://lkml.kernel.org/g/alpine.DEB.2.11.1408172143020.9652@gentwo.org
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5828f666
  16. 13 8月, 2014 1 次提交
  17. 28 7月, 2014 3 次提交
    • M
      powerpc/perf: Add per-event excludes on Power8 · 9de5cb0f
      Michael Ellerman 提交于
      Power8 has a new register (MMCR2), which contains individual freeze bits
      for each counter. This is an improvement on previous chips as it means
      we can have multiple events on the PMU at the same time with different
      exclude_{user,kernel,hv} settings. Previously we had to ensure all
      events on the PMU had the same exclude settings.
      
      The core of the patch is fairly simple. We use the 207S feature flag to
      indicate that the PMU backend supports per-event excludes, if it's set
      we skip the generic logic that enforces the equality of excludes between
      events. We also use that flag to skip setting the freeze bits in MMCR0,
      the PMU backend is expected to have handled setting them in MMCR2.
      
      The complication arises with EBB. The FCxP bits in MMCR2 are accessible
      R/W to a task using EBB. Which means a task using EBB will be able to
      see that we are using MMCR2 for freezing, whereas the old logic which
      used MMCR0 is not user visible.
      
      The task can not see or affect exclude_kernel & exclude_hv, so we only
      need to consider exclude_user.
      
      The table below summarises the behaviour both before and after this
      commit is applied:
      
       exclude_user           true  false
       ------------------------------------
              | User visible |  N    N
       Before | Can freeze   |  Y    Y
              | Can unfreeze |  N    Y
       ------------------------------------
              | User visible |  Y    Y
        After | Can freeze   |  Y    Y
              | Can unfreeze |  Y/N  Y
       ------------------------------------
      
      So firstly I assert that the simple visibility of the exclude_user
      setting in MMCR2 is a non-issue. The event belongs to the task, and
      was most likely created by the task. So the exclude_user setting is not
      privileged information in any way.
      
      Secondly, the behaviour in the exclude_user = false case is unchanged.
      This is important as it is the case that is actually useful, ie. the
      event is created with no exclude setting and the task uses MMCR2 to
      implement exclusion manually.
      
      For exclude_user = true there is no meaningful change to freezing the
      event. Previously the task could use MMCR2 to freeze the event, though
      it was already frozen with MMCR0. With the new code the task can use
      MMCR2 to freeze the event, though it was already frozen with MMCR2.
      
      The only real change is when exclude_user = true and the task tries to
      use MMCR2 to unfreeze the event. Previously this had no effect, because
      the event was already frozen in MMCR0. With the new code the task can
      unfreeze the event in MMCR2, but at some indeterminate time in the
      future the kernel will overwrite its setting and refreeze the event.
      
      Therefore my final assertion is that any task using exclude_user = true
      and also fiddling with MMCR2 was deeply confused before this change, and
      remains so after it.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      9de5cb0f
    • M
      powerpc/perf: Pass the struct perf_events down to compute_mmcr() · 8abd818f
      Michael Ellerman 提交于
      To support per-event exclude settings on Power8 we need access to the
      struct perf_events in compute_mmcr().
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      8abd818f
    • M
      powerpc/perf: Clear all MMCR settings before calling compute_mmcr() · 79a4cb28
      Michael Ellerman 提交于
      Because we reuse cpuhw->mmcr on each call to compute_mmcr() there's a
      risk that we could forget to set one of the values and use whatever
      value was in there previously.
      
      Currently all the implementations are careful to set all the values, but
      it's safer to clear them all before we call compute_mmcr().
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      79a4cb28
  18. 23 7月, 2014 1 次提交
  19. 11 7月, 2014 2 次提交
    • A
      powerpc/perf: Never program book3s PMCs with values >= 0x80000000 · f5602941
      Anton Blanchard 提交于
      We are seeing a lot of PMU warnings on POWER8:
      
          Can't find PMC that caused IRQ
      
      Looking closer, the active PMC is 0 at this point and we took a PMU
      exception on the transition from negative to 0. Some versions of POWER8
      have an issue where they edge detect and not level detect PMC overflows.
      
      A number of places program the PMC with (0x80000000 - period_left),
      where period_left can be negative. We can either fix all of these or
      just ensure that period_left is always >= 1.
      
      This patch takes the second option.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f5602941
    • J
      powerpc/perf: Clear MMCR2 when enabling PMU · b50a6c58
      Joel Stanley 提交于
      On POWER8 when switching to a KVM guest we set bits in MMCR2 to freeze
      the PMU counters. Aside from on boot they are then never reset,
      resulting in stuck perf counters for any user in the guest or host.
      
      We now set MMCR2 to 0 whenever enabling the PMU, which provides a sane
      state for perf to use the PMU counters under either the guest or the
      host.
      
      This was manifesting as a bug with ppc64_cpu --frequency:
      
          $ sudo ppc64_cpu --frequency
          WARNING: couldn't run on cpu 0
          WARNING: couldn't run on cpu 8
            ...
          WARNING: couldn't run on cpu 144
          WARNING: couldn't run on cpu 152
          min:    18446744073.710 GHz (cpu -1)
          max:    0.000 GHz (cpu -1)
          avg:    0.000 GHz
      
      The command uses a perf counter to measure CPU cycles over a fixed
      amount of time, in order to approximate the frequency of the machine.
      The counters were returning zero once a guest was started, regardless of
      weather it was still running or had been shut down.
      
      By dumping the value of MMCR2, it was observed that once a guest is
      running MMCR2 is set to 1s - which stops counters from running:
      
          $ sudo sh -c 'echo p > /proc/sysrq-trigger'
          CPU: 0 PMU registers, ppmu = POWER8 n_counters = 6
          PMC1:  5b635e38 PMC2: 00000000 PMC3: 00000000 PMC4: 00000000
          PMC5:  1bf5a646 PMC6: 5793d378 PMC7: deadbeef PMC8: deadbeef
          MMCR0: 0000000080000000 MMCR1: 000000001e000000 MMCRA: 0000040000000000
          MMCR2: fffffffffffffc00 EBBHR: 0000000000000000
          EBBRR: 0000000000000000 BESCR: 0000000000000000
          SIAR:  00000000000a51cc SDAR:  c00000000fc40000 SIER:  0000000001000000
      
      This is done unconditionally in book3s_hv_interrupts.S upon entering the
      guest, and the original value is only save/restored if the host has
      indicated it was using the PMU. This is okay, however the user of the
      PMU needs to ensure that it is in a defined state when it starts using
      it.
      
      Fixes: e05b9b9e ("powerpc/perf: Power8 PMU support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJoel Stanley <joel@jms.id.au>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      b50a6c58