1. 23 3月, 2015 1 次提交
    • D
      powerpc: Move Power Macintosh drivers to generic byteswappers · f5718726
      David Gibson 提交于
      ppc has special instruction forms to efficiently load and store values
      in non-native endianness.  These can be accessed via the arch-specific
      {ld,st}_le{16,32}() inlines in arch/powerpc/include/asm/swab.h.
      
      However, gcc is perfectly capable of generating the byte-reversing
      load/store instructions when using the normal, generic cpu_to_le*() and
      le*_to_cpu() functions eaning the arch-specific functions don't have much
      point.
      
      Worse the "le" in the names of the arch specific functions is now
      misleading, because they always generate byte-reversing forms, but some
      ppc machines can now run a little-endian kernel.
      
      To start getting rid of the arch-specific forms, this patch removes them
      from all the old Power Macintosh drivers, replacing them with the
      generic byteswappers.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      f5718726
  2. 04 3月, 2015 2 次提交
    • N
      powerpc/iommu: Remove IOMMU device references via bus notifier · 4ad04e59
      Nishanth Aravamudan 提交于
      After d905c5df ("PPC: POWERNV: move iommu_add_device earlier"), the
      refcnt on the kobject backing the IOMMU group for a PCI device is
      elevated by each call to pci_dma_dev_setup_pSeriesLP() (via
      set_iommu_table_base_and_group). When we go to dlpar a multi-function
      PCI device out:
      
              iommu_reconfig_notifier ->
                      iommu_free_table ->
                              iommu_group_put
                              BUG_ON(tbl->it_group)
      
      We trip this BUG_ON, because there are still references on the table, so
      it is not freed. Fix this by moving the powernv bus notifier to common
      code and calling it for both powernv and pseries.
      
      Fixes: d905c5df ("PPC: POWERNV: move iommu_add_device earlier")
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Tested-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4ad04e59
    • M
      powerpc/smp: Wait until secondaries are active & online · 875ebe94
      Michael Ellerman 提交于
      Anton has a busy ppc64le KVM box where guests sometimes hit the infamous
      "kernel BUG at kernel/smpboot.c:134!" issue during boot:
      
        BUG_ON(td->cpu != smp_processor_id());
      
      Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops
      output confirms it:
      
        CPU: 0
        Comm: watchdog/130
      
      The problem is that we aren't ensuring the CPU active bit is set for the
      secondary before allowing the master to continue on. The master unparks
      the secondary CPU's kthreads and the scheduler looks for a CPU to run
      on. It calls select_task_rq() and realises the suggested CPU is not in
      the cpus_allowed mask. It then ends up in select_fallback_rq(), and
      since the active bit isnt't set we choose some other CPU to run on.
      
      This seems to have been introduced by 6acbfb96 "sched: Fix hotplug
      vs. set_cpus_allowed_ptr()", which changed from setting active before
      online to setting active after online. However that was in turn fixing a
      bug where other code assumed an active CPU was also online, so we can't
      just revert that fix.
      
      The simplest fix is just to spin waiting for both active & online to be
      set. We already have a barrier prior to set_cpu_online() (which also
      sets active), to ensure all other setup is completed before online &
      active are set.
      
      Fixes: 6acbfb96 ("sched: Fix hotplug vs. set_cpus_allowed_ptr()")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      875ebe94
  3. 23 2月, 2015 1 次提交
    • P
      powerpc: Re-enable dynticks · fea559f3
      Paul Clarke 提交于
      Implement arch_irq_work_has_interrupt() for powerpc
      
      Commit 9b01f5bf introduced a dependency on "IRQ work self-IPIs" for
      full dynamic ticks to be enabled, by expecting architectures to
      implement a suitable arch_irq_work_has_interrupt() routine.
      
      Several arches have implemented this routine, including x86 (3010279f)
      and arm (09f6edd4), but powerpc was omitted.
      
      This patch implements this routine for powerpc.
      
      The symptom, at boot (on powerpc systems) with "nohz_full=<CPU list>"
      is displayed:
      
           NO_HZ: Can't run full dynticks because arch doesn't support irq work self-IPIs
      
      after this patch:
      
           NO_HZ: Full dynticks CPUs: <CPU list>.
      
      Tested against 3.19.
      
      powerpc implements "IRQ work self-IPIs" by setting the decrementer to 1 in
      arch_irq_work_raise(), which causes a decrementer exception on the next
      timebase tick. We then handle the work in __timer_interrupt().
      
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul A. Clarke <pc@us.ibm.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [mpe: Flesh out change log, fix ws & include guards, remove include of processor.h]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fea559f3
  4. 19 2月, 2015 1 次提交
  5. 18 2月, 2015 1 次提交
  6. 17 2月, 2015 1 次提交
  7. 14 2月, 2015 1 次提交
  8. 13 2月, 2015 6 次提交
    • C
      powerpc: add running_clock for powerpc to prevent spurious softlockup warnings · 4be1b297
      Cyril Bur 提交于
      On POWER8 virtualised kernels the VTB register can be read to have a view
      of time that only increases while the guest is running.  This will prevent
      guests from seeing time jump if a guest is paused for significant amounts
      of time.
      
      On POWER7 and below virtualised kernels stolen time is subtracted from
      local_clock as a best effort approximation.  This will not eliminate
      spurious warnings in the case of a suspended guest but may reduce the
      occurance in the case of softlockups due to host over commit.
      
      Bare metal kernels should avoid reading the VTB as KVM does not restore
      sane values when not executing, the approxmation is fine as host kernels
      won't observe any stolen time.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrew Jones <drjones@redhat.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: chai wen <chaiw.fnst@cn.fujitsu.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ben Zhang <benzh@chromium.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4be1b297
    • A
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski 提交于
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f56141e3
    • M
      mm: remove remaining references to NUMA hinting bits and helpers · 21d9ee3e
      Mel Gorman 提交于
      This patch removes the NUMA PTE bits and associated helpers.  As a
      side-effect it increases the maximum possible swap space on x86-64.
      
      One potential source of problems is races between the marking of PTEs
      PROT_NONE, NUMA hinting faults and migration.  It must be guaranteed that
      a PTE being protected is not faulted in parallel, seen as a pte_none and
      corrupting memory.  The base case is safe but transhuge has problems in
      the past due to an different migration mechanism and a dependance on page
      lock to serialise migrations and warrants a closer look.
      
      task_work hinting update			parallel fault
      ------------------------			--------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						pmd_none
      						  do_huge_pmd_anonymous_page
      						  read? pmd_lock blocks until hinting complete, fail !pmd_none test
      						  write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
            pmd_modify
            set_pmd_at
      
      task_work hinting update			parallel migration
      ------------------------			------------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						  do_huge_pmd_numa_page
      						    migrate_misplaced_transhuge_page
      						    pmd_lock waits for updates to complete, recheck pmd_same
            pmd_modify
            set_pmd_at
      
      Both of those are safe and the case where a transhuge page is inserted
      during a protection update is unchanged.  The case where two processes try
      migrating at the same time is unchanged by this series so should still be
      ok.  I could not find a case where we are accidentally depending on the
      PTE not being cleared and flushed.  If one is missed, it'll manifest as
      corruption problems that start triggering shortly after this series is
      merged and only happen when NUMA balancing is enabled.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21d9ee3e
    • M
      ppc64: add paranoid warnings for unexpected DSISR_PROTFAULT · 842915f5
      Mel Gorman 提交于
      ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected if
      they are triggered.  This patch adds warnings just in case they are being
      accidentally depended upon.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      842915f5
    • M
      mm: convert p[te|md]_numa users to p[te|md]_protnone_numa · 8a0516ed
      Mel Gorman 提交于
      Convert existing users of pte_numa and friends to the new helper.  Note
      that the kernel is broken after this patch is applied until the other page
      table modifiers are also altered.  This patch layout is to make review
      easier.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a0516ed
    • M
      mm: add p[te|md] protnone helpers for use by NUMA balancing · e7bb4b6d
      Mel Gorman 提交于
      This is a preparatory patch that introduces protnone helpers for automatic
      NUMA balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7bb4b6d
  9. 12 2月, 2015 3 次提交
    • N
      arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() · 1757bbd9
      Naoya Horiguchi 提交于
      We don't have to use mm_walk->private to pass vma to the callback function
      because of mm_walk->vma.  And walk_page_vma() is useful if we walk over a
      single vma.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1757bbd9
    • K
      mm: make FIRST_USER_ADDRESS unsigned long on all archs · d016bf7e
      Kirill A. Shutemov 提交于
      LKP has triggered a compiler warning after my recent patch "mm: account
      pmd page tables to the process":
      
          mm/mmap.c: In function 'exit_mmap':
       >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]
      
      The code:
      
       > 2857                WARN_ON(mm_nr_pmds(mm) >
         2858                                round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
      
      In this, on tile, we have FIRST_USER_ADDRESS defined as 0.  round_up() has
      the same type -- int.  PUD_SHIFT.
      
      I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
      long.  On every arch for consistency.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d016bf7e
    • N
      mm/hugetlb: reduce arch dependent code around follow_huge_* · 61f77eda
      Naoya Horiguchi 提交于
      Currently we have many duplicates in definitions around
      follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
      patch tries to remove the m.  The basic idea is to put the default
      implementation for these functions in mm/hugetlb.c as weak symbols
      (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
      arch-specific code only when the arch needs it.
      
      For follow_huge_addr(), only powerpc and ia64 have their own
      implementation, and in all other architectures this function just returns
      ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
      default.
      
      As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
      always return 0 in your architecture (like in ia64 or sparc,) it's never
      called (the callsite is optimized away) no matter how implemented it is.
      So in such architectures, we don't need arch-specific implementation.
      
      In some architecture (like mips, s390 and tile,) their current
      arch-specific follow_huge_(pmd|pud)() are effectively identical with the
      common code, so this patch lets these architecture use the common code.
      
      One exception is metag, where pmd_huge() could return non-zero but it
      expects follow_huge_pmd() to always return NULL.  This means that we need
      arch-specific implementation which returns NULL.  This behavior looks
      strange to me (because non-zero pmd_huge() implies that the architecture
      supports PMD-based hugepage, so follow_huge_pmd() can/should return some
      relevant value,) but that's beyond this cleanup patch, so let's keep it.
      
      Justification of non-trivial changes:
      - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
        patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
        is true when follow_huge_pmd() can be called (note that pmd_huge() has
        the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
      - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
        code. This patch forces these archs use PMD_MASK, but it's OK because
        they are identical in both archs.
        In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
        In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
        PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
        PTE_ORDER is always 0, so these are identical.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f77eda
  10. 06 2月, 2015 2 次提交
    • P
      kvm: add halt_poll_ns module parameter · f7819512
      Paolo Bonzini 提交于
      This patch introduces a new module parameter for the KVM module; when it
      is present, KVM attempts a bit of polling on every HLT before scheduling
      itself out via kvm_vcpu_block.
      
      This parameter helps a lot for latency-bound workloads---in particular
      I tested it with O_DSYNC writes with a battery-backed disk in the host.
      In this case, writes are fast (because the data doesn't have to go all
      the way to the platters) but they cannot be merged by either the host or
      the guest.  KVM's performance here is usually around 30% of bare metal,
      or 50% if you use cache=directsync or cache=writethrough (these
      parameters avoid that the guest sends pointless flush requests, and
      at the same time they are not slow because of the battery-backed cache).
      The bad performance happens because on every halt the host CPU decides
      to halt itself too.  When the interrupt comes, the vCPU thread is then
      migrated to a new physical CPU, and in general the latency is horrible
      because the vCPU thread has to be scheduled back in.
      
      With this patch performance reaches 60-65% of bare metal and, more
      important, 99% of what you get if you use idle=poll in the guest.  This
      means that the tunable gets rid of this particular bottleneck, and more
      work can be done to improve performance in the kernel or QEMU.
      
      Of course there is some price to pay; every time an otherwise idle vCPUs
      is interrupted by an interrupt, it will poll unnecessarily and thus
      impose a little load on the host.  The above results were obtained with
      a mostly random value of the parameter (500000), and the load was around
      1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
      
      The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
      that can be used to tune the parameter.  It counts how many HLT
      instructions received an interrupt during the polling period; each
      successful poll avoids that Linux schedules the VCPU thread out and back
      in, and may also avoid a likely trip to C1 and back for the physical CPU.
      
      While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
      Of these halts, almost all are failed polls.  During the benchmark,
      instead, basically all halts end within the polling period, except a more
      or less constant stream of 50 per second coming from vCPUs that are not
      running the benchmark.  The wasted time is thus very low.  Things may
      be slightly different for Windows VMs, which have a ~10 ms timer tick.
      
      The effect is also visible on Marcelo's recently-introduced latency
      test for the TSC deadline timer.  Though of course a non-RT kernel has
      awful latency bounds, the latency of the timer is around 8000-10000 clock
      cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
      deadline timer, thus, the effect is both a smaller average latency and
      a smaller variance.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7819512
    • J
      mm/debug_pagealloc: fix build failure on ppc and some other archs · 7b02190c
      Joonsoo Kim 提交于
      Kim Phillips reported following build failure.
      
        LD      init/built-in.o
        mm/built-in.o: In function `free_pages_prepare':
        mm/page_alloc.c:770: undefined reference to `.kernel_map_pages'
        mm/built-in.o: In function `prep_new_page':
        mm/page_alloc.c:933: undefined reference to `.kernel_map_pages'
        mm/built-in.o: In function `map_pages':
        mm/compaction.c:61: undefined reference to `.kernel_map_pages'
        make: *** [vmlinux] Error 1
      
      Reason for this problem is that commit 031bc574
      ("mm/debug-pagealloc: make debug-pagealloc boottime configurable")
      forgot to remove the old declaration of kernel_map_pages() for some
      architectures.  This patch removes them to fix build failure.
      Reported-by: NKim Phillips <kim.phillips@freescale.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David Howells <dhowells@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b02190c
  11. 04 2月, 2015 3 次提交
  12. 02 2月, 2015 8 次提交
    • C
      powerpc/perf/hv-gpci: add the remaining gpci requests · 97bf2640
      Cody P Schafer 提交于
      Add the remaining gpci requests that contain counters suitable for use
      by perf. Omit those that don't contain any counters (but note their
      ommision).
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      97bf2640
    • C
      powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated · 9e9f6010
      Cody P Schafer 提交于
      This adds (in req-gen/) a framework for defining gpci counter requests.
      It uses macro magic similar to ftrace.
      
      Also convert the existing hv-gpci request structures and enum values to
      use the new framework (and adjust old users of the structs and enum
      values to cope with changes in naming).
      
      In exchange for this macro disaster, we get autogenerated event listing
      for GPCI in sysfs, build time field offset checking, and zero
      duplication of information about GPCI requests.
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9e9f6010
    • C
      powerpc/perf/hv-24x7: parse catalog and populate sysfs with events · 5c5cd7b5
      Cody P Schafer 提交于
      Retrieves and parses the 24x7 catalog on POWER systems that supply it
      (right now, only POWER 8). Events are exposed via sysfs in the standard
      fashion, and are all parameterized.
      
      	$ cd /sys/bus/event_source/devices/hv_24x7/events
      
      	$ cat HPM_CS_FROM_L4_LDATA__PHYS_CORE
      	domain=0x2,offset=0xd58,core=?,lpar=0x0
      
      	$ cat HPM_TLBIE__VCPU_HOME_CHIP
      	domain=0x4,offset=0x358,vcpu=?,lpar=?
      
      where user is required to specify values for the fields with '?' (like
      core, vcpu, lpar above), when specifying the event with the perf tool.
      
      Catalog is (at the moment) only parsed on boot. It needs re-parsing
      when a some hypervisor events occur. At that point we'll also need to
      prevent old events from continuing to function (counter that is passed
      in via spare space in the config values?).
      Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5c5cd7b5
    • S
      perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper · e08e5282
      sukadev@linux.vnet.ibm.com 提交于
      Define a lite version of the EVENT_DEFINE_RANGE_FORMAT() that avoids
      defining helper functions for the bit-field ranges.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e08e5282
    • G
      powerpc/kernel: Avoid initializing device-tree pointer twice · fe12545e
      Gavin Shan 提交于
      As commit 50ba08f3 ("of/fdt: Don't clear initial_boot_params
      if fdt_check_header() fails") does, the device-tree pointer
      "initial_boot_params" is initialized by early_init_dt_verify(),
      which is called by early_init_devtree(). So we needn't explicitly
      initialize that again in early_init_devtree().
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fe12545e
    • M
      powerpc: Remove old compile time disabled syscall tracing code · a4bcbe6a
      Michael Ellerman 提交于
      We have code to do syscall tracing which is disabled at compile time by
      default. It's not been touched since the dawn of time (ie. v2.6.12).
      
      There are now better ways to do syscall tracing, ie. using the
      raw_syscall, or syscall tracepoints.
      
      For the specific case of tracing syscalls at boot on a system that
      doesn't get to userspace, you can boot with:
      
        trace_event=syscalls tp_printk=on
      
      Which will trace syscalls from boot, and echo all output to the console.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a4bcbe6a
    • M
      powerpc/kernel: Make syscall_exit a local label · 4c3b2168
      Michael Ellerman 提交于
      Currently when we back trace something that is in a syscall we see
      something like this:
      
      [c000000000000000] [c000000000000000] SyS_read+0x6c/0x110
      [c000000000000000] [c000000000000000] syscall_exit+0x0/0x98
      
      Although it's entirely correct, seeing syscall_exit at the bottom can be
      confusing - we were exiting from a syscall and then called SyS_read() ?
      
      If we instead change syscall_exit to be a local label we get something
      more intuitive:
      
      [c0000001fa46fde0] [c00000000026719c] SyS_read+0x6c/0x110
      [c0000001fa46fe30] [c000000000009264] system_call+0x38/0xd0
      
      ie. we were handling a system call, and it was SyS_read().
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4c3b2168
    • R
      cxl: Fix device_node reference counting · 6f963ec2
      Ryan Grimm 提交于
      When unbinding and rebinding the driver on a system with a card in PHB0, this
      error condition is reached after a few attempts:
      
      ERROR: Bad of_node_put() on /pciex@3fffe40000000
      CPU: 0 PID: 3040 Comm: bash Not tainted 3.18.0-rc3-12545-g3627ffe #152
      Call Trace:
      [c000000721acb5c0] [c00000000086ef94] .dump_stack+0x84/0xb0 (unreliable)
      [c000000721acb640] [c00000000073a0a8] .of_node_release+0xd8/0xe0
      [c000000721acb6d0] [c00000000044bc44] .kobject_release+0x74/0xe0
      [c000000721acb760] [c0000000007394fc] .of_node_put+0x1c/0x30
      [c000000721acb7d0] [c000000000545cd8] .cxl_probe+0x1a98/0x1d50
      [c000000721acb900] [c0000000004845a0] .local_pci_probe+0x40/0xc0
      [c000000721acb980] [c000000000484998] .pci_device_probe+0x128/0x170
      [c000000721acba30] [c00000000052400c] .driver_probe_device+0xac/0x2a0
      [c000000721acbad0] [c000000000522468] .bind_store+0x108/0x160
      [c000000721acbb70] [c000000000521448] .drv_attr_store+0x38/0x60
      [c000000721acbbe0] [c000000000293840] .sysfs_kf_write+0x60/0xa0
      [c000000721acbc50] [c000000000292500] .kernfs_fop_write+0x140/0x1d0
      [c000000721acbcf0] [c000000000208648] .vfs_write+0xd8/0x260
      [c000000721acbd90] [c000000000208b18] .SyS_write+0x58/0x100
      [c000000721acbe30] [c000000000009258] syscall_exit+0x0/0x98
      
      We are missing a call to of_node_get(). pnv_pci_to_phb_node() should
      call of_node_get() otherwise np's reference count isn't incremented and
      it might go away. Rename pnv_pci_to_phb_node() to pnv_pci_get_phb_node()
      so it's clear it calls of_node_get().
      Signed-off-by: NRyan Grimm <grimm@linux.vnet.ibm.com>
      Acked-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6f963ec2
  13. 31 1月, 2015 3 次提交
  14. 30 1月, 2015 7 次提交