1. 16 1月, 2016 2 次提交
    • M
      arch/powerpc/include/asm/pgtable-ppc64.h: add pmd_[dirty|mkclean] for THP · d5d6a443
      Minchan Kim 提交于
      MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
      of the contents since MADV_FREE syscall is called for THP page.
      
      This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
      support.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5d6a443
    • K
      powerpc, thp: remove infrastructure for handling splitting PMDs · 7aa9a23c
      Kirill A. Shutemov 提交于
      With new refcounting we don't need to mark PMDs splitting.  Let's drop
      code to handle this.
      
      pmdp_splitting_flush() is not needed too: on splitting PMD we will do
      pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
      needed for fast_gup.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aa9a23c
  2. 13 1月, 2016 1 次提交
  3. 12 1月, 2016 2 次提交
    • H
      powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff · 2f10f1a7
      Hugh Dickins 提交于
      Swapoff after swapping hangs on the G5, when CONFIG_CHECKPOINT_RESTORE=y
      but CONFIG_MEM_SOFT_DIRTY is not set.  That's because the non-zero
      _PAGE_SWP_SOFT_DIRTY bit, added by CONFIG_HAVE_ARCH_SOFT_DIRTY=y, is not
      discounted when CONFIG_MEM_SOFT_DIRTY is not set: so swap ptes cannot be
      recognized.
      
      (I suspect that the peculiar dependence of HAVE_ARCH_SOFT_DIRTY on
      CHECKPOINT_RESTORE in arch/powerpc/Kconfig comes from an incomplete
      attempt to solve this problem.)
      
      It's true that the relationship between CONFIG_HAVE_ARCH_SOFT_DIRTY and
      and CONFIG_MEM_SOFT_DIRTY is too confusing, and it's true that swapoff
      should be made more robust; but nevertheless, fix up the powerpc ifdefs
      as x86_64 and s390 (which met the same problem) have them, defining the
      bits as 0 if CONFIG_MEM_SOFT_DIRTY is not set.
      
      Fixes: 7207f436 ("powerpc/mm: Add page soft dirty tracking")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2f10f1a7
    • A
      powerpc/mm: Fix _PAGE_PTE breaking swapoff · 44734f23
      Aneesh Kumar K.V 提交于
      Core kernel expects swp_entry_t to consist of only swap type and swap
      offset. We should not leak pte bits into swp_entry_t. This breaks
      swapoff which use the swap type and offset to build a swp_entry_t and
      later compare that to the swp_entry_t obtained from linux page table
      pte. Leaking pte bits into swp_entry_t breaks that comparison and
      results in us looping in try_to_unuse.
      
      The stack trace can be anywhere below try_to_unuse() in mm/swapfile.c,
      since swapoff is circling around and around that function, reading from
      each used swap block into a page, then trying to find where that page
      belongs, looking at every non-file pte of every mm that ever swapped.
      
      Fixes: 6a119eae ("powerpc/mm: Add a _PAGE_PTE bit")
      Reported-by: NHugh Dickins <hughd@google.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      44734f23
  4. 09 1月, 2016 2 次提交
  5. 04 1月, 2016 1 次提交
  6. 28 12月, 2015 1 次提交
  7. 27 12月, 2015 2 次提交
    • R
      powerpc/powernv: Add a kmsg_dumper that flushes console output on panic · affddff6
      Russell Currey 提交于
      On BMC machines, console output is controlled by the OPAL firmware and is
      only flushed when its pollers are called.  When the kernel is in a panic
      state, it no longer calls these pollers and thus console output does not
      completely flush, causing some output from the panic to be lost.
      
      Output is only actually lost when the kernel is configured to not power off
      or reboot after panic (i.e. CONFIG_PANIC_TIMEOUT is set to 0) since OPAL
      flushes the console buffer as part of its power down routines.  Before this
      patch, however, only partial output would be printed during the timeout wait.
      
      This patch adds a new kmsg_dumper which gets called at panic time to ensure
      panic output is not lost.  It accomplishes this by calling OPAL_CONSOLE_FLUSH
      in the OPAL API, and if that is not available, the pollers are called enough
      times to (hopefully) completely flush the buffer.
      
      The flushing mechanism will only affect output printed at and before the
      kmsg_dump call in kernel/panic.c:panic().  As such, the "end Kernel panic"
      message may still be truncated as follows:
      
      >Call Trace:
      >[c000000f1f603b00] [c0000000008e9458] dump_stack+0x90/0xbc (unreliable)
      >[c000000f1f603b30] [c0000000008e7e78] panic+0xf8/0x2c4
      >[c000000f1f603bc0] [c000000000be4860] mount_block_root+0x288/0x33c
      >[c000000f1f603c80] [c000000000be4d14] prepare_namespace+0x1f4/0x254
      >[c000000f1f603d00] [c000000000be43e8] kernel_init_freeable+0x318/0x350
      >[c000000f1f603dc0] [c00000000000bd74] kernel_init+0x24/0x130
      >[c000000f1f603e30] [c0000000000095b0] ret_from_kernel_thread+0x5c/0xac
      >---[ end Kernel panic - not
      
      This functionality is implemented as a kmsg_dumper as it seems to be the
      most sensible way to introduce platform-specific functionality to the
      panic function.
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      Reviewed-by: NAndrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      affddff6
    • M
      powerpc: Copy only required pieces of the mm_context_t to the paca · 2fc251a8
      Michael Neuling 提交于
      Currently we copy the whole mm_context_t to the paca but only access a
      few bits of it.  This is wasteful of space paca and also takes quite
      some time in the hot path of context switching.
      
      This patch pulls in only the required bits from the mm_context_t to
      the paca and on context switch, copies only those.
      
      Benchmarking this (On top of Anton's recent MSR context switching
      changes [1]) using processes and yield shows an improvement of almost
      3% on POWER8:
      
        http://ozlabs.org/~anton/junkcode/context_switch2.c
        ./context_switch2 --test=yield --process 0 0
      
      1. https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-October/135700.htmlSigned-off-by: NMichael Neuling <mikey@neuling.org>
      [mpe: Rename paca fields to be mm_ctx_foo rather than context_foo]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2fc251a8
  8. 23 12月, 2015 3 次提交
  9. 19 12月, 2015 1 次提交
  10. 17 12月, 2015 13 次提交
    • A
      powerpc/powernv: Add support for Nvlink NPUs · 5d2aa710
      Alistair Popple 提交于
      NVLink is a high speed interconnect that is used in conjunction with a
      PCI-E connection to create an interface between CPU and GPU that
      provides very high data bandwidth. A PCI-E connection to a GPU is used
      as the control path to initiate and report status of large data
      transfers sent via the NVLink.
      
      On IBM Power systems the NVLink processing unit (NPU) is similar to
      the existing PHB3. This patch adds support for a new NPU PHB type. DMA
      operations on the NPU are not supported as this patch sets the TCE
      translation tables to be the same as the related GPU PCIe device for
      each NVLink. Therefore all DMA operations are setup and controlled via
      the PCIe device.
      
      EEH is not presently supported for the NPU devices, although it may be
      added in future.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5d2aa710
    • A
      powerpc: Add __raw_rm_writeq() function · a84bf321
      Alistair Popple 提交于
      Move __raw_rm_writeq() from platforms/powernv/pci-ioda.c to
      include/asm/io.h so that it can be used by other code.
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a84bf321
    • A
      Revert "powerpc/pci: Remove unused struct pci_dn.pcidev field" · 94973b24
      Alistair Popple 提交于
      This commit removed the pcidev field from struct pci_dn as it was no
      longer in use by the kernel. However to support finding the
      association of Nvlink devices to GPU devices from the device-tree this
      field is required.
      
      This reverts commit 250c7b27 ("powerpc/pci: Remove unused struct
      pci_dn.pcidev field").
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      94973b24
    • L
      powerpc/mm: Add page soft dirty tracking · 7207f436
      Laurent Dufour 提交于
      User space checkpoint and restart tool (CRIU) needs the page's change
      to be soft tracked. This allows to do a pre checkpoint and then dump
      only touched pages.
      
      This is done by using a newly assigned PTE bit (_PAGE_SOFT_DIRTY) when
      the page is backed in memory, and a new _PAGE_SWP_SOFT_DIRTY bit when
      the page is swapped out.
      
      To introduce a new PTE _PAGE_SOFT_DIRTY bit value common to hash 4k
      and hash 64k pte, the bits already defined in hash-*4k.h should be
      shifted left by one.
      
      The _PAGE_SWP_SOFT_DIRTY bit is dynamically put after the swap type in
      the swap pte. A check is added to ensure that the bit is not
      overwritten by _PAGE_HPTEFLAGS.
      Signed-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      CC: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7207f436
    • M
      powerpc/kernel: Combine vec/loc for STD_EXCEPTION_PSERIES · 2613265c
      Michael Ellerman 提交于
      The STD_EXCEPTION_PSERIES macro takes both a vector number, and a
      location (memory address). However both are always identical, so combine
      them to save repeating ourselves.
      
      This does mean an exception handler must always exist at the location in
      memory that matches its vector number. But that's OK because this is the
      "STD" macro (standard), which does exactly that. We have other macros
      for the other cases, eg. STD_EXCEPTION_PSERIES_OOL (out of line).
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2613265c
    • M
      powerpc/kernel: Open code SET_DEFAULT_THREAD_PPR · d8725ce8
      Michael Ellerman 提交于
      This is only used in one location, open code it.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d8725ce8
    • M
      powerpc/kernel: Open code HMT_MEDIUM_LOW_HAS_PPR · d030a4b5
      Michael Ellerman 提交于
      HMT_MEDIUM_LOW_HAS_PPR is only used in once place, open code it.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d030a4b5
    • M
      powerpc/kernel: Drop HMT_MEDIUM_PPR_DISCARD · d6265aea
      Michael Ellerman 提交于
      HMT_MEDIUM_PPR_DISCARD is a macro which is present at the start of most
      of our first level exception handlers. It conditionally executes a
      HMT_MEDIUM instruction, which sets the processor priority to medium.
      
      On on modern systems, ie. Power7 and later, it is nop'ed out at boot.
      All it does is make the exception vectors more cramped, and consume 4
      bytes of icache.
      
      On old systems it has the effect of boosting the processor priority at
      the start of exception processing. If we were previously in the idle
      loop for example, we may be at low or very low priority. This is
      desirable as we want to process the exception as fast as possible.
      
      However looking closely at the generated code, we see that in all cases
      we execute another HMT_MEDIUM just four instructions later. With code
      patching applied, the final code on an old (Power6) system will look
      like, eg:
      
        c000000000000300 <data_access_pSeries>:
        c000000000000300:	7c 42 13 78	mr	r2,r2		<-
        c000000000000304:	7d b2 43 a6	mtsprg	2,r13
        c000000000000308:	7d b1 42 a6	mfsprg	r13,1
        c00000000000030c:	f9 2d 00 80	std	r9,128(r13)
        c000000000000310:	60 00 00 00	nop
        c000000000000314:	7c 42 13 78	mr	r2,r2		<-
      
      So I suggest that the added code complexity of HMT_MEDIUM_PPR_DISCARD is
      not justified by the benefit of boosting the processor priority for the
      duration of four instructions, and therefore we drop it.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d6265aea
    • M
      powerpc/rtas: Make enter_rtas() private · cd5cdeb6
      Michael Ellerman 提交于
      There are no longer any users of enter_rtas() outside of rtas.c, so make
      it "private", by moving the declaration inside rtas.c. Hopefully this
      will encourage people to use one of the wrappers which takes the sharp
      edges off the RTAS calling sequence.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      cd5cdeb6
    • M
      powerpc/rtas: Add rtas_call_unlocked() · 209eb4e5
      Michael Ellerman 提交于
      Most users of RTAS (Run-Time Abstraction Services) use rtas_call(),
      which deals with locking as well as endian handling.
      
      However we have two users outside of rtas.c that can't use rtas_call()
      because they have different locking requirements.
      
      The hotplug CPU code can't take the RTAS lock because the CPU would go
      offline with the lock held and no other CPUs would be able to call RTAS
      until the CPU came back online.
      
      The xmon code doesn't want to take the lock because it would risk dead
      locking when we are trying to recover from a crash.
      
      Both sites required multiple patches when we added little endian
      support, proving that programmers can't do endian right.
      
      Although that ship has sailed, we can still clean the code up by
      providing an unlocked version of rtas_call() which avoids the need to
      open code the logic elsewhere.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      209eb4e5
    • S
      powerpc/powernv: remove FW_FEATURE_OPALv3 and just use FW_FEATURE_OPAL · e4d54f71
      Stewart Smith 提交于
      Long ago, only in the lab, there was OPALv1 and OPALv2. Now there is
      just OPALv3, with nobody ever expecting anything on pre-OPALv3 to
      be cared about or supported by mainline kernels.
      
      So, let's remove FW_FEATURE_OPALv3 and instead use FW_FEATURE_OPAL
      exclusively.
      Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e4d54f71
    • S
      powerpc/powernv: Remove OPALv2 firmware define and references · 7261aafc
      Stewart Smith 提交于
      OPALv2 only ever existed in the lab and didn't escape to the world.
      All OPAL systems in the wild are OPALv3.
      
      The probability of there being an OPALv2 system still powered on
      anywhere inside IBM is approximately zero, let alone anyone
      expecting to run mainline kernels.
      
      So, start to remove references to OPALv2.
      Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7261aafc
    • H
      crypto: nx-842 - Mask XERS0 bit in return value · 6333ed8f
      Haren Myneni 提交于
      NX842 coprocessor sets 3rd bit in CR register with XER[S0] which is
      nothing to do with NX request. Since this bit can be set with other
      valuable return status, mast this bit.
      
      One of other bits (INITIATED, BUSY or REJECTED) will be returned for
      any given NX request.
      Signed-off-by: NHaren Myneni <haren@us.ibm.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      6333ed8f
  11. 16 12月, 2015 2 次提交
    • M
      Partial revert of "powerpc: Individual System V IPC system calls" · 2475c362
      Michael Ellerman 提交于
      This partially reverts commit a3423615.
      
      While reviewing the glibc patch to exploit the individual IPC calls,
      Arnd & Andreas noticed that we were still requiring userspace to pass
      IPC_64 in order to get the new style IPC API.
      
      With a bit of cleanup in the kernel we can drop that requirement, and
      instead only provide the new style API, which will simplify things for
      userspace.
      
      Rather than try and sneak that patch into 4.4, instead we will drop the
      individual IPC calls for powerpc, and merge them again in 4.5 once the
      cleanup patch has gone in.
      
      Because we've already added sys_mlock2() as syscall #378, we don't do a
      full revert of the IPC calls. Instead we drop the __NR #defines, and
      send those now undefined syscall numbers to sys_ni_syscall(). This
      leaves a gap in the syscall numbers, but we'll reuse them when we merge
      the individual IPC calls.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      2475c362
    • D
      powerpc: Remove broken GregorianDay() · 00b912b0
      Daniel Axtens 提交于
      GregorianDay() is supposed to calculate the day of the week
      (tm->tm_wday) for a given day/month/year. In that calcuation it
      indexed into an array called MonthOffset using tm->tm_mon-1. However
      tm_mon is zero-based, not one-based, so this is off-by-one. It also
      means that every January, GregoiranDay() will access element -1 of
      the MonthOffset array.
      
      It also doesn't appear to be a correct algorithm either: see in
      contrast kernel/time/timeconv.c's time_to_tm function.
      
      It's been broken forever, which suggests no-one in userland uses
      this. It looks like no-one in the kernel uses tm->tm_wday either
      (see e.g. drivers/rtc/rtc-ds1305.c:319).
      
      tm->tm_wday is conventionally set to -1 when not available in
      hardware so we can simply set it to -1 and drop the function.
      (There are over a dozen other drivers in drivers/rtc that do
      this.)
      
      Found using UBSAN.
      
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrew Morton <akpm@linux-foundation.org> # as an example of what UBSan finds.
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
      Cc: rtc-linux@googlegroups.com
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Acked-by: NAlexandre Belloni <alexandre.belloni@free-electrons.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      00b912b0
  12. 14 12月, 2015 10 次提交