- 10 4月, 2015 1 次提交
-
-
由 Michael Ellerman 提交于
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Fix the 32-bit code also] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 07 4月, 2015 1 次提交
-
-
由 Michael Ellerman 提交于
The celleb code has seen no actual development for ~7 years. We (maintainers) have no access to test hardware, and it is highly likely the code has bit-rotted. As far as we're aware the hardware was never widely available, and is certainly no longer available, and no one on the list has shown any interest in it over the years. So remove it. If anyone has one and cares please speak up. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: NJeremy Kerr <jk@ozlabs.org>
-
- 01 4月, 2015 2 次提交
-
-
由 Kevin Hao 提交于
All the cache line size of the current book3e 64bit SoCs are 64 bytes. So we should use this size to align the member of paca_struct. This only change the paca_struct's members which are private to book3e CPUs, and should not have any effect to book3s ones. With this, we save 192 bytes. Also change it to __aligned(size) since it is preferred over __attribute__((aligned(size))). Before: /* size: 1920, cachelines: 30, members: 46 */ /* sum members: 1667, holes: 6, sum holes: 141 */ /* padding: 112 */ After: /* size: 1728, cachelines: 27, members: 46 */ /* sum members: 1667, holes: 4, sum holes: 13 */ /* padding: 48 */ Signed-off-by: NKevin Hao <haokexin@gmail.com> Signed-off-by: NScott Wood <scottwood@freescale.com>
-
由 Madalin Bucur 提交于
Signed-off-by: NMadalin Bucur <madalin.bucur@freescale.com> Signed-off-by: NScott Wood <scottwood@freescale.com>
-
- 31 3月, 2015 2 次提交
-
-
由 Cédric Le Goater 提交于
OPAL has its own list of return codes. The patch provides a translation of such codes in errnos for the opal_sensor_read call, and possibly others if needed. Signed-off-by: NCédric Le Goater <clg@fr.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Joe Perches 提交于
Use the normal return values for bool functions Signed-off-by: NJoe Perches <joe@perches.com> Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 28 3月, 2015 2 次提交
-
-
由 Michael Ellerman 提交于
We currently have a "special" syscall for switching endianness. This is syscall number 0x1ebe, which is handled explicitly in the 64-bit syscall exception entry. That has a few problems, firstly the syscall number is outside of the usual range, which confuses various tools. For example strace doesn't recognise the syscall at all. Secondly it's handled explicitly as a special case in the syscall exception entry, which is complicated enough without it. As a first step toward removing the special syscall, we need to add a regular syscall that implements the same functionality. The logic is simple, it simply toggles the MSR_LE bit in the userspace MSR. This is the same as the special syscall, with the caveat that the special syscall clobbers fewer registers. This version clobbers r9-r12, XER, CTR, and CR0-1,5-7. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Tyrel Datwyler 提交于
During suspend/migration operation we must wait for the VASI state reported by the hypervisor to become Suspending prior to making the ibm,suspend-me RTAS call. Calling routines to rtas_ibm_supend_me() pass a vasi_state variable that exposes the VASI state to the caller. This is unnecessary as the caller only really cares about the following three conditions; if there is an error we should bailout, success indicating we have suspended and woken back up so proceed to device tree update, or we are not suspendable yet so try calling rtas_ibm_suspend_me again shortly. This patch removes the extraneous vasi_state variable and simply uses the return code to communicate how to proceed. We either succeed, fail, or get -EAGAIN in which case we sleep for a second before trying to call rtas_ibm_suspend_me again. The behaviour of ppc_rtas() remains the same, but migrate_store() now returns the propogated error code on failure. Previously -1 was returned from migrate_store() in the failure case which equates to -EPERM and was clearly wrong. Signed-off-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com> Cc: Nathan Fontenont <nfont@linux.vnet.ibm.com> Cc: Cyril Bur <cyrilbur@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 25 3月, 2015 1 次提交
-
-
由 Neelesh Gupta 提交于
Provide an unregister interface for the opal message notifiers to be called when not needed like during driver unload/remove. Signed-off-by: NNeelesh Gupta <neelegup@linux.vnet.ibm.com> Reviewed-by: NVasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 24 3月, 2015 9 次提交
-
-
由 David Gibson 提交于
The powerpc specific st_le*() and ld_le*() functions in arch/powerpc/asm/swab.h no longer have any users. They are also misleadingly named, since they always byteswap, even on a little-endian host. This patch removes them. Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 David Gibson 提交于
Sometimes the KVM code on powerpc needs to emulate load or store instructions from the guest, which can include both normal and byte reversed forms. We currently (AFAICT) handle this correctly, but some variable names are very misleading. In particular we use "is_bigendian" in several places to actually mean "is the IO the same endian as the host", but we now support little-endian powerpc hosts. This also ties into the misleadingly named ld_le*() and st_le*() functions, which in fact always byteswap, even on an LE host. This patch cleans this up by renaming to more accurate "host_swabbed", and uses the generic swab*() functions instead of the powerpc specific and misleadingly named ld_le*() and st_le*() functions. Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au> Reviewed-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Gavin Shan 提交于
The patch removes struct eeh_dev::dn and the corresponding helper functions: eeh_dev_to_of_node() and of_node_to_eeh_dev(). Instead, eeh_dev_to_pdn() and pdn_to_eeh_dev() should be used to get the pdn, which might contain device_node on PowerNV platform. Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Gavin Shan 提交于
There are 3 EEH operations whose arguments contain device_node: read_config(), write_config() and restore_config(). The patch replaces device_node with pci_dn. Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Gavin Shan 提交于
Originally, EEH core probes on device_node or pci_dev to populate EEH devices and PEs, which conflicts with the fact: SRIOV VFs are usually enabled and created by PF's driver and they don't have the corresponding device_nodes. Instead, SRIOV VFs have dynamically created pci_dn, which can be used for EEH probe. The patch reworks EEH probe for PowerNV and pSeries platforms to do probing based on pci_dn, instead of pci_dev or device_node any more. Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Gavin Shan 提交于
The patch adds function traverse_pci_dn(), which is similar to traverse_pci_devices() except it takes pci_dn, not device_node as parameter. The pci_dev.c has been reworked to create eeh_dev from pci_dn, instead of device_node. Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Gavin Shan 提交于
Originally, EEH probes on device_node or pci_dev and populates the corresponding eeh_dev. In the subsequent patches, EEH will probes on pci_dn and populates the corresponding eeh_dev. So we have to cache some information in pci_dn, either from device_node or SRIOV PF's enablement platform hook, to populate the eeh_dev properly. The motivation to probe pci_dn, instead of device node or pci_dev, to populate eeh_dev is SRIOV VFs are dynamically created and we don't have the corresponding device nodes for them. Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Gavin Shan 提交于
Currently, the PCI config accessors are implemented based on device node. Unfortunately, SRIOV VFs won't have the corresponding device nodes. pci_dn will be used in replacement with device node for SRIOV VFs. So we have to use pci_dn in PCI config accessors. The patch refactors pci_dn in following aspects to make it ready to be used in PCI config accessors as we do in subsequent patch: * pci_dn is organized as a hierarchy tree. PCI device's pci_dn is put to the child list of pci_dn of its upstream bridge or PHB. VF's pci_dn will be put to the child list of pci_dn of PF's bridge. * For one particular PCI device (VF or not), its pci_dn can be found from pdev->dev.archdata.pci_data, PCI_DN(devnode), or parent's list. The fast path (fetching pci_dn through PCI device instance) is populated during early fixup time. [bhelgaas: changelog] Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Hongtao Jia 提交于
MPIC version is useful information for both mpic_alloc() and mpic_init(). The patch provide an API to get MPIC version for reusing the code. Also, some other IP block may need MPIC version for their own use. The API for external use is also provided. This function had been previously added but was removed by commit 5e86bfde ("powerpc/mpic: remove unused functions") due to the lack of a user. This function will be used by "powerpc/mpic: Add get_version API both for internal and external use". Signed-off-by: NJia Hongtao <hongtao.jia@freescale.com> Signed-off-by: NLi Yang <leoli@freescale.com> [scottwood@freescale.com: changelog update] Signed-off-by: NScott Wood <scottwood@freescale.com>
-
- 23 3月, 2015 4 次提交
-
-
由 Fabian Frederick 提交于
Replace one line asm-generic include files declared in arch/powerpc/include/asm/ by generic-y declaration which creates arch/powerpc/include/generated/asm equivalent. Signed-off-by: NFabian Frederick <fabf@skynet.be> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 David Gibson 提交于
ppc has special instruction forms to efficiently load and store values in non-native endianness. These can be accessed via the arch-specific {ld,st}_le{16,32}() inlines in arch/powerpc/include/asm/swab.h. However, gcc is perfectly capable of generating the byte-reversing load/store instructions when using the normal, generic cpu_to_le*() and le*_to_cpu() functions eaning the arch-specific functions don't have much point. Worse the "le" in the names of the arch specific functions is now misleading, because they always generate byte-reversing forms, but some ppc machines can now run a little-endian kernel. To start getting rid of the arch-specific forms, this patch removes them from all the old Power Macintosh drivers, replacing them with the generic byteswappers. Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Hari Bathini 提交于
While we are here, let us make timestamp related code y2038-safe. Suggested-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Hari Bathini 提交于
With minor checks, we can move most of the code for nvram under pseries to a common place to be re-used by other powerpc platforms like powernv. This patch moves such common code to arch/powerpc/kernel/nvram_64.c file. Signed-off-by: NHari Bathini <hbathini@linux.vnet.ibm.com> [mpe: Move select of ZLIB_DEFLATE to PPC64 to fix the build] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 17 3月, 2015 3 次提交
-
-
由 Kyle Moffett 提交于
These functions are only used from one place each. If the cacheable_* versions really are more efficient, then those changes should be migrated into the common code instead. NOTE: The old routines are just flat buggy on kernels that support hardware with different cacheline sizes. Signed-off-by: NKyle Moffett <Kyle.D.Moffett@boeing.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Gavin Shan 提交于
The patch fixes the comments about ppc_md.pcibios_fixup(), which should be called after allocating resources. Signed-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Mahesh Salgaonkar 提交于
The flush_tlb hook in cpu_spec was introduced as a generic function hook to invalidate TLBs. But the current implementation of flush_tlb hook takes IS (invalidation selector) as an argument which is architecture dependent. Hence, It is not right to have a generic routine where caller has to pass non-generic argument. This patch fixes this and makes flush_tlb hook as high level API. Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 16 3月, 2015 7 次提交
-
-
由 Arseny Solokha 提交于
Drop unused fsl_mpic_primary_get_version(), mpic_set_clk_ratio(), mpic_set_serial_int(). + fsl_mpic_primary_get_version() is just a safe wrapper around fsl_mpic_get_version() for SMP configurations. While the latter is called explicitly for handling PIC initialization and setting up error interrupt vector depending on PIC hardware version, the former isn't used for anything. + As for mpic_set_clk_ratio() and mpic_set_serial_int(), they both are almost nine years old[1] but still have no chance to be called even from out-of-tree modules because they both are __init and of course aren't exported. [1] https://lists.ozlabs.org/pipermail/linuxppc-dev/2006-June/023867.htmlSigned-off-by: NArseny Solokha <asolokha@kb.kras.ru> Cc: hongtao.jia@freescale.com Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Arseny Solokha 提交于
Drop ucc_slow_poll_transmitter_now() which has no users since its inception in 2007 in commit 98658538 ("[POWERPC] Add QUICC Engine (QE) infrastructure"). Signed-off-by: NArseny Solokha <asolokha@kb.kras.ru> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Michael Ellerman 提交于
This removes definitions in opal-api.h that are completely unused in Linux. For each of these I see three possibilities, 1) we *should* be using them in Linux and patches will arrive to do that, 2) they are not used but should stay in the header to document the API for some important reason, 3) they are not used and needn't be part of the API. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Reviewed-by: NStewart Smith <stewart@linux.vnet.ibm.com>
-
由 Michael Ellerman 提交于
This commit gets opal-api.h to mostly match the version in Skiboot as of commit ea7d806ab0ba. The exceptions are things which are not (currently) used in Linux. Most of this is just whitespace and a few things moving around. I think the diff is readable. Also OpalMessageType became opal_msg_type, requiring a change in the Linux code. Finally Skiboot and Linux disagree on CAPI vs CXL, because CAPI means something else in Linux. To handle that we just point the Linux wrapper, which is named "cxl" to the OPAL token OPAL_PCI_SET_PHB_CAPI_MODE. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Reviewed-by: NStewart Smith <stewart@linux.vnet.ibm.com>
-
由 Michael Ellerman 提交于
We'd like to get to the stage where the OPAL API is defined in a header that is identical between Linux and Skiboot. As step one, split the bits that actually define the API into opal-api.h. The Linux specific parts stay in opal.h. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Acked-by: NStewart Smith <stewart@linux.vnet.ibm.com>
-
由 Anton Blanchard 提交于
As our various loops (copy, string, crypto etc) get more complicated, we want to share implementations between userspace (eg glibc) and the kernel. We also want to write userspace test harnesses to put in tools/testing/selftest. One gratuitous difference between userspace and the kernel is the VSX register definitions - the kernel uses vsrX whereas gcc uses vsX. Change the kernel to match userspace. Signed-off-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Anton Blanchard 提交于
As our various loops (copy, string, crypto etc) get more complicated, we want to share implementations between userspace (eg glibc) and the kernel. We also want to write userspace test harnesses to put in tools/testing/selftest. One gratuitous difference between userspace and the kernel is the VMX register definitions - the kernel uses vrX whereas both gcc and glibc use vX. Change the kernel to match userspace. Signed-off-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 04 3月, 2015 1 次提交
-
-
由 Nishanth Aravamudan 提交于
After d905c5df ("PPC: POWERNV: move iommu_add_device earlier"), the refcnt on the kobject backing the IOMMU group for a PCI device is elevated by each call to pci_dma_dev_setup_pSeriesLP() (via set_iommu_table_base_and_group). When we go to dlpar a multi-function PCI device out: iommu_reconfig_notifier -> iommu_free_table -> iommu_group_put BUG_ON(tbl->it_group) We trip this BUG_ON, because there are still references on the table, so it is not freed. Fix this by moving the powernv bus notifier to common code and calling it for both powernv and pseries. Fixes: d905c5df ("PPC: POWERNV: move iommu_add_device earlier") Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Tested-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 23 2月, 2015 1 次提交
-
-
由 Paul Clarke 提交于
Implement arch_irq_work_has_interrupt() for powerpc Commit 9b01f5bf introduced a dependency on "IRQ work self-IPIs" for full dynamic ticks to be enabled, by expecting architectures to implement a suitable arch_irq_work_has_interrupt() routine. Several arches have implemented this routine, including x86 (3010279f) and arm (09f6edd4), but powerpc was omitted. This patch implements this routine for powerpc. The symptom, at boot (on powerpc systems) with "nohz_full=<CPU list>" is displayed: NO_HZ: Can't run full dynticks because arch doesn't support irq work self-IPIs after this patch: NO_HZ: Full dynticks CPUs: <CPU list>. Tested against 3.19. powerpc implements "IRQ work self-IPIs" by setting the decrementer to 1 in arch_irq_work_raise(), which causes a decrementer exception on the next timebase tick. We then handle the work in __timer_interrupt(). CC: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: NPaul A. Clarke <pc@us.ibm.com> Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> [mpe: Flesh out change log, fix ws & include guards, remove include of processor.h] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 17 2月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2015 3 次提交
-
-
由 Andy Lutomirski 提交于
If an attacker can cause a controlled kernel stack overflow, overwriting the restart block is a very juicy exploit target. This is because the restart_block is held in the same memory allocation as the kernel stack. Moving the restart block to struct task_struct prevents this exploit by making the restart_block harder to locate. Note that there are other fields in thread_info that are also easy targets, at least on some architectures. It's also a decent simplification, since the restart code is more or less identical on all architectures. [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack] Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: David Miller <davem@davemloft.net> Acked-by: NRichard Weinberger <richard@nod.at> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Steven Miao <realmz6@gmail.com> Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Chen Liqin <liqin.linux@gmail.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: NJames Hogan <james.hogan@imgtec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This patch removes the NUMA PTE bits and associated helpers. As a side-effect it increases the maximum possible swap space on x86-64. One potential source of problems is races between the marking of PTEs PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that a PTE being protected is not faulted in parallel, seen as a pte_none and corrupting memory. The base case is safe but transhuge has problems in the past due to an different migration mechanism and a dependance on page lock to serialise migrations and warrants a closer look. task_work hinting update parallel fault ------------------------ -------------- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault pmd_none do_huge_pmd_anonymous_page read? pmd_lock blocks until hinting complete, fail !pmd_none test write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none pmd_modify set_pmd_at task_work hinting update parallel migration ------------------------ ------------------ change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault do_huge_pmd_numa_page migrate_misplaced_transhuge_page pmd_lock waits for updates to complete, recheck pmd_same pmd_modify set_pmd_at Both of those are safe and the case where a transhuge page is inserted during a protection update is unchanged. The case where two processes try migrating at the same time is unchanged by this series so should still be ok. I could not find a case where we are accidentally depending on the PTE not being cleared and flushed. If one is missed, it'll manifest as corruption problems that start triggering shortly after this series is merged and only happen when NUMA balancing is enabled. Signed-off-by: NMel Gorman <mgorman@suse.de> Tested-by: NSasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mark Brown <broonie@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This is a preparatory patch that introduces protnone helpers for automatic NUMA balancing. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Acked-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 2月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
LKP has triggered a compiler warning after my recent patch "mm: account pmd page tables to the process": mm/mmap.c: In function 'exit_mmap': >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default] The code: > 2857 WARN_ON(mm_nr_pmds(mm) > 2858 round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT); In this, on tile, we have FIRST_USER_ADDRESS defined as 0. round_up() has the same type -- int. PUD_SHIFT. I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned long. On every arch for consistency. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 2月, 2015 1 次提交
-
-
由 Paolo Bonzini 提交于
This patch introduces a new module parameter for the KVM module; when it is present, KVM attempts a bit of polling on every HLT before scheduling itself out via kvm_vcpu_block. This parameter helps a lot for latency-bound workloads---in particular I tested it with O_DSYNC writes with a battery-backed disk in the host. In this case, writes are fast (because the data doesn't have to go all the way to the platters) but they cannot be merged by either the host or the guest. KVM's performance here is usually around 30% of bare metal, or 50% if you use cache=directsync or cache=writethrough (these parameters avoid that the guest sends pointless flush requests, and at the same time they are not slow because of the battery-backed cache). The bad performance happens because on every halt the host CPU decides to halt itself too. When the interrupt comes, the vCPU thread is then migrated to a new physical CPU, and in general the latency is horrible because the vCPU thread has to be scheduled back in. With this patch performance reaches 60-65% of bare metal and, more important, 99% of what you get if you use idle=poll in the guest. This means that the tunable gets rid of this particular bottleneck, and more work can be done to improve performance in the kernel or QEMU. Of course there is some price to pay; every time an otherwise idle vCPUs is interrupted by an interrupt, it will poll unnecessarily and thus impose a little load on the host. The above results were obtained with a mostly random value of the parameter (500000), and the load was around 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU. The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll, that can be used to tune the parameter. It counts how many HLT instructions received an interrupt during the polling period; each successful poll avoids that Linux schedules the VCPU thread out and back in, and may also avoid a likely trip to C1 and back for the physical CPU. While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second. Of these halts, almost all are failed polls. During the benchmark, instead, basically all halts end within the polling period, except a more or less constant stream of 50 per second coming from vCPUs that are not running the benchmark. The wasted time is thus very low. Things may be slightly different for Windows VMs, which have a ~10 ms timer tick. The effect is also visible on Marcelo's recently-introduced latency test for the TSC deadline timer. Though of course a non-RT kernel has awful latency bounds, the latency of the timer is around 8000-10000 clock cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC deadline timer, thus, the effect is both a smaller average latency and a smaller variance. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-