1. 16 7月, 2014 3 次提交
  2. 02 7月, 2014 1 次提交
    • H
      perf/x86/intel: ignore CondChgd bit to avoid false NMI handling · b292d7a1
      HATAYAMA Daisuke 提交于
      Currently, any NMI is falsely handled by a NMI handler of NMI watchdog
      if CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
      
      For example, we use external NMI to make system panic to get crash
      dump, but in this case, the external NMI is falsely handled do to the
      issue.
      
      This commit deals with the issue simply by ignoring CondChgd bit.
      
      Here is explanation in detail.
      
      On x86 NMI watchdog uses performance monitoring feature to
      periodically signal NMI each time performance counter gets overflowed.
      
      intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
      handler of NMI watchdog, perf_event_nmi_handler(). It identifies an
      owner of a given NMI by looking at overflow status bits in
      MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
      handles the given NMI as its own NMI.
      
      The problem is that the intel_pmu_handle_irq() doesn't distinguish
      CondChgd bit from other bits. Unlike the other status bits, CondChgd
      bit doesn't represent overflow status for performance counters. Thus,
      CondChgd bit cannot be thought of as a mark indicating a given NMI is
      NMI watchdog's.
      
      As a result, if CondChgd bit is set, any NMI is falsely handled by the
      NMI handler of NMI watchdog. Also, if type of the falsely handled NMI
      is either NMI_UNKNOWN, NMI_SERR or NMI_IO_CHECK, the corresponding
      action is never performed until CondChgd bit is cleared.
      
      I noticed this behavior on systems with Ivy Bridge processors: Intel
      Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
      CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
      in the beginning at boot. Then the CondChgd bit is immediately cleared
      by next wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain
      0.
      
      On the other hand, on older processors such as Nehalem, Xeon E7540,
      CondChgd bit is not set in the beginning at boot.
      
      I'm not sure about exact behavior of CondChgd bit, in particular when
      this bit is set. Although I read Intel System Programmer's Manual to
      figure out that, the descriptions I found are:
      
        In 18.9.1:
      
        "The MSR_PERF_GLOBAL_STATUS MSR also provides a ¡sticky bit¢ to
         indicate changes to the state of performancmonitoring hardware"
      
        In Table 35-2 IA-32 Architectural MSRs
      
        63 CondChg: status bits of this register has changed.
      
      These are different from the bahviour I see on the actual system as I
      explained above.
      
      At least, I think ignoring CondChgd bit should be enough for NMI
      watchdog perspective.
      Signed-off-by: NHATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20140625.103503.409316067.d.hatayama@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b292d7a1
  3. 18 6月, 2014 1 次提交
  4. 17 6月, 2014 1 次提交
  5. 14 6月, 2014 2 次提交
  6. 13 6月, 2014 2 次提交
  7. 11 6月, 2014 3 次提交
  8. 09 6月, 2014 1 次提交
  9. 08 6月, 2014 1 次提交
    • M
      x86/boot: EFI_MIXED should not prohibit loading above 4G · 745c5167
      Matt Fleming 提交于
      commit 7d453eee ("x86/efi: Wire up CONFIG_EFI_MIXED") introduced a
      regression for the functionality to load kernels above 4G. The relevant
      (incorrect) reasoning behind this change can be seen in the commit
      message,
      
        "The xloadflags field in the bzImage header is also updated to reflect
        that the kernel supports both entry points by setting both of
        XLF_EFI_HANDOVER_32 and XLF_EFI_HANDOVER_64 when CONFIG_EFI_MIXED=y.
        XLF_CAN_BE_LOADED_ABOVE_4G is disabled so that the kernel text is
        guaranteed to be addressable with 32-bits."
      
      This is obviously bogus since 32-bit EFI loaders will never place the
      kernel above the 4G mark. So this restriction is entirely unnecessary.
      
      But things are worse than that - since we want to encourage people to
      always compile with CONFIG_EFI_MIXED=y so that their kernels work out of
      the box for both 32-bit and 64-bit firmware, commit 7d453eee
      effectively disables XLF_CAN_BE_LOADED_ABOVE_4G completely.
      
      Remove the overzealous and superfluous restriction and restore the
      XLF_CAN_BE_LOADED_ABOVE_4G functionality.
      
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Link: http://lkml.kernel.org/r/1402140380-15377-1-git-send-email-matt@console-pimps.orgSigned-off-by: NH. Peter Anvin <hpa@zytor.com>
      745c5167
  10. 07 6月, 2014 2 次提交
  11. 06 6月, 2014 1 次提交
  12. 05 6月, 2014 22 次提交
    • I
      x86/smpboot: Initialize secondary CPU only if master CPU will wait for it · 3e1a878b
      Igor Mammedov 提交于
      Hang is observed on virtual machines during CPU hotplug,
      especially in big guests with many CPUs. (It reproducible
      more often if host is over-committed).
      
      It happens because master CPU gives up waiting on
      secondary CPU and allows it to run wild. As result
      AP causes locking or crashing system. For example
      as described here:
      
         https://lkml.org/lkml/2014/3/6/257
      
      If master CPU have sent STARTUP IPI successfully,
      and AP signalled to master CPU that it's ready
      to start initialization, make master CPU wait
      indefinitely till AP is onlined.
      To ensure that AP won't ever run wild, make it
      wait at early startup till master CPU confirms its
      intention to wait for AP. If AP doesn't respond in 10
      seconds, the master CPU will timeout and cancel
      AP onlining.
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-4-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3e1a878b
    • I
      x86/smpboot: Log error on secondary CPU wakeup failure at ERR level · feef1e8e
      Igor Mammedov 提交于
      If system is running without debug level logging,
      it will not log error if do_boot_cpu() failed to
      wakeup AP. It may lead to silent AP bringup
      failures at boot time.
      Change message level to KERN_ERR to make error
      visible to user as it's done on other architectures.
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-3-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      feef1e8e
    • I
      x86: Fix list/memory corruption on CPU hotplug · 89f898c1
      Igor Mammedov 提交于
      currently if AP wake up is failed, master CPU marks AP as not
      present in do_boot_cpu() by calling set_cpu_present(cpu, false).
      That leads to following list corruption on the next physical CPU
      hotplug:
      
      [  418.107336] WARNING: CPU: 1 PID: 45 at lib/list_debug.c:33 __list_add+0xbe/0xd0()
      [  418.115268] list_add corruption. prev->next should be next (ffff88003dc57600), but was ffff88003e20c3a0. (prev=ffff88003e20c3a0).
      [  418.123693] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT cfg80211 xt_conntrack rfkill ee
      [  418.138979] CPU: 1 PID: 45 Comm: kworker/u10:1 Not tainted 3.14.0-rc6+ #387
      [  418.149989] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
      [  418.165750] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
      [  418.166433]  0000000000000021 ffff880038ca7988 ffffffff8159b22d 0000000000000021
      [  418.176460]  ffff880038ca79d8 ffff880038ca79c8 ffffffff8106942c ffff880038ca79e8
      [  418.177453]  ffff88003e20c3a0 ffff88003dc57600 ffff88003e20c3a0 00000000ffffffea
      [  418.178445] Call Trace:
      [  418.185811]  [<ffffffff8159b22d>] dump_stack+0x49/0x5c
      [  418.186440]  [<ffffffff8106942c>] warn_slowpath_common+0x8c/0xc0
      [  418.187192]  [<ffffffff81069516>] warn_slowpath_fmt+0x46/0x50
      [  418.191231]  [<ffffffff8136ef51>] ? acpi_ns_get_node+0xb7/0xc7
      [  418.193889]  [<ffffffff812f796e>] __list_add+0xbe/0xd0
      [  418.196649]  [<ffffffff812e2aa9>] kobject_add_internal+0x79/0x200
      [  418.208610]  [<ffffffff812e2e18>] kobject_add_varg+0x38/0x60
      [  418.213831]  [<ffffffff812e2ef4>] kobject_add+0x44/0x70
      [  418.229961]  [<ffffffff813e2c60>] device_add+0xd0/0x550
      [  418.234991]  [<ffffffff813f0e95>] ? pm_runtime_init+0xe5/0xf0
      [  418.250226]  [<ffffffff813e32be>] device_register+0x1e/0x30
      [  418.255296]  [<ffffffff813e82a3>] register_cpu+0xe3/0x130
      [  418.266539]  [<ffffffff81592be5>] arch_register_cpu+0x65/0x150
      [  418.285845]  [<ffffffff81355c0d>] acpi_processor_hotadd_init+0x5a/0x9b
      ...
      Which is caused by the fact that generic_processor_info() allocates
      logical CPU id by calling:
      
       cpu = cpumask_next_zero(-1, cpu_present_mask);
      
      which returns id of previously failed to wake up CPU, since its
      bit is cleared by do_boot_cpu() and as result register_cpu()
      tries to register another CPU with the same id as already
      present but failed to be onlined CPU.
      
      Taking in account that AP will not do anything if master CPU
      failed to wake it up, there is no reason to mark that AP as not
      present and break next cpu hotplug attempts. As a side effect of
      not marking AP as not present, user would be allowed to online
      it again later.
      
      Also fix memory corruption in acpi_unmap_lsapic()
      
      if during CPU hotplug master CPU failed to wake up AP
      it set percpu x86_cpu_to_apicid to BAD_APICID=0xFFFF for AP.
      
      However following attempt to unplug that CPU will lead to
      out of bound write access to __apicid_to_node[] which is
      32768 items long on x86_64 kernel.
      
      So with above fix of cpu_present_mask make sure that a present
      CPU has a valid APIC ID by not setting x86_cpu_to_apicid
      to BAD_APICID in do_boot_cpu() on failure and allow
      acpi_processor_remove()->acpi_unmap_lsapic() cleanly remove CPU.
      Signed-off-by: NIgor Mammedov <imammedo@redhat.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1401975765-22328-2-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89f898c1
    • O
      uprobes/x86: Rename arch_uprobe->def to ->defparam, minor comment updates · 5cdb76d6
      Oleg Nesterov 提交于
      Purely cosmetic, no changes in .o,
      
      1. As Jim pointed out arch_uprobe->def looks ambiguous, rename it to
         ->defparam.
      
      2. Add the comment into default_post_xol_op() to explain "regs->sp +=".
      
      3. Remove the stale part of the comment in arch_uprobe_analyze_insn().
      Suggested-by: NJim Keniston <jkenisto@us.ibm.com>
      Reviewed-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      5cdb76d6
    • D
      Revert "xen/pvh: Update E820 to work with PVH (v2)" · 562658f3
      David Vrabel 提交于
      This reverts commit 9103bb0f.
      
      Now than xen_memory_setup() is not called for auto-translated guests,
      we can remove this commit.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
      Tested-by: NRoger Pau Monné <roger.pau@citrix.com>
      562658f3
    • D
      x86/xen: fix memory setup for PVH dom0 · abacaadc
      David Vrabel 提交于
      Since af06d66ee32b (x86: fix setup of PVH Dom0 memory map) in Xen, PVH
      dom0 need only use the memory memory provided by Xen which has already
      setup all the correct holes.
      
      xen_memory_setup() then ends up being trivial for a PVH guest so
      introduce a new function (xen_auto_xlated_memory_setup()).
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
      Tested-by: NRoger Pau Monné <roger.pau@citrix.com>
      abacaadc
    • A
      perf/x86: Add conditional branch filtering support · 37548914
      Anshuman Khandual 提交于
      This patch adds conditional branch filtering support,
      enabling it for PERF_SAMPLE_BRANCH_COND in perf branch
      stack sampling framework by utilizing an available
      software filter X86_BR_JCC.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Reviewed-by: NStephane Eranian <eranian@google.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: mpe@ellerman.id.au
      Cc: benh@kernel.crashing.org
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1400743210-32289-3-git-send-email-khandual@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37548914
    • V
      perf/x86: Use common PMU interrupt disabled code · c184c980
      Vince Weaver 提交于
      Make the x86 perf code use the new common PMU interrupt disabled code.
      
      Typically most x86 machines have working PMU interrupts, although
      some older p6-class machines had this problem.
      Signed-off-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1405161715560.11099@vincent-weaver-1.umelst.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c184c980
    • B
      kernel/printk: use symbolic defines for console loglevels · a8fe19eb
      Borislav Petkov 提交于
      ... instead of naked numbers.
      
      Stuff in sysrq.c used to set it to 8 which is supposed to mean above
      default level so set it to DEBUG instead as we're terminating/killing all
      tasks and we want to be verbose there.
      
      Also, correct the check in x86_64_start_kernel which should be >= as
      we're clearly issuing the string there for all debug levels, not only
      the magical 10.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8fe19eb
    • F
      sys_sgetmask/sys_ssetmask: add CONFIG_SGETMASK_SYSCALL · f6187769
      Fabian Frederick 提交于
      sys_sgetmask and sys_ssetmask are obsolete system calls no longer
      supported in libc.
      
      This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert
      mode configuration.That option is enabled by default for those
      architectures.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6187769
    • C
      hwpoison: remove unused global variable in do_machine_check() · 65eb7182
      Chen Yucong 提交于
      Remove an unused global variable mce_entry and relative operations in
      do_machine_check().
      Signed-off-by: NChen Yucong <slaoub@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65eb7182
    • E
      arch/x86/mm/numa.c: use for_each_memblock() · af4459d3
      Emil Medve 提交于
      Signed-off-by: NEmil Medve <Emilian.Medve@Freescale.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af4459d3
    • C
      mm: x86 pgtable: require X86_64 for soft-dirty tracker · 2bf01f9f
      Cyrill Gorcunov 提交于
      Tracking dirty status on 2 level pages requires very ugly macros and
      taking into account how old the machines who can operate without PAE
      mode only are, lets drop soft dirty tracker from them for code
      simplicity (note I can't drop all the macros from 2 level pages by now
      since _PAGE_BIT_PROTNONE and _PAGE_BIT_FILE are still used even without
      tracker).
      
      Linus proposed to completely rip off softdirty support on x86-32 (even
      with PAE) and since for CRIU we're not planning to support native x86-32
      mode, lets do that.
      
      (Softdirty tracker is relatively new feature which is mostly used by
      CRIU so I don't expect if such API change would cause problems for
      userspace).
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bf01f9f
    • C
      mm: x86 pgtable: drop unneeded preprocessor ifdef · 2373eaec
      Cyrill Gorcunov 提交于
      _PAGE_BIT_FILE (bit 6) is always less than _PAGE_BIT_PROTNONE (bit 8), so
      drop redundant #ifdef.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2373eaec
    • A
      arch/x86/kernel/pci-dma.c: fix dma_generic_alloc_coherent() when CONFIG_DMA_CMA is enabled · 38f7ea5a
      Akinobu Mita 提交于
      dma_generic_alloc_coherent() firstly attempts to allocate by
      dma_alloc_from_contiguous() if CONFIG_DMA_CMA is enabled.  But the
      memory region allocated by it may not fit within the device's DMA mask.
      This change makes it fall back to usual alloc_pages_node() allocation
      for such cases.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38f7ea5a
    • A
      cma: add placement specifier for "cma=" kernel parameter · 5ea3b1b2
      Akinobu Mita 提交于
      Currently, "cma=" kernel parameter is used to specify the size of CMA,
      but we can't specify where it is located.  We want to locate CMA below
      4GB for devices only supporting 32-bit addressing on 64-bit systems
      without iommu.
      
      This enables to specify the placement of CMA by extending "cma=" kernel
      parameter.
      
      Examples:
       1. locate 64MB CMA below 4GB by "cma=64M@0-4G"
       2. locate 64MB CMA exact at 512MB by "cma=64M@512M"
      
      Note that the DMA contiguous memory allocator on x86 assumes that
      page_address() works for the pages to allocate.  So this change requires
      to limit end address of contiguous memory area upto max_pfn_mapped to
      prevent from locating it on highmem area by the argument of
      dma_contiguous_reserve().
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ea3b1b2
    • A
      x86: enable DMA CMA with swiotlb · 9c5a3621
      Akinobu Mita 提交于
      The DMA Contiguous Memory Allocator support on x86 is disabled when
      swiotlb config option is enabled.  So DMA CMA is always disabled on
      x86_64 because swiotlb is always enabled.  This attempts to support for
      DMA CMA with enabling swiotlb config option.
      
      The contiguous memory allocator on x86 is integrated in the function
      dma_generic_alloc_coherent() which is .alloc callback in nommu_dma_ops
      for dma_alloc_coherent().
      
      x86_swiotlb_alloc_coherent() which is .alloc callback in swiotlb_dma_ops
      tries to allocate with dma_generic_alloc_coherent() firstly and then
      swiotlb_alloc_coherent() is called as a fallback.
      
      The main part of supporting DMA CMA with swiotlb is that changing
      x86_swiotlb_free_coherent() which is .free callback in swiotlb_dma_ops
      for dma_free_coherent() so that it can distinguish memory allocated by
      dma_generic_alloc_coherent() from one allocated by
      swiotlb_alloc_coherent() and release it with dma_generic_free_coherent()
      which can handle contiguous memory.  This change requires making
      is_swiotlb_buffer() global function.
      
      This also needs to change .free callback in the dma_map_ops for amd_gart
      and sta2x11, because these dma_ops are also using
      dma_generic_alloc_coherent().
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c5a3621
    • A
      x86: make dma_alloc_coherent() return zeroed memory if CMA is enabled · d92ef66c
      Akinobu Mita 提交于
      This patchset enhances the DMA Contiguous Memory Allocator on x86.
      
      Currently the DMA CMA is only supported with pci-nommu dma_map_ops and
      furthermore it can't be enabled on x86_64.  But I would like to allocate
      big contiguous memory with dma_alloc_coherent() and tell it to the device
      that requires it, regardless of which dma mapping implementation is
      actually used in the system.
      
      So this makes it work with swiotlb and intel-iommu dma_map_ops, too.  And
      this also extends "cma=" kernel parameter to specify placement constraint
      by the physical address range of memory allocations.  For example, CMA
      allocates memory below 4GB by "cma=64M@0-4G", it is required for the
      devices only supporting 32-bit addressing on 64-bit systems without iommu.
      
      This patch (of 5):
      
      Calling dma_alloc_coherent() with __GFP_ZERO must return zeroed memory.
      
      But when the contiguous memory allocator (CMA) is enabled on x86 and the
      memory region is allocated by dma_alloc_from_contiguous(), it doesn't
      return zeroed memory.  Because dma_generic_alloc_coherent() forgot to fill
      the memory region with zero if it was allocated by
      dma_alloc_from_contiguous()
      
      Most implementations of dma_alloc_coherent() return zeroed memory
      regardless of whether __GFP_ZERO is specified.  So this fixes it by
      unconditionally zeroing the allocated memory region.
      
      Alternatively, we could fix dma_alloc_from_contiguous() to return zeroed
      out memory and remove memset() from all caller of it.  But we can't simply
      remove the memset on arm because __dma_clear_buffer() is used there for
      ensuring cache flushing and it is used in many places.  Of course we can
      do redundant memset in dma_alloc_from_contiguous(), but I think this patch
      is less impact for fixing this problem.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d92ef66c
    • Y
      x86, mm: probe memory block size for generic x86 64bit · 982792c7
      Yinghai Lu 提交于
      On system with 2TiB ram, current x86_64 have 128M as section size, and
      one memory_block only include one section.  So will have 16400 entries
      under /sys/devices/system/memory/.
      
      Current code try to use block id to find block pointer in /sys for any
      section, and reuse that block pointer.  that finding will take some time
      even after commit 7c243c71 ("mm: speedup in __early_pfn_to_nid")
      that will skip the search in that case during booting up.
      
      So solution could be increase block size just like SGI UV system did.
      (harded code to 2g).
      
      This patch is trying to probe the block size to make it match mmio remap
      size.  for example, Intel Nehalem later system will have memory range [0,
      TOML), [4g, TOMH].  If the memory hole is 2g and total is 128g, TOM will
      be 2g, and TOM2 will be 130g.
      
      We could use 2g as block size instead of default 128M.  That will reduce
      number of entries in /sys/devices/system/memory/
      
      On system 6TiB system will reduce boot time by 35 seconds.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      982792c7
    • M
      x86: define _PAGE_NUMA by reusing software bits on the PMD and PTE levels · c46a7c81
      Mel Gorman 提交于
      _PAGE_NUMA is currently an alias of _PROT_PROTNONE to trap NUMA hinting
      faults on x86.  Care is taken such that _PAGE_NUMA is used only in
      situations where the VMA flags distinguish between NUMA hinting faults
      and prot_none faults.  This decision was x86-specific and conceptually
      it is difficult requiring special casing to distinguish between PROTNONE
      and NUMA ptes based on context.
      
      Fundamentally, we only need the _PAGE_NUMA bit to tell the difference
      between an entry that is really unmapped and a page that is protected
      for NUMA hinting faults as if the PTE is not present then a fault will
      be trapped.
      
      Swap PTEs on x86-64 use the bits after _PAGE_GLOBAL for the offset.
      This patch shrinks the maximum possible swap size and uses the bit to
      uniquely distinguish between NUMA hinting ptes and swap ptes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c46a7c81
    • M
      x86: require x86-64 for automatic NUMA balancing · 4468dd76
      Mel Gorman 提交于
      32-bit support for NUMA is an oddity on its own but with automatic NUMA
      balancing on top there is a reasonable risk that the CPUPID information
      cannot be stored in the page flags.  This patch removes support for
      automatic NUMA support on 32-bit x86.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Noonan <steven@uplinklabs.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4468dd76
    • N
      hugetlb: restrict hugepage_migration_support() to x86_64 · c177c81e
      Naoya Horiguchi 提交于
      Currently hugepage migration is available for all archs which support
      pmd-level hugepage, but testing is done only for x86_64 and there're
      bugs for other archs.  So to avoid breaking such archs, this patch
      limits the availability strictly to x86_64 until developers of other
      archs get interested in enabling this feature.
      
      Simply disabling hugepage migration on non-x86_64 archs is not enough to
      fix the reported problem where sys_move_pages() hits the BUG_ON() in
      follow_page(FOLL_GET), so let's fix this by checking if hugepage
      migration is supported in vma_migratable().
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c177c81e