1. 28 8月, 2015 2 次提交
    • D
      mm: ZONE_DEVICE for "device memory" · 033fbae9
      Dan Williams 提交于
      While pmem is usable as a block device or via DAX mappings to userspace
      there are several usage scenarios that can not target pmem due to its
      lack of struct page coverage. In preparation for "hot plugging" pmem
      into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
      separately from the ones that are subject to standard page allocations.
      Importantly "device memory" can be removed at will by userspace
      unbinding the driver of the device.
      
      Having a separate zone prevents allocation and otherwise marks these
      pages that are distinct from typical uniform memory.  Device memory has
      different lifetime and performance characteristics than RAM.  However,
      since we have run out of ZONES_SHIFT bits this functionality currently
      depends on sacrificing ZONE_DMA.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Jerome Glisse <j.glisse@gmail.com>
      [hch: various simplifications in the arch interface]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      033fbae9
    • R
      nd_blk: change aperture mapping from WC to WB · 67a3e8fe
      Ross Zwisler 提交于
      This should result in a pretty sizeable performance gain for reads.  For
      rough comparison I did some simple read testing using PMEM to compare
      reads of write combining (WC) mappings vs write-back (WB).  This was
      done on a random lab machine.
      
      PMEM reads from a write combining mapping:
      	# dd of=/dev/null if=/dev/pmem0 bs=4096 count=100000
      	100000+0 records in
      	100000+0 records out
      	409600000 bytes (410 MB) copied, 9.2855 s, 44.1 MB/s
      
      PMEM reads from a write-back mapping:
      	# dd of=/dev/null if=/dev/pmem0 bs=4096 count=1000000
      	1000000+0 records in
      	1000000+0 records out
      	4096000000 bytes (4.1 GB) copied, 3.44034 s, 1.2 GB/s
      
      To be able to safely support a write-back aperture I needed to add
      support for the "read flush" _DSM flag, as outlined in the DSM spec:
      
      http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      
      This flag tells the ND BLK driver that it needs to flush the cache lines
      associated with the aperture after the aperture is moved but before any
      new data is read.  This ensures that any stale cache lines from the
      previous contents of the aperture will be discarded from the processor
      cache, and the new data will be read properly from the DIMM.  We know
      that the cache lines are clean and will be discarded without any
      writeback because either a) the previous aperture operation was a read,
      and we never modified the contents of the aperture, or b) the previous
      aperture operation was a write and we must have written back the dirtied
      contents of the aperture to the DIMM before the I/O was completed.
      
      In order to add support for the "read flush" flag I needed to add a
      generic routine to invalidate cache lines, mmio_flush_range().  This is
      protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
      only supported on x86.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      67a3e8fe
  2. 21 8月, 2015 4 次提交
  3. 19 8月, 2015 1 次提交
    • D
      libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option · 7a67832c
      Dan Williams 提交于
      We currently register a platform device for e820 type-12 memory and
      register a nvdimm bus beneath it.  Registering the platform device
      triggers the device-core machinery to probe for a driver, but that
      search currently comes up empty.  Building the nvdimm-bus registration
      into the e820_pmem platform device registration in this way forces
      libnvdimm to be built-in.  Instead, convert the built-in portion of
      CONFIG_X86_PMEM_LEGACY to simply register a platform device and move the
      rest of the logic to the driver for e820_pmem, for the following
      reasons:
      
      1/ Letting e820_pmem support be a module allows building and testing
         libnvdimm.ko changes without rebooting
      
      2/ All the normal policy around modules can be applied to e820_pmem
         (unbind to disable and/or blacklisting the module from loading by
         default)
      
      3/ Moving the driver to a generic location and converting it to scan
         "iomem_resource" rather than "e820.map" means any other architecture can
         take advantage of this simple nvdimm resource discovery mechanism by
         registering a resource named "Persistent Memory (legacy)"
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      7a67832c
  4. 15 8月, 2015 1 次提交
  5. 26 7月, 2015 2 次提交
    • T
      x86/mm/pat: Revert 'Adjust default caching mode translation tables' · 1a4e8795
      Thomas Gleixner 提交于
      Toshi explains:
      
      "No, the default values need to be set to the fallback types,
       i.e. minimal supported mode.  For WC and WT, UC is the fallback type.
      
       When PAT is disabled, pat_init() does update the tables below to
       enable WT per the default BIOS setup.  However, when PAT is enabled,
       but CPU has PAT -errata, WT falls back to UC per the default values."
      
      Revert: ca1fec58 'x86/mm/pat: Adjust default caching mode translation tables'
      Requested-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Jan Beulich <jbeulich@suse.de>
      Link: http://lkml.kernel.org/r/1437577776.3214.252.camel@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1a4e8795
    • M
      perf/x86/intel/cqm: Return cached counter value from IRQ context · 2c534c0d
      Matt Fleming 提交于
      Peter reported the following potential crash which I was able to
      reproduce with his test program,
      
      [  148.765788] ------------[ cut here ]------------
      [  148.765796] WARNING: CPU: 34 PID: 2840 at kernel/smp.c:417 smp_call_function_many+0xb6/0x260()
      [  148.765797] Modules linked in:
      [  148.765800] CPU: 34 PID: 2840 Comm: perf Not tainted 4.2.0-rc1+ #4
      [  148.765803]  ffffffff81cdc398 ffff88085f105950 ffffffff818bdfd5 0000000000000007
      [  148.765805]  0000000000000000 ffff88085f105990 ffffffff810e413a 0000000000000000
      [  148.765807]  ffffffff82301080 0000000000000022 ffffffff8107f640 ffffffff8107f640
      [  148.765809] Call Trace:
      [  148.765810]  <NMI>  [<ffffffff818bdfd5>] dump_stack+0x45/0x57
      [  148.765818]  [<ffffffff810e413a>] warn_slowpath_common+0x8a/0xc0
      [  148.765822]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
      [  148.765824]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
      [  148.765825]  [<ffffffff810e422a>] warn_slowpath_null+0x1a/0x20
      [  148.765827]  [<ffffffff811613f6>] smp_call_function_many+0xb6/0x260
      [  148.765829]  [<ffffffff8107f640>] ? intel_cqm_stable+0x60/0x60
      [  148.765831]  [<ffffffff81161748>] on_each_cpu_mask+0x28/0x60
      [  148.765832]  [<ffffffff8107f6ef>] intel_cqm_event_count+0x7f/0xe0
      [  148.765836]  [<ffffffff811cdd35>] perf_output_read+0x2a5/0x400
      [  148.765839]  [<ffffffff811d2e5a>] perf_output_sample+0x31a/0x590
      [  148.765840]  [<ffffffff811d333d>] ? perf_prepare_sample+0x26d/0x380
      [  148.765841]  [<ffffffff811d3497>] perf_event_output+0x47/0x60
      [  148.765843]  [<ffffffff811d36c5>] __perf_event_overflow+0x215/0x240
      [  148.765844]  [<ffffffff811d4124>] perf_event_overflow+0x14/0x20
      [  148.765847]  [<ffffffff8107e7f4>] intel_pmu_handle_irq+0x1d4/0x440
      [  148.765849]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765853]  [<ffffffff81219bad>] ? vunmap_page_range+0x19d/0x2f0
      [  148.765854]  [<ffffffff81219d11>] ? unmap_kernel_range_noflush+0x11/0x20
      [  148.765859]  [<ffffffff814ce6fe>] ? ghes_copy_tofrom_phys+0x11e/0x2a0
      [  148.765863]  [<ffffffff8109e5db>] ? native_apic_msr_write+0x2b/0x30
      [  148.765865]  [<ffffffff8109e44d>] ? x2apic_send_IPI_self+0x1d/0x20
      [  148.765869]  [<ffffffff81065135>] ? arch_irq_work_raise+0x35/0x40
      [  148.765872]  [<ffffffff811c8d86>] ? irq_work_queue+0x66/0x80
      [  148.765875]  [<ffffffff81075306>] perf_event_nmi_handler+0x26/0x40
      [  148.765877]  [<ffffffff81063ed9>] nmi_handle+0x79/0x100
      [  148.765879]  [<ffffffff81064422>] default_do_nmi+0x42/0x100
      [  148.765880]  [<ffffffff81064563>] do_nmi+0x83/0xb0
      [  148.765884]  [<ffffffff818c7c0f>] end_repeat_nmi+0x1e/0x2e
      [  148.765886]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765888]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765890]  [<ffffffff811d07a6>] ? __perf_event_task_sched_in+0x36/0xa0
      [  148.765891]  <<EOE>>  [<ffffffff8110ab66>] finish_task_switch+0x156/0x210
      [  148.765898]  [<ffffffff818c1671>] __schedule+0x341/0x920
      [  148.765899]  [<ffffffff818c1c87>] schedule+0x37/0x80
      [  148.765903]  [<ffffffff810ae1af>] ? do_page_fault+0x2f/0x80
      [  148.765905]  [<ffffffff818c1f4a>] schedule_user+0x1a/0x50
      [  148.765907]  [<ffffffff818c666c>] retint_careful+0x14/0x32
      [  148.765908] ---[ end trace e33ff2be78e14901 ]---
      
      The CQM task events are not safe to be called from within interrupt
      context because they require performing an IPI to read the counter value
      on all sockets. And performing IPIs from within IRQ context is a
      "no-no".
      
      Make do with the last read counter value currently event in
      event->count when we're invoked in this context.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikas Shivappa <vikas.shivappa@intel.com>
      Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
      Cc: Will Auld <will.auld@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1437490509-15373-1-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      2c534c0d
  6. 24 7月, 2015 2 次提交
  7. 23 7月, 2015 5 次提交
  8. 22 7月, 2015 2 次提交
    • T
      x86/mm: Remove region_is_ram() call from ioremap · 9a58eebe
      Toshi Kani 提交于
      __ioremap_caller() calls region_is_ram() to walk through the
      iomem_resource table to check if a target range is in RAM, which was
      added to improve the lookup performance over page_is_ram() (commit
      906e36c5 "x86: use optimized ioresource lookup in ioremap
      function"). page_is_ram() was no longer used when this change was
      added, though.
      
      __ioremap_caller() then calls walk_system_ram_range(), which had
      replaced page_is_ram() to improve the lookup performance (commit
      c81c8a1e "x86, ioremap: Speed up check for RAM pages").
      
      Since both checks walk through the same iomem_resource table for
      the same purpose, there is no need to call both functions.
      
      Aside of that walk_system_ram_range() is the only useful check at the
      moment because region_is_ram() always returns -1 due to an
      implementation bug. That bug in region_is_ram() cannot be fixed
      without breaking existing ioremap callers, which rely on the subtle
      difference of walk_system_ram_range() versus non page aligned ranges.
      
      Once these offending callers are fixed we can use region_is_ram() and
      remove walk_system_ram_range().
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1437088996-28511-3-git-send-email-toshi.kani@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      9a58eebe
    • T
      x86/mm: Move warning from __ioremap_check_ram() to the call site · 1c9cf9b2
      Toshi Kani 提交于
      __ioremap_check_ram() has a WARN_ONCE() which is emitted when the
      given pfn range is not RAM. The warning is bogus in two aspects:
      
      - it never triggers since walk_system_ram_range() only calls
        __ioremap_check_ram() for RAM ranges.
      
      - the warning message is wrong as it says: "ioremap on RAM' after it
        established that the pfn range is not RAM.
      
      Move the WARN_ONCE() to __ioremap_caller(), and update the message to
      include the address range so we get an actual warning when something
      tries to ioremap system RAM.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1437088996-28511-2-git-send-email-toshi.kani@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1c9cf9b2
  9. 21 7月, 2015 4 次提交
  10. 18 7月, 2015 3 次提交
  11. 17 7月, 2015 9 次提交
    • A
      x86/entry/64, x86/nmi/64: Add CONFIG_DEBUG_ENTRY NMI testing code · a97439aa
      Andy Lutomirski 提交于
      It turns out to be rather tedious to test the NMI nesting code.
      Make it easier: add a new CONFIG_DEBUG_ENTRY option that causes
      the NMI handler to pre-emptively unmask NMIs.
      
      With this option set, errors in the repeat_nmi logic or failures
      to detect that we're in a nested NMI will result in quick panics
      under perf (especially if multiple counters are running at high
      frequency) instead of requiring an unusual workload that
      generates page faults or breakpoints inside NMIs.
      
      I called it CONFIG_DEBUG_ENTRY instead of CONFIG_DEBUG_NMI_ENTRY
      because I want to add new non-NMI checks elsewhere in the entry
      code in the future, and I'd rather not add too many new config
      options or add this option and then immediately rename it.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a97439aa
    • A
      x86/nmi/64: Make the "NMI executing" variable more consistent · 36f1a77b
      Andy Lutomirski 提交于
      Currently, "NMI executing" is one the first time an outermost
      NMI hits repeat_nmi and zero thereafter.  Change it to be zero
      each time for consistency.
      
      This is intended to help NMI handling fail harder if it's buggy.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      36f1a77b
    • A
      x86/nmi/64: Minor asm simplification · 23a781e9
      Andy Lutomirski 提交于
      Replace LEA; MOV with an equivalent SUB.  This saves one
      instruction.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      23a781e9
    • A
      x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection · 810bc075
      Andy Lutomirski 提交于
      We have a tricky bug in the nested NMI code: if we see RSP
      pointing to the NMI stack on NMI entry from kernel mode, we
      assume that we are executing a nested NMI.
      
      This isn't quite true.  A malicious userspace program can point
      RSP at the NMI stack, issue SYSCALL, and arrange for an NMI to
      happen while RSP is still pointing at the NMI stack.
      
      Fix it with a sneaky trick.  Set DF in the region of code that
      the RSP check is intended to detect.  IRET will clear DF
      atomically.
      
      ( Note: other than paravirt, there's little need for all this
        complexity. We could check RIP instead of RSP. )
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      810bc075
    • A
      x86/nmi/64: Reorder nested NMI checks · a27507ca
      Andy Lutomirski 提交于
      Check the repeat_nmi .. end_repeat_nmi special case first.  The
      next patch will rework the RSP check and, as a side effect, the
      RSP check will no longer detect repeat_nmi .. end_repeat_nmi, so
      we'll need this ordering of the checks.
      
      Note: this is more subtle than it appears.  The check for
      repeat_nmi .. end_repeat_nmi jumps straight out of the NMI code
      instead of adjusting the "iret" frame to force a repeat.  This
      is necessary, because the code between repeat_nmi and
      end_repeat_nmi sets "NMI executing" and then writes to the
      "iret" frame itself.  If a nested NMI comes in and modifies the
      "iret" frame while repeat_nmi is also modifying it, we'll end up
      with garbage.  The old code got this right, as does the new
      code, but the new code is a bit more explicit.
      
      If we were to move the check right after the "NMI executing"
      check, then we'd get it wrong and have random crashes.
      
      ( Because the "NMI executing" check would jump to the code that would
        modify the "iret" frame without checking if the interrupted NMI was
        currently modifying it. )
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a27507ca
    • A
      x86/nmi/64: Improve nested NMI comments · 0b22930e
      Andy Lutomirski 提交于
      I found the nested NMI documentation to be difficult to follow.
      Improve the comments.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0b22930e
    • A
      x86/nmi/64: Switch stacks on userspace NMI entry · 9b6e6a83
      Andy Lutomirski 提交于
      Returning to userspace is tricky: IRET can fail, and ESPFIX can
      rearrange the stack prior to IRET.
      
      The NMI nesting fixup relies on a precise stack layout and
      atomic IRET.  Rather than trying to teach the NMI nesting fixup
      to handle ESPFIX and failed IRET, punt: run NMIs that came from
      user mode on the normal kernel stack.
      
      This will make some nested NMIs visible to C code, but the C
      code is okay with that.
      
      As a side effect, this should speed up perf: it eliminates an
      RDMSR when NMIs come from user mode.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9b6e6a83
    • A
      x86/nmi/64: Remove asm code that saves CR2 · 0e181bb5
      Andy Lutomirski 提交于
      Now that do_nmi saves CR2, we don't need to save it in asm.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e181bb5
    • A
      x86/nmi: Enable nested do_nmi() handling for 64-bit kernels · 9d050416
      Andy Lutomirski 提交于
      32-bit kernels handle nested NMIs in C.  Enable the exact same
      handling on 64-bit kernels as well.  This isn't currently
      necessary, but it will become necessary once the asm code starts
      allowing limited nesting.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9d050416
  12. 15 7月, 2015 1 次提交
    • T
      genirq: Revert sparse irq locking around __cpu_up() and move it to x86 for now · ce0d3c0a
      Thomas Gleixner 提交于
      Boris reported that the sparse_irq protection around __cpu_up() in the
      generic code causes a regression on Xen. Xen allocates interrupts and
      some more in the xen_cpu_up() function, so it deadlocks on the
      sparse_irq_lock.
      
      There is no simple fix for this and we really should have the
      protection for all architectures, but for now the only solution is to
      move it to x86 where actual wreckage due to the lack of protection has
      been observed.
      Reported-and-tested-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Fixes: a8994181 'hotplug: Prevent alloc/free of irq descriptors during cpu up/down'
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: xiao jin <jin.xiao@intel.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
      Cc: xen-devel <xen-devel@lists.xenproject.org>
      ce0d3c0a
  13. 10 7月, 2015 4 次提交
    • W
      kvm: x86: fix load xsave feature warning · ee4100da
      Wanpeng Li 提交于
      [   68.196974] WARNING: CPU: 1 PID: 2140 at arch/x86/kvm/x86.c:3161 kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]()
      [   68.196975] Modules linked in: snd_hda_codec_hdmi i915 rfcomm bnep bluetooth i2c_algo_bit rfkill nfsd drm_kms_helper nfs_acl nfs drm lockd grace sunrpc fscache snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_dummy snd_seq_oss x86_pkg_temp_thermal snd_seq_midi kvm_intel snd_seq_midi_event snd_rawmidi kvm snd_seq ghash_clmulni_intel fuse snd_timer aesni_intel parport_pc ablk_helper snd_seq_device cryptd ppdev snd lp parport lrw dcdbas gf128mul i2c_core glue_helper lpc_ich video shpchp mfd_core soundcore serio_raw acpi_cpufreq ext4 mbcache jbd2 sd_mod crc32c_intel ahci libahci libata e1000e ptp pps_core
      [   68.197005] CPU: 1 PID: 2140 Comm: qemu-system-x86 Not tainted 4.2.0-rc1+ #2
      [   68.197006] Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
      [   68.197007]  ffffffffa03b0657 ffff8800d984bca8 ffffffff815915a2 0000000000000000
      [   68.197009]  0000000000000000 ffff8800d984bce8 ffffffff81057c0a 00007ff6d0001000
      [   68.197010]  0000000000000002 ffff880211c1a000 0000000000000004 ffff8800ce0288c0
      [   68.197012] Call Trace:
      [   68.197017]  [<ffffffff815915a2>] dump_stack+0x45/0x57
      [   68.197020]  [<ffffffff81057c0a>] warn_slowpath_common+0x8a/0xc0
      [   68.197022]  [<ffffffff81057cfa>] warn_slowpath_null+0x1a/0x20
      [   68.197029]  [<ffffffffa037bed8>] kvm_arch_vcpu_ioctl+0xe88/0x1340 [kvm]
      [   68.197035]  [<ffffffffa037aede>] ? kvm_arch_vcpu_load+0x4e/0x1c0 [kvm]
      [   68.197040]  [<ffffffffa03696a6>] kvm_vcpu_ioctl+0xc6/0x5c0 [kvm]
      [   68.197043]  [<ffffffff811252d2>] ? perf_pmu_enable+0x22/0x30
      [   68.197044]  [<ffffffff8112663e>] ? perf_event_context_sched_in+0x7e/0xb0
      [   68.197048]  [<ffffffff811a6882>] do_vfs_ioctl+0x2c2/0x4a0
      [   68.197050]  [<ffffffff8107bf33>] ? finish_task_switch+0x173/0x220
      [   68.197053]  [<ffffffff8123307f>] ? selinux_file_ioctl+0x4f/0xd0
      [   68.197055]  [<ffffffff8122cac3>] ? security_file_ioctl+0x43/0x60
      [   68.197057]  [<ffffffff811a6ad9>] SyS_ioctl+0x79/0x90
      [   68.197060]  [<ffffffff81597e57>] entry_SYSCALL_64_fastpath+0x12/0x6a
      [   68.197061] ---[ end trace 558a5ebf9445fc80 ]---
      
      After commit (0c4109be 'x86/fpu/xstate: Fix up bad get_xsave_addr()
      assumptions'), there is no assumption an xsave bit is present in the
      hardware (pcntxt_mask) that it is always present in a given xsave buffer.
      An enabled state to be present on 'pcntxt_mask', but *not* in 'xstate_bv'
      could happen when the last 'xsave' did not request that this feature be
      saved (unlikely) or because the "init optimization" caused it to not be
      saved. This patch kill the assumption.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee4100da
    • P
      KVM: x86: apply guest MTRR virtualization on host reserved pages · fd717f11
      Paolo Bonzini 提交于
      Currently guest MTRR is avoided if kvm_is_reserved_pfn returns true.
      However, the guest could prefer a different page type than UC for
      such pages. A good example is that pass-throughed VGA frame buffer is
      not always UC as host expected.
      
      This patch enables full use of virtual guest MTRRs.
      Suggested-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Tested-by: Joerg Roedel <jroedel@suse.de> (on AMD)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fd717f11
    • J
      KVM: SVM: Sync g_pat with guest-written PAT value · e098223b
      Jan Kiszka 提交于
      When hardware supports the g_pat VMCB field, we can use it for emulating
      the PAT configuration that the guest configures by writing to the
      corresponding MSR.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Tested-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e098223b
    • P
      KVM: SVM: use NPT page attributes · 3c2e7f7d
      Paolo Bonzini 提交于
      Right now, NPT page attributes are not used, and the final page
      attribute depends solely on gPAT (which however is not synced
      correctly), the guest MTRRs and the guest page attributes.
      
      However, we can do better by mimicking what is done for VMX.
      In the absence of PCI passthrough, the guest PAT can be ignored
      and the page attributes can be just WB.  If passthrough is being
      used, instead, keep respecting the guest PAT, and emulate the guest
      MTRRs through the PAT field of the nested page tables.
      
      The only snag is that WP memory cannot be emulated correctly,
      because Linux's default PAT setting only includes the other types.
      Tested-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c2e7f7d