1. 15 7月, 2016 10 次提交
  2. 13 7月, 2016 4 次提交
    • D
      x86/mm: Use pte_none() to test for empty PTE · dcb32d99
      Dave Hansen 提交于
      The page table manipulation code seems to have grown a couple of
      sites that are looking for empty PTEs.  Just in case one of these
      entries got a stray bit set, use pte_none() instead of checking
      for a zero pte_val().
      
      The use pte_same() makes me a bit nervous.  If we were doing a
      pte_same() check against two cleared entries and one of them had
      a stray bit set, it might fail the pte_same() check.  But, I
      don't think we ever _do_ pte_same() for cleared entries.  It is
      almost entirely used for checking for races in fault-in paths.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: dave.hansen@intel.com
      Cc: linux-mm@kvack.org
      Cc: mhocko@suse.com
      Link: http://lkml.kernel.org/r/20160708001915.813703D9@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dcb32d99
    • D
      x86/mm: Disallow running with 32-bit PTEs to work around erratum · e4a84be6
      Dave Hansen 提交于
      The Intel(R) Xeon Phi(TM) Processor x200 Family (codename: Knights
      Landing) has an erratum where a processor thread setting the Accessed
      or Dirty bits may not do so atomically against its checks for the
      Present bit.  This may cause a thread (which is about to page fault)
      to set A and/or D, even though the Present bit had already been
      atomically cleared.
      
      These bits are truly "stray".  In the case of the Dirty bit, the
      thread associated with the stray set was *not* allowed to write to
      the page.  This means that we do not have to launder the bit(s); we
      can simply ignore them.
      
      If the PTE is used for storing a swap index or a NUMA migration index,
      the A bit could be misinterpreted as part of the swap type.  The stray
      bits being set cause a software-cleared PTE to be interpreted as a
      swap entry.  In some cases (like when the swap index ends up being
      for a non-existent swapfile), the kernel detects the stray value
      and WARN()s about it, but there is no guarantee that the kernel can
      always detect it.
      
      When we have 64-bit PTEs (64-bit mode or 32-bit PAE), we were able
      to move the swap PTE format around to avoid these troublesome bits.
      But, 32-bit non-PAE is tight on bits.  So, disallow it from running
      on this hardware.  I can't imagine anyone wanting to run 32-bit
      non-highmem kernels on this hardware, but disallowing them from
      running entirely is surely the safe thing to do.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: dave.hansen@intel.com
      Cc: linux-mm@kvack.org
      Cc: mhocko@suse.com
      Link: http://lkml.kernel.org/r/20160708001914.D0B50110@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e4a84be6
    • D
      x86/mm: Ignore A/D bits in pte/pmd/pud_none() · 97e3c602
      Dave Hansen 提交于
      The erratum we are fixing here can lead to stray setting of the
      A and D bits.  That means that a pte that we cleared might
      suddenly have A/D set.  So, stop considering those bits when
      determining if a pte is pte_none().  The same goes for the
      other pmd_none() and pud_none().  pgd_none() can be skipped
      because it is not affected; we do not use PGD entries for
      anything other than pagetables on affected configurations.
      
      This adds a tiny amount of overhead to all pte_none() checks.
      I doubt we'll be able to measure it anywhere.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: dave.hansen@intel.com
      Cc: linux-mm@kvack.org
      Cc: mhocko@suse.com
      Link: http://lkml.kernel.org/r/20160708001912.5216F89C@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      97e3c602
    • D
      x86/mm: Move swap offset/type up in PTE to work around erratum · 00839ee3
      Dave Hansen 提交于
      This erratum can result in Accessed/Dirty getting set by the hardware
      when we do not expect them to be (on !Present PTEs).
      
      Instead of trying to fix them up after this happens, we just
      allow the bits to get set and try to ignore them.  We do this by
      shifting the layout of the bits we use for swap offset/type in
      our 64-bit PTEs.
      
      It looks like this:
      
       bitnrs: |     ...            | 11| 10|  9|8|7|6|5| 4| 3|2|1|0|
       names:  |     ...            |SW3|SW2|SW1|G|L|D|A|CD|WT|U|W|P|
       before: |         OFFSET (9-63)          |0|X|X| TYPE(1-5) |0|
        after: | OFFSET (14-63)  |  TYPE (9-13) |0|X|X|X| X| X|X|X|0|
      
      Note that D was already a don't care (X) even before.  We just
      move TYPE up and turn its old spot (which could be hit by the
      A bit) into all don't cares.
      
      We take 5 bits away from the offset, but that still leaves us
      with 50 bits which lets us index into a 62-bit swapfile (4 EiB).
      I think that's probably fine for the moment.  We could
      theoretically reclaim 5 of the bits (1, 2, 3, 4, 7) but it
      doesn't gain us anything.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: dave.hansen@intel.com
      Cc: linux-mm@kvack.org
      Cc: mhocko@suse.com
      Link: http://lkml.kernel.org/r/20160708001911.9A3FD2B6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      00839ee3
  3. 10 7月, 2016 2 次提交
  4. 08 7月, 2016 4 次提交
    • B
      x86/asm/entry: Make thunk's restore a local label · 9a7e7b57
      Borislav Petkov 提交于
      No need to have it appear in objdump output.
      
      No functionality change.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160708141016.GH3808@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9a7e7b57
    • D
      x86/vdso: Add mremap hook to vm_special_mapping · b059a453
      Dmitry Safonov 提交于
      Add possibility for 32-bit user-space applications to move
      the vDSO mapping.
      
      Previously, when a user-space app called mremap() for the vDSO
      address, in the syscall return path it would land on the previous
      address of the vDSOpage, resulting in segmentation violation.
      
      Now it lands fine and returns to userspace with a remapped vDSO.
      
      This will also fix the context.vdso pointer for 64-bit, which does
      not affect the user of vDSO after mremap() currently, but this
      may change in the future.
      
      As suggested by Andy, return -EINVAL for mremap() that would
      split the vDSO image: that operation cannot possibly result in
      a working system so reject it.
      
      Renamed and moved the text_mapping structure declaration inside
      map_vdso(), as it used only there and now it complements the
      vvar_mapping variable.
      
      There is still a problem for remapping the vDSO in glibc
      applications: the linker relocates addresses for syscalls
      on the vDSO page, so you need to relink with the new
      addresses.
      
      Without that the next syscall through glibc may fail:
      
        Program received signal SIGSEGV, Segmentation fault.
        #0  0xf7fd9b80 in __kernel_vsyscall ()
        #1  0xf7ec8238 in _exit () from /usr/lib32/libc.so.6
      Signed-off-by: NDmitry Safonov <dsafonov@virtuozzo.com>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: 0x7f454c46@gmail.com
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160628113539.13606-2-dsafonov@virtuozzo.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b059a453
    • J
      x86/mm/pat, /dev/mem: Remove superfluous error message · 39380b80
      Jiri Kosina 提交于
      Currently it's possible for broken (or malicious) userspace to flood a
      kernel log indefinitely with messages a-la
      
      	Program dmidecode tried to access /dev/mem between f0000->100000
      
      because range_is_allowed() is case of CONFIG_STRICT_DEVMEM being turned on
      dumps this information each and every time devmem_is_allowed() fails.
      
      Reportedly userspace that is able to trigger contignuous flow of these
      messages exists.
      
      It would be possible to rate limit this message, but that'd have a
      questionable value; the administrator wouldn't get information about all
      the failing accessess, so then the information would be both superfluous
      and incomplete at the same time :)
      
      Returning EPERM (which is what is actually happening) is enough indication
      for userspace what has happened; no need to log this particular error as
      some sort of special condition.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1607081137020.24757@cbobk.fhfr.pmSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39380b80
    • G
      arm64: Enable workaround for Cavium erratum 27456 on thunderx-81xx · 47c459be
      Ganapatrao Kulkarni 提交于
      Cavium erratum 27456 commit 104a0c02
      ("arm64: Add workaround for Cavium erratum 27456")
      is applicable for thunderx-81xx pass1.0 SoC as well.
      Adding code to enable to 81xx.
      Signed-off-by: NGanapatrao Kulkarni <gkulkarni@cavium.com>
      Reviewed-by: NAndrew Pinski <apinski@cavium.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      47c459be
  5. 07 7月, 2016 1 次提交
  6. 06 7月, 2016 1 次提交
  7. 03 7月, 2016 2 次提交
    • J
      perf/x86: Fix 32-bit perf user callgraph collection · fc188225
      Josh Poimboeuf 提交于
      A basic perf callgraph record operation causes an immediate panic on a
      32-bit kernel compiled with CONFIG_CC_STACKPROTECTOR=y:
      
        $ perf record -g ls
        Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: c0404fbd
      
        CPU: 0 PID: 998 Comm: ls Not tainted 4.7.0-rc5+ #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
         c0dd5967 ff7afe1c 00000086 f41dbc2c c07445a0 464c457f f41dbca8 f41dbc44
         c05646f4 f41dbca8 464c457f f41dbca8 464c457f f41dbc54 c04625be c0ce56fc
         c0404fbd f41dbc88 c0404fbd b74668f0 f41dc000 00000000 c0000000 00000000
        Call Trace:
         [<c07445a0>] dump_stack+0x58/0x78
         [<c05646f4>] panic+0x8e/0x1c6
         [<c04625be>] __stack_chk_fail+0x1e/0x30
         [<c0404fbd>] ? perf_callchain_user+0x22d/0x230
         [<c0404fbd>] perf_callchain_user+0x22d/0x230
         [<c055f89f>] get_perf_callchain+0x1ff/0x270
         [<c055f988>] perf_callchain+0x78/0x90
         [<c055c7eb>] perf_prepare_sample+0x24b/0x370
         [<c055c934>] perf_event_output_forward+0x24/0x70
         [<c05531c0>] __perf_event_overflow+0xa0/0x210
         [<c0550a93>] ? cpu_clock_event_read+0x43/0x50
         [<c0553431>] perf_swevent_hrtimer+0x101/0x180
         [<c0456235>] ? kmap_atomic_prot+0x35/0x140
         [<c056dc69>] ? get_page_from_freelist+0x279/0x950
         [<c058fdd8>] ? vma_interval_tree_remove+0x158/0x230
         [<c05939f4>] ? wp_page_copy.isra.82+0x2f4/0x630
         [<c05a050d>] ? page_add_file_rmap+0x1d/0x50
         [<c0565611>] ? unlock_page+0x61/0x80
         [<c0566755>] ? filemap_map_pages+0x305/0x320
         [<c059769f>] ? handle_mm_fault+0xb7f/0x1560
         [<c074cbeb>] ? timerqueue_del+0x1b/0x70
         [<c04cfefe>] ? __remove_hrtimer+0x2e/0x60
         [<c04d017b>] __hrtimer_run_queues+0xcb/0x2a0
         [<c0553330>] ? __perf_event_overflow+0x210/0x210
         [<c04d0a2a>] hrtimer_interrupt+0x8a/0x180
         [<c043ecc2>] local_apic_timer_interrupt+0x32/0x60
         [<c043f643>] smp_apic_timer_interrupt+0x33/0x50
         [<c0b0cd38>] apic_timer_interrupt+0x34/0x3c
        Kernel Offset: disabled
        ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: c0404fbd
      
      The panic is caused by the fact that perf_callchain_user() mistakenly
      assumes it's 64-bit only and ends up corrupting the stack.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: stable@vger.kernel.org # v4.5+
      Fixes: 75925e1a ("perf/x86: Optimize stack walk user accesses")
      Link: http://lkml.kernel.org/r/1a547f5077ec30f75f9b57074837c3c80df86e5e.1467432113.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fc188225
    • S
      perf/x86/intel: Update event constraints when HT is off · 9010ae4a
      Stephane Eranian 提交于
      This patch updates the event constraints for non-PEBS mode for
      Intel Broadwell and Skylake processors. When HT is off, each
      CPU gets 8 generic counters. However, not all events can be
      programmed on any of the 8 counters.  This patch adds the
      constraints for the MEM_* events which can only be measured on the
      bottom 4 counters. The constraints are also valid when HT is off
      because, then, there are only 4 generic counters and they are the
      bottom counters.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: kan.liang@intel.com
      Link: http://lkml.kernel.org/r/1467411742-13245-1-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9010ae4a
  8. 02 7月, 2016 2 次提交
    • R
      MIPS: Fix possible corruption of cache mode by mprotect. · 6d037de9
      Ralf Baechle 提交于
      The following testcase may result in a page table entries with a invalid
      CCA field being generated:
      
      static void *bindstack;
      
      static int sysrqfd;
      
      static void protect_low(int protect)
      {
      	mprotect(bindstack, BINDSTACK_SIZE, protect);
      }
      
      static void sigbus_handler(int signal, siginfo_t * info, void *context)
      {
      	void *addr = info->si_addr;
      
      	write(sysrqfd, "x", 1);
      
      	printf("sigbus, fault address %p (should not happen, but might)\n",
      	       addr);
      	abort();
      }
      
      static void run_bind_test(void)
      {
      	unsigned int *p = bindstack;
      
      	p[0] = 0xf001f001;
      
      	write(sysrqfd, "x", 1);
      
      	/* Set trap on access to p[0] */
      	protect_low(PROT_NONE);
      
      	write(sysrqfd, "x", 1);
      
      	/* Clear trap on access to p[0] */
      	protect_low(PROT_READ | PROT_WRITE | PROT_EXEC);
      
      	write(sysrqfd, "x", 1);
      
      	/* Check the contents of p[0] */
      	if (p[0] != 0xf001f001) {
      		write(sysrqfd, "x", 1);
      
      		/* Reached, but shouldn't be */
      		printf("badness, shouldn't happen but does\n");
      		abort();
      	}
      }
      
      int main(void)
      {
      	struct sigaction sa;
      
      	sysrqfd = open("/proc/sysrq-trigger", O_WRONLY);
      
      	if (sigprocmask(SIG_BLOCK, NULL, &sa.sa_mask)) {
      		perror("sigprocmask");
      		return 0;
      	}
      
      	sa.sa_sigaction = sigbus_handler;
      	sa.sa_flags = SA_SIGINFO | SA_NODEFER | SA_RESTART;
      	if (sigaction(SIGBUS, &sa, NULL)) {
      		perror("sigaction");
      		return 0;
      	}
      
      	bindstack = mmap(NULL,
      			 BINDSTACK_SIZE,
      			 PROT_READ | PROT_WRITE | PROT_EXEC,
      			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      	if (bindstack == MAP_FAILED) {
      		perror("mmap bindstack");
      		return 0;
      	}
      
      	printf("bindstack: %p\n", bindstack);
      
      	run_bind_test();
      
      	printf("done\n");
      
      	return 0;
      }
      
      There are multiple ingredients for this:
      
       1) PAGE_NONE is defined to _CACHE_CACHABLE_NONCOHERENT, which is CCA 3
          on all platforms except SB1 where it's CCA 5.
       2) _page_cachable_default must have bits set which are not set
          _CACHE_CACHABLE_NONCOHERENT.
       3) Either the defective version of pte_modify for XPA or the standard
          version must be in used.  However pte_modify for the 36 bit address
          space support is no affected.
      
      In that case additional bits in the final CCA mode may generate an invalid
      value for the CCA field.  On the R10000 system where this was tracked
      down for example a CCA 7 has been observed, which is Uncached Accelerated.
      
      Fixed by:
      
       1) Using the proper CCA mode for PAGE_NONE just like for all the other
          PAGE_* pte/pmd bits.
       2) Fix the two affected variants of pte_modify.
      
      Further code inspection also shows the same issue to exist in pmd_modify
      which would affect huge page systems.
      
      Issue in pte_modify tracked down by Alastair Bridgewater, PAGE_NONE
      and pmd_modify issue found by me.
      
      The history of this goes back beyond Linus' git history.  Chris Dearman's
      commit 35133692 ("[MIPS] Allow setting of
      the cache attribute at run time.") missed the opportunity to fix this
      but it was originally introduced in lmo commit
      d523832cf12007b3242e50bb77d0c9e63e0b6518 ("Missing from last commit.")
      and 32cc38229ac7538f2346918a09e75413e8861f87 ("New configuration option
      CONFIG_MIPS_UNCACHED.")
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Reported-by: NAlastair Bridgewater <alastair.bridgewater@gmail.com>
      6d037de9
    • S
      Revert "ACPI, PCI, IRQ: remove redundant code in acpi_irq_penalty_init()" · 487cf917
      Sinan Kaya 提交于
      Trying to make the ISA and PCI init functionality common turned out
      to be a bad idea, because the ISA path depends on external
      functionality.
      
      Restore the previous behavior and limit the refactoring to PCI
      interrupts only.
      
      Fixes: 1fcb6a81 "ACPI,PCI,IRQ: remove redundant code in acpi_irq_penalty_init()"
      Signed-off-by: NSinan Kaya <okaya@codeaurora.org>
      Tested-by: NWim Osterholt <wim@djo.tudelft.nl>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      487cf917
  9. 01 7月, 2016 2 次提交
    • B
      x86/amd_nb: Fix boot crash on non-AMD systems · 1ead852d
      Borislav Petkov 提交于
      Fix boot crash that triggers if this driver is built into a kernel and
      run on non-AMD systems.
      
      AMD northbridges users call amd_cache_northbridges() and it returns
      a negative value to signal that we weren't able to cache/detect any
      northbridges on the system.
      
      At least, it should do so as all its callers expect it to do so. But it
      does return a negative value only when kmalloc() fails.
      
      Fix it to return -ENODEV if there are no NBs cached as otherwise, amd_nb
      users like amd64_edac, for example, which relies on it to know whether
      it should load or not, gets loaded on systems like Intel Xeons where it
      shouldn't.
      Reported-and-tested-by: NTony Battersby <tonyb@cybernetics.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1466097230-5333-2-git-send-email-bp@alien8.de
      Link: https://lkml.kernel.org/r/5761BEB0.9000807@cybernetics.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1ead852d
    • R
      x86/power/64: Fix kernel text mapping corruption during image restoration · 65c0554b
      Rafael J. Wysocki 提交于
      Logan Gunthorpe reports that hibernation stopped working reliably for
      him after commit ab76f7b4 (x86/mm: Set NX on gap between __ex_table
      and rodata).
      
      That turns out to be a consequence of a long-standing issue with the
      64-bit image restoration code on x86, which is that the temporary
      page tables set up by it to avoid page tables corruption when the
      last bits of the image kernel's memory contents are copied into
      their original page frames re-use the boot kernel's text mapping,
      but that mapping may very well get corrupted just like any other
      part of the page tables.  Of course, if that happens, the final
      jump to the image kernel's entry point will go to nowhere.
      
      The exact reason why commit ab76f7b4 matters here is that it
      sometimes causes a PMD of a large page to be split into PTEs
      that are allocated dynamically and get corrupted during image
      restoration as described above.
      
      To fix that issue note that the code copying the last bits of the
      image kernel's memory contents to the page frames occupied by them
      previoulsy doesn't use the kernel text mapping, because it runs from
      a special page covered by the identity mapping set up for that code
      from scratch.  Hence, the kernel text mapping is only needed before
      that code starts to run and then it will only be used just for the
      final jump to the image kernel's entry point.
      
      Accordingly, the temporary page tables set up in swsusp_arch_resume()
      on x86-64 need to contain the kernel text mapping too.  That mapping
      is only going to be used for the final jump to the image kernel, so
      it only needs to cover the image kernel's entry point, because the
      first thing the image kernel does after getting control back is to
      switch over to its own original page tables.  Moreover, the virtual
      address of the image kernel's entry point in that mapping has to be
      the same as the one mapped by the image kernel's page tables.
      
      With that in mind, modify the x86-64's arch_hibernation_header_save()
      and arch_hibernation_header_restore() routines to pass the physical
      address of the image kernel's entry point (in addition to its virtual
      address) to the boot kernel (a small piece of assembly code involved
      in passing the entry point's virtual address to the image kernel is
      not necessary any more after that, so drop it).  Update RESTORE_MAGIC
      too to reflect the image header format change.
      
      Next, in set_up_temporary_mappings(), use the physical and virtual
      addresses of the image kernel's entry point passed in the image
      header to set up a minimum kernel text mapping (using memory pages
      that won't be overwritten by the image kernel's memory contents) that
      will map those addresses to each other as appropriate.
      
      This makes the concern about the possible corruption of the original
      boot kernel text mapping go away and if the the minimum kernel text
      mapping used for the final jump marks the image kernel's entry point
      memory as executable, the jump to it is guaraneed to succeed.
      
      Fixes: ab76f7b4 (x86/mm: Set NX on gap between __ex_table and rodata)
      Link: http://marc.info/?l=linux-pm&m=146372852823760&w=2Reported-by: NLogan Gunthorpe <logang@deltatee.com>
      Reported-and-tested-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      65c0554b
  10. 30 6月, 2016 1 次提交
  11. 29 6月, 2016 1 次提交
    • M
      powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 · 190ce869
      Michael Neuling 提交于
      Currently we have 2 segments that are bolted for the kernel linear
      mapping (ie 0xc000... addresses). This is 0 to 1TB and also the kernel
      stacks. Anything accessed outside of these regions may need to be
      faulted in. (In practice machines with TM always have 1T segments)
      
      If a machine has < 2TB of memory we never fault on the kernel linear
      mapping as these two segments cover all physical memory. If a machine
      has > 2TB of memory, there may be structures outside of these two
      segments that need to be faulted in. This faulting can occur when
      running as a guest as the hypervisor may remove any SLB that's not
      bolted.
      
      When we treclaim and trecheckpoint we have a window where we need to
      run with the userspace GPRs. This means that we no longer have a valid
      stack pointer in r1. For this window we therefore clear MSR RI to
      indicate that any exceptions taken at this point won't be able to be
      handled. This means that we can't take segment misses in this RI=0
      window.
      
      In this RI=0 region, we currently access the thread_struct for the
      process being context switched to or from. This thread_struct access
      may cause a segment fault since it's not guaranteed to be covered by
      the two bolted segment entries described above.
      
      We've seen this with a crash when running as a guest with > 2TB of
      memory on PowerVM:
      
        Unrecoverable exception 4100 at c00000000004f138
        Oops: Unrecoverable exception, sig: 6 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        CPU: 1280 PID: 7755 Comm: kworker/1280:1 Tainted: G                 X 4.4.13-46-default #1
        task: c000189001df4210 ti: c000189001d5c000 task.ti: c000189001d5c000
        NIP: c00000000004f138 LR: 0000000010003a24 CTR: 0000000010001b20
        REGS: c000189001d5f730 TRAP: 4100   Tainted: G                 X  (4.4.13-46-default)
        MSR: 8000000100001031 <SF,ME,IR,DR,LE>  CR: 24000048  XER: 00000000
        CFAR: c00000000004ed18 SOFTE: 0
        GPR00: ffffffffc58d7b60 c000189001d5f9b0 00000000100d7d00 000000003a738288
        GPR04: 0000000000002781 0000000000000006 0000000000000000 c0000d1f4d889620
        GPR08: 000000000000c350 00000000000008ab 00000000000008ab 00000000100d7af0
        GPR12: 00000000100d7ae8 00003ffe787e67a0 0000000000000000 0000000000000211
        GPR16: 0000000010001b20 0000000000000000 0000000000800000 00003ffe787df110
        GPR20: 0000000000000001 00000000100d1e10 0000000000000000 00003ffe787df050
        GPR24: 0000000000000003 0000000000010000 0000000000000000 00003fffe79e2e30
        GPR28: 00003fffe79e2e68 00000000003d0f00 00003ffe787e67a0 00003ffe787de680
        NIP [c00000000004f138] restore_gprs+0xd0/0x16c
        LR [0000000010003a24] 0x10003a24
        Call Trace:
        [c000189001d5f9b0] [c000189001d5f9f0] 0xc000189001d5f9f0 (unreliable)
        [c000189001d5fb90] [c00000000001583c] tm_recheckpoint+0x6c/0xa0
        [c000189001d5fbd0] [c000000000015c40] __switch_to+0x2c0/0x350
        [c000189001d5fc30] [c0000000007e647c] __schedule+0x32c/0x9c0
        [c000189001d5fcb0] [c0000000007e6b58] schedule+0x48/0xc0
        [c000189001d5fce0] [c0000000000deabc] worker_thread+0x22c/0x5b0
        [c000189001d5fd80] [c0000000000e7000] kthread+0x110/0x130
        [c000189001d5fe30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
        Instruction dump:
        7cb103a6 7cc0e3a6 7ca222a6 78a58402 38c00800 7cc62838 08860000 7cc000a6
        38a00006 78c60022 7cc62838 0b060000 <e8c701a0> 7ccff120 e8270078 e8a70098
        ---[ end trace 602126d0a1dedd54 ]---
      
      This fixes this by copying the required data from the thread_struct to
      the stack before we clear MSR RI. Then once we clear RI, we only access
      the stack, guaranteeing there's no segment miss.
      
      We also tighten the region over which we set RI=0 on the treclaim()
      path. This may have a slight performance impact since we're adding an
      mtmsr instruction.
      
      Fixes: 090b9284 ("powerpc/tm: Clear MSR RI in non-recoverable TM code")
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Reviewed-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      190ce869
  12. 28 6月, 2016 5 次提交
  13. 27 6月, 2016 5 次提交
    • Q
      KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode. · ff30ef40
      Quentin Casasnovas 提交于
      I couldn't get Xen to boot a L2 HVM when it was nested under KVM - it was
      getting a GP(0) on a rather unspecial vmread from Xen:
      
           (XEN) ----[ Xen-4.7.0-rc  x86_64  debug=n  Not tainted ]----
           (XEN) CPU:    1
           (XEN) RIP:    e008:[<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d1v0)
           (XEN) rax: ffff82d0801e6288   rbx: ffff83003ffbfb7c   rcx: fffffffffffab928
           (XEN) rdx: 0000000000000000   rsi: 0000000000000000   rdi: ffff83000bdd0000
           (XEN) rbp: ffff83000bdd0000   rsp: ffff83003ffbfab0   r8:  ffff830038813910
           (XEN) r9:  ffff83003faf3958   r10: 0000000a3b9f7640   r11: ffff83003f82d418
           (XEN) r12: 0000000000000000   r13: ffff83003ffbffff   r14: 0000000000004802
           (XEN) r15: 0000000000000008   cr0: 0000000080050033   cr4: 00000000001526e0
           (XEN) cr3: 000000003fc79000   cr2: 0000000000000000
           (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
           (XEN) Xen code around <ffff82d0801e629e> (vmx_get_segment_register+0x14e/0x450):
           (XEN)  00 00 41 be 02 48 00 00 <44> 0f 78 74 24 08 0f 86 38 56 00 00 b8 08 68 00
           (XEN) Xen stack trace from rsp=ffff83003ffbfab0:
      
           ...
      
           (XEN) Xen call trace:
           (XEN)    [<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN)    [<ffff82d0801f3695>] get_page_from_gfn_p2m+0x165/0x300
           (XEN)    [<ffff82d0801bfe32>] hvmemul_get_seg_reg+0x52/0x60
           (XEN)    [<ffff82d0801bfe93>] hvm_emulate_prepare+0x53/0x70
           (XEN)    [<ffff82d0801ccacb>] handle_mmio+0x2b/0xd0
           (XEN)    [<ffff82d0801be591>] emulate.c#_hvm_emulate_one+0x111/0x2c0
           (XEN)    [<ffff82d0801cd6a4>] handle_hvm_io_completion+0x274/0x2a0
           (XEN)    [<ffff82d0801f334a>] __get_gfn_type_access+0xfa/0x270
           (XEN)    [<ffff82d08012f3bb>] timer.c#add_entry+0x4b/0xb0
           (XEN)    [<ffff82d08012f80c>] timer.c#remove_entry+0x7c/0x90
           (XEN)    [<ffff82d0801c8433>] hvm_do_resume+0x23/0x140
           (XEN)    [<ffff82d0801e4fe7>] vmx_do_resume+0xa7/0x140
           (XEN)    [<ffff82d080164aeb>] context_switch+0x13b/0xe40
           (XEN)    [<ffff82d080128e6e>] schedule.c#schedule+0x22e/0x570
           (XEN)    [<ffff82d08012c0cc>] softirq.c#__do_softirq+0x5c/0x90
           (XEN)    [<ffff82d0801602c5>] domain.c#idle_loop+0x25/0x50
           (XEN)
           (XEN)
           (XEN) ****************************************
           (XEN) Panic on CPU 1:
           (XEN) GENERAL PROTECTION FAULT
           (XEN) [error_code=0000]
           (XEN) ****************************************
      
      Tracing my host KVM showed it was the one injecting the GP(0) when
      emulating the VMREAD and checking the destination segment permissions in
      get_vmx_mem_address():
      
           3)               |    vmx_handle_exit() {
           3)               |      handle_vmread() {
           3)               |        nested_vmx_check_permission() {
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.065 us    |            vmx_read_guest_seg_selector();
           3)   0.066 us    |            vmx_read_guest_seg_ar();
           3)   1.636 us    |          }
           3)   0.058 us    |          vmx_get_rflags();
           3)   0.062 us    |          vmx_read_guest_seg_ar();
           3)   3.469 us    |        }
           3)               |        vmx_get_cs_db_l_bits() {
           3)   0.058 us    |          vmx_read_guest_seg_ar();
           3)   0.662 us    |        }
           3)               |        get_vmx_mem_address() {
           3)   0.068 us    |          vmx_cache_reg();
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.068 us    |            vmx_read_guest_seg_selector();
           3)   0.071 us    |            vmx_read_guest_seg_ar();
           3)   1.756 us    |          }
           3)               |          kvm_queue_exception_e() {
           3)   0.066 us    |            kvm_multiple_exception();
           3)   0.684 us    |          }
           3)   4.085 us    |        }
           3)   9.833 us    |      }
           3) + 10.366 us   |    }
      
      Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software
      Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine
      Control Structure", I found that we're enforcing that the destination
      operand is NOT located in a read-only data segment or any code segment when
      the L1 is in long mode - BUT that check should only happen when it is in
      protected mode.
      
      Shuffling the code a bit to make our emulation follow the specification
      allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests
      without problems.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Eugene Korenevsky <ekorenevsky@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff30ef40
    • M
      KVM: LAPIC: cap __delay at lapic_timer_advance_ns · b606f189
      Marcelo Tosatti 提交于
      The host timer which emulates the guest LAPIC TSC deadline
      timer has its expiration diminished by lapic_timer_advance_ns
      nanoseconds. Therefore if, at wait_lapic_expire, a difference
      larger than lapic_timer_advance_ns is encountered, delay at most
      lapic_timer_advance_ns.
      
      This fixes a problem where the guest can cause the host
      to delay for large amounts of time.
      Reported-by: NAlan Jenkins <alan.christopher.jenkins@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b606f189
    • M
      KVM: x86: move nsec_to_cycles from x86.c to x86.h · 8d93c874
      Marcelo Tosatti 提交于
      Move the inline function nsec_to_cycles from x86.c to x86.h, as
      the next patch uses it from lapic.c.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d93c874
    • M
      pvclock: Get rid of __pvclock_read_cycles in function pvclock_read_flags · ed911b43
      Minfei Huang 提交于
      There is a generic function __pvclock_read_cycles to be used to get both
      flags and cycles. For function pvclock_read_flags, it's useless to get
      cycles value. To make this function be more effective, get this variable
      flags directly in function.
      Signed-off-by: NMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed911b43
    • M
      pvclock: Cleanup to remove function pvclock_get_nsec_offset · f7550d07
      Minfei Huang 提交于
      Function __pvclock_read_cycles is short enough, so there is no need to
      have another function pvclock_get_nsec_offset to calculate tsc delta.
      It's better to combine it into function __pvclock_read_cycles.
      
      Remove useless variables in function __pvclock_read_cycles.
      Signed-off-by: NMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7550d07