1. 16 3月, 2016 5 次提交
  2. 14 3月, 2016 3 次提交
  3. 13 3月, 2016 4 次提交
  4. 12 3月, 2016 8 次提交
    • M
      x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables · 452308de
      Matt Fleming 提交于
      Some machines have EFI regions in page zero (physical address
      0x00000000) and historically that region has been added to the e820
      map via trim_bios_range(), and ultimately mapped into the kernel page
      tables. It was not mapped via efi_map_regions() as one would expect.
      
      Alexis reports that with the new separate EFI page tables some boot
      services regions, such as page zero, are not mapped. This triggers an
      oops during the SetVirtualAddressMap() runtime call.
      
      For the EFI boot services quirk on x86 we need to memblock_reserve()
      boot services regions until after SetVirtualAddressMap(). Doing that
      while respecting the ownership of regions that may have already been
      reserved by the kernel was the motivation behind this commit:
      
        7d68dc3f ("x86, efi: Do not reserve boot services regions within reserved areas")
      
      That patch was merged at a time when the EFI runtime virtual mappings
      were inserted into the kernel page tables as described above, and the
      trick of setting ->numpages (and hence the region size) to zero to
      track regions that should not be freed in efi_free_boot_services()
      meant that we never mapped those regions in efi_map_regions(). Instead
      we were relying solely on the existing kernel mappings.
      
      Now that we have separate page tables we need to make sure the EFI
      boot services regions are mapped correctly, even if someone else has
      already called memblock_reserve(). Instead of stashing a tag in
      ->numpages, set the EFI_MEMORY_RUNTIME bit of ->attribute. Since it
      generally makes no sense to mark a boot services region as required at
      runtime, it's pretty much guaranteed the firmware will not have
      already set this bit.
      
      For the record, the specific circumstances under which Alexis
      triggered this bug was that an EFI runtime driver on his machine was
      responding to the EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE event during
      SetVirtualAddressMap().
      
      The event handler for this driver looks like this,
      
        sub rsp,0x28
        lea rdx,[rip+0x2445] # 0xaa948720
        mov ecx,0x4
        call func_aa9447c0  ; call to ConvertPointer(4, & 0xaa948720)
        mov r11,QWORD PTR [rip+0x2434] # 0xaa948720
        xor eax,eax
        mov BYTE PTR [r11+0x1],0x1
        add rsp,0x28
        ret
      
      Which is pretty typical code for an EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE
      handler. The "mov r11, QWORD PTR [rip+0x2424]" was the faulting
      instruction because ConvertPointer() was being called to convert the
      address 0x0000000000000000, which when converted is left unchanged and
      remains 0x0000000000000000.
      
      The output of the oops trace gave the impression of a standard NULL
      pointer dereference bug, but because we're accessing physical
      addresses during ConvertPointer(), it wasn't. EFI boot services code
      is stored at that address on Alexis' machine.
      Reported-by: NAlexis Murzeau <amurzeau@gmail.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raphael Hertzog <hertzog@debian.org>
      Cc: Roger Shimizu <rogershimizu@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/1457695163-29632-2-git-send-email-matt@codeblueprint.co.uk
      Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815125Signed-off-by: NIngo Molnar <mingo@kernel.org>
      452308de
    • B
      x86/fpu: Fix eager-FPU handling on legacy FPU machines · 6e686709
      Borislav Petkov 提交于
      i486 derived cores like Intel Quark support only the very old,
      legacy x87 FPU (FSAVE/FRSTOR, CPUID bit FXSR is not set), and
      our FPU code wasn't handling the saving and restoring there
      properly in the 'eagerfpu' case.
      
      So after we made eagerfpu the default for all CPU types:
      
        58122bf1 x86/fpu: Default eagerfpu=on on all CPUs
      
      these old FPU designs broke. First, Andy Shevchenko reported a splat:
      
        WARNING: CPU: 0 PID: 823 at arch/x86/include/asm/fpu/internal.h:163 fpu__clear+0x8c/0x160
      
      which was us trying to execute FXRSTOR on those machines even though
      they don't support it.
      
      After taking care of that, Bryan O'Donoghue reported that a simple FPU
      test still failed because we weren't initializing the FPU state properly
      on those machines.
      
      Take care of all that.
      Reported-and-tested-by: NBryan O'Donoghue <pure.logic@nexus-software.ie>
      Reported-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/20160311113206.GD4312@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e686709
    • B
      MIPS: Loongson 3: Keep CPU physical (not virtual) addresses in shadow ROM resource · 97f47e73
      Bjorn Helgaas 提交于
      Loongson 3 used the IORESOURCE_ROM_COPY flag for its ROM resource.  There
      are two problems with this:
      
        - When IORESOURCE_ROM_COPY is set, pci_map_rom() assumes the resource
          contains virtual addresses, so it doesn't ioremap the resource.  This
          implies loongson_sysconf.vgabios_addr is a virtual address.  That's a
          problem because resources should contain CPU *physical* addresses not
          virtual addresses.
      
        - When IORESOURCE_ROM_COPY is set, pci_cleanup_rom() calls kfree() on the
          resource.  We did not kmalloc() the loongson_sysconf.vgabios_addr area,
          so it is incorrect to kfree() it.
      
      If we're using a shadow copy in RAM for the Loongson 3 VGA BIOS area,
      disable the ROM BAR and release the address space it was consuming.
      
      Use IORESOURCE_ROM_SHADOW instead of IORESOURCE_ROM_COPY.  This means the
      struct resource contains CPU physical addresses, and pci_map_rom() will
      ioremap() it as needed.
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      97f47e73
    • B
      MIPS: Loongson 3: Use temporary struct resource * to avoid repetition · 53f0a509
      Bjorn Helgaas 提交于
      Use a temporary struct resource pointer to avoid needless repetition of
      "pdev->resource[PCI_ROM_RESOURCE]".  No functional change intended.
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      53f0a509
    • B
      ia64/PCI: Keep CPU physical (not virtual) addresses in shadow ROM resource · 240504ad
      Bjorn Helgaas 提交于
      A struct resource contains CPU physical addresses, not virtual addresses.
      But sn_acpi_slot_fixup() and sn_io_slot_fixup() stored the virtual address
      of a shadow ROM copy in the resource.  To compensate, pci_map_rom() had a
      special case that returned the resource address directly rather than
      calling ioremap() on it.
      
      When we're using a shadow copy in RAM or PROM, disable the ROM BAR and
      release the address space it was consuming.
      
      Store the CPU physical (not virtual) address in the shadow ROM resource,
      and mark the resource as IORESOURCE_ROM_SHADOW so we use the normal
      pci_map_rom() path that ioremaps the copy.
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      240504ad
    • B
      ia64/PCI: Use ioremap() instead of open-coded equivalent · f976721e
      Bjorn Helgaas 提交于
      Depositing __IA64_UNCACHED_OFFSET in the upper address bits is essentially
      equivalent to ioremap(): it converts a CPU physical address to a virtual
      address using the ia64 uncacheable identity map.
      
      Call ioremap() instead of doing the phys-to-virt conversion manually with
      __IA64_UNCACHED_OFFSET.
      
      Note that this makes it obvious that (a) we're putting a virtual address in
      a struct resource, and (b) we're passing a virtual address to ioremap()
      below in the PCI_ROM_RESOURCE case.  These are both pre-existing problems
      that I'll resolve next.
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      f976721e
    • B
      ia64/PCI: Use temporary struct resource * to avoid repetition · ab97b8cc
      Bjorn Helgaas 提交于
      Use a temporary struct resource pointer to avoid needless repetition of
      "dev->resource[idx]".  No functional change intended.
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      ab97b8cc
    • T
      ARM: mvebu: fix overlap of Crypto SRAM with PCIe memory window · d7d5a43c
      Thomas Petazzoni 提交于
      When the Crypto SRAM mappings were added to the Device Tree files
      describing the Armada XP boards in commit c466d997 ("ARM: mvebu:
      define crypto SRAM ranges for all armada-xp boards"), the fact that
      those mappings were overlaping with the PCIe memory aperture was
      overlooked. Due to this, we currently have for all Armada XP platforms
      a situation that looks like this:
      
      Memory mapping on Armada XP boards with internal registers at
      0xf1000000:
      
       - 0x00000000 -> 0xf0000000	3.75G 	RAM
       - 0xf0000000 -> 0xf1000000	16M	NOR flashes (AXP GP / AXP DB)
       - 0xf1000000 -> 0xf1100000	1M	internal registers
       - 0xf8000000 -> 0xffe0000	126M	PCIe memory aperture
       - 0xf8100000 -> 0xf8110000	64KB	Crypto SRAM #0	=> OVERLAPS WITH PCIE !
       - 0xf8110000 -> 0xf8120000	64KB	Crypto SRAM #1	=> OVERLAPS WITH PCIE !
       - 0xffe00000 -> 0xfff00000	1M	PCIe I/O aperture
       - 0xfff0000  -> 0xffffffff	1M	BootROM
      
      The overlap means that when PCIe devices are added, depending on their
      memory window needs, they might or might not be mapped into the
      physical address space. Indeed, they will not be mapped if the area
      allocated in the PCIe memory aperture by the PCI core overlaps with
      one of the Crypto SRAM. Typically, a Intel IGB PCIe NIC that needs 8MB
      of PCIe memory will see its PCIe memory window allocated from
      0xf80000000 for 8MB, which overlaps with the Crypto SRAM windows. Due
      to this, the PCIe window is not created, and any attempt to access the
      PCIe window makes the kernel explode:
      
      [    3.302213] igb: Copyright (c) 2007-2014 Intel Corporation.
      [    3.307841] pci 0000:00:09.0: enabling device (0140 -> 0143)
      [    3.313539] mvebu_mbus: cannot add window '4:f8', conflicts with another window
      [    3.320870] mvebu-pcie soc:pcie-controller: Could not create MBus window at [mem 0xf8000000-0xf87fffff]: -22
      [    3.330811] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf08c0018
      
      This problem does not occur on Armada 370 boards, because we use the
      following memory mapping (for boards that have internal registers at
      0xf1000000):
      
       - 0x00000000 -> 0xf0000000	3.75G 	RAM
       - 0xf0000000 -> 0xf1000000	16M	NOR flashes (AXP GP / AXP DB)
       - 0xf1000000 -> 0xf1100000	1M	internal registers
       - 0xf1100000 -> 0xf1110000	64KB	Crypto SRAM #0 => OK !
       - 0xf8000000 -> 0xffe0000	126M	PCIe memory
       - 0xffe00000 -> 0xfff00000	1M	PCIe I/O
       - 0xfff0000  -> 0xffffffff	1M	BootROM
      
      Obviously, the solution is to align the location of the Crypto SRAM
      mappings of Armada XP to be similar with the ones on Armada 370, i.e
      have them between the "internal registers" area and the beginning of
      the PCIe aperture.
      
      However, we have a special case with the OpenBlocks AX3-4 platform,
      which has a 128 MB NOR flash. Currently, this NOR flash is mapped from
      0xf0000000 to 0xf8000000. This is possible because on OpenBlocks
      AX3-4, the internal registers are not at 0xf1000000. And this explains
      why the Crypto SRAM mappings were not configured at the same place on
      Armada XP.
      
      Hence, the solution is two-fold:
      
       (1) Move the NOR flash mapping on Armada XP OpenBlocks AX3-4 from
           0xe8000000 to 0xf0000000. This frees the 0xf0000000 ->
           0xf80000000 space.
      
       (2) Move the Crypto SRAM mappings on Armada XP to be similar to
           Armada 370 (except of course that Armada XP has two Crypto SRAM
           and not one).
      
      After this patch, the memory mapping on Armada XP boards with
      registers at 0xf1 is:
      
       - 0x00000000 -> 0xf0000000	3.75G 	RAM
       - 0xf0000000 -> 0xf1000000	16M	NOR flashes (AXP GP / AXP DB)
       - 0xf1000000 -> 0xf1100000	1M	internal registers
       - 0xf1100000 -> 0xf1110000	64KB	Crypto SRAM #0
       - 0xf1110000 -> 0xf1120000	64KB	Crypto SRAM #1
       - 0xf8000000 -> 0xffe0000	126M	PCIe memory
       - 0xffe00000 -> 0xfff00000	1M	PCIe I/O
       - 0xfff0000  -> 0xffffffff	1M	BootROM
      
      And the memory mapping for the special case of the OpenBlocks AX3-4
      (internal registers at 0xd0000000, NOR of 128 MB):
      
       - 0x00000000 -> 0xc0000000	3G 	RAM
       - 0xd0000000 -> 0xd1000000	1M	internal registers
       - 0xe800000  -> 0xf0000000	128M	NOR flash
       - 0xf1100000 -> 0xf1110000	64KB	Crypto SRAM #0
       - 0xf1110000 -> 0xf1120000	64KB	Crypto SRAM #1
       - 0xf8000000 -> 0xffe0000	126M	PCIe memory
       - 0xffe00000 -> 0xfff00000	1M	PCIe I/O
       - 0xfff0000  -> 0xffffffff	1M	BootROM
      
      Fixes: c466d997 ("ARM: mvebu: define crypto SRAM ranges for all armada-xp boards")
      Reported-by: NPhil Sutter <phil@nwl.cc>
      Cc: Phil Sutter <phil@nwl.cc>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Acked-by: NGregory CLEMENT <gregory.clement@free-electrons.com>
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      d7d5a43c
  5. 11 3月, 2016 9 次提交
    • C
      arm64: kasan: Fix zero shadow mapping overriding kernel image shadow · 2776e0e8
      Catalin Marinas 提交于
      With the 16KB and 64KB page size configurations, SWAPPER_BLOCK_SIZE is
      PAGE_SIZE and ARM64_SWAPPER_USES_SECTION_MAPS is 0. Since
      kimg_shadow_end is not page aligned (_end shifted by
      KASAN_SHADOW_SCALE_SHIFT), the edges of previously mapped kernel image
      shadow via vmemmap_populate() may be overridden by subsequent calls to
      kasan_populate_zero_shadow(), leading to kernel panics like below:
      
      ------------------------------------------------------------------------------
      Unable to handle kernel paging request at virtual address fffffc100135068c
      pgd = fffffc8009ac0000
      [fffffc100135068c] *pgd=00000009ffee0003, *pud=00000009ffee0003, *pmd=00000009ffee0003, *pte=00e0000081a00793
      Internal error: Oops: 9600004f [#1] PREEMPT SMP
      Modules linked in:
      CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.5.0-rc4+ #1984
      Hardware name: Juno (DT)
      task: fffffe09001a0000 ti: fffffe0900200000 task.ti: fffffe0900200000
      PC is at __memset+0x4c/0x200
      LR is at kasan_unpoison_shadow+0x34/0x50
      pc : [<fffffc800846f1cc>] lr : [<fffffc800821ff54>] pstate: 00000245
      sp : fffffe0900203db0
      x29: fffffe0900203db0 x28: 0000000000000000
      x27: 0000000000000000 x26: 0000000000000000
      x25: fffffc80099b69d0 x24: 0000000000000001
      x23: 0000000000000000 x22: 0000000000002000
      x21: dffffc8000000000 x20: 1fffff9001350a8c
      x19: 0000000000002000 x18: 0000000000000008
      x17: 0000000000000147 x16: ffffffffffffffff
      x15: 79746972100e041d x14: ffffff0000000000
      x13: ffff000000000000 x12: 0000000000000000
      x11: 0101010101010101 x10: 1fffffc11c000000
      x9 : 0000000000000000 x8 : fffffc100135068c
      x7 : 0000000000000000 x6 : 000000000000003f
      x5 : 0000000000000040 x4 : 0000000000000004
      x3 : fffffc100134f651 x2 : 0000000000000400
      x1 : 0000000000000000 x0 : fffffc100135068c
      
      Process swapper/0 (pid: 1, stack limit = 0xfffffe0900200020)
      Call trace:
      [<fffffc800846f1cc>] __memset+0x4c/0x200
      [<fffffc8008220044>] __asan_register_globals+0x5c/0xb0
      [<fffffc8008a09d34>] _GLOBAL__sub_I_65535_1_sunrpc_cache_lookup+0x1c/0x28
      [<fffffc8008f20d28>] kernel_init_freeable+0x104/0x274
      [<fffffc80089e1948>] kernel_init+0x10/0xf8
      [<fffffc8008093a00>] ret_from_fork+0x10/0x50
      ------------------------------------------------------------------------------
      
      This patch aligns kimg_shadow_start and kimg_shadow_end to
      SWAPPER_BLOCK_SIZE in all configurations.
      
      Fixes: f9040773 ("arm64: move kernel image to base of vmalloc area")
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      2776e0e8
    • C
      arm64: kasan: Use actual memory node when populating the kernel image shadow · 2f76969f
      Catalin Marinas 提交于
      With the 16KB or 64KB page configurations, the generic
      vmemmap_populate() implementation warns on potential offnode
      page_structs via vmemmap_verify() because the arm64 kasan_init() passes
      NUMA_NO_NODE instead of the actual node for the kernel image memory.
      
      Fixes: f9040773 ("arm64: move kernel image to base of vmalloc area")
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: NJames Morse <james.morse@arm.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      2f76969f
    • C
      arm64: Update PTE_RDONLY in set_pte_at() for PROT_NONE permission · fdc69e7d
      Catalin Marinas 提交于
      The set_pte_at() function must update the hardware PTE_RDONLY bit
      depending on the state of the PTE_WRITE and PTE_DIRTY bits of the given
      entry value. However, it currently only performs this for pte_valid()
      entries, ignoring PTE_PROT_NONE. The side-effect is that PROT_NONE
      mappings would not have the PTE_RDONLY bit set. Without
      CONFIG_ARM64_HW_AFDBM, this is not an issue since such PROT_NONE pages
      are not accessible anyway.
      
      With commit 2f4b829c ("arm64: Add support for hardware updates of
      the access and dirty pte bits"), the ptep_set_wrprotect() function was
      re-written to cope with automatic hardware updates of the dirty state.
      As an optimisation, only PTE_RDONLY is checked to assess the "dirty"
      status. Since set_pte_at() does not set this bit for PROT_NONE mappings,
      such pages may be considered "dirty" as a result of
      ptep_set_wrprotect().
      
      This patch updates the pte_valid() check to pte_present() in
      set_pte_at(). It also adds PTE_PROT_NONE to the swap entry bits comment.
      
      Fixes: 2f4b829c ("arm64: Add support for hardware updates of the access and dirty pte bits")
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: NGanapatrao Kulkarni <gkulkarni@caviumnetworks.com>
      Tested-by: NGanapatrao Kulkarni <gkulkarni@cavium.com>
      Cc: <stable@vger.kernel.org>
      fdc69e7d
    • H
      x86/mm/32: Enable full randomization on i386 and X86_32 · 8b8addf8
      Hector Marco-Gisbert 提交于
      Currently on i386 and on X86_64 when emulating X86_32 in legacy mode, only
      the stack and the executable are randomized but not other mmapped files
      (libraries, vDSO, etc.). This patch enables randomization for the
      libraries, vDSO and mmap requests on i386 and in X86_32 in legacy mode.
      
      By default on i386 there are 8 bits for the randomization of the libraries,
      vDSO and mmaps which only uses 1MB of VA.
      
      This patch preserves the original randomness, using 1MB of VA out of 3GB or
      4GB. We think that 1MB out of 3GB is not a big cost for having the ASLR.
      
      The first obvious security benefit is that all objects are randomized (not
      only the stack and the executable) in legacy mode which highly increases
      the ASLR effectiveness, otherwise the attackers may use these
      non-randomized areas. But also sensitive setuid/setgid applications are
      more secure because currently, attackers can disable the randomization of
      these applications by setting the ulimit stack to "unlimited". This is a
      very old and widely known trick to disable the ASLR in i386 which has been
      allowed for too long.
      
      Another trick used to disable the ASLR was to set the ADDR_NO_RANDOMIZE
      personality flag, but fortunately this doesn't work on setuid/setgid
      applications because there is security checks which clear Security-relevant
      flags.
      
      This patch always randomizes the mmap_legacy_base address, removing the
      possibility to disable the ASLR by setting the stack to "unlimited".
      Signed-off-by: NHector Marco-Gisbert <hecmargi@upv.es>
      Acked-by: NIsmael Ripoll Ripoll <iripoll@upv.es>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: kees Cook <keescook@chromium.org>
      Link: http://lkml.kernel.org/r/1457639460-5242-1-git-send-email-hecmargi@upv.esSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b8addf8
    • J
      x86/PCI: VMD: Attach VMD resources to parent domain's resource tree · 2c2c5c5c
      Jon Derrick 提交于
      Attach the new VMD domain's resources to the VMD device's resources.  This
      allows /proc/iomem to display a more complete picture.
      
      Before:
        c0000000-c1ffffff : 0000:5d:05.5
        c2000000-c3ffffff : 0000:5d:05.5
          c2010000-c2013fff : nvme
        c4000000-c40fffff : 0000:5d:05.5
      
      After:
        c0000000-c1ffffff : 0000:5d:05.5
        c2000000-c3ffffff : 0000:5d:05.5
          c2000000-c3ffffff : VMD MEMBAR1
            c2000000-c22fffff : PCI Bus 10000:01
              c2000000-c200ffff : 10000:01:00.0
              c2010000-c2013fff : 10000:01:00.0
                c2010000-c2013fff : nvme
            c2300000-c24fffff : PCI Bus 10000:01
        c4000000-c40fffff : 0000:5d:05.5
          c4002000-c40fffff : VMD MEMBAR2
      Signed-off-by: NJon Derrick <jonathan.derrick@intel.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      2c2c5c5c
    • K
      x86/PCI: VMD: Set bus resource start to 0 · d068c350
      Keith Busch 提交于
      The bus always starts at 0.  Due to alignment and down-casting, this
      happened to work before, but looked alarmingly incorrect in kernel logs.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      d068c350
    • K
      x86/PCI: VMD: Document code for maintainability · 83cc54a6
      Keith Busch 提交于
      Comment the less obvious portion of the code for setting up memory windows,
      and the platform dependency for initializing the h/w with appropriate
      resources.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      83cc54a6
    • J
      ARC: Add PCI support · c1678ffc
      Joao Pinto 提交于
      Add PCI support to ARC and update drivers/pci Makefile enabling the ARC
      arch to use the generic PCI setup functions.
      
      [bhelgaas: fold in Joao's pci-dma-compat.h & pci-bridge.h build fix (I
      should have caught this myself, sorry]
      Signed-off-by: NJoao Pinto <jpinto@synopsys.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NVineet Gupta <vgupta@synopsys.com>
      c1678ffc
    • J
      x86/entry/traps: Show unhandled signal for i386 in do_trap() · 10ee7386
      Jianyu Zhan 提交于
      Commit abd4f750 ("x86: i386-show-unhandled-signals-v3") did turn on
      the showing-unhandled-signal behaviour for i386 for some exception handlers,
      but for no reason do_trap() is left out (my naive guess is because turning it on
      for do_trap() would be too noisy since do_trap() is shared by several exceptions).
      
      And since the same commit make "show_unhandled_signals" a debug tunable(in
      /proc/sys/debug/exception-trace), and x86 by default turning it on.
      
      So it would be strange for i386 users who turing it on manually and expect
      seeing the unhandled signal output in log, but nothing.
      
      This patch turns it on for i386 in do_trap() as well.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Reviewed-by: NJan Beulich <jbeulich@suse.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@suse.de
      Cc: dave.hansen@linux.intel.com
      Cc: heukelum@fastmail.fm
      Cc: jbeulich@novell.com
      Cc: jdike@addtoit.com
      Cc: joe@perches.com
      Cc: luto@kernel.org
      Link: http://lkml.kernel.org/r/1457612398-4568-1-git-send-email-nasa4836@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10ee7386
  6. 10 3月, 2016 11 次提交
    • M
      s390: fix floating pointer register corruption (again) · e370e476
      Martin Schwidefsky 提交于
      There is a tricky interaction between the machine check handler
      and the critical sections of load_fpu_regs and save_fpu_regs
      functions. If the machine check interrupts one of the two
      functions the critical section cleanup will complete the function
      before the machine check handler s390_do_machine_check is called.
      Trouble is that the machine check handler needs to validate the
      floating point registers *before* and not *after* the completion
      of load_fpu_regs/save_fpu_regs.
      
      The simplest solution is to rewind the PSW to the start of the
      load_fpu_regs/save_fpu_regs and retry the function after the
      return from the machine check handler.
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: <stable@vger.kernel.org> # 4.3+
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      e370e476
    • H
      s390/cpumf: add missing lpp magic initialization · 8f100bb1
      Heiko Carstens 提交于
      Add the missing lpp magic initialization for cpu 0. Without this all
      samples on cpu 0 do not have the most significant bit set in the
      program parameter field, which we use to distinguish between guest and
      host samples if the pid is also 0.
      
      We did initialize the lpp magic in the absolute zero lowcore but
      forgot that when switching to the allocated lowcore on cpu 0 only.
      Reported-by: NShu Juan Zhang <zhshuj@cn.ibm.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: stable@vger.kernel.org # v4.4+
      Fixes: e22cf8ca ("s390/cpumf: rework program parameter setting to detect guest samples")
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      8f100bb1
    • B
      x86/delay: Avoid preemptible context checks in delay_mwaitx() · 84477336
      Borislav Petkov 提交于
      We do use this_cpu_ptr(&cpu_tss) as a cacheline-aligned, seldomly
      accessed per-cpu var as the MONITORX target in delay_mwaitx(). However,
      when called in preemptible context, this_cpu_ptr -> smp_processor_id() ->
      debug_smp_processor_id() fires:
      
        BUG: using smp_processor_id() in preemptible [00000000] code: udevd/312
        caller is delay_mwaitx+0x40/0xa0
      
      But we don't care about that check - we only need cpu_tss as a MONITORX
      target and it doesn't really matter which CPU's var we're touching as
      we're going idle anyway. Fix that.
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/20160309205622.GG6564@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      84477336
    • P
      KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 · 5f0b8199
      Paolo Bonzini 提交于
      KVM has special logic to handle pages with pte.u=1 and pte.w=0 when
      CR0.WP=1.  These pages' SPTEs flip continuously between two states:
      U=1/W=0 (user and supervisor reads allowed, supervisor writes not allowed)
      and U=0/W=1 (supervisor reads and writes allowed, user writes not allowed).
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0, making the two states U=1/W=0/NX=gpte.NX and U=0/W=1/NX=1.
      When guest EFER has the NX bit cleared, the reserved bit check thinks
      that the latter state is invalid; teach it that the smep_andnot_wp case
      will also use the NX bit of SPTEs.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.inel.com>
      Fixes: c258b62bSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f0b8199
    • P
      KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo · 844a5fe2
      Paolo Bonzini 提交于
      Yes, all of these are needed. :) This is admittedly a bit odd, but
      kvm-unit-tests access.flat tests this if you run it with "-cpu host"
      and of course ept=0.
      
      KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
      specially when pte.u=1/pte.w=0/CR0.WP=0.  Such writes cause a fault
      when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
      When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
      restarts execution.  This will still cause a user write to fault, while
      supervisor writes will succeed.  User reads will fault spuriously now,
      and KVM will then flip U and W again in the SPTE (U=1, W=0).  User reads
      will be enabled and supervisor writes disabled, going back to the
      originary situation where supervisor writes fault spuriously.
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0.  If the guest has not enabled NX, the result is a continuous
      stream of page faults due to the NX bit being reserved.
      
      The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
      switch.  (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
      control, so they do not use user-return notifiers for EFER---if they did,
      EFER.NX would be forced to the same value as the host).
      
      There is another bug in the reserved bit check, which I've split to a
      separate patch for easier application to stable kernels.
      
      Cc: stable@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Fixes: f6577a5fSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      844a5fe2
    • A
      x86/entry: Call enter_from_user_mode() with IRQs off · 9999c8c0
      Andy Lutomirski 提交于
      Now that slow-path syscalls always enter C before enabling
      interrupts, it's straightforward to call enter_from_user_mode() before
      enabling interrupts rather than doing it as part of entry tracing.
      
      With this change, we should finally be able to retire exception_enter().
      
      This will also enable optimizations based on knowing that we never
      change context tracking state with interrupts on.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/bc376ecf87921a495e874ff98139b1ca2f5c5dd7.1457558566.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9999c8c0
    • A
      x86/entry/32: Change INT80 to be an interrupt gate · a798f091
      Andy Lutomirski 提交于
      We want all of the syscall entries to run with interrupts off so that
      we can efficiently run context tracking before enabling interrupts.
      
      This will regress int $0x80 performance on 32-bit kernels by a
      couple of cycles.  This shouldn't matter much -- int $0x80 is not a
      fast path.
      
      This effectively reverts:
      
        657c1eea ("x86/entry/32: Fix entry_INT80_32() to expect interrupts to be on")
      
      ... and fixes the same issue differently.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/59b4f90c9ebfccd8c937305dbbbca680bc74b905.1457558566.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a798f091
    • Y
      x86/fpu: Revert ("x86/fpu: Disable AVX when eagerfpu is off") · a65050c6
      Yu-cheng Yu 提交于
      Leonid Shatz noticed that the SDM interpretation of the following
      recent commit:
      
        394db20c ("x86/fpu: Disable AVX when eagerfpu is off")
      
      ... is incorrect and that the original behavior of the FPU code was correct.
      
      Because AVX is not stated in CR0 TS bit description, it was mistakenly
      believed to be not supported for lazy context switch. This turns out
      to be false:
      
        Intel Software Developer's Manual Vol. 3A, Sec. 2.5 Control Registers:
      
         'TS Task Switched bit (bit 3 of CR0) -- Allows the saving of the x87 FPU/
          MMX/SSE/SSE2/SSE3/SSSE3/SSE4 context on a task switch to be delayed until
          an x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction is actually executed
          by the new task.'
      
        Intel Software Developer's Manual Vol. 2A, Sec. 2.4 Instruction Exception
        Specification:
      
         'AVX instructions refer to exceptions by classes that include #NM
          "Device Not Available" exception for lazy context switch.'
      
      So revert the commit.
      Reported-by: NLeonid Shatz <leonid.shatz@ravellosystems.com>
      Signed-off-by: NYu-cheng Yu <yu-cheng.yu@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
      Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1457569734-3785-1-git-send-email-yu-cheng.yu@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a65050c6
    • A
      x86/entry: Improve system call entry comments · fda57b22
      Andy Lutomirski 提交于
      Ingo suggested that the comments should explain when the various
      entries are used.  This adds these explanations and improves other
      parts of the comments.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/9524ecef7a295347294300045d08354d6a57c6e7.1457578375.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fda57b22
    • A
      x86/entry: Remove TIF_SINGLESTEP entry work · 392a6254
      Andy Lutomirski 提交于
      Now that SYSENTER with TF set puts X86_EFLAGS_TF directly into
      regs->flags, we don't need a TIF_SINGLESTEP fixup in the syscall
      entry code.  Remove it.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/2d15f24da52dafc9d2f0b8d76f55544f4779c517.1457578375.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      392a6254
    • A
      x86/entry/32: Add and check a stack canary for the SYSENTER stack · 2a41aa4f
      Andy Lutomirski 提交于
      The first instruction of the SYSENTER entry runs on its own tiny
      stack.  That stack can be used if a #DB or NMI is delivered before
      the SYSENTER prologue switches to a real stack.
      
      We have code in place to prevent us from overflowing the tiny stack.
      For added paranoia, add a canary to the stack and check it in
      do_debug() -- that way, if something goes wrong with the #DB logic,
      we'll eventually notice.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/6ff9a806f39098b166dc2c41c1db744df5272f29.1457578375.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2a41aa4f