1. 14 4月, 2016 1 次提交
  2. 13 3月, 2016 3 次提交
  3. 12 3月, 2016 3 次提交
    • M
      x86/efi: Fix boot crash by always mapping boot service regions into new EFI page tables · 452308de
      Matt Fleming 提交于
      Some machines have EFI regions in page zero (physical address
      0x00000000) and historically that region has been added to the e820
      map via trim_bios_range(), and ultimately mapped into the kernel page
      tables. It was not mapped via efi_map_regions() as one would expect.
      
      Alexis reports that with the new separate EFI page tables some boot
      services regions, such as page zero, are not mapped. This triggers an
      oops during the SetVirtualAddressMap() runtime call.
      
      For the EFI boot services quirk on x86 we need to memblock_reserve()
      boot services regions until after SetVirtualAddressMap(). Doing that
      while respecting the ownership of regions that may have already been
      reserved by the kernel was the motivation behind this commit:
      
        7d68dc3f ("x86, efi: Do not reserve boot services regions within reserved areas")
      
      That patch was merged at a time when the EFI runtime virtual mappings
      were inserted into the kernel page tables as described above, and the
      trick of setting ->numpages (and hence the region size) to zero to
      track regions that should not be freed in efi_free_boot_services()
      meant that we never mapped those regions in efi_map_regions(). Instead
      we were relying solely on the existing kernel mappings.
      
      Now that we have separate page tables we need to make sure the EFI
      boot services regions are mapped correctly, even if someone else has
      already called memblock_reserve(). Instead of stashing a tag in
      ->numpages, set the EFI_MEMORY_RUNTIME bit of ->attribute. Since it
      generally makes no sense to mark a boot services region as required at
      runtime, it's pretty much guaranteed the firmware will not have
      already set this bit.
      
      For the record, the specific circumstances under which Alexis
      triggered this bug was that an EFI runtime driver on his machine was
      responding to the EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE event during
      SetVirtualAddressMap().
      
      The event handler for this driver looks like this,
      
        sub rsp,0x28
        lea rdx,[rip+0x2445] # 0xaa948720
        mov ecx,0x4
        call func_aa9447c0  ; call to ConvertPointer(4, & 0xaa948720)
        mov r11,QWORD PTR [rip+0x2434] # 0xaa948720
        xor eax,eax
        mov BYTE PTR [r11+0x1],0x1
        add rsp,0x28
        ret
      
      Which is pretty typical code for an EVT_SIGNAL_VIRTUAL_ADDRESS_CHANGE
      handler. The "mov r11, QWORD PTR [rip+0x2424]" was the faulting
      instruction because ConvertPointer() was being called to convert the
      address 0x0000000000000000, which when converted is left unchanged and
      remains 0x0000000000000000.
      
      The output of the oops trace gave the impression of a standard NULL
      pointer dereference bug, but because we're accessing physical
      addresses during ConvertPointer(), it wasn't. EFI boot services code
      is stored at that address on Alexis' machine.
      Reported-by: NAlexis Murzeau <amurzeau@gmail.com>
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maarten Lankhorst <maarten.lankhorst@canonical.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Raphael Hertzog <hertzog@debian.org>
      Cc: Roger Shimizu <rogershimizu@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-efi@vger.kernel.org
      Link: http://lkml.kernel.org/r/1457695163-29632-2-git-send-email-matt@codeblueprint.co.uk
      Link: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=815125Signed-off-by: NIngo Molnar <mingo@kernel.org>
      452308de
    • B
      x86/fpu: Fix eager-FPU handling on legacy FPU machines · 6e686709
      Borislav Petkov 提交于
      i486 derived cores like Intel Quark support only the very old,
      legacy x87 FPU (FSAVE/FRSTOR, CPUID bit FXSR is not set), and
      our FPU code wasn't handling the saving and restoring there
      properly in the 'eagerfpu' case.
      
      So after we made eagerfpu the default for all CPU types:
      
        58122bf1 x86/fpu: Default eagerfpu=on on all CPUs
      
      these old FPU designs broke. First, Andy Shevchenko reported a splat:
      
        WARNING: CPU: 0 PID: 823 at arch/x86/include/asm/fpu/internal.h:163 fpu__clear+0x8c/0x160
      
      which was us trying to execute FXRSTOR on those machines even though
      they don't support it.
      
      After taking care of that, Bryan O'Donoghue reported that a simple FPU
      test still failed because we weren't initializing the FPU state properly
      on those machines.
      
      Take care of all that.
      Reported-and-tested-by: NBryan O'Donoghue <pure.logic@nexus-software.ie>
      Reported-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yu-cheng <yu-cheng.yu@intel.com>
      Link: http://lkml.kernel.org/r/20160311113206.GD4312@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e686709
    • T
      ARM: mvebu: fix overlap of Crypto SRAM with PCIe memory window · d7d5a43c
      Thomas Petazzoni 提交于
      When the Crypto SRAM mappings were added to the Device Tree files
      describing the Armada XP boards in commit c466d997 ("ARM: mvebu:
      define crypto SRAM ranges for all armada-xp boards"), the fact that
      those mappings were overlaping with the PCIe memory aperture was
      overlooked. Due to this, we currently have for all Armada XP platforms
      a situation that looks like this:
      
      Memory mapping on Armada XP boards with internal registers at
      0xf1000000:
      
       - 0x00000000 -> 0xf0000000	3.75G 	RAM
       - 0xf0000000 -> 0xf1000000	16M	NOR flashes (AXP GP / AXP DB)
       - 0xf1000000 -> 0xf1100000	1M	internal registers
       - 0xf8000000 -> 0xffe0000	126M	PCIe memory aperture
       - 0xf8100000 -> 0xf8110000	64KB	Crypto SRAM #0	=> OVERLAPS WITH PCIE !
       - 0xf8110000 -> 0xf8120000	64KB	Crypto SRAM #1	=> OVERLAPS WITH PCIE !
       - 0xffe00000 -> 0xfff00000	1M	PCIe I/O aperture
       - 0xfff0000  -> 0xffffffff	1M	BootROM
      
      The overlap means that when PCIe devices are added, depending on their
      memory window needs, they might or might not be mapped into the
      physical address space. Indeed, they will not be mapped if the area
      allocated in the PCIe memory aperture by the PCI core overlaps with
      one of the Crypto SRAM. Typically, a Intel IGB PCIe NIC that needs 8MB
      of PCIe memory will see its PCIe memory window allocated from
      0xf80000000 for 8MB, which overlaps with the Crypto SRAM windows. Due
      to this, the PCIe window is not created, and any attempt to access the
      PCIe window makes the kernel explode:
      
      [    3.302213] igb: Copyright (c) 2007-2014 Intel Corporation.
      [    3.307841] pci 0000:00:09.0: enabling device (0140 -> 0143)
      [    3.313539] mvebu_mbus: cannot add window '4:f8', conflicts with another window
      [    3.320870] mvebu-pcie soc:pcie-controller: Could not create MBus window at [mem 0xf8000000-0xf87fffff]: -22
      [    3.330811] Unhandled fault: external abort on non-linefetch (0x1008) at 0xf08c0018
      
      This problem does not occur on Armada 370 boards, because we use the
      following memory mapping (for boards that have internal registers at
      0xf1000000):
      
       - 0x00000000 -> 0xf0000000	3.75G 	RAM
       - 0xf0000000 -> 0xf1000000	16M	NOR flashes (AXP GP / AXP DB)
       - 0xf1000000 -> 0xf1100000	1M	internal registers
       - 0xf1100000 -> 0xf1110000	64KB	Crypto SRAM #0 => OK !
       - 0xf8000000 -> 0xffe0000	126M	PCIe memory
       - 0xffe00000 -> 0xfff00000	1M	PCIe I/O
       - 0xfff0000  -> 0xffffffff	1M	BootROM
      
      Obviously, the solution is to align the location of the Crypto SRAM
      mappings of Armada XP to be similar with the ones on Armada 370, i.e
      have them between the "internal registers" area and the beginning of
      the PCIe aperture.
      
      However, we have a special case with the OpenBlocks AX3-4 platform,
      which has a 128 MB NOR flash. Currently, this NOR flash is mapped from
      0xf0000000 to 0xf8000000. This is possible because on OpenBlocks
      AX3-4, the internal registers are not at 0xf1000000. And this explains
      why the Crypto SRAM mappings were not configured at the same place on
      Armada XP.
      
      Hence, the solution is two-fold:
      
       (1) Move the NOR flash mapping on Armada XP OpenBlocks AX3-4 from
           0xe8000000 to 0xf0000000. This frees the 0xf0000000 ->
           0xf80000000 space.
      
       (2) Move the Crypto SRAM mappings on Armada XP to be similar to
           Armada 370 (except of course that Armada XP has two Crypto SRAM
           and not one).
      
      After this patch, the memory mapping on Armada XP boards with
      registers at 0xf1 is:
      
       - 0x00000000 -> 0xf0000000	3.75G 	RAM
       - 0xf0000000 -> 0xf1000000	16M	NOR flashes (AXP GP / AXP DB)
       - 0xf1000000 -> 0xf1100000	1M	internal registers
       - 0xf1100000 -> 0xf1110000	64KB	Crypto SRAM #0
       - 0xf1110000 -> 0xf1120000	64KB	Crypto SRAM #1
       - 0xf8000000 -> 0xffe0000	126M	PCIe memory
       - 0xffe00000 -> 0xfff00000	1M	PCIe I/O
       - 0xfff0000  -> 0xffffffff	1M	BootROM
      
      And the memory mapping for the special case of the OpenBlocks AX3-4
      (internal registers at 0xd0000000, NOR of 128 MB):
      
       - 0x00000000 -> 0xc0000000	3G 	RAM
       - 0xd0000000 -> 0xd1000000	1M	internal registers
       - 0xe800000  -> 0xf0000000	128M	NOR flash
       - 0xf1100000 -> 0xf1110000	64KB	Crypto SRAM #0
       - 0xf1110000 -> 0xf1120000	64KB	Crypto SRAM #1
       - 0xf8000000 -> 0xffe0000	126M	PCIe memory
       - 0xffe00000 -> 0xfff00000	1M	PCIe I/O
       - 0xfff0000  -> 0xffffffff	1M	BootROM
      
      Fixes: c466d997 ("ARM: mvebu: define crypto SRAM ranges for all armada-xp boards")
      Reported-by: NPhil Sutter <phil@nwl.cc>
      Cc: Phil Sutter <phil@nwl.cc>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NThomas Petazzoni <thomas.petazzoni@free-electrons.com>
      Acked-by: NGregory CLEMENT <gregory.clement@free-electrons.com>
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      d7d5a43c
  4. 10 3月, 2016 6 次提交
    • B
      x86/delay: Avoid preemptible context checks in delay_mwaitx() · 84477336
      Borislav Petkov 提交于
      We do use this_cpu_ptr(&cpu_tss) as a cacheline-aligned, seldomly
      accessed per-cpu var as the MONITORX target in delay_mwaitx(). However,
      when called in preemptible context, this_cpu_ptr -> smp_processor_id() ->
      debug_smp_processor_id() fires:
      
        BUG: using smp_processor_id() in preemptible [00000000] code: udevd/312
        caller is delay_mwaitx+0x40/0xa0
      
      But we don't care about that check - we only need cpu_tss as a MONITORX
      target and it doesn't really matter which CPU's var we're touching as
      we're going idle anyway. Fix that.
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Huang Rui <ray.huang@amd.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: spg_linux_kernel@amd.com
      Link: http://lkml.kernel.org/r/20160309205622.GG6564@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
      84477336
    • P
      KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 · 5f0b8199
      Paolo Bonzini 提交于
      KVM has special logic to handle pages with pte.u=1 and pte.w=0 when
      CR0.WP=1.  These pages' SPTEs flip continuously between two states:
      U=1/W=0 (user and supervisor reads allowed, supervisor writes not allowed)
      and U=0/W=1 (supervisor reads and writes allowed, user writes not allowed).
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0, making the two states U=1/W=0/NX=gpte.NX and U=0/W=1/NX=1.
      When guest EFER has the NX bit cleared, the reserved bit check thinks
      that the latter state is invalid; teach it that the smep_andnot_wp case
      will also use the NX bit of SPTEs.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.inel.com>
      Fixes: c258b62bSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f0b8199
    • P
      KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo · 844a5fe2
      Paolo Bonzini 提交于
      Yes, all of these are needed. :) This is admittedly a bit odd, but
      kvm-unit-tests access.flat tests this if you run it with "-cpu host"
      and of course ept=0.
      
      KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
      specially when pte.u=1/pte.w=0/CR0.WP=0.  Such writes cause a fault
      when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
      When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
      restarts execution.  This will still cause a user write to fault, while
      supervisor writes will succeed.  User reads will fault spuriously now,
      and KVM will then flip U and W again in the SPTE (U=1, W=0).  User reads
      will be enabled and supervisor writes disabled, going back to the
      originary situation where supervisor writes fault spuriously.
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0.  If the guest has not enabled NX, the result is a continuous
      stream of page faults due to the NX bit being reserved.
      
      The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
      switch.  (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
      control, so they do not use user-return notifiers for EFER---if they did,
      EFER.NX would be forced to the same value as the host).
      
      There is another bug in the reserved bit check, which I've split to a
      separate patch for easier application to stable kernels.
      
      Cc: stable@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Fixes: f6577a5fSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      844a5fe2
    • Y
      x86/fpu: Revert ("x86/fpu: Disable AVX when eagerfpu is off") · a65050c6
      Yu-cheng Yu 提交于
      Leonid Shatz noticed that the SDM interpretation of the following
      recent commit:
      
        394db20c ("x86/fpu: Disable AVX when eagerfpu is off")
      
      ... is incorrect and that the original behavior of the FPU code was correct.
      
      Because AVX is not stated in CR0 TS bit description, it was mistakenly
      believed to be not supported for lazy context switch. This turns out
      to be false:
      
        Intel Software Developer's Manual Vol. 3A, Sec. 2.5 Control Registers:
      
         'TS Task Switched bit (bit 3 of CR0) -- Allows the saving of the x87 FPU/
          MMX/SSE/SSE2/SSE3/SSSE3/SSE4 context on a task switch to be delayed until
          an x87 FPU/MMX/SSE/SSE2/SSE3/SSSE3/SSE4 instruction is actually executed
          by the new task.'
      
        Intel Software Developer's Manual Vol. 2A, Sec. 2.4 Instruction Exception
        Specification:
      
         'AVX instructions refer to exceptions by classes that include #NM
          "Device Not Available" exception for lazy context switch.'
      
      So revert the commit.
      Reported-by: NLeonid Shatz <leonid.shatz@ravellosystems.com>
      Signed-off-by: NYu-cheng Yu <yu-cheng.yu@intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
      Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1457569734-3785-1-git-send-email-yu-cheng.yu@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a65050c6
    • M
      s390/mm: four page table levels vs. fork · 3446c13b
      Martin Schwidefsky 提交于
      The fork of a process with four page table levels is broken since
      git commit 6252d702 "[S390] dynamic page tables."
      
      All new mm contexts are created with three page table levels and
      an asce limit of 4TB. If the parent has four levels dup_mmap will
      add vmas to the new context which are outside of the asce limit.
      The subsequent call to copy_page_range will walk the three level
      page table structure of the new process with non-zero pgd and pud
      indexes. This leads to memory clobbers as the pgd_index *and* the
      pud_index is added to the mm->pgd pointer without a pgd_deref
      in between.
      
      The init_new_context() function is selecting the number of page
      table levels for a new context. The function is used by mm_init()
      which in turn is called by dup_mm() and mm_alloc(). These two are
      used by fork() and exec(). The init_new_context() function can
      distinguish the two cases by looking at mm->context.asce_limit,
      for fork() the mm struct has been copied and the number of page
      table levels may not change. For exec() the mm_alloc() function
      set the new mm structure to zero, in this case a three-level page
      table is created as the temporary stack space is located at
      STACK_TOP_MAX = 4TB.
      
      This fixes CVE-2016-2143.
      Reported-by: NMarcin Kościelnicki <koriakin@0x04.net>
      Reviewed-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      3446c13b
    • M
      arm64: kasan: clear stale stack poison · 0d97e6d8
      Mark Rutland 提交于
      Functions which the compiler has instrumented for KASAN place poison on
      the stack shadow upon entry and remove this poison prior to returning.
      
      In the case of cpuidle, CPUs exit the kernel a number of levels deep in
      C code.  Any instrumented functions on this critical path will leave
      portions of the stack shadow poisoned.
      
      If CPUs lose context and return to the kernel via a cold path, we
      restore a prior context saved in __cpu_suspend_enter are forgotten, and
      we never remove the poison they placed in the stack shadow area by
      functions calls between this and the actual exit of the kernel.
      
      Thus, (depending on stackframe layout) subsequent calls to instrumented
      functions may hit this stale poison, resulting in (spurious) KASAN
      splats to the console.
      
      To avoid this, clear any stale poison from the idle thread for a CPU
      prior to bringing a CPU online.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reviewed-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d97e6d8
  5. 09 3月, 2016 3 次提交
    • W
      arm64: hugetlb: partial revert of 66b3923a · ff792584
      Will Deacon 提交于
      Commit 66b3923a ("arm64: hugetlb: add support for PTE contiguous bit")
      introduced support for huge pages using the contiguous bit in the PTE
      as opposed to block mappings, which may be slightly unwieldy (512M) in
      64k page configurations.
      
      Unfortunately, this support has resulted in some late regressions when
      running the libhugetlbfs test suite with 64k pages and CONFIG_DEBUG_VM
      as a result of a BUG:
      
       | readback (2M: 64):	------------[ cut here ]------------
       | kernel BUG at fs/hugetlbfs/inode.c:446!
       | Internal error: Oops - BUG: 0 [#1] SMP
       | Modules linked in:
       | CPU: 7 PID: 1448 Comm: readback Not tainted 4.5.0-rc7 #148
       | Hardware name: linux,dummy-virt (DT)
       | task: fffffe0040964b00 ti: fffffe00c2668000 task.ti: fffffe00c2668000
       | PC is at remove_inode_hugepages+0x44c/0x480
       | LR is at remove_inode_hugepages+0x264/0x480
      
      Rather than revert the entire patch, simply avoid advertising the
      contiguous huge page sizes for now while people are actively working on
      a fix. This patch can then be reverted once things have been sorted out.
      
      Cc: David Woods <dwoods@ezchip.com>
      Reported-by: NSteve Capper <steve.capper@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      ff792584
    • A
      arm64: account for sparsemem section alignment when choosing vmemmap offset · 36e5cd6b
      Ard Biesheuvel 提交于
      Commit dfd55ad8 ("arm64: vmemmap: use virtual projection of linear
      region") fixed an issue where the struct page array would overflow into the
      adjacent virtual memory region if system RAM was placed so high up in
      physical memory that its addresses were not representable in the build time
      configured virtual address size.
      
      However, the fix failed to take into account that the vmemmap region needs
      to be relatively aligned with respect to the sparsemem section size, so that
      a sequence of page structs corresponding with a sparsemem section in the
      linear region appears naturally aligned in the vmemmap region.
      
      So round up vmemmap to sparsemem section size. Since this essentially moves
      the projection of the linear region up in memory, also revert the reduction
      of the size of the vmemmap region.
      
      Cc: <stable@vger.kernel.org>
      Fixes: dfd55ad8 ("arm64: vmemmap: use virtual projection of linear region")
      Tested-by: NMark Langsdorf <mlangsdo@redhat.com>
      Tested-by: NDavid Daney <david.daney@cavium.com>
      Tested-by: NRobert Richter <rrichter@cavium.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      36e5cd6b
    • A
      x86/fpu: Fix 'no387' regression · f363938c
      Andy Lutomirski 提交于
      After fixing FPU option parsing, we now parse the 'no387' boot option
      too early: no387 clears X86_FEATURE_FPU before it's even probed, so
      the boot CPU promptly re-enables it.
      
      I suspect it gets even more confused on SMP.
      
      Fix the probing code to leave X86_FEATURE_FPU off if it's been
      disabled by setup_clear_cpu_cap().
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: yu-cheng yu <yu-cheng.yu@intel.com>
      Fixes: 4f81cbaf ("x86/fpu: Fix early FPU command-line parsing")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f363938c
  6. 08 3月, 2016 4 次提交
    • D
      KVM: s390: correct fprs on SIGP (STOP AND) STORE STATUS · 9522b37f
      David Hildenbrand 提交于
      With MACHINE_HAS_VX, we convert the floating point registers from the
      vector registeres when storing the status. For other VCPUs, these are
      stored to vcpu->run->s.regs.vrs, but we are using current->thread.fpu.vxrs,
      which resolves to the currently loaded VCPU.
      
      So kvm_s390_store_status_unloaded() currently writes the wrong floating
      point registers (converted from the vector registers) when called from
      another VCPU on a z13.
      
      This is only the case for old user space not handling SIGP STORE STATUS and
      SIGP STOP AND STORE STATUS, but relying on the kernel implementation. All
      other calls come from the loaded VCPU via kvm_s390_store_status().
      
      Fixes: 9abc2a08 (KVM: s390: fix memory overwrites when vx is disabled)
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: stable@vger.kernel.org # v4.4+
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9522b37f
    • R
      KVM: VMX: disable PEBS before a guest entry · 7099e2e1
      Radim Krčmář 提交于
      Linux guests on Haswell (and also SandyBridge and Broadwell, at least)
      would crash if you decided to run a host command that uses PEBS, like
        perf record -e 'cpu/mem-stores/pp' -a
      
      This happens because KVM is using VMX MSR switching to disable PEBS, but
      SDM [2015-12] 18.4.4.4 Re-configuring PEBS Facilities explains why it
      isn't safe:
        When software needs to reconfigure PEBS facilities, it should allow a
        quiescent period between stopping the prior event counting and setting
        up a new PEBS event. The quiescent period is to allow any latent
        residual PEBS records to complete its capture at their previously
        specified buffer address (provided by IA32_DS_AREA).
      
      There might not be a quiescent period after the MSR switch, so a CPU
      ends up using host's MSR_IA32_DS_AREA to access an area in guest's
      memory.  (Or MSR switching is just buggy on some models.)
      
      The guest can learn something about the host this way:
      If the guest doesn't map address pointed by MSR_IA32_DS_AREA, it results
      in #PF where we leak host's MSR_IA32_DS_AREA through CR2.
      
      After that, a malicious guest can map and configure memory where
      MSR_IA32_DS_AREA is pointing and can therefore get an output from
      host's tracing.
      
      This is not a critical leak as the host must initiate with PEBS tracing
      and I have not been able to get a record from more than one instruction
      before vmentry in vmx_vcpu_run() (that place has most registers already
      overwritten with guest's).
      
      We could disable PEBS just few instructions before vmentry, but
      disabling it earlier shouldn't affect host tracing too much.
      We also don't need to switch MSR_IA32_PEBS_ENABLE on VMENTRY, but that
      optimization isn't worth its code, IMO.
      
      (If you are implementing PEBS for guests, be sure to handle the case
       where both host and guest enable PEBS, because this patch doesn't.)
      
      Fixes: 26a4f3c0 ("perf/x86: disable PEBS on a guest entry.")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJiří Olša <jolsa@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7099e2e1
    • C
      s390/cpumf: Fix lpp detection · 7a76aa95
      Christian Borntraeger 提交于
      we have to check bit 40 of the facility list before issuing LPP
      and not bit 48. Otherwise a guest running on a system with
      "The decimal-floating-point zoned-conversion facility" and without
      the "The set-program-parameters facility" might crash on an lpp
      instruction.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: stable@vger.kernel.org # v4.4+
      Fixes: e22cf8ca ("s390/cpumf: rework program parameter setting to detect guest samples")
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      7a76aa95
    • P
      KVM: PPC: Book3S HV: Sanitize special-purpose register values on guest exit · ccec4456
      Paul Mackerras 提交于
      Thomas Huth discovered that a guest could cause a hard hang of a
      host CPU by setting the Instruction Authority Mask Register (IAMR)
      to a suitable value.  It turns out that this is because when the
      code was added to context-switch the new special-purpose registers
      (SPRs) that were added in POWER8, we forgot to add code to ensure
      that they were restored to a sane value on guest exit.
      
      This adds code to set those registers where a bad value could
      compromise the execution of the host kernel to a suitable neutral
      value on guest exit.
      
      Cc: stable@vger.kernel.org # v3.14+
      Fixes: b005255eReported-by: NThomas Huth <thuth@redhat.com>
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      ccec4456
  7. 07 3月, 2016 2 次提交
    • M
      ARM: dts: dra7: do not gate cpsw clock due to errata i877 · 0f514e69
      Mugunthan V N 提交于
      Errata id: i877
      
      Description:
      ------------
      The RGMII 1000 Mbps Transmit timing is based on the output clock
      (rgmiin_txc) being driven relative to the rising edge of an internal
      clock and the output control/data (rgmiin_txctl/txd) being driven relative
      to the falling edge of an internal clock source. If the internal clock
      source is allowed to be static low (i.e., disabled) for an extended period
      of time then when the clock is actually enabled the timing delta between
      the rising edge and falling edge can change over the lifetime of the
      device. This can result in the device switching characteristics degrading
      over time, and eventually failing to meet the Data Manual Delay Time/Skew
      specs.
      To maintain RGMII 1000 Mbps IO Timings, SW should minimize the
      duration that the Ethernet internal clock source is disabled. Note that
      the device reset state for the Ethernet clock is "disabled".
      Other RGMII modes (10 Mbps, 100Mbps) are not affected
      
      Workaround:
      -----------
      If the SoC Ethernet interface(s) are used in RGMII mode at 1000 Mbps,
      SW should minimize the time the Ethernet internal clock source is disabled
      to a maximum of 200 hours in a device life cycle. This is done by enabling
      the clock as early as possible in IPL (QNX) or SPL/u-boot (Linux/Android)
      by setting the register CM_GMAC_CLKSTCTRL[1:0]CLKTRCTRL = 0x2:SW_WKUP.
      
      So, do not allow to gate the cpsw clocks using ti,no-idle property in
      cpsw node assuming 1000 Mbps is being used all the time. If someone does
      not need 1000 Mbps and wants to gate clocks to cpsw, this property needs
      to be deleted in their respective board files.
      Signed-off-by: NMugunthan V N <mugunthanvnm@ti.com>
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NLokesh Vutla <lokeshvutla@ti.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NPaul Walmsley <paul@pwsan.com>
      0f514e69
    • L
      ARM: OMAP2+: hwmod: Introduce ti,no-idle dt property · 2e18f5a1
      Lokesh Vutla 提交于
      Introduce a dt property, ti,no-idle, that prevents an IP to idle at any
      point. This is to handle Errata i877, which tells that GMAC clocks
      cannot be disabled.
      Acked-by: NRoger Quadros <rogerq@ti.com>
      Tested-by: NMugunthan V N <mugunthanvnm@ti.com>
      Signed-off-by: NLokesh Vutla <lokeshvutla@ti.com>
      Signed-off-by: NSekhar Nori <nsekhar@ti.com>
      Signed-off-by: NDave Gerlach <d-gerlach@ti.com>
      Acked-by: NRob Herring <robh@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NPaul Walmsley <paul@pwsan.com>
      2e18f5a1
  8. 06 3月, 2016 3 次提交
  9. 05 3月, 2016 2 次提交
  10. 04 3月, 2016 1 次提交
  11. 03 3月, 2016 2 次提交
    • R
      powerpc/hw_breakpoint: Fix oops when destroying hw_breakpoint event · fb822e60
      Ravi Bangoria 提交于
      When destroying a hw_breakpoint event, the kernel oopses as follows:
      
        Unable to handle kernel paging request for data at address 0x00000c07
        NIP [c0000000000291d0] arch_unregister_hw_breakpoint+0x40/0x60
        LR [c00000000020b6b4] release_bp_slot+0x44/0x80
      
      Call chain:
      
        hw_breakpoint_event_init()
          bp->destroy = bp_perf_event_destroy;
      
        do_exit()
          perf_event_exit_task()
            perf_event_exit_task_context()
              WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
              perf_event_exit_event()
                free_event()
                  _free_event()
                    bp_perf_event_destroy() // event->destroy(event);
                      release_bp_slot()
                        arch_unregister_hw_breakpoint()
      
      perf_event_exit_task_context() sets child_ctx->task as TASK_TOMBSTONE
      which is (void *)-1. arch_unregister_hw_breakpoint() tries to fetch
      'thread' attribute of 'task' resulting in oops.
      
      Peterz points out that the code shouldn't be using bp->ctx anyway, but
      fixing that will require a decent amount of rework. So for now to fix
      the oops, check if bp->ctx->task has been set to (void *)-1, before
      dereferencing it. We don't use TASK_TOMBSTONE, because that would
      require exporting it and it's supposed to be an internal detail.
      
      Fixes: 63b6da39 ("perf: Fix perf_event_exit_task() race")
      Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fb822e60
    • T
      PM / sleep / x86: Fix crash on graph trace through x86 suspend · 92f9e179
      Todd E Brandt 提交于
      Pause/unpause graph tracing around do_suspend_lowlevel as it has
      inconsistent call/return info after it jumps to the wakeup vector.
      The graph trace buffer will otherwise become misaligned and
      may eventually crash and hang on suspend.
      
      To reproduce the issue and test the fix:
      Run a function_graph trace over suspend/resume and set the graph
      function to suspend_devices_and_enter. This consistently hangs the
      system without this fix.
      Signed-off-by: NTodd Brandt <todd.e.brandt@linux.intel.com>
      Cc: All applicable <stable@vger.kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      92f9e179
  12. 02 3月, 2016 6 次提交
  13. 01 3月, 2016 2 次提交
  14. 29 2月, 2016 2 次提交
    • M
      MIPS: kvm: Fix ioctl error handling. · 887349f6
      Michael S. Tsirkin 提交于
      Calling return copy_to_user(...) or return copy_from_user in an ioctl
      will not do the right thing if there's a pagefault:
      copy_to_user/copy_from_user return the number of bytes not copied in
      this case.
      
      Fix up kvm on mips to do
      	return copy_to_user(...)) ?  -EFAULT : 0;
      and
      	return copy_from_user(...)) ?  -EFAULT : 0;
      
      everywhere.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Cc: stable@vger.kernel.org
      Cc: kvm@vger.kernel.org
      Patchwork: https://patchwork.linux-mips.org/patch/12709/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      887349f6
    • G
      MIPS: scache: Fix scache init with invalid line size. · 56fa81fc
      Govindraj Raja 提交于
      In current scache init cache line_size is determined from
      cpu config register, however if there there no scache
      then mips_sc_probe_cm3 function populates a invalid line_size of 2.
      
      The invalid line_size can cause a NULL pointer deference
      during r4k_dma_cache_inv as r4k_blast_scache is populated
      based on line_size. Scache line_size of 2 is invalid option in
      r4k_blast_scache_setup.
      
      This issue was faced during a MIPS I6400 based virtual platform bring up
      where scache was not available in virtual platform model.
      Signed-off-by: NGovindraj Raja <Govindraj.Raja@imgtec.com>
      Fixes: 7d53e9c4("MIPS: CM3: Add support for CM3 L2 cache.")
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hartley <James.Hartley@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Cc: stable@vger.kernel.org # v4.2+
      Patchwork: https://patchwork.linux-mips.org/patch/12710/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      56fa81fc