1. 02 11月, 2017 2 次提交
  2. 31 10月, 2017 1 次提交
  3. 30 10月, 2017 2 次提交
    • N
      arm64: prevent regressions in compressed kernel image size when upgrading to binutils 2.27 · fd9dde6a
      Nick Desaulniers 提交于
      Upon upgrading to binutils 2.27, we found that our lz4 and gzip
      compressed kernel images were significantly larger, resulting is 10ms
      boot time regressions.
      
      As noted by Rahul:
      "aarch64 binaries uses RELA relocations, where each relocation entry
      includes an addend value. This is similar to x86_64.  On x86_64, the
      addend values are also stored at the relocation offset for relative
      relocations. This is an optimization: in the case where code does not
      need to be relocated, the loader can simply skip processing relative
      relocations.  In binutils-2.25, both bfd and gold linkers did this for
      x86_64, but only the gold linker did this for aarch64.  The kernel build
      here is using the bfd linker, which stored zeroes at the relocation
      offsets for relative relocations.  Since a set of zeroes compresses
      better than a set of non-zero addend values, this behavior was resulting
      in much better lz4 compression.
      
      The bfd linker in binutils-2.27 is now storing the actual addend values
      at the relocation offsets. The behavior is now consistent with what it
      does for x86_64 and what gold linker does for both architectures.  The
      change happened in this upstream commit:
      https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=1f56df9d0d5ad89806c24e71f296576d82344613
      Since a bunch of zeroes got replaced by non-zero addend values, we see
      the side effect of lz4 compressed image being a bit bigger.
      
      To get the old behavior from the bfd linker, "--no-apply-dynamic-relocs"
      flag can be used:
      $ LDFLAGS="--no-apply-dynamic-relocs" make
      With this flag, the compressed image size is back to what it was with
      binutils-2.25.
      
      If the kernel is using ASLR, there aren't additional runtime costs to
      --no-apply-dynamic-relocs, as the relocations will need to be applied
      again anyway after the kernel is relocated to a random address.
      
      If the kernel is not using ASLR, then presumably the current default
      behavior of the linker is better. Since the static linker performed the
      dynamic relocs, and the kernel is not moved to a different address at
      load time, it can skip applying the relocations all over again."
      
      Some measurements:
      
      $ ld -v
      GNU ld (binutils-2.25-f3d35cf6) 2.25.51.20141117
                          ^
      $ ls -l vmlinux
      -rwxr-x--- 1 ndesaulniers eng 300652760 Oct 26 11:57 vmlinux
      $ ls -l Image.lz4-dtb
      -rw-r----- 1 ndesaulniers eng 16932627 Oct 26 11:57 Image.lz4-dtb
      
      $ ld -v
      GNU ld (binutils-2.27-53dd00a1) 2.27.0.20170315
                          ^
      pre patch:
      $ ls -l vmlinux
      -rwxr-x--- 1 ndesaulniers eng 300376208 Oct 26 11:43 vmlinux
      $ ls -l Image.lz4-dtb
      -rw-r----- 1 ndesaulniers eng 18159474 Oct 26 11:43 Image.lz4-dtb
      
      post patch:
      $ ls -l vmlinux
      -rwxr-x--- 1 ndesaulniers eng 300376208 Oct 26 12:06 vmlinux
      $ ls -l Image.lz4-dtb
      -rw-r----- 1 ndesaulniers eng 16932466 Oct 26 12:06 Image.lz4-dtb
      
      By Siqi's measurement w/ gzip:
      binutils 2.27 with this patch (with --no-apply-dynamic-relocs):
      Image 41535488
      Image.gz 13404067
      
      binutils 2.27 without this patch (without --no-apply-dynamic-relocs):
      Image 41535488
      Image.gz 14125516
      
      Any compression scheme should be able to get better results from the
      longer runs of zeros, not just GZIP and LZ4.
      
      10ms boot time savings isn't anything to get excited about, but users of
      arm64+compression+bfd-2.27 should not have to pay a penalty for no
      runtime improvement.
      Reported-by: NGopinath Elanchezhian <gelanchezhian@google.com>
      Reported-by: NSindhuri Pentyala <spentyala@google.com>
      Reported-by: NWei Wang <wvw@google.com>
      Suggested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Suggested-by: NRahul Chaudhry <rahulchaudhry@google.com>
      Suggested-by: NSiqi Lin <siqilin@google.com>
      Suggested-by: NStephen Hines <srhines@google.com>
      Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      [will: added comment to Makefile]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      fd9dde6a
    • C
      arm64: Implement arch-specific pte_access_permitted() · 6218f96c
      Catalin Marinas 提交于
      The generic pte_access_permitted() implementation only checks for
      pte_present() (together with the write permission where applicable).
      However, for both kernel ptes and PROT_NONE mappings pte_present() also
      returns true on arm64 even though such mappings are not user accessible.
      Additionally, arm64 now supports execute-only user permission
      (PROT_EXEC) which is implemented by clearing the PTE_USER bit.
      
      With this patch the arm64 implementation of pte_access_permitted()
      checks for the PTE_VALID and PTE_USER bits together with writable access
      if applicable.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      6218f96c
  4. 27 10月, 2017 4 次提交
  5. 25 10月, 2017 3 次提交
  6. 24 10月, 2017 1 次提交
    • M
      arm64: Avoid aligning normal memory pointers in __memcpy_{to,from}io · 9ca255bf
      Mark Salyzyn 提交于
      __memcpy_{to,from}io fall back to byte-at-a-time copying if both the
      source and destination pointers are not 8-byte aligned. Since one of the
      pointers always points at normal memory, this is unnecessary and
      detrimental to performance, so only do byte copying until we hit an 8-byte
      boundary for the device pointer.
      
      This change was motivated by performance issues in the pstore driver.
      On a test platform, measuring probe time for pstore, console buffer
      size of 1/4MB and pmsg of 1/2MB, was in the 90-107ms region. Change
      managed to reduce it to 10-25ms, an improvement in boot time.
      
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Signed-off-by: NMark Salyzyn <salyzyn@android.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      9ca255bf
  7. 20 10月, 2017 1 次提交
    • S
      arm64: Fix the feature type for ID register fields · 5bdecb79
      Suzuki K Poulose 提交于
      Now that the ARM ARM clearly specifies the rules for inferring
      the values of the ID register fields, fix the types of the
      feature bits we have in the kernel.
      
      As per ARM ARM DDI0487B.b, section D10.1.4 "Principles of the
      ID scheme for fields in ID registers" lists the registers to
      which the scheme applies along with the exceptions.
      
      This patch changes the relevant feature bits from FTR_EXACT
      to FTR_LOWER_SAFE to select the safer value. This will enable
      an older kernel running on a new CPU detect the safer option
      rather than completely disabling the feature.
      
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      5bdecb79
  8. 19 10月, 2017 1 次提交
    • J
      arm64: Update fault_info table with new exception types · 3f7c86b2
      Julien Thierry 提交于
      Based on: ARM Architecture Reference Manual, ARMv8 (DDI 0487B.b).
      
      ARMv8.1 introduces the optional feature ARMv8.1-TTHM which can trigger a
      new type of memory abort. This exception is triggered when hardware update
      of page table flags is not atomic in regards to other memory accesses.
      Replace the corresponding unknown entry with a more accurate one.
      
      Cf: Section D10.2.28 ESR_ELx, Exception Syndrome Register (p D10-2381),
      section D4.4.11 Restriction on memory types for hardware updates on page
      tables (p D4-2116 - D4-2117).
      
      ARMv8.2 does not add new exception types, however it is worth mentioning
      that when obligatory feature RAS (optional for ARMv8.{0,1}) is implemented,
      exceptions related to "Synchronous parity or ECC error on memory access,
      not on translation table walk" become reserved and should not occur.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      3f7c86b2
  9. 18 10月, 2017 2 次提交
  10. 14 10月, 2017 2 次提交
    • J
      arm64: use WFE for long delays · 7b77452e
      Julien Thierry 提交于
      The current delay implementation uses the yield instruction, which is a
      hint that it is beneficial to schedule another thread. As this is a hint,
      it may be implemented as a NOP, causing all delays to be busy loops. This
      is the case for many existing CPUs.
      
      Taking advantage of the generic timer sending periodic events to all
      cores, we can use WFE during delays to reduce power consumption. This is
      beneficial only for delays longer than the period of the timer event
      stream.
      
      If timer event stream is not enabled, delays will behave as yield/busy
      loops.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      7b77452e
    • J
      arm_arch_timer: Expose event stream status · ec5c8e42
      Julien Thierry 提交于
      The arch timer configuration for a CPU might get reset after suspending
      said CPU.
      
      In order to reliably use the event stream in the kernel (e.g. for delays),
      we keep track of the state where we can safely consider the event stream as
      properly configured. After writing to cntkctl, we issue an ISB to ensure
      that subsequent delay loops can rely on the event stream being enabled.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      ec5c8e42
  11. 11 10月, 2017 1 次提交
    • S
      arm64: Expose support for optional ARMv8-A features · f5e035f8
      Suzuki K Poulose 提交于
      ARMv8-A adds a few optional features for ARMv8.2 and ARMv8.3.
      Expose them to the userspace via HWCAPs and mrs emulation.
      
      SHA2-512  - Instruction support for SHA512 Hash algorithm (e.g SHA512H,
      	    SHA512H2, SHA512U0, SHA512SU1)
      SHA3 	  - SHA3 crypto instructions (EOR3, RAX1, XAR, BCAX).
      SM3	  - Instruction support for Chinese cryptography algorithm SM3
      SM4 	  - Instruction support for Chinese cryptography algorithm SM4
      DP	  - Dot Product instructions (UDOT, SDOT).
      
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      f5e035f8
  12. 09 10月, 2017 1 次提交
  13. 04 10月, 2017 3 次提交
    • M
      dma mapping : export caller to vmallocinfo · 359be678
      Matthieu CASTET 提交于
      For example on arm64 board, this add info to "user" entries in vmallocinfo
      
      Before :
      [...]
      0xffffff8008997000 0xffffff80089d8000 266240 user
      [...]
      
      Afer :
      [...]
      0xffffff8008997000 0xffffff80089d8000 266240 atomic_pool_init+0x0/0x1d8 user
      [...]
      
      This help to debug mapping issues, and is consistent with others entries
      (ioremap, vmalloc, ...) that already provide caller.
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NMatthieu CASTET <matthieu.castet@parrot.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      359be678
    • S
      arm64: Unconditionally support {ARCH_}HAVE_NMI{_SAFE_CMPXCHG} · 396a5d4a
      Stephen Boyd 提交于
      From what I can see there isn't anything about ACPI_APEI_SEA that
      means the arm64 architecture can or cannot support NMI safe
      cmpxchg or NMIs, so the 'if' condition here is not important.
      Let's remove it. Doing that allows us to support ftrace
      histograms via CONFIG_HIST_TRIGGERS that depends on the arch
      having the ARCH_HAVE_NMI_SAFE_CMPXCHG config selected.
      
      Cc: Tyler Baicar <tbaicar@codeaurora.org>
      Cc: Jonathan (Zhixiong) Zhang <zjzhang@codeaurora.org>
      Cc: Dongjiu Geng <gengdongjiu@huawei.com>
      Acked-by: NJames Morse <james.morse@arm.com>
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      396a5d4a
    • M
      arm64: consistently log boot/secondary CPU IDs · ccaac162
      Mark Rutland 提交于
      Currently we inconsistently log identifying information for the boot CPU
      and secondary CPUs. For the boot CPU, we log the MIDR and MPIDR across
      separate messages, whereas for the secondary CPUs we only log the MIDR.
      
      In some cases, it would be useful to know the MPIDR of secondary CPUs,
      and it would be nice for these messages to be consistent.
      
      This patch ensures that in the primary and secondary boot paths, we log
      both the MPIDR and MIDR in a single message, with a consistent format.
      the MPIDR is consistently padded to 10 hex characters to cover Aff3 in
      bits 39:32, so that IDs can be compared easily.
      
      The newly redundant message in setup_arch() is removed.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Al Stone <ahs3@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      [will: added '0x' prefixes consistently]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      ccaac162
  14. 02 10月, 2017 5 次提交
  15. 30 9月, 2017 2 次提交
  16. 29 9月, 2017 6 次提交
    • W
      arm64: fault: Route pte translation faults via do_translation_fault · 760bfb47
      Will Deacon 提交于
      We currently route pte translation faults via do_page_fault, which elides
      the address check against TASK_SIZE before invoking the mm fault handling
      code. However, this can cause issues with the path walking code in
      conjunction with our word-at-a-time implementation because
      load_unaligned_zeropad can end up faulting in kernel space if it reads
      across a page boundary and runs into a page fault (e.g. by attempting to
      read from a guard region).
      
      In the case of such a fault, load_unaligned_zeropad has registered a
      fixup to shift the valid data and pad with zeroes, however the abort is
      reported as a level 3 translation fault and we dispatch it straight to
      do_page_fault, despite it being a kernel address. This results in calling
      a sleeping function from atomic context:
      
        BUG: sleeping function called from invalid context at arch/arm64/mm/fault.c:313
        in_atomic(): 0, irqs_disabled(): 0, pid: 10290
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        [...]
        [<ffffff8e016cd0cc>] ___might_sleep+0x134/0x144
        [<ffffff8e016cd158>] __might_sleep+0x7c/0x8c
        [<ffffff8e016977f0>] do_page_fault+0x140/0x330
        [<ffffff8e01681328>] do_mem_abort+0x54/0xb0
        Exception stack(0xfffffffb20247a70 to 0xfffffffb20247ba0)
        [...]
        [<ffffff8e016844fc>] el1_da+0x18/0x78
        [<ffffff8e017f399c>] path_parentat+0x44/0x88
        [<ffffff8e017f4c9c>] filename_parentat+0x5c/0xd8
        [<ffffff8e017f5044>] filename_create+0x4c/0x128
        [<ffffff8e017f59e4>] SyS_mkdirat+0x50/0xc8
        [<ffffff8e01684e30>] el0_svc_naked+0x24/0x28
        Code: 36380080 d5384100 f9400800 9402566d (d4210000)
        ---[ end trace 2d01889f2bca9b9f ]---
      
      Fix this by dispatching all translation faults to do_translation_faults,
      which avoids invoking the page fault logic for faults on kernel addresses.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: NAnkit Jain <ankijain@codeaurora.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      760bfb47
    • W
      arm64: mm: Use READ_ONCE when dereferencing pointer to pte table · f069faba
      Will Deacon 提交于
      On kernels built with support for transparent huge pages, different CPUs
      can access the PMD concurrently due to e.g. fast GUP or page_vma_mapped_walk
      and they must take care to use READ_ONCE to avoid value tearing or caching
      of stale values by the compiler. Unfortunately, these functions call into
      our pgtable macros, which don't use READ_ONCE, and compiler caching has
      been observed to cause the following crash during ext4 writeback:
      
      PC is at check_pte+0x20/0x170
      LR is at page_vma_mapped_walk+0x2e0/0x540
      [...]
      Process doio (pid: 2463, stack limit = 0xffff00000f2e8000)
      Call trace:
      [<ffff000008233328>] check_pte+0x20/0x170
      [<ffff000008233758>] page_vma_mapped_walk+0x2e0/0x540
      [<ffff000008234adc>] page_mkclean_one+0xac/0x278
      [<ffff000008234d98>] rmap_walk_file+0xf0/0x238
      [<ffff000008236e74>] rmap_walk+0x64/0xa0
      [<ffff0000082370c8>] page_mkclean+0x90/0xa8
      [<ffff0000081f3c64>] clear_page_dirty_for_io+0x84/0x2a8
      [<ffff00000832f984>] mpage_submit_page+0x34/0x98
      [<ffff00000832fb4c>] mpage_process_page_bufs+0x164/0x170
      [<ffff00000832fc8c>] mpage_prepare_extent_to_map+0x134/0x2b8
      [<ffff00000833530c>] ext4_writepages+0x484/0xe30
      [<ffff0000081f6ab4>] do_writepages+0x44/0xe8
      [<ffff0000081e5bd4>] __filemap_fdatawrite_range+0xbc/0x110
      [<ffff0000081e5e68>] file_write_and_wait_range+0x48/0xd8
      [<ffff000008324310>] ext4_sync_file+0x80/0x4b8
      [<ffff0000082bd434>] vfs_fsync_range+0x64/0xc0
      [<ffff0000082332b4>] SyS_msync+0x194/0x1e8
      
      This is because page_vma_mapped_walk loads the PMD twice before calling
      pte_offset_map: the first time without READ_ONCE (where it gets all zeroes
      due to a concurrent pmdp_invalidate) and the second time with READ_ONCE
      (where it sees a valid table pointer due to a concurrent pmd_populate).
      However, the compiler inlines everything and caches the first value in
      a register, which is subsequently used in pte_offset_phys which returns
      a junk pointer that is later dereferenced when attempting to access the
      relevant pte.
      
      This patch fixes the issue by using READ_ONCE in pte_offset_phys to ensure
      that a stale value is not used. Whilst this is a point fix for a known
      failure (and simple to backport), a full fix moving all of our page table
      accessors over to {READ,WRITE}_ONCE and consistently using READ_ONCE in
      page_vma_mapped_walk is in the works for a future kernel release.
      
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Timur Tabi <timur@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Fixes: f27176cf ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()")
      Tested-by: NRichard Ruigrok <rruigrok@codeaurora.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      f069faba
    • B
      kvm/x86: Handle async PF in RCU read-side critical sections · b862789a
      Boqun Feng 提交于
      Sasha Levin reported a WARNING:
      
      | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
      | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
      | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
      | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
      ...
      | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
      | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      | 1.10.1-1ubuntu1 04/01/2014
      | Call Trace:
      ...
      | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
      | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
      | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
      | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
      | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
      | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
      | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
      | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
      | __schedule+0x201/0x2240 kernel/sched/core.c:3292
      | schedule+0x113/0x460 kernel/sched/core.c:3421
      | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
      | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
      | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
      | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
      | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
      | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
      | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
      | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
      | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
      | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
      | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
      | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
      | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
      | get_link fs/namei.c:1047 [inline]
      | link_path_walk+0x1041/0x1490 fs/namei.c:2127
      ...
      
      This happened when the host hit a page fault, and delivered it as in an
      async page fault, while the guest was in an RCU read-side critical
      section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
      but rcu_preempt_note_context_switch() would treat the reschedule as a
      sleep in RCU read-side critical section, which is not allowed (even in
      preemptible RCU).  Thus the WARN.
      
      To cure this, make kvm_async_pf_task_wait() go to the halt path if the
      PF happens in a RCU read-side critical section.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b862789a
    • W
      KVM: nVMX: Fix nested #PF intends to break L1's vmlauch/vmresume · 305d0ab4
      Wanpeng Li 提交于
      ------------[ cut here ]------------
       WARNING: CPU: 4 PID: 5280 at /home/kernel/linux/arch/x86/kvm//vmx.c:11394 nested_vmx_vmexit+0xc2b/0xd70 [kvm_intel]
       CPU: 4 PID: 5280 Comm: qemu-system-x86 Tainted: G        W  OE   4.13.0+ #17
       RIP: 0010:nested_vmx_vmexit+0xc2b/0xd70 [kvm_intel]
       Call Trace:
        ? emulator_read_emulated+0x15/0x20 [kvm]
        ? segmented_read+0xae/0xf0 [kvm]
        vmx_inject_page_fault_nested+0x60/0x70 [kvm_intel]
        ? vmx_inject_page_fault_nested+0x60/0x70 [kvm_intel]
        x86_emulate_instruction+0x733/0x810 [kvm]
        vmx_handle_exit+0x2f4/0xda0 [kvm_intel]
        ? kvm_arch_vcpu_ioctl_run+0xd2f/0x1c60 [kvm]
        kvm_arch_vcpu_ioctl_run+0xdab/0x1c60 [kvm]
        ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        entry_SYSCALL_64_fastpath+0x23/0xc2
      
      A nested #PF is triggered during L0 emulating instruction for L2. However, it
      doesn't consider we should not break L1's vmlauch/vmresme. This patch fixes
      it by queuing the #PF exception instead ,requesting an immediate VM exit from
      L2 and keeping the exception for L1 pending for a subsequent nested VM exit.
      
      This should actually work all the time, making vmx_inject_page_fault_nested
      totally unnecessary.  However, that's not working yet, so this patch can work
      around the issue in the meanwhile.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      305d0ab4
    • J
      x86/asm: Fix inline asm call constraints for GCC 4.4 · 520a13c5
      Josh Poimboeuf 提交于
      The kernel test bot (run by Xiaolong Ye) reported that the following commit:
      
        f5caf621 ("x86/asm: Fix inline asm call constraints for Clang")
      
      is causing double faults in a kernel compiled with GCC 4.4.
      
      Linus subsequently diagnosed the crash pattern and the buggy commit and found that
      the issue is with this code:
      
        register unsigned int __asm_call_sp asm("esp");
        #define ASM_CALL_CONSTRAINT "+r" (__asm_call_sp)
      
      Even on a 64-bit kernel, it's using ESP instead of RSP.  That causes GCC
      to produce the following bogus code:
      
        ffffffff8147461d:       89 e0                   mov    %esp,%eax
        ffffffff8147461f:       4c 89 f7                mov    %r14,%rdi
        ffffffff81474622:       4c 89 fe                mov    %r15,%rsi
        ffffffff81474625:       ba 20 00 00 00          mov    $0x20,%edx
        ffffffff8147462a:       89 c4                   mov    %eax,%esp
        ffffffff8147462c:       e8 bf 52 05 00          callq  ffffffff814c98f0 <copy_user_generic_unrolled>
      
      Despite the absurdity of it backing up and restoring the stack pointer
      for no reason, the bug is actually the fact that it's only backing up
      and restoring the lower 32 bits of the stack pointer.  The upper 32 bits
      are getting cleared out, corrupting the stack pointer.
      
      So change the '__asm_call_sp' register variable to be associated with
      the actual full-size stack pointer.
      
      This also requires changing the __ASM_SEL() macro to be based on the
      actual compiled arch size, rather than the CONFIG value, because
      CONFIG_X86_64 compiles some files with '-m32' (e.g., realmode and vdso).
      Otherwise Clang fails to build the kernel because it complains about the
      use of a 64-bit register (RSP) in a 32-bit file.
      Reported-and-Bisected-and-Tested-by: Nkernel test robot <xiaolong.ye@intel.com>
      Diagnosed-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: LKP <lkp@01.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: f5caf621 ("x86/asm: Fix inline asm call constraints for Clang")
      Link: http://lkml.kernel.org/r/20170928215826.6sdpmwtkiydiytim@trebleSigned-off-by: NIngo Molnar <mingo@kernel.org>
      520a13c5
    • T
      um/time: Fixup namespace collision · 69b73e95
      Thomas Gleixner 提交于
      The new timer_setup() function for struct timer_list collides with a
      private um function. Rename it.
      
      Fixes: 686fef92 ("timer: Prepare to change timer callback argument type")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: user-mode-linux-devel@lists.sourceforge.net
      Cc: Kees Cook  <keescook@chromium.org>
      69b73e95
  17. 28 9月, 2017 2 次提交
    • P
      KVM: VMX: use cmpxchg64 · c0a1666b
      Paolo Bonzini 提交于
      This fixes a compilation failure on 32-bit systems.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c0a1666b
    • Z
      xen/mmu: Call xen_cleanhighmap() with 4MB aligned for page tables mapping · 0d805ee7
      Zhenzhong Duan 提交于
      When bootup a PVM guest with large memory(Ex.240GB), XEN provided initial
      mapping overlaps with kernel module virtual space. When mapping in this space
      is cleared by xen_cleanhighmap(), in certain case there could be an 2MB mapping
      left. This is due to XEN initialize 4MB aligned mapping but xen_cleanhighmap()
      finish at 2MB boundary.
      
      When module loading is just on top of the 2MB space, got below warning:
      
      WARNING: at mm/vmalloc.c:106 vmap_pte_range+0x14e/0x190()
      Call Trace:
       [<ffffffff81117083>] warn_alloc_failed+0xf3/0x160
       [<ffffffff81146022>] __vmalloc_area_node+0x182/0x1c0
       [<ffffffff810ac91e>] ? module_alloc_update_bounds+0x1e/0x80
       [<ffffffff81145df7>] __vmalloc_node_range+0xa7/0x110
       [<ffffffff810ac91e>] ? module_alloc_update_bounds+0x1e/0x80
       [<ffffffff8103ca54>] module_alloc+0x64/0x70
       [<ffffffff810ac91e>] ? module_alloc_update_bounds+0x1e/0x80
       [<ffffffff810ac91e>] module_alloc_update_bounds+0x1e/0x80
       [<ffffffff810ac9a7>] move_module+0x27/0x150
       [<ffffffff810aefa0>] layout_and_allocate+0x120/0x1b0
       [<ffffffff810af0a8>] load_module+0x78/0x640
       [<ffffffff811ff90b>] ? security_file_permission+0x8b/0x90
       [<ffffffff810af6d2>] sys_init_module+0x62/0x1e0
       [<ffffffff815154c2>] system_call_fastpath+0x16/0x1b
      
      Then the mapping of 2MB is cleared, finally oops when the page in that space is
      accessed.
      
      BUG: unable to handle kernel paging request at ffff880022600000
      IP: [<ffffffff81260877>] clear_page_c_e+0x7/0x10
      PGD 1788067 PUD 178c067 PMD 22434067 PTE 0
      Oops: 0002 [#1] SMP
      Call Trace:
       [<ffffffff81116ef7>] ? prep_new_page+0x127/0x1c0
       [<ffffffff81117d42>] get_page_from_freelist+0x1e2/0x550
       [<ffffffff81133010>] ? ii_iovec_copy_to_user+0x90/0x140
       [<ffffffff81119c9d>] __alloc_pages_nodemask+0x12d/0x230
       [<ffffffff81155516>] alloc_pages_vma+0xc6/0x1a0
       [<ffffffff81006ffd>] ? pte_mfn_to_pfn+0x7d/0x100
       [<ffffffff81134cfb>] do_anonymous_page+0x16b/0x350
       [<ffffffff81139c34>] handle_pte_fault+0x1e4/0x200
       [<ffffffff8100712e>] ? xen_pmd_val+0xe/0x10
       [<ffffffff810052c9>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
       [<ffffffff81139dab>] handle_mm_fault+0x15b/0x270
       [<ffffffff81510c10>] do_page_fault+0x140/0x470
       [<ffffffff8150d7d5>] page_fault+0x25/0x30
      
      Call xen_cleanhighmap() with 4MB aligned for page tables mapping to fix it.
      The unnecessory call of xen_cleanhighmap() in DEBUG mode is also removed.
      
      -v2: add comment about XEN alignment from Juergen.
      
      References: https://lists.xen.org/archives/html/xen-devel/2012-07/msg01562.htmlSigned-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      
      [boris: added 'xen/mmu' tag to commit subject]
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      0d805ee7
  18. 27 9月, 2017 1 次提交
    • P
      KVM: VMX: simplify and fix vmx_vcpu_pi_load · 31afb2ea
      Paolo Bonzini 提交于
      The simplify part: do not touch pi_desc.nv, we can set it when the
      VCPU is first created.  Likewise, pi_desc.sn is only handled by
      vmx_vcpu_pi_load, do not touch it in __pi_post_block.
      
      The fix part: do not check kvm_arch_has_assigned_device, instead
      check the SN bit to figure out whether vmx_vcpu_pi_put ran before.
      This matches what the previous patch did in pi_post_block.
      
      Cc: Huangweidong <weidong.huang@huawei.com>
      Cc: Gonglei <arei.gonglei@huawei.com>
      Cc: wangxin <wangxinxin.wang@huawei.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Tested-by: NLongpeng (Mike) <longpeng2@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      31afb2ea