1. 08 7月, 2016 11 次提交
    • T
      x86/mm: Add memory hotplug support for KASLR memory randomization · 90397a41
      Thomas Garnier 提交于
      Add a new option (CONFIG_RANDOMIZE_MEMORY_PHYSICAL_PADDING) to define
      the padding used for the physical memory mapping section when KASLR
      memory is enabled. It ensures there is enough virtual address space when
      CONFIG_MEMORY_HOTPLUG is used. The default value is 10 terabytes. If
      CONFIG_MEMORY_HOTPLUG is not used, no space is reserved increasing the
      entropy available.
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466556426-32664-10-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      90397a41
    • T
      x86/mm: Enable KASLR for vmalloc memory regions · a95ae27c
      Thomas Garnier 提交于
      Add vmalloc to the list of randomized memory regions.
      
      The vmalloc memory region contains the allocation made through the vmalloc()
      API. The allocations are done sequentially to prevent fragmentation and
      each allocation address can easily be deduced especially from boot.
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466556426-32664-8-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a95ae27c
    • T
      x86/mm: Enable KASLR for physical mapping memory regions · 021182e5
      Thomas Garnier 提交于
      Add the physical mapping in the list of randomized memory regions.
      
      The physical memory mapping holds most allocations from boot and heap
      allocators. Knowing the base address and physical memory size, an attacker
      can deduce the PDE virtual address for the vDSO memory page. This attack
      was demonstrated at CanSecWest 2016, in the following presentation:
      
        "Getting Physical: Extreme Abuse of Intel Based Paged Systems":
        https://github.com/n3k/CansecWest2016_Getting_Physical_Extreme_Abuse_of_Intel_Based_Paging_Systems/blob/master/Presentation/CanSec2016_Presentation.pdf
      
      (See second part of the presentation).
      
      The exploits used against Linux worked successfully against 4.6+ but
      fail with KASLR memory enabled:
      
        https://github.com/n3k/CansecWest2016_Getting_Physical_Extreme_Abuse_of_Intel_Based_Paging_Systems/tree/master/Demos/Linux/exploits
      
      Similar research was done at Google leading to this patch proposal.
      
      Variants exists to overwrite /proc or /sys objects ACLs leading to
      elevation of privileges. These variants were tested against 4.6+.
      
      The page offset used by the compressed kernel retains the static value
      since it is not yet randomized during this boot stage.
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466556426-32664-7-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      021182e5
    • T
      x86/mm: Implement ASLR for kernel memory regions · 0483e1fa
      Thomas Garnier 提交于
      Randomizes the virtual address space of kernel memory regions for
      x86_64. This first patch adds the infrastructure and does not randomize
      any region. The following patches will randomize the physical memory
      mapping, vmalloc and vmemmap regions.
      
      This security feature mitigates exploits relying on predictable kernel
      addresses. These addresses can be used to disclose the kernel modules
      base addresses or corrupt specific structures to elevate privileges
      bypassing the current implementation of KASLR. This feature can be
      enabled with the CONFIG_RANDOMIZE_MEMORY option.
      
      The order of each memory region is not changed. The feature looks at the
      available space for the regions based on different configuration options
      and randomizes the base and space between each. The size of the physical
      memory mapping is the available physical memory. No performance impact
      was detected while testing the feature.
      
      Entropy is generated using the KASLR early boot functions now shared in
      the lib directory (originally written by Kees Cook). Randomization is
      done on PGD & PUD page table levels to increase possible addresses. The
      physical memory mapping code was adapted to support PUD level virtual
      addresses. This implementation on the best configuration provides 30,000
      possible virtual addresses in average for each memory region.  An
      additional low memory page is used to ensure each CPU can start with a
      PGD aligned virtual address (for realmode).
      
      x86/dump_pagetable was updated to correctly display each region.
      
      Updated documentation on x86_64 memory layout accordingly.
      
      Performance data, after all patches in the series:
      
      Kernbench shows almost no difference (-+ less than 1%):
      
      Before:
      
      Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.63 (1.2695)
      User Time 1034.89 (1.18115) System Time 87.056 (0.456416) Percent CPU 1092.9
      (13.892) Context Switches 199805 (3455.33) Sleeps 97907.8 (900.636)
      
      After:
      
      Average Optimal load -j 12 Run (std deviation): Elapsed Time 102.489 (1.10636)
      User Time 1034.86 (1.36053) System Time 87.764 (0.49345) Percent CPU 1095
      (12.7715) Context Switches 199036 (4298.1) Sleeps 97681.6 (1031.11)
      
      Hackbench shows 0% difference on average (hackbench 90 repeated 10 times):
      
      attemp,before,after 1,0.076,0.069 2,0.072,0.069 3,0.066,0.066 4,0.066,0.068
      5,0.066,0.067 6,0.066,0.069 7,0.067,0.066 8,0.063,0.067 9,0.067,0.065
      10,0.068,0.071 average,0.0677,0.0677
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466556426-32664-6-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0483e1fa
    • T
      x86/mm: Separate variable for trampoline PGD · b234e8a0
      Thomas Garnier 提交于
      Use a separate global variable to define the trampoline PGD used to
      start other processors. This change will allow KALSR memory
      randomization to change the trampoline PGD to be correctly aligned with
      physical memory.
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466556426-32664-5-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b234e8a0
    • T
      x86/mm: Add PUD VA support for physical mapping · faa37933
      Thomas Garnier 提交于
      Minor change that allows early boot physical mapping of PUD level virtual
      addresses. The current implementation expects the virtual address to be
      PUD aligned. For KASLR memory randomization, we need to be able to
      randomize the offset used on the PUD table.
      
      It has no impact on current usage.
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466556426-32664-4-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      faa37933
    • T
      x86/mm: Update physical mapping variable names · 59b3d020
      Thomas Garnier 提交于
      Change the variable names in kernel_physical_mapping_init() and related
      functions to correctly reflect physical and virtual memory addresses.
      Also add comments on each function to describe usage and alignment
      constraints.
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466556426-32664-3-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      59b3d020
    • T
      x86/mm: Refactor KASLR entropy functions · d899a7d1
      Thomas Garnier 提交于
      Move the KASLR entropy functions into arch/x86/lib to be used in early
      kernel boot for KASLR memory randomization.
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Alexander Popov <alpopov@ptsecurity.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Lv Zheng <lv.zheng@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466556426-32664-2-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d899a7d1
    • B
      x86/KASLR: Fix boot crash with certain memory configurations · 6daa2ec0
      Baoquan He 提交于
      Ye Xiaolong reported this boot crash:
      
      |
      |  XZ-compressed data is corrupt
      |
      |   -- System halted
      |
      
      Fix the bug in mem_avoid_overlap() of finding the earliest overlap.
      Reported-and-tested-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6daa2ec0
    • D
      x86/vdso: Add mremap hook to vm_special_mapping · b059a453
      Dmitry Safonov 提交于
      Add possibility for 32-bit user-space applications to move
      the vDSO mapping.
      
      Previously, when a user-space app called mremap() for the vDSO
      address, in the syscall return path it would land on the previous
      address of the vDSOpage, resulting in segmentation violation.
      
      Now it lands fine and returns to userspace with a remapped vDSO.
      
      This will also fix the context.vdso pointer for 64-bit, which does
      not affect the user of vDSO after mremap() currently, but this
      may change in the future.
      
      As suggested by Andy, return -EINVAL for mremap() that would
      split the vDSO image: that operation cannot possibly result in
      a working system so reject it.
      
      Renamed and moved the text_mapping structure declaration inside
      map_vdso(), as it used only there and now it complements the
      vvar_mapping variable.
      
      There is still a problem for remapping the vDSO in glibc
      applications: the linker relocates addresses for syscalls
      on the vDSO page, so you need to relink with the new
      addresses.
      
      Without that the next syscall through glibc may fail:
      
        Program received signal SIGSEGV, Segmentation fault.
        #0  0xf7fd9b80 in __kernel_vsyscall ()
        #1  0xf7ec8238 in _exit () from /usr/lib32/libc.so.6
      Signed-off-by: NDmitry Safonov <dsafonov@virtuozzo.com>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: 0x7f454c46@gmail.com
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160628113539.13606-2-dsafonov@virtuozzo.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b059a453
    • J
      x86/mm/pat, /dev/mem: Remove superfluous error message · 39380b80
      Jiri Kosina 提交于
      Currently it's possible for broken (or malicious) userspace to flood a
      kernel log indefinitely with messages a-la
      
      	Program dmidecode tried to access /dev/mem between f0000->100000
      
      because range_is_allowed() is case of CONFIG_STRICT_DEVMEM being turned on
      dumps this information each and every time devmem_is_allowed() fails.
      
      Reportedly userspace that is able to trigger contignuous flow of these
      messages exists.
      
      It would be possible to rate limit this message, but that'd have a
      questionable value; the administrator wouldn't get information about all
      the failing accessess, so then the information would be both superfluous
      and incomplete at the same time :)
      
      Returning EPERM (which is what is actually happening) is enough indication
      for userspace what has happened; no need to log this particular error as
      some sort of special condition.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1607081137020.24757@cbobk.fhfr.pmSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39380b80
  2. 02 7月, 2016 1 次提交
    • R
      MIPS: Fix possible corruption of cache mode by mprotect. · 6d037de9
      Ralf Baechle 提交于
      The following testcase may result in a page table entries with a invalid
      CCA field being generated:
      
      static void *bindstack;
      
      static int sysrqfd;
      
      static void protect_low(int protect)
      {
      	mprotect(bindstack, BINDSTACK_SIZE, protect);
      }
      
      static void sigbus_handler(int signal, siginfo_t * info, void *context)
      {
      	void *addr = info->si_addr;
      
      	write(sysrqfd, "x", 1);
      
      	printf("sigbus, fault address %p (should not happen, but might)\n",
      	       addr);
      	abort();
      }
      
      static void run_bind_test(void)
      {
      	unsigned int *p = bindstack;
      
      	p[0] = 0xf001f001;
      
      	write(sysrqfd, "x", 1);
      
      	/* Set trap on access to p[0] */
      	protect_low(PROT_NONE);
      
      	write(sysrqfd, "x", 1);
      
      	/* Clear trap on access to p[0] */
      	protect_low(PROT_READ | PROT_WRITE | PROT_EXEC);
      
      	write(sysrqfd, "x", 1);
      
      	/* Check the contents of p[0] */
      	if (p[0] != 0xf001f001) {
      		write(sysrqfd, "x", 1);
      
      		/* Reached, but shouldn't be */
      		printf("badness, shouldn't happen but does\n");
      		abort();
      	}
      }
      
      int main(void)
      {
      	struct sigaction sa;
      
      	sysrqfd = open("/proc/sysrq-trigger", O_WRONLY);
      
      	if (sigprocmask(SIG_BLOCK, NULL, &sa.sa_mask)) {
      		perror("sigprocmask");
      		return 0;
      	}
      
      	sa.sa_sigaction = sigbus_handler;
      	sa.sa_flags = SA_SIGINFO | SA_NODEFER | SA_RESTART;
      	if (sigaction(SIGBUS, &sa, NULL)) {
      		perror("sigaction");
      		return 0;
      	}
      
      	bindstack = mmap(NULL,
      			 BINDSTACK_SIZE,
      			 PROT_READ | PROT_WRITE | PROT_EXEC,
      			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      	if (bindstack == MAP_FAILED) {
      		perror("mmap bindstack");
      		return 0;
      	}
      
      	printf("bindstack: %p\n", bindstack);
      
      	run_bind_test();
      
      	printf("done\n");
      
      	return 0;
      }
      
      There are multiple ingredients for this:
      
       1) PAGE_NONE is defined to _CACHE_CACHABLE_NONCOHERENT, which is CCA 3
          on all platforms except SB1 where it's CCA 5.
       2) _page_cachable_default must have bits set which are not set
          _CACHE_CACHABLE_NONCOHERENT.
       3) Either the defective version of pte_modify for XPA or the standard
          version must be in used.  However pte_modify for the 36 bit address
          space support is no affected.
      
      In that case additional bits in the final CCA mode may generate an invalid
      value for the CCA field.  On the R10000 system where this was tracked
      down for example a CCA 7 has been observed, which is Uncached Accelerated.
      
      Fixed by:
      
       1) Using the proper CCA mode for PAGE_NONE just like for all the other
          PAGE_* pte/pmd bits.
       2) Fix the two affected variants of pte_modify.
      
      Further code inspection also shows the same issue to exist in pmd_modify
      which would affect huge page systems.
      
      Issue in pte_modify tracked down by Alastair Bridgewater, PAGE_NONE
      and pmd_modify issue found by me.
      
      The history of this goes back beyond Linus' git history.  Chris Dearman's
      commit 35133692 ("[MIPS] Allow setting of
      the cache attribute at run time.") missed the opportunity to fix this
      but it was originally introduced in lmo commit
      d523832cf12007b3242e50bb77d0c9e63e0b6518 ("Missing from last commit.")
      and 32cc38229ac7538f2346918a09e75413e8861f87 ("New configuration option
      CONFIG_MIPS_UNCACHED.")
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Reported-by: NAlastair Bridgewater <alastair.bridgewater@gmail.com>
      6d037de9
  3. 30 6月, 2016 1 次提交
  4. 29 6月, 2016 1 次提交
    • M
      powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0 · 190ce869
      Michael Neuling 提交于
      Currently we have 2 segments that are bolted for the kernel linear
      mapping (ie 0xc000... addresses). This is 0 to 1TB and also the kernel
      stacks. Anything accessed outside of these regions may need to be
      faulted in. (In practice machines with TM always have 1T segments)
      
      If a machine has < 2TB of memory we never fault on the kernel linear
      mapping as these two segments cover all physical memory. If a machine
      has > 2TB of memory, there may be structures outside of these two
      segments that need to be faulted in. This faulting can occur when
      running as a guest as the hypervisor may remove any SLB that's not
      bolted.
      
      When we treclaim and trecheckpoint we have a window where we need to
      run with the userspace GPRs. This means that we no longer have a valid
      stack pointer in r1. For this window we therefore clear MSR RI to
      indicate that any exceptions taken at this point won't be able to be
      handled. This means that we can't take segment misses in this RI=0
      window.
      
      In this RI=0 region, we currently access the thread_struct for the
      process being context switched to or from. This thread_struct access
      may cause a segment fault since it's not guaranteed to be covered by
      the two bolted segment entries described above.
      
      We've seen this with a crash when running as a guest with > 2TB of
      memory on PowerVM:
      
        Unrecoverable exception 4100 at c00000000004f138
        Oops: Unrecoverable exception, sig: 6 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        CPU: 1280 PID: 7755 Comm: kworker/1280:1 Tainted: G                 X 4.4.13-46-default #1
        task: c000189001df4210 ti: c000189001d5c000 task.ti: c000189001d5c000
        NIP: c00000000004f138 LR: 0000000010003a24 CTR: 0000000010001b20
        REGS: c000189001d5f730 TRAP: 4100   Tainted: G                 X  (4.4.13-46-default)
        MSR: 8000000100001031 <SF,ME,IR,DR,LE>  CR: 24000048  XER: 00000000
        CFAR: c00000000004ed18 SOFTE: 0
        GPR00: ffffffffc58d7b60 c000189001d5f9b0 00000000100d7d00 000000003a738288
        GPR04: 0000000000002781 0000000000000006 0000000000000000 c0000d1f4d889620
        GPR08: 000000000000c350 00000000000008ab 00000000000008ab 00000000100d7af0
        GPR12: 00000000100d7ae8 00003ffe787e67a0 0000000000000000 0000000000000211
        GPR16: 0000000010001b20 0000000000000000 0000000000800000 00003ffe787df110
        GPR20: 0000000000000001 00000000100d1e10 0000000000000000 00003ffe787df050
        GPR24: 0000000000000003 0000000000010000 0000000000000000 00003fffe79e2e30
        GPR28: 00003fffe79e2e68 00000000003d0f00 00003ffe787e67a0 00003ffe787de680
        NIP [c00000000004f138] restore_gprs+0xd0/0x16c
        LR [0000000010003a24] 0x10003a24
        Call Trace:
        [c000189001d5f9b0] [c000189001d5f9f0] 0xc000189001d5f9f0 (unreliable)
        [c000189001d5fb90] [c00000000001583c] tm_recheckpoint+0x6c/0xa0
        [c000189001d5fbd0] [c000000000015c40] __switch_to+0x2c0/0x350
        [c000189001d5fc30] [c0000000007e647c] __schedule+0x32c/0x9c0
        [c000189001d5fcb0] [c0000000007e6b58] schedule+0x48/0xc0
        [c000189001d5fce0] [c0000000000deabc] worker_thread+0x22c/0x5b0
        [c000189001d5fd80] [c0000000000e7000] kthread+0x110/0x130
        [c000189001d5fe30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
        Instruction dump:
        7cb103a6 7cc0e3a6 7ca222a6 78a58402 38c00800 7cc62838 08860000 7cc000a6
        38a00006 78c60022 7cc62838 0b060000 <e8c701a0> 7ccff120 e8270078 e8a70098
        ---[ end trace 602126d0a1dedd54 ]---
      
      This fixes this by copying the required data from the thread_struct to
      the stack before we clear MSR RI. Then once we clear RI, we only access
      the stack, guaranteeing there's no segment miss.
      
      We also tighten the region over which we set RI=0 on the treclaim()
      path. This may have a slight performance impact since we're adding an
      mtmsr instruction.
      
      Fixes: 090b9284 ("powerpc/tm: Clear MSR RI in non-recoverable TM code")
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Reviewed-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      190ce869
  5. 28 6月, 2016 5 次提交
  6. 27 6月, 2016 9 次提交
    • Q
      KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode. · ff30ef40
      Quentin Casasnovas 提交于
      I couldn't get Xen to boot a L2 HVM when it was nested under KVM - it was
      getting a GP(0) on a rather unspecial vmread from Xen:
      
           (XEN) ----[ Xen-4.7.0-rc  x86_64  debug=n  Not tainted ]----
           (XEN) CPU:    1
           (XEN) RIP:    e008:[<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d1v0)
           (XEN) rax: ffff82d0801e6288   rbx: ffff83003ffbfb7c   rcx: fffffffffffab928
           (XEN) rdx: 0000000000000000   rsi: 0000000000000000   rdi: ffff83000bdd0000
           (XEN) rbp: ffff83000bdd0000   rsp: ffff83003ffbfab0   r8:  ffff830038813910
           (XEN) r9:  ffff83003faf3958   r10: 0000000a3b9f7640   r11: ffff83003f82d418
           (XEN) r12: 0000000000000000   r13: ffff83003ffbffff   r14: 0000000000004802
           (XEN) r15: 0000000000000008   cr0: 0000000080050033   cr4: 00000000001526e0
           (XEN) cr3: 000000003fc79000   cr2: 0000000000000000
           (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
           (XEN) Xen code around <ffff82d0801e629e> (vmx_get_segment_register+0x14e/0x450):
           (XEN)  00 00 41 be 02 48 00 00 <44> 0f 78 74 24 08 0f 86 38 56 00 00 b8 08 68 00
           (XEN) Xen stack trace from rsp=ffff83003ffbfab0:
      
           ...
      
           (XEN) Xen call trace:
           (XEN)    [<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
           (XEN)    [<ffff82d0801f3695>] get_page_from_gfn_p2m+0x165/0x300
           (XEN)    [<ffff82d0801bfe32>] hvmemul_get_seg_reg+0x52/0x60
           (XEN)    [<ffff82d0801bfe93>] hvm_emulate_prepare+0x53/0x70
           (XEN)    [<ffff82d0801ccacb>] handle_mmio+0x2b/0xd0
           (XEN)    [<ffff82d0801be591>] emulate.c#_hvm_emulate_one+0x111/0x2c0
           (XEN)    [<ffff82d0801cd6a4>] handle_hvm_io_completion+0x274/0x2a0
           (XEN)    [<ffff82d0801f334a>] __get_gfn_type_access+0xfa/0x270
           (XEN)    [<ffff82d08012f3bb>] timer.c#add_entry+0x4b/0xb0
           (XEN)    [<ffff82d08012f80c>] timer.c#remove_entry+0x7c/0x90
           (XEN)    [<ffff82d0801c8433>] hvm_do_resume+0x23/0x140
           (XEN)    [<ffff82d0801e4fe7>] vmx_do_resume+0xa7/0x140
           (XEN)    [<ffff82d080164aeb>] context_switch+0x13b/0xe40
           (XEN)    [<ffff82d080128e6e>] schedule.c#schedule+0x22e/0x570
           (XEN)    [<ffff82d08012c0cc>] softirq.c#__do_softirq+0x5c/0x90
           (XEN)    [<ffff82d0801602c5>] domain.c#idle_loop+0x25/0x50
           (XEN)
           (XEN)
           (XEN) ****************************************
           (XEN) Panic on CPU 1:
           (XEN) GENERAL PROTECTION FAULT
           (XEN) [error_code=0000]
           (XEN) ****************************************
      
      Tracing my host KVM showed it was the one injecting the GP(0) when
      emulating the VMREAD and checking the destination segment permissions in
      get_vmx_mem_address():
      
           3)               |    vmx_handle_exit() {
           3)               |      handle_vmread() {
           3)               |        nested_vmx_check_permission() {
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.065 us    |            vmx_read_guest_seg_selector();
           3)   0.066 us    |            vmx_read_guest_seg_ar();
           3)   1.636 us    |          }
           3)   0.058 us    |          vmx_get_rflags();
           3)   0.062 us    |          vmx_read_guest_seg_ar();
           3)   3.469 us    |        }
           3)               |        vmx_get_cs_db_l_bits() {
           3)   0.058 us    |          vmx_read_guest_seg_ar();
           3)   0.662 us    |        }
           3)               |        get_vmx_mem_address() {
           3)   0.068 us    |          vmx_cache_reg();
           3)               |          vmx_get_segment() {
           3)   0.074 us    |            vmx_read_guest_seg_base();
           3)   0.068 us    |            vmx_read_guest_seg_selector();
           3)   0.071 us    |            vmx_read_guest_seg_ar();
           3)   1.756 us    |          }
           3)               |          kvm_queue_exception_e() {
           3)   0.066 us    |            kvm_multiple_exception();
           3)   0.684 us    |          }
           3)   4.085 us    |        }
           3)   9.833 us    |      }
           3) + 10.366 us   |    }
      
      Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software
      Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine
      Control Structure", I found that we're enforcing that the destination
      operand is NOT located in a read-only data segment or any code segment when
      the L1 is in long mode - BUT that check should only happen when it is in
      protected mode.
      
      Shuffling the code a bit to make our emulation follow the specification
      allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests
      without problems.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Eugene Korenevsky <ekorenevsky@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: linux-stable <stable@vger.kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff30ef40
    • M
      KVM: LAPIC: cap __delay at lapic_timer_advance_ns · b606f189
      Marcelo Tosatti 提交于
      The host timer which emulates the guest LAPIC TSC deadline
      timer has its expiration diminished by lapic_timer_advance_ns
      nanoseconds. Therefore if, at wait_lapic_expire, a difference
      larger than lapic_timer_advance_ns is encountered, delay at most
      lapic_timer_advance_ns.
      
      This fixes a problem where the guest can cause the host
      to delay for large amounts of time.
      Reported-by: NAlan Jenkins <alan.christopher.jenkins@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b606f189
    • M
      KVM: x86: move nsec_to_cycles from x86.c to x86.h · 8d93c874
      Marcelo Tosatti 提交于
      Move the inline function nsec_to_cycles from x86.c to x86.h, as
      the next patch uses it from lapic.c.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d93c874
    • M
      pvclock: Get rid of __pvclock_read_cycles in function pvclock_read_flags · ed911b43
      Minfei Huang 提交于
      There is a generic function __pvclock_read_cycles to be used to get both
      flags and cycles. For function pvclock_read_flags, it's useless to get
      cycles value. To make this function be more effective, get this variable
      flags directly in function.
      Signed-off-by: NMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed911b43
    • M
      pvclock: Cleanup to remove function pvclock_get_nsec_offset · f7550d07
      Minfei Huang 提交于
      Function __pvclock_read_cycles is short enough, so there is no need to
      have another function pvclock_get_nsec_offset to calculate tsc delta.
      It's better to combine it into function __pvclock_read_cycles.
      
      Remove useless variables in function __pvclock_read_cycles.
      Signed-off-by: NMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7550d07
    • M
      pvclock: Add CPU barriers to get correct version value · 749d088b
      Minfei Huang 提交于
      Protocol for the "version" fields is: hypervisor raises it (making it
      uneven) before it starts updating the fields and raises it again (making
      it even) when it is done.  Thus the guest can make sure the time values
      it got are consistent by checking the version before and after reading
      them.
      
      Add CPU barries after getting version value just like what function
      vread_pvclock does, because all of callees in this function is inline.
      
      Fixes: 502dfeff
      Cc: stable@vger.kernel.org
      Signed-off-by: NMinfei Huang <mnghuan@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      749d088b
    • J
      KVM: arm/arm64: Stop leaking vcpu pid references · 591d215a
      James Morse 提交于
      kvm provides kvm_vcpu_uninit(), which amongst other things, releases the
      last reference to the struct pid of the task that was last running the vcpu.
      
      On arm64 built with CONFIG_DEBUG_KMEMLEAK, starting a guest with kvmtool,
      then killing it with SIGKILL results (after some considerable time) in:
      > cat /sys/kernel/debug/kmemleak
      > unreferenced object 0xffff80007d5ea080 (size 128):
      >  comm "lkvm", pid 2025, jiffies 4294942645 (age 1107.776s)
      >  hex dump (first 32 bytes):
      >    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      >    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      >  backtrace:
      >    [<ffff8000001b30ec>] create_object+0xfc/0x278
      >    [<ffff80000071da34>] kmemleak_alloc+0x34/0x70
      >    [<ffff80000019fa2c>] kmem_cache_alloc+0x16c/0x1d8
      >    [<ffff8000000d0474>] alloc_pid+0x34/0x4d0
      >    [<ffff8000000b5674>] copy_process.isra.6+0x79c/0x1338
      >    [<ffff8000000b633c>] _do_fork+0x74/0x320
      >    [<ffff8000000b66b0>] SyS_clone+0x18/0x20
      >    [<ffff800000085cb0>] el0_svc_naked+0x24/0x28
      >    [<ffffffffffffffff>] 0xffffffffffffffff
      
      On x86 kvm_vcpu_uninit() is called on the path from kvm_arch_destroy_vm(),
      on arm no equivalent call is made. Add the call to kvm_arch_vcpu_free().
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Fixes: 749cf76c ("KVM: ARM: Initial skeleton to compile KVM support")
      Cc: <stable@vger.kernel.org> # 3.10+
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NChristoffer Dall <christoffer.dall@linaro.org>
      591d215a
    • C
      powerpc/tm: Always reclaim in start_thread() for exec() class syscalls · 8e96a87c
      Cyril Bur 提交于
      Userspace can quite legitimately perform an exec() syscall with a
      suspended transaction. exec() does not return to the old process, rather
      it load a new one and starts that, the expectation therefore is that the
      new process starts not in a transaction. Currently exec() is not treated
      any differently to any other syscall which creates problems.
      
      Firstly it could allow a new process to start with a suspended
      transaction for a binary that no longer exists. This means that the
      checkpointed state won't be valid and if the suspended transaction were
      ever to be resumed and subsequently aborted (a possibility which is
      exceedingly likely as exec()ing will likely doom the transaction) the
      new process will jump to invalid state.
      
      Secondly the incorrect attempt to keep the transactional state while
      still zeroing state for the new process creates at least two TM Bad
      Things. The first triggers on the rfid to return to userspace as
      start_thread() has given the new process a 'clean' MSR but the suspend
      will still be set in the hardware MSR. The second TM Bad Thing triggers
      in __switch_to() as the processor is still transactionally suspended but
      __switch_to() wants to zero the TM sprs for the new process.
      
      This is an example of the outcome of calling exec() with a suspended
      transaction. Note the first 700 is likely the first TM bad thing
      decsribed earlier only the kernel can't report it as we've loaded
      userspace registers. c000000000009980 is the rfid in
      fast_exception_return()
      
        Bad kernel stack pointer 3fffcfa1a370 at c000000000009980
        Oops: Bad kernel stack pointer, sig: 6 [#1]
        CPU: 0 PID: 2006 Comm: tm-execed Not tainted
        NIP: c000000000009980 LR: 0000000000000000 CTR: 0000000000000000
        REGS: c00000003ffefd40 TRAP: 0700   Not tainted
        MSR: 8000000300201031 <SF,ME,IR,DR,LE,TM[SE]>  CR: 00000000  XER: 00000000
        CFAR: c0000000000098b4 SOFTE: 0
        PACATMSCRATCH: b00000010000d033
        GPR00: 0000000000000000 00003fffcfa1a370 0000000000000000 0000000000000000
        GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR12: 00003fff966611c0 0000000000000000 0000000000000000 0000000000000000
        NIP [c000000000009980] fast_exception_return+0xb0/0xb8
        LR [0000000000000000]           (null)
        Call Trace:
        Instruction dump:
        f84d0278 e9a100d8 7c7b03a6 e84101a0 7c4ff120 e8410170 7c5a03a6 e8010070
        e8410080 e8610088 e8810090 e8210078 <4c000024> 48000000 e8610178 88ed023b
      
        Kernel BUG at c000000000043e80 [verbose debug info unavailable]
        Unexpected TM Bad Thing exception at c000000000043e80 (msr 0x201033)
        Oops: Unrecoverable exception, sig: 6 [#2]
        CPU: 0 PID: 2006 Comm: tm-execed Tainted: G      D
        task: c0000000fbea6d80 ti: c00000003ffec000 task.ti: c0000000fb7ec000
        NIP: c000000000043e80 LR: c000000000015a24 CTR: 0000000000000000
        REGS: c00000003ffef7e0 TRAP: 0700   Tainted: G      D
        MSR: 8000000300201033 <SF,ME,IR,DR,RI,LE,TM[SE]>  CR: 28002828  XER: 00000000
        CFAR: c000000000015a20 SOFTE: 0
        PACATMSCRATCH: b00000010000d033
        GPR00: 0000000000000000 c00000003ffefa60 c000000000db5500 c0000000fbead000
        GPR04: 8000000300001033 2222222222222222 2222222222222222 00000000ff160000
        GPR08: 0000000000000000 800000010000d033 c0000000fb7e3ea0 c00000000fe00004
        GPR12: 0000000000002200 c00000000fe00000 0000000000000000 0000000000000000
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: 0000000000000000 0000000000000000 c0000000fbea7410 00000000ff160000
        GPR24: c0000000ffe1f600 c0000000fbea8700 c0000000fbea8700 c0000000fbead000
        GPR28: c000000000e20198 c0000000fbea6d80 c0000000fbeab680 c0000000fbea6d80
        NIP [c000000000043e80] tm_restore_sprs+0xc/0x1c
        LR [c000000000015a24] __switch_to+0x1f4/0x420
        Call Trace:
        Instruction dump:
        7c800164 4e800020 7c0022a6 f80304a8 7c0222a6 f80304b0 7c0122a6 f80304b8
        4e800020 e80304a8 7c0023a6 e80304b0 <7c0223a6> e80304b8 7c0123a6 4e800020
      
      This fixes CVE-2016-5828.
      
      Fixes: bc2a9408 ("powerpc: Hook in new transactional memory code")
      Cc: stable@vger.kernel.org # v3.9+
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      8e96a87c
    • B
      x86/boot/64: Add forgotten end of function marker · dbf984d8
      Borislav Petkov 提交于
      Add secondary_startup_64()'s ENDPROC() marker.
      
      No functionality change.
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160625112457.16930-1-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dbf984d8
  7. 26 6月, 2016 6 次提交
    • Y
      x86/KASLR: Allow randomization below the load address · e066cc47
      Yinghai Lu 提交于
      Currently the kernel image physical address randomization's lower
      boundary is the original kernel load address.
      
      For bootloaders that load kernels into very high memory (e.g. kexec),
      this means randomization takes place in a very small window at the
      top of memory, ignoring the large region of physical memory below
      the load address.
      
      Since mem_avoid[] is already correctly tracking the regions that must be
      avoided, this patch changes the minimum address to whatever is less:
      512M (to conservatively avoid unknown things in lower memory) or the
      load address. Now, for example, if the kernel is loaded at 8G, [512M,
      8G) will be added to the list of possible physical memory positions.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      [ Rewrote the changelog, refactored the code to use min(). ]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1464216334-17200-6-git-send-email-keescook@chromium.org
      [ Edited the changelog some more, plus the code comment as well. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e066cc47
    • K
      x86/KASLR: Extend kernel image physical address randomization to addresses larger than 4G · ed9f007e
      Kees Cook 提交于
      We want the physical address to be randomized anywhere between
      16MB and the top of physical memory (up to 64TB).
      
      This patch exchanges the prior slots[] array for the new slot_areas[]
      array, and lifts the limitation of KERNEL_IMAGE_SIZE on the physical
      address offset for 64-bit. As before, process_e820_entry() walks
      memory and populates slot_areas[], splitting on any detected mem_avoid
      collisions.
      
      Finally, since the slots[] array and its associated functions are not
      needed any more, so they are removed.
      
      Based on earlier patches by Baoquan He.
      
      Originally-from: Baoquan He <bhe@redhat.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1464216334-17200-5-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ed9f007e
    • B
      x86/KASLR: Randomize virtual address separately · 8391c73c
      Baoquan He 提交于
      The current KASLR implementation randomizes the physical and virtual
      addresses of the kernel together (both are offset by the same amount). It
      calculates the delta of the physical address where vmlinux was linked
      to load and where it is finally loaded. If the delta is not equal to 0
      (i.e. the kernel was relocated), relocation handling needs be done.
      
      On 64-bit, this patch randomizes both the physical address where kernel
      is decompressed and the virtual address where kernel text is mapped and
      will execute from. We now have two values being chosen, so the function
      arguments are reorganized to pass by pointer so they can be directly
      updated. Since relocation handling only depends on the virtual address,
      we must check the virtual delta, not the physical delta for processing
      kernel relocations. This also populates the page table for the new
      virtual address range. 32-bit does not support a separate virtual address,
      so it continues to use the physical offset for its virtual offset.
      
      Additionally updates the sanity checks done on the resulting kernel
      addresses since they are potentially separate now.
      
      [kees: rewrote changelog, limited virtual split to 64-bit only, update checks]
      [kees: fix CONFIG_RANDOMIZE_BASE=n boot failure]
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1464216334-17200-4-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8391c73c
    • K
      x86/KASLR: Clarify identity map interface · 11fdf97a
      Kees Cook 提交于
      This extracts the call to prepare_level4() into a top-level function
      that the user of the pagetable.c interface must call to initialize
      the new page tables. For clarity and to match the "finalize" function,
      it has been renamed to initialize_identity_maps(). This function also
      gains the initialization of mapping_info so we don't have to do it each
      time in add_identity_map().
      
      Additionally add copyright notice to the top, to make it clear that the
      bulk of the pagetable.c code was written by Yinghai, and that I just
      added bugs later. :)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1464216334-17200-3-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      11fdf97a
    • K
      x86/boot: Refuse to build with data relocations · 98f78525
      Kees Cook 提交于
      The compressed kernel is built with -fPIC/-fPIE so that it can run in any
      location a bootloader happens to put it. However, since ELF relocation
      processing is not happening (and all the relocation information has
      already been stripped at link time), none of the code can use data
      relocations (e.g. static assignments of pointers). This is already noted
      in a warning comment at the top of misc.c, but this adds an explicit
      check for the condition during the linking stage to block any such bugs
      from appearing.
      
      If this was in place with the earlier bug in pagetable.c, the build
      would fail like this:
      
        ...
          CC      arch/x86/boot/compressed/pagetable.o
          DATAREL arch/x86/boot/compressed/vmlinux
        error: arch/x86/boot/compressed/pagetable.o has data relocations!
        make[2]: *** [arch/x86/boot/compressed/vmlinux] Error 1
        ...
      
      A clean build shows:
      
        ...
          CC      arch/x86/boot/compressed/pagetable.o
          DATAREL arch/x86/boot/compressed/vmlinux
          LD      arch/x86/boot/compressed/vmlinux
        ...
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1464216334-17200-2-git-send-email-keescook@chromium.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      98f78525
    • K
      x86/KASLR, x86/power: Remove x86 hibernation restrictions · 65fe935d
      Kees Cook 提交于
      With the following fix:
      
        70595b479ce1 ("x86/power/64: Fix crash whan the hibernation code passes control to the image kernel")
      
      ... there is no longer a problem with hibernation resuming a
      KASLR-booted kernel image, so remove the restriction.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux PM list <linux-pm@vger.kernel.org>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: linux-doc@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160613221002.GA29719@www.outflux.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      65fe935d
  8. 25 6月, 2016 6 次提交