1. 05 12月, 2017 12 次提交
    • B
      KVM: SVM: VMRUN should use associated ASID when SEV is enabled · 70cd94e6
      Brijesh Singh 提交于
      SEV hardware uses ASIDs to associate a memory encryption key with a
      guest VM. During guest creation, a SEV VM uses the SEV_CMD_ACTIVATE
      command to bind a particular ASID to the guest. Lets make sure that the
      VMCB is programmed with the bound ASID before a VMRUN.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      70cd94e6
    • B
      KVM: SVM: Add KVM_SEV_INIT command · 1654efcb
      Brijesh Singh 提交于
      The command initializes the SEV platform context and allocates a new ASID
      for this guest from the SEV ASID pool. The firmware must be initialized
      before we issue any guest launch commands to create a new memory encryption
      context.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      1654efcb
    • B
      KVM: SVM: Add sev module_param · e9df0942
      Brijesh Singh 提交于
      The module parameter can be used to control the SEV feature support.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      e9df0942
    • B
      KVM: SVM: Reserve ASID range for SEV guest · ed3cd233
      Brijesh Singh 提交于
      A SEV-enabled guest must use ASIDs from the defined subset, while non-SEV
      guests can use the remaining ASID range. The range of allowed SEV guest
      ASIDs is [1 - CPUID_8000_001F[ECX][31:0]].
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Improvements-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      ed3cd233
    • B
      KVM: X86: Add CONFIG_KVM_AMD_SEV · 5dd0a57c
      Brijesh Singh 提交于
      The config option can be used to enable SEV support on AMD Processors.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      5dd0a57c
    • B
      KVM: Introduce KVM_MEMORY_ENCRYPT_{UN,}REG_REGION ioctl · 69eaedee
      Brijesh Singh 提交于
      If hardware supports memory encryption then KVM_MEMORY_ENCRYPT_REG_REGION
      and KVM_MEMORY_ENCRYPT_UNREG_REGION ioctl's can be used by userspace to
      register/unregister the guest memory regions which may contain the encrypted
      data (e.g guest RAM, PCI BAR, SMRAM etc).
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Improvements-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      69eaedee
    • B
      KVM: Introduce KVM_MEMORY_ENCRYPT_OP ioctl · 5acc5c06
      Brijesh Singh 提交于
      If the hardware supports memory encryption then the
      KVM_MEMORY_ENCRYPT_OP ioctl can be used by qemu to issue a platform
      specific memory encryption commands.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      5acc5c06
    • B
      KVM: X86: Extend CPUID range to include new leaf · 8765d753
      Brijesh Singh 提交于
      This CPUID leaf provides the memory encryption support information on
      AMD Platform. Its complete description is available in APM volume 2,
      Section 15.34
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      8765d753
    • B
      KVM: SVM: Prepare to reserve asid for SEV guest · 4faefff3
      Brijesh Singh 提交于
      Currently, ASID allocation start at 1. Add a svm_vcpu_data.min_asid
      which allows supplying a dynamic start ASID.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      4faefff3
    • T
      kvm: svm: Add SEV feature definitions to KVM · ba7c3398
      Tom Lendacky 提交于
      Define the SEV enable bit for the VMCB control structure. The hypervisor
      will use this bit to enable SEV in the guest.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      ba7c3398
    • T
      kvm: svm: prepare for new bit definition in nested_ctl · cea3a19b
      Tom Lendacky 提交于
      Currently the nested_ctl variable in the vmcb_control_area structure is
      used to indicate nested paging support. The nested paging support field
      is actually defined as bit 0 of the field. In order to support a new
      feature flag the usage of the nested_ctl and nested paging support must
      be converted to operate on a single bit.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      cea3a19b
    • T
      x86/CPU/AMD: Add the Secure Encrypted Virtualization CPU feature · 18c71ce9
      Tom Lendacky 提交于
      Update the CPU features to include identifying and reporting on the
      Secure Encrypted Virtualization (SEV) feature.  SEV is identified by
      CPUID 0x8000001f, but requires BIOS support to enable it (set bit 23 of
      MSR_K8_SYSCFG and set bit 0 of MSR_K7_HWCR).  Only show the SEV feature
      as available if reported by CPUID and enabled by BIOS.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: kvm@vger.kernel.org
      Cc: x86@kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      18c71ce9
  2. 24 11月, 2017 4 次提交
  3. 23 11月, 2017 1 次提交
    • A
      x86/entry/64: Add missing irqflags tracing to native_load_gs_index() · ca37e57b
      Andy Lutomirski 提交于
      Running this code with IRQs enabled (where dummy_lock is a spinlock):
      
      static void check_load_gs_index(void)
      {
      	/* This will fail. */
      	load_gs_index(0xffff);
      
      	spin_lock(&dummy_lock);
      	spin_unlock(&dummy_lock);
      }
      
      Will generate a lockdep warning.  The issue is that the actual write
      to %gs would cause an exception with IRQs disabled, and the exception
      handler would, as an inadvertent side effect, update irqflag tracing
      to reflect the IRQs-off status.  native_load_gs_index() would then
      turn IRQs back on and return with irqflag tracing still thinking that
      IRQs were off.  The dummy lock-and-unlock causes lockdep to notice the
      error and warn.
      
      Fix it by adding the missing tracing.
      
      Apparently nothing did this in a context where it mattered.  I haven't
      tried to find a code path that would actually exhibit the warning if
      appropriately nasty user code were running.
      
      I suspect that the security impact of this bug is very, very low --
      production systems don't run with lockdep enabled, and the warning is
      mostly harmless anyway.
      
      Found during a quick audit of the entry code to try to track down an
      unrelated bug that Ingo found in some still-in-development code.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/e1aeb0e6ba8dd430ec36c8a35e63b429698b4132.1511411918.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ca37e57b
  4. 22 11月, 2017 2 次提交
    • A
      x86/mm/kasan: Don't use vmemmap_populate() to initialize shadow · f68d62a5
      Andrey Ryabinin 提交于
      [ Note, this commit is a cherry-picked version of:
      
          d17a1d97: ("x86/mm/kasan: don't use vmemmap_populate() to initialize shadow")
      
        ... for easier x86 entry code testing and back-porting. ]
      
      The KASAN shadow is currently mapped using vmemmap_populate() since that
      provides a semi-convenient way to map pages into init_top_pgt.  However,
      since that no longer zeroes the mapped pages, it is not suitable for
      KASAN, which requires zeroed shadow memory.
      
      Add kasan_populate_shadow() interface and use it instead of
      vmemmap_populate().  Besides, this allows us to take advantage of
      gigantic pages and use them to populate the shadow, which should save us
      some memory wasted on page tables and reduce TLB pressure.
      
      Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f68d62a5
    • A
      x86/entry/64: Fix entry_SYSCALL_64_after_hwframe() IRQ tracing · 548c3050
      Andy Lutomirski 提交于
      When I added entry_SYSCALL_64_after_hwframe(), I left TRACE_IRQS_OFF
      before it.  This means that users of entry_SYSCALL_64_after_hwframe()
      were responsible for invoking TRACE_IRQS_OFF, and the one and only
      user (Xen, added in the same commit) got it wrong.
      
      I think this would manifest as a warning if a Xen PV guest with
      CONFIG_DEBUG_LOCKDEP=y were used with context tracking.  (The
      context tracking bit is to cause lockdep to get invoked before we
      turn IRQs back on.)  I haven't tested that for real yet because I
      can't get a kernel configured like that to boot at all on Xen PV.
      
      Move TRACE_IRQS_OFF below the label.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bpetkov@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: 8a9949bc ("x86/xen/64: Rearrange the SYSCALL entries")
      Link: http://lkml.kernel.org/r/9150aac013b7b95d62c2336751d5b6e91d2722aa.1511325444.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      548c3050
  5. 21 11月, 2017 1 次提交
    • R
      x86/umip: Print a warning into the syslog if UMIP-protected instructions are used · fd11a649
      Ricardo Neri 提交于
      Print a rate-limited warning when a user-space program attempts to execute
      any of the instructions that UMIP protects (i.e., SGDT, SIDT, SLDT, STR
      and SMSW).
      
      This is useful, because when CONFIG_X86_INTEL_UMIP=y is selected and
      supported by the hardware, user space programs that try to execute such
      instructions will receive a SIGSEGV signal that they might not expect.
      
      In the specific cases for which emulation is provided (instructions SGDT,
      SIDT and SMSW in protected and virtual-8086 modes), no signal is
      generated. However, a warning is helpful to encourage updates in such
      programs to avoid the use of such instructions.
      
      Warnings are printed via a customized printk() function that also provides
      information about the program that attempted to use the affected
      instructions.
      
      Utility macros are defined to wrap umip_printk() for the error and warning
      kernel log levels.
      
      While here, replace an existing call to the generic rate-limited pr_err()
      with the new umip_pr_err().
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NRicardo Neri <ricardo.neri-calderon@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: ricardo.neri@intel.com
      Link: http://lkml.kernel.org/r/1511233476-17088-1-git-send-email-ricardo.neri-calderon@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fd11a649
  6. 17 11月, 2017 9 次提交
  7. 16 11月, 2017 11 次提交
    • C
      x86/mm: Limit mmap() of /dev/mem to valid physical addresses · be62a320
      Craig Bergstrom 提交于
      One thing /dev/mem access APIs should verify is that there's no way
      that excessively large pfn's can leak into the high bits of the
      page table entry.
      
      In particular, if people can use "very large physical page addresses"
      through /dev/mem to set the bits past bit 58 - SOFTW4 and permission
      key bits and NX bit, that could *really* confuse the kernel.
      
      We had an earlier attempt:
      
        ce56a86e ("x86/mm: Limit mmap() of /dev/mem to valid physical addresses")
      
      ... which turned out to be too restrictive (breaking mem=... bootups for example) and
      had to be reverted in:
      
        90edaac6 ("Revert "x86/mm: Limit mmap() of /dev/mem to valid physical addresses"")
      
      This v2 attempt modifies the original patch and makes sure that mmap(/dev/mem)
      limits the pfns so that it at least fits in the actual pteval_t architecturally:
      
       - Make sure mmap_mem() actually validates that the offset fits in phys_addr_t
      
          ( This may be indirectly true due to some other check, but it's not
            entirely obvious. )
      
       - Change valid_mmap_phys_addr_range() to just use phys_addr_valid()
         on the top byte
      
          ( Top byte is sufficient, because mmap_mem() has already checked that
            it cannot wrap. )
      
       - Add a few comments about what the valid_phys_addr_range() vs.
         valid_mmap_phys_addr_range() difference is.
      Signed-off-by: NCraig Bergstrom <craigb@google.com>
      [ Fixed the checks and added comments. ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [ Collected the discussion and patches into a commit. ]
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hans Verkuil <hans.verkuil@cisco.com>
      Cc: Mauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sander Eikelenboom <linux@eikelenboom.it>
      Cc: Sean Young <sean@mess.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/CA+55aFyEcOMb657vWSmrM13OxmHxC-XxeBmNis=DwVvpJUOogQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      be62a320
    • K
      x86/mm: Prevent non-MAP_FIXED mapping across DEFAULT_MAP_WINDOW border · 1e0f25db
      Kirill A. Shutemov 提交于
      In case of 5-level paging, the kernel does not place any mapping above
      47-bit, unless userspace explicitly asks for it.
      
      Userspace can request an allocation from the full address space by
      specifying the mmap address hint above 47-bit.
      
      Nicholas noticed that the current implementation violates this interface:
      
        If user space requests a mapping at the end of the 47-bit address space
        with a length which causes the mapping to cross the 47-bit border
        (DEFAULT_MAP_WINDOW), then the vma is partially in the address space
        below and above.
      
      Sanity check the mmap address hint so that start and end of the resulting
      vma are on the same side of the 47-bit border. If that's not the case fall
      back to the code path which ignores the address hint and allocate from the
      regular address space below 47-bit.
      
      To make the checks consistent, mask out the address hints lower bits
      (either PAGE_MASK or huge_page_mask()) instead of using ALIGN() which can
      push them up to the next boundary.
      
      [ tglx: Moved the address check to a function and massaged comment and
        	changelog ]
      Reported-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20171115143607.81541-1-kirill.shutemov@linux.intel.com
      1e0f25db
    • M
      mm, sparse: do not swamp log with huge vmemmap allocation failures · fcdaf842
      Michal Hocko 提交于
      While doing memory hotplug tests under heavy memory pressure we have
      noticed too many page allocation failures when allocating vmemmap memmap
      backed by huge page
      
        kworker/u3072:1: page allocation failure: order:9, mode:0x24084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO)
        [...]
        Call Trace:
          dump_trace+0x59/0x310
          show_stack_log_lvl+0xea/0x170
          show_stack+0x21/0x40
          dump_stack+0x5c/0x7c
          warn_alloc_failed+0xe2/0x150
          __alloc_pages_nodemask+0x3ed/0xb20
          alloc_pages_current+0x7f/0x100
          vmemmap_alloc_block+0x79/0xb6
          __vmemmap_alloc_block_buf+0x136/0x145
          vmemmap_populate+0xd2/0x2b9
          sparse_mem_map_populate+0x23/0x30
          sparse_add_one_section+0x68/0x18e
          __add_pages+0x10a/0x1d0
          arch_add_memory+0x4a/0xc0
          add_memory_resource+0x89/0x160
          add_memory+0x6d/0xd0
          acpi_memory_device_add+0x181/0x251
          acpi_bus_attach+0xfd/0x19b
          acpi_bus_scan+0x59/0x69
          acpi_device_hotplug+0xd2/0x41f
          acpi_hotplug_work_fn+0x1a/0x23
          process_one_work+0x14e/0x410
          worker_thread+0x116/0x490
          kthread+0xbd/0xe0
          ret_from_fork+0x3f/0x70
      
      and we do see many of those because essentially every allocation fails
      for each memory section.  This is an excessive way to tell the user that
      there is nothing to really worry about because we do have a fallback
      mechanism to use base pages.  The only downside might be a performance
      degradation due to TLB pressure.
      
      This patch changes vmemmap_alloc_block() to use __GFP_NOWARN and warn
      explicitly once on the first allocation failure.  This will reduce the
      noise in the kernel log considerably, while we still have an indication
      that a performance might be impacted.
      
      [mhocko@kernel.org: forgot to git add the follow up fix]
        Link: http://lkml.kernel.org/r/20171107090635.c27thtse2lchjgvb@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20171106092228.31098-1-mhocko@kernel.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcdaf842
    • A
      x86/mm/kasan: don't use vmemmap_populate() to initialize shadow · d17a1d97
      Andrey Ryabinin 提交于
      The kasan shadow is currently mapped using vmemmap_populate() since that
      provides a semi-convenient way to map pages into init_top_pgt.  However,
      since that no longer zeroes the mapped pages, it is not suitable for
      kasan, which requires zeroed shadow memory.
      
      Add kasan_populate_shadow() interface and use it instead of
      vmemmap_populate().  Besides, this allows us to take advantage of
      gigantic pages and use them to populate the shadow, which should save us
      some memory wasted on page tables and reduce TLB pressure.
      
      Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.comSigned-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d17a1d97
    • P
      x86/mm: set fields in deferred pages · 353b1e7b
      Pavel Tatashin 提交于
      Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
      flags and other fields in "struct page"es are never changed prior to
      first initializing struct pages by going through __init_single_page().
      
      With deferred struct page feature enabled, however, we set fields in
      register_page_bootmem_info that are subsequently clobbered right after
      in free_all_bootmem:
      
              mem_init() {
                      register_page_bootmem_info();
                      free_all_bootmem();
                      ...
              }
      
      When register_page_bootmem_info() is called only non-deferred struct
      pages are initialized.  But, this function goes through some reserved
      pages which might be part of the deferred, and thus are not yet
      initialized.
      
        mem_init
         register_page_bootmem_info
          register_page_bootmem_info_node
           get_page_bootmem
            .. setting fields here ..
            such as: page->freelist = (void *)type;
      
        free_all_bootmem()
         free_low_memory_core_early()
          for_each_reserved_mem_region()
           reserve_bootmem_region()
            init_reserved_page() <- Only if this is deferred reserved page
             __init_single_pfn()
              __init_single_page()
                  memset(0) <-- Loose the set fields here
      
      We end up with issue where, currently we do not observe problem as
      memory is explicitly zeroed.  But, if flag asserts are changed we can
      start hitting issues.
      
      Also, because in this patch series we will stop zeroing struct page
      memory during allocation, we must make sure that struct pages are
      properly initialized prior to using them.
      
      The deferred-reserved pages are initialized in free_all_bootmem().
      Therefore, the fix is to switch the above calls.
      
      Link: http://lkml.kernel.org/r/20171013173214.27300-3-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Tested-by: NBob Picco <bob.picco@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      353b1e7b
    • L
      kmemcheck: rip it out · 4675ff05
      Levin, Alexander (Sasha Levin) 提交于
      Fix up makefiles, remove references, and git rm kmemcheck.
      
      Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4675ff05
    • L
      kmemcheck: remove whats left of NOTRACK flags · d8be7566
      Levin, Alexander (Sasha Levin) 提交于
      Now that kmemcheck is gone, we don't need the NOTRACK flags.
      
      Link: http://lkml.kernel.org/r/20171007030159.22241-5-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8be7566
    • L
      kmemcheck: stop using GFP_NOTRACK and SLAB_NOTRACK · 75f296d9
      Levin, Alexander (Sasha Levin) 提交于
      Convert all allocations that used a NOTRACK flag to stop using it.
      
      Link: http://lkml.kernel.org/r/20171007030159.22241-3-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75f296d9
    • L
      kmemcheck: remove annotations · 49502766
      Levin, Alexander (Sasha Levin) 提交于
      Patch series "kmemcheck: kill kmemcheck", v2.
      
      As discussed at LSF/MM, kill kmemcheck.
      
      KASan is a replacement that is able to work without the limitation of
      kmemcheck (single CPU, slow).  KASan is already upstream.
      
      We are also not aware of any users of kmemcheck (or users who don't
      consider KASan as a suitable replacement).
      
      The only objection was that since KASAN wasn't supported by all GCC
      versions provided by distros at that time we should hold off for 2
      years, and try again.
      
      Now that 2 years have passed, and all distros provide gcc that supports
      KASAN, kill kmemcheck again for the very same reasons.
      
      This patch (of 4):
      
      Remove kmemcheck annotations, and calls to kmemcheck from the kernel.
      
      [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
        Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
      Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.comSigned-off-by: NSasha Levin <alexander.levin@verizon.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tim Hansen <devtimhansen@gmail.com>
      Cc: Vegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49502766
    • M
      kbuild: create object directories simpler and faster · 8a78756e
      Masahiro Yamada 提交于
      For the out-of-tree build, scripts/Makefile.build creates output
      directories, but this operation is not efficient.
      
      scripts/Makefile.lib calculates obj-dirs as follows:
      
        obj-dirs := $(dir $(multi-objs) $(obj-y))
      
      Please notice $(sort ...) is not used here.  Usually the result is
      as many "./" as objects here.
      
      For a lot of duplicated paths, the following command is invoked.
      
        _dummy := $(foreach d,$(obj-dirs), $(shell [ -d $(d) ] || mkdir -p $(d)))
      
      Then, the costly shell command is run over and over again.
      
      I see many points for optimization:
      
      [1] Use $(sort ...) to cut down duplicated paths before passing them
          to system call
      [2] Use single $(shell ...) instead of repeating it with $(foreach ...)
          This will reduce forking.
      [3] We can calculate obj-dirs more simply.  Most of objects are already
          accumulated in $(targets).  So, $(dir $(targets)) is fine and more
          comprehensive.
      
      I also removed ugly code in arch/x86/entry/vdso/Makefile.  This is now
      really unnecessary.
      Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Tested-by: NDouglas Anderson <dianders@chromium.org>
      8a78756e
    • R
      x86 / CPU: Always show current CPU frequency in /proc/cpuinfo · 7d5905dc
      Rafael J. Wysocki 提交于
      After commit 890da9cf (Revert "x86: do not use cpufreq_quick_get()
      for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo
      on x86 can be either the nominal CPU frequency (which is constant)
      or the frequency most recently requested by a scaling governor in
      cpufreq, depending on the cpufreq configuration.  That is somewhat
      inconsistent and is different from what it was before 4.13, so in
      order to restore the previous behavior, make it report the current
      CPU frequency like the scaling_cur_freq sysfs file in cpufreq.
      
      To that end, modify the /proc/cpuinfo implementation on x86 to use
      aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback
      registers, if available, and use their values to compute the CPU
      frequency to be reported as "cpu MHz".
      
      However, do that carefully enough to avoid accumulating delays that
      lead to unacceptable access times for /proc/cpuinfo on systems with
      many CPUs.  Run aperfmperf_snapshot_khz() once on all CPUs
      asynchronously at the /proc/cpuinfo open time, add a single delay
      upfront (if necessary) at that point and simply compute the current
      frequency while running show_cpuinfo() for each individual CPU.
      
      Also, to avoid slowing down /proc/cpuinfo accesses too much, reduce
      the default delay between consecutive APERF and MPERF reads to 10 ms,
      which should be sufficient to get large enough numbers for the
      frequency computation in all cases.
      
      Fixes: 890da9cf (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"")
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      7d5905dc