1. 01 2月, 2021 4 次提交
    • K
      perf/x86/intel: Add perf core PMU support for Sapphire Rapids · 61b985e3
      Kan Liang 提交于
      Add perf core PMU support for the Intel Sapphire Rapids server, which is
      the successor of the Intel Ice Lake server. The enabling code is based
      on Ice Lake, but there are several new features introduced.
      
      The event encoding is changed and simplified, e.g., the event codes
      which are below 0x90 are restricted to counters 0-3. The event codes
      which above 0x90 are likely to have no restrictions. The event
      constraints, extra_regs(), and hardware cache events table are changed
      accordingly.
      
      A new Precise Distribution (PDist) facility is introduced, which
      further minimizes the skid when a precise event is programmed on the GP
      counter 0. Enable the Precise Distribution (PDist) facility with :ppp
      event. For this facility to work, the period must be initialized with a
      value larger than 127. Add spr_limit_period() to apply the limit for
      :ppp event.
      
      Two new data source fields, data block & address block, are added in the
      PEBS Memory Info Record for the load latency event. To enable the
      feature,
      - An auxiliary event has to be enabled together with the load latency
        event on Sapphire Rapids. A new flag PMU_FL_MEM_LOADS_AUX is
        introduced to indicate the case. A new event, mem-loads-aux, is
        exposed to sysfs for the user tool.
        Add a check in hw_config(). If the auxiliary event is not detected,
        return an unique error -ENODATA.
      - The union perf_mem_data_src is extended to support the new fields.
      - Ice Lake and earlier models do not support block information, but the
        fields may be set by HW on some machines. Add pebs_no_block to
        explicitly indicate the previous platforms which don't support the new
        block fields. Accessing the new block fields are ignored on those
        platforms.
      
      A new store Latency facility is introduced, which leverages the PEBS
      facility where it can provide additional information about sampled
      stores. The additional information includes the data address, memory
      auxiliary info (e.g. Data Source, STLB miss) and the latency of the
      store access. To enable the facility, the new event (0x02cd) has to be
      programed on the GP counter 0. A new flag PERF_X86_EVENT_PEBS_STLAT is
      introduced to indicate the event. The store_latency_data() is introduced
      to parse the memory auxiliary info.
      
      The layout of access latency field of PEBS Memory Info Record has been
      changed. Two latency, instruction latency (bit 15:0) and cache access
      latency (bit 47:32) are recorded.
      - The cache access latency is similar to previous memory access latency.
        For loads, the latency starts by the actual cache access until the
        data is returned by the memory subsystem.
        For stores, the latency starts when the demand write accesses the L1
        data cache and lasts until the cacheline write is completed in the
        memory subsystem.
        The cache access latency is stored in low 32bits of the sample type
        PERF_SAMPLE_WEIGHT_STRUCT.
      - The instruction latency starts by the dispatch of the load operation
        for execution and lasts until completion of the instruction it belongs
        to.
        Add a new flag PMU_FL_INSTR_LATENCY to indicate the instruction
        latency support. The instruction latency is stored in the bit 47:32
        of the sample type PERF_SAMPLE_WEIGHT_STRUCT.
      
      Extends the PERF_METRICS MSR to feature TMA method level 2 metrics. The
      lower half of the register is the TMA level 1 metrics (legacy). The
      upper half is also divided into four 8-bit fields for the new level 2
      metrics. Expose all eight Topdown metrics events to user space.
      
      The full description for the SPR features can be found at Intel
      Architecture Instruction Set Extensions and Future Features
      Programming Reference, 319433-041.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1611873611-156687-5-git-send-email-kan.liang@linux.intel.com
      61b985e3
    • K
      perf/x86/intel: Filter unsupported Topdown metrics event · 1ab5f235
      Kan Liang 提交于
      Intel Sapphire Rapids server will introduce 8 metrics events. Intel
      Ice Lake only supports 4 metrics events. A perf tool user may mistakenly
      use the unsupported events via RAW format on Ice Lake. The user can
      still get a value from the unsupported Topdown metrics event once the
      following Sapphire Rapids enabling patch is applied.
      
      To enable the 8 metrics events on Intel Sapphire Rapids, the
      INTEL_TD_METRIC_MAX has to be updated, which impacts the
      is_metric_event(). The is_metric_event() is a generic function.
      On Ice Lake, the newly added SPR metrics events will be mistakenly
      accepted as metric events on creation. At runtime, the unsupported
      Topdown metrics events will be updated.
      
      Add a variable num_topdown_events in x86_pmu to indicate the available
      number of the Topdown metrics event on the platform. Apply the number
      into is_metric_event(). Only the supported Topdown metrics events
      should be created as metrics events.
      
      Apply the num_topdown_events in icl_update_topdown_event() as well. The
      function can be reused by the following patch.
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1611873611-156687-4-git-send-email-kan.liang@linux.intel.com
      1ab5f235
    • K
      perf/x86/intel: Factor out intel_update_topdown_event() · 628d923a
      Kan Liang 提交于
      Similar to Ice Lake, Intel Sapphire Rapids server also supports the
      topdown performance metrics feature. The difference is that Intel
      Sapphire Rapids server extends the PERF_METRICS MSR to feature TMA
      method level two metrics, which will introduce 8 metrics events. Current
      icl_update_topdown_event() only check 4 level one metrics events.
      
      Factor out intel_update_topdown_event() to facilitate the code sharing
      between Ice Lake and Sapphire Rapids.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1611873611-156687-3-git-send-email-kan.liang@linux.intel.com
      628d923a
    • K
      perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT · 2a6c6b7d
      Kan Liang 提交于
      Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the
      cost of an action represented by the sample. This allows the profiler
      to scale the samples to be more informative to the programmer. It could
      also help to locate a hotspot, e.g., when profiling by memory latencies,
      the expensive load appear higher up in the histograms. But current
      PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This
      could be a problem, if users want two or more factors to contribute to
      the weight. For example, Golden Cove core PMU can provide both the
      instruction latency and the cache Latency information as factors for the
      memory profiling.
      
      For current X86 platforms, although meminfo::latency is defined as a
      u64, only the lower 32 bits include the valid data in practice (No
      memory access could last than 4G cycles). The higher 32 bits can be used
      to store new factors.
      
      Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new
      sample weight structure. It shares the same space as the
      PERF_SAMPLE_WEIGHT sample type.
      
      Users can apply either the PERF_SAMPLE_WEIGHT sample type or the
      PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but
      they cannot apply both sample types simultaneously.
      
      Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type.
      - For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT
        sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT
        sample type. PowerPC can re-struct the weight field similarly later.
      - For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT
        sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now.
        The following patches will apply the new factors for the
        PERF_SAMPLE_WEIGHT_STRUCT sample type.
      
      The field in the union perf_sample_weight should be shared among
      different architectures. A generic name is required, but it's hard to
      abstract a name that applies to all architectures. For example, on X86,
      the fields are to store all kinds of latency. While on PowerPC, it
      stores MMCRA[TECX/TECM], which should not be latency. So a general name
      prefix 'var$NUM' is used here.
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com
      2a6c6b7d
  2. 28 1月, 2021 2 次提交
  3. 14 1月, 2021 2 次提交
  4. 30 12月, 2020 1 次提交
  5. 23 12月, 2020 3 次提交
  6. 20 12月, 2020 2 次提交
  7. 19 12月, 2020 1 次提交
  8. 17 12月, 2020 1 次提交
  9. 16 12月, 2020 11 次提交
    • T
      xen: remove trailing semicolon in macro definition · eef02412
      Tom Rix 提交于
      The macro use will already have a semicolon.
      Signed-off-by: NTom Rix <trix@redhat.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Link: https://lore.kernel.org/r/20201127160707.2622061-1-trix@redhat.comSigned-off-by: NJuergen Gross <jgross@suse.com>
      eef02412
    • J
      xen: Kconfig: nest Xen guest options · bfda93ae
      Jason Andryuk 提交于
      Moving XEN_512GB allows it to nest under XEN_PV.  That also allows
      XEN_PVH to nest under XEN as a sibling to XEN_PV and XEN_PVHVM giving:
      
      [*]   Xen guest support
      [*]     Xen PV guest support
      [*]       Limit Xen pv-domain memory to 512GB
      [*]       Xen PV Dom0 support
      [*]     Xen PVHVM guest support
      [*]     Xen PVH guest support
      Signed-off-by: NJason Andryuk <jandryuk@gmail.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Link: https://lore.kernel.org/r/20201014175342.152712-3-jandryuk@gmail.comSigned-off-by: NJuergen Gross <jgross@suse.com>
      bfda93ae
    • J
      xen: Remove Xen PVH/PVHVM dependency on PCI · 34aff145
      Jason Andryuk 提交于
      A Xen PVH domain doesn't have a PCI bus or devices, so it doesn't need
      PCI support built in.  Currently, XEN_PVH depends on XEN_PVHVM which
      depends on PCI.
      
      Introduce XEN_PVHVM_GUEST as a toplevel item and change XEN_PVHVM to a
      hidden variable.  This allows XEN_PVH to depend on XEN_PVHVM without PCI
      while XEN_PVHVM_GUEST depends on PCI.
      
      In drivers/xen, compile platform-pci depending on XEN_PVHVM_GUEST since
      that pulls in the PCI dependency for linking.
      Signed-off-by: NJason Andryuk <jandryuk@gmail.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Link: https://lore.kernel.org/r/20201014175342.152712-2-jandryuk@gmail.comSigned-off-by: NJuergen Gross <jgross@suse.com>
      34aff145
    • Q
      x86/xen: Convert to DEFINE_SHOW_ATTRIBUTE · e34ff4cd
      Qinglang Miao 提交于
      Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.
      Signed-off-by: NQinglang Miao <miaoqinglang@huawei.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Link: https://lore.kernel.org/r/20200917125547.104472-1-miaoqinglang@huawei.comSigned-off-by: NJuergen Gross <jgross@suse.com>
      e34ff4cd
    • M
      arch, mm: make kernel_page_present() always available · 32a0de88
      Mike Rapoport 提交于
      For architectures that enable ARCH_HAS_SET_MEMORY having the ability to
      verify that a page is mapped in the kernel direct map can be useful
      regardless of hibernation.
      
      Add RISC-V implementation of kernel_page_present(), update its forward
      declarations and stubs to be a part of set_memory API and remove ugly
      ifdefery in inlcude/linux/mm.h around current declarations of
      kernel_page_present().
      
      Link: https://lkml.kernel.org/r/20201109192128.960-5-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32a0de88
    • M
      arch, mm: restore dependency of __kernel_map_pages() on DEBUG_PAGEALLOC · 5d6ad668
      Mike Rapoport 提交于
      The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must
      never fail.  With this assumption is wouldn't be safe to allow general
      usage of this function.
      
      Moreover, some architectures that implement __kernel_map_pages() have this
      function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
      pages when page allocation debugging is disabled at runtime.
      
      As all the users of __kernel_map_pages() were converted to use
      debug_pagealloc_map_pages() it is safe to make it available only when
      DEBUG_PAGEALLOC is set.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-4-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d6ad668
    • D
      mm: forbid splitting special mappings · 871402e0
      Dmitry Safonov 提交于
      Don't allow splitting of vm_special_mapping's.  It affects vdso/vvar
      areas.  Uprobes have only one page in xol_area so they aren't affected.
      
      Those restrictions were enforced by checks in .mremap() callbacks.
      Restrict resizing with generic .split() callback.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-7-dima@arista.comSigned-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      871402e0
    • D
      mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio · cd544fd1
      Dmitry Safonov 提交于
      As kernel expect to see only one of such mappings, any further operations
      on the VMA-copy may be unexpected by the kernel.  Maybe it's being on the
      safe side, but there doesn't seem to be any expected use-case for this, so
      restrict it now.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
      Fixes: commit e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd544fd1
    • K
      x86: mremap speedup - Enable HAVE_MOVE_PUD · be37c98d
      Kalesh Singh 提交于
      HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
      source and destination addresses are PUD-aligned.
      
      With HAVE_MOVE_PUD enabled it can be inferred that there is
      approximately a 13x improvement in performance on x86.  (See data
      below).
      
      ------- Test Results ---------
      
      The following results were obtained using a 5.4 kernel, by remapping
      a PUD-aligned, 1GB sized region to a PUD-aligned destination.
      The results from 10 iterations of the test are given below:
      
      Total mremap times for 1GB data on x86. All times are in nanoseconds.
      
        Control        HAVE_MOVE_PUD
      
        180394         15089
        235728         14056
        238931         25741
        187330         13838
        241742         14187
        177925         14778
        182758         14728
        160872         14418
        205813         15107
        245722         13998
      
        205721.5       15594    <-- Mean time in nanoseconds
      
      A 1GB mremap completion time drops from ~205 microseconds
      to ~15 microseconds on x86. (~13x speed up).
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-6-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be37c98d
    • J
      mm/gup: prevent gup_fast from racing with COW during fork · 57efa1fe
      Jason Gunthorpe 提交于
      Since commit 70e806e4 ("mm: Do early cow for pinned pages during
      fork() for ptes") pages under a FOLL_PIN will not be write protected
      during COW for fork.  This means that pages returned from
      pin_user_pages(FOLL_WRITE) should not become write protected while the pin
      is active.
      
      However, there is a small race where get_user_pages_fast(FOLL_PIN) can
      establish a FOLL_PIN at the same time copy_present_page() is write
      protecting it:
      
              CPU 0                             CPU 1
         get_user_pages_fast()
          internal_get_user_pages_fast()
                                             copy_page_range()
                                               pte_alloc_map_lock()
                                                 copy_present_page()
                                                   atomic_read(has_pinned) == 0
      					     page_maybe_dma_pinned() == false
           atomic_set(has_pinned, 1);
           gup_pgd_range()
            gup_pte_range()
             pte_t pte = gup_get_pte(ptep)
             pte_access_permitted(pte)
             try_grab_compound_head()
                                                   pte = pte_wrprotect(pte)
      	                                     set_pte_at();
                                               pte_unmap_unlock()
            // GUP now returns with a write protected page
      
      The first attempt to resolve this by using the write protect caused
      problems (and was missing a barrrier), see commit f3c64eda ("mm: avoid
      early COW write protect games during fork()")
      
      Instead wrap copy_p4d_range() with the write side of a seqcount and check
      the read side around gup_pgd_range().  If there is a collision then
      get_user_pages_fast() fails and falls back to slow GUP.
      
      Slow GUP is safe against this race because copy_page_range() is only
      called while holding the exclusive side of the mmap_lock on the src
      mm_struct.
      
      [akpm@linux-foundation.org: coding style fixes]
        Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com
      
      Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
      Fixes: f3c64eda ("mm: avoid early COW write protect games during fork()")
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Acked-by: "Ahmed S. Darwish" <a.darwish@linutronix.de>	[seqcount_t parts]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Leon Romanovsky <leonro@nvidia.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57efa1fe
    • T
      KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting · 8640ca58
      Tom Lendacky 提交于
      The GHCB specification requires the hypervisor to save the address of an
      AP Jump Table so that, for example, vCPUs that have been parked by UEFI
      can be started by the OS. Provide support for the AP Jump Table set/get
      exit code.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8640ca58
  10. 15 12月, 2020 13 次提交
    • T
      genirq: Move irq_has_action() into core code · a313357e
      Thomas Gleixner 提交于
      This function uses irq_to_desc() and is going to be used by modules to
      replace the open coded irq_to_desc() (ab)usage. The final goal is to remove
      the export of irq_to_desc() so driver cannot fiddle with it anymore.
      
      Move it into the core code and fixup the usage sites to include the proper
      header.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20201210194042.548936472@linutronix.de
      
      a313357e
    • T
      KVM: SVM: Provide support to launch and run an SEV-ES guest · ad73109a
      Tom Lendacky 提交于
      An SEV-ES guest is started by invoking a new SEV initialization ioctl,
      KVM_SEV_ES_INIT. This identifies the guest as an SEV-ES guest, which is
      used to drive the appropriate ASID allocation, VMSA encryption, etc.
      
      Before being able to run an SEV-ES vCPU, the vCPU VMSA must be encrypted
      and measured. This is done using the LAUNCH_UPDATE_VMSA command after all
      calls to LAUNCH_UPDATE_DATA have been performed, but before LAUNCH_MEASURE
      has been performed. In order to establish the encrypted VMSA, the current
      (traditional) VMSA and the GPRs are synced to the page that will hold the
      encrypted VMSA and then LAUNCH_UPDATE_VMSA is invoked. The vCPU is then
      marked as having protected guest state.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <e9643245adb809caf3a87c09997926d2f3d6ff41.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad73109a
    • T
      KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests · 16809ecd
      Tom Lendacky 提交于
      The run sequence is different for an SEV-ES guest compared to a legacy or
      even an SEV guest. The guest vCPU register state of an SEV-ES guest will
      be restored on VMRUN and saved on VMEXIT. There is no need to restore the
      guest registers directly and through VMLOAD before VMRUN and no need to
      save the guest registers directly and through VMSAVE on VMEXIT.
      
      Update the svm_vcpu_run() function to skip register state saving and
      restoring and provide an alternative function for running an SEV-ES guest
      in vmenter.S
      
      Additionally, certain host state is restored across an SEV-ES VMRUN. As
      a result certain register states are not required to be restored upon
      VMEXIT (e.g. FS, GS, etc.), so only do that if the guest is not an SEV-ES
      guest.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <fb1c66d32f2194e171b95fc1a8affd6d326e10c1.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16809ecd
    • T
      KVM: SVM: Provide support for SEV-ES vCPU loading · 86137773
      Tom Lendacky 提交于
      An SEV-ES vCPU requires additional VMCB vCPU load/put requirements. SEV-ES
      hardware will restore certain registers on VMEXIT, but not save them on
      VMRUN (see Table B-3 and Table B-4 of the AMD64 APM Volume 2), so make the
      following changes:
      
      General vCPU load changes:
        - During vCPU loading, perform a VMSAVE to the per-CPU SVM save area and
          save the current values of XCR0, XSS and PKRU to the per-CPU SVM save
          area as these registers will be restored on VMEXIT.
      
      General vCPU put changes:
        - Do not attempt to restore registers that SEV-ES hardware has already
          restored on VMEXIT.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <019390e9cb5e93cd73014fa5a040c17d42588733.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      86137773
    • T
      KVM: SVM: Provide support for SEV-ES vCPU creation/loading · 376c6d28
      Tom Lendacky 提交于
      An SEV-ES vCPU requires additional VMCB initialization requirements for
      vCPU creation and vCPU load/put requirements. This includes:
      
      General VMCB initialization changes:
        - Set a VMCB control bit to enable SEV-ES support on the vCPU.
        - Set the VMCB encrypted VM save area address.
        - CRx registers are part of the encrypted register state and cannot be
          updated. Remove the CRx register read and write intercepts and replace
          them with CRx register write traps to track the CRx register values.
        - Certain MSR values are part of the encrypted register state and cannot
          be updated. Remove certain MSR intercepts (EFER, CR_PAT, etc.).
        - Remove the #GP intercept (no support for "enable_vmware_backdoor").
        - Remove the XSETBV intercept since the hypervisor cannot modify XCR0.
      
      General vCPU creation changes:
        - Set the initial GHCB gpa value as per the GHCB specification.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <3a8aef366416eddd5556dfa3fdc212aafa1ad0a2.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      376c6d28
    • T
      KVM: SVM: Update ASID allocation to support SEV-ES guests · 80675b3a
      Tom Lendacky 提交于
      SEV and SEV-ES guests each have dedicated ASID ranges. Update the ASID
      allocation routine to return an ASID in the respective range.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <d7aed505e31e3954268b2015bb60a1486269c780.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      80675b3a
    • T
      KVM: SVM: Set the encryption mask for the SVM host save area · 85ca8be9
      Tom Lendacky 提交于
      The SVM host save area is used to restore some host state on VMEXIT of an
      SEV-ES guest. After allocating the save area, clear it and add the
      encryption mask to the SVM host save area physical address that is
      programmed into the VM_HSAVE_PA MSR.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <b77aa28af6d7f1a0cb545959e08d6dc75e0c3cba.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      85ca8be9
    • T
      KVM: SVM: Add NMI support for an SEV-ES guest · 4444dfe4
      Tom Lendacky 提交于
      The GHCB specification defines how NMIs are to be handled for an SEV-ES
      guest. To detect the completion of an NMI the hypervisor must not
      intercept the IRET instruction (because a #VC while running the NMI will
      issue an IRET) and, instead, must receive an NMI Complete exit event from
      the guest.
      
      Update the KVM support for detecting the completion of NMIs in the guest
      to follow the GHCB specification. When an SEV-ES guest is active, the
      IRET instruction will no longer be intercepted. Now, when the NMI Complete
      exit event is received, the iret_interception() function will be called
      to simulate the completion of the NMI.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <5ea3dd69b8d4396cefdc9048ebc1ab7caa70a847.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4444dfe4
    • T
      KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest · ed02b213
      Tom Lendacky 提交于
      The guest FPU state is automatically restored on VMRUN and saved on VMEXIT
      by the hardware, so there is no reason to do this in KVM. Eliminate the
      allocation of the guest_fpu save area and key off that to skip operations
      related to the guest FPU state.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <173e429b4d0d962c6a443c4553ffdaf31b7665a4.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed02b213
    • T
      KVM: SVM: Do not report support for SMM for an SEV-ES guest · 5719455f
      Tom Lendacky 提交于
      SEV-ES guests do not currently support SMM. Update the has_emulated_msr()
      kvm_x86_ops function to take a struct kvm parameter so that the capability
      can be reported at a VM level.
      
      Since this op is also called during KVM initialization and before a struct
      kvm instance is available, comments will be added to each implementation
      of has_emulated_msr() to indicate the kvm parameter can be null.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <75de5138e33b945d2fb17f81ae507bda381808e3.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5719455f
    • T
      KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES · 5265713a
      Tom Lendacky 提交于
      Since many of the registers used by the SEV-ES are encrypted and cannot
      be read or written, adjust the __get_sregs() / __set_sregs() to take into
      account whether the VMSA/guest state is encrypted.
      
      For __get_sregs(), return the actual value that is in use by the guest
      for all registers being tracked using the write trap support.
      
      For __set_sregs(), skip setting of all guest registers values.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <23051868db76400a9b07a2020525483a1e62dbcf.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5265713a
    • T
      KVM: SVM: Add support for CR8 write traps for an SEV-ES guest · d1949b93
      Tom Lendacky 提交于
      For SEV-ES guests, the interception of control register write access
      is not recommended. Control register interception occurs prior to the
      control register being modified and the hypervisor is unable to modify
      the control register itself because the register is located in the
      encrypted register state.
      
      SEV-ES guests introduce new control register write traps. These traps
      provide intercept support of a control register write after the control
      register has been modified. The new control register value is provided in
      the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
      of the guest control registers.
      
      Add support to track the value of the guest CR8 register using the control
      register write trap so that the hypervisor understands the guest operating
      mode.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <5a01033f4c8b3106ca9374b7cadf8e33da852df1.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1949b93
    • T
      KVM: SVM: Add support for CR4 write traps for an SEV-ES guest · 5b51cb13
      Tom Lendacky 提交于
      For SEV-ES guests, the interception of control register write access
      is not recommended. Control register interception occurs prior to the
      control register being modified and the hypervisor is unable to modify
      the control register itself because the register is located in the
      encrypted register state.
      
      SEV-ES guests introduce new control register write traps. These traps
      provide intercept support of a control register write after the control
      register has been modified. The new control register value is provided in
      the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
      of the guest control registers.
      
      Add support to track the value of the guest CR4 register using the control
      register write trap so that the hypervisor understands the guest operating
      mode.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <c3880bf2db8693aa26f648528fbc6e967ab46e25.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b51cb13