1. 02 4月, 2022 6 次提交
  2. 21 3月, 2022 1 次提交
    • O
      KVM: x86: Introduce KVM_CAP_DISABLE_QUIRKS2 · 6d849191
      Oliver Upton 提交于
      KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not
      advertise the set of quirks which may be disabled to userspace, so it is
      impossible to predict the behavior of KVM. Worse yet,
      KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning
      it fails to reject attempts to set invalid quirk bits.
      
      The only valid workaround for the quirky quirks API is to add a new CAP.
      Actually advertise the set of quirks that can be disabled to userspace
      so it can predict KVM's behavior. Reject values for cap->args[0] that
      contain invalid bits.
      
      Finally, add documentation for the new capability and describe the
      existing quirks.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20220301060351.442881-5-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6d849191
  3. 25 2月, 2022 1 次提交
  4. 22 2月, 2022 2 次提交
  5. 14 2月, 2022 4 次提交
  6. 03 2月, 2022 1 次提交
  7. 02 2月, 2022 2 次提交
  8. 31 1月, 2022 1 次提交
  9. 28 1月, 2022 2 次提交
  10. 26 1月, 2022 1 次提交
  11. 20 1月, 2022 3 次提交
  12. 16 1月, 2022 1 次提交
  13. 15 1月, 2022 5 次提交
    • A
      mm/mempolicy: wire up syscall set_mempolicy_home_node · 21b084fd
      Aneesh Kumar K.V 提交于
      Link: https://lkml.kernel.org/r/20211202123810.267175-4-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Ben Widawsky <ben.widawsky@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21b084fd
    • C
      mm: add a field to store names for private anonymous memory · 9a10064f
      Colin Cross 提交于
      In many userspace applications, and especially in VM based applications
      like Android uses heavily, there are multiple different allocators in
      use.  At a minimum there is libc malloc and the stack, and in many cases
      there are libc malloc, the stack, direct syscalls to mmap anonymous
      memory, and multiple VM heaps (one for small objects, one for big
      objects, etc.).  Each of these layers usually has its own tools to
      inspect its usage; malloc by compiling a debug version, the VM through
      heap inspection tools, and for direct syscalls there is usually no way
      to track them.
      
      On Android we heavily use a set of tools that use an extended version of
      the logic covered in Documentation/vm/pagemap.txt to walk all pages
      mapped in userspace and slice their usage by process, shared (COW) vs.
      unique mappings, backing, etc.  This can account for real physical
      memory usage even in cases like fork without exec (which Android uses
      heavily to share as many private COW pages as possible between
      processes), Kernel SamePage Merging, and clean zero pages.  It produces
      a measurement of the pages that only exist in that process (USS, for
      unique), and a measurement of the physical memory usage of that process
      with the cost of shared pages being evenly split between processes that
      share them (PSS).
      
      If all anonymous memory is indistinguishable then figuring out the real
      physical memory usage (PSS) of each heap requires either a pagemap
      walking tool that can understand the heap debugging of every layer, or
      for every layer's heap debugging tools to implement the pagemap walking
      logic, in which case it is hard to get a consistent view of memory
      across the whole system.
      
      Tracking the information in userspace leads to all sorts of problems.
      It either needs to be stored inside the process, which means every
      process has to have an API to export its current heap information upon
      request, or it has to be stored externally in a filesystem that somebody
      needs to clean up on crashes.  It needs to be readable while the process
      is still running, so it has to have some sort of synchronization with
      every layer of userspace.  Efficiently tracking the ranges requires
      reimplementing something like the kernel vma trees, and linking to it
      from every layer of userspace.  It requires more memory, more syscalls,
      more runtime cost, and more complexity to separately track regions that
      the kernel is already tracking.
      
      This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a
      userspace-provided name for anonymous vmas.  The names of named
      anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as
      [anon:<name>].
      
      Userspace can set the name for a region of memory by calling
      
         prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name)
      
      Setting the name to NULL clears it.  The name length limit is 80 bytes
      including NUL-terminator and is checked to contain only printable ascii
      characters (including space), except '[',']','\','$' and '`'.
      
      Ascii strings are being used to have a descriptive identifiers for vmas,
      which can be understood by the users reading /proc/pid/maps or
      /proc/pid/smaps.  Names can be standardized for a given system and they
      can include some variable parts such as the name of the allocator or a
      library, tid of the thread using it, etc.
      
      The name is stored in a pointer in the shared union in vm_area_struct
      that points to a null terminated string.  Anonymous vmas with the same
      name (equivalent strings) and are otherwise mergeable will be merged.
      The name pointers are not shared between vmas even if they contain the
      same name.  The name pointer is stored in a union with fields that are
      only used on file-backed mappings, so it does not increase memory usage.
      
      CONFIG_ANON_VMA_NAME kernel configuration is introduced to enable this
      feature.  It keeps the feature disabled by default to prevent any
      additional memory overhead and to avoid confusing procfs parsers on
      systems which are not ready to support named anonymous vmas.
      
      The patch is based on the original patch developed by Colin Cross, more
      specifically on its latest version [1] posted upstream by Sumit Semwal.
      It used a userspace pointer to store vma names.  In that design, name
      pointers could be shared between vmas.  However during the last
      upstreaming attempt, Kees Cook raised concerns [2] about this approach
      and suggested to copy the name into kernel memory space, perform
      validity checks [3] and store as a string referenced from
      vm_area_struct.
      
      One big concern is about fork() performance which would need to strdup
      anonymous vma names.  Dave Hansen suggested experimenting with
      worst-case scenario of forking a process with 64k vmas having longest
      possible names [4].  I ran this experiment on an ARM64 Android device
      and recorded a worst-case regression of almost 40% when forking such a
      process.
      
      This regression is addressed in the followup patch which replaces the
      pointer to a name with a refcounted structure that allows sharing the
      name pointer between vmas of the same name.  Instead of duplicating the
      string during fork() or when splitting a vma it increments the refcount.
      
      [1] https://lore.kernel.org/linux-mm/20200901161459.11772-4-sumit.semwal@linaro.org/
      [2] https://lore.kernel.org/linux-mm/202009031031.D32EF57ED@keescook/
      [3] https://lore.kernel.org/linux-mm/202009031022.3834F692@keescook/
      [4] https://lore.kernel.org/linux-mm/5d0358ab-8c47-2f5f-8e43-23b89d6a8e95@intel.com/
      
      Changes for prctl(2) manual page (in the options section):
      
      PR_SET_VMA
      	Sets an attribute specified in arg2 for virtual memory areas
      	starting from the address specified in arg3 and spanning the
      	size specified	in arg4. arg5 specifies the value of the attribute
      	to be set. Note that assigning an attribute to a virtual memory
      	area might prevent it from being merged with adjacent virtual
      	memory areas due to the difference in that attribute's value.
      
      	Currently, arg2 must be one of:
      
      	PR_SET_VMA_ANON_NAME
      		Set a name for anonymous virtual memory areas. arg5 should
      		be a pointer to a null-terminated string containing the
      		name. The name length including null byte cannot exceed
      		80 bytes. If arg5 is NULL, the name of the appropriate
      		anonymous virtual memory areas will be reset. The name
      		can contain only printable ascii characters (including
                      space), except '[',']','\','$' and '`'.
      
                      This feature is available only if the kernel is built with
                      the CONFIG_ANON_VMA_NAME option enabled.
      
      [surenb@google.com: docs: proc.rst: /proc/PID/maps: fix malformed table]
        Link: https://lkml.kernel.org/r/20211123185928.2513763-1-surenb@google.com
      [surenb: rebased over v5.15-rc6, replaced userpointer with a kernel copy,
       added input sanitization and CONFIG_ANON_VMA_NAME config. The bulk of the
       work here was done by Colin Cross, therefore, with his permission, keeping
       him as the author]
      
      Link: https://lkml.kernel.org/r/20211019215511.3771969-2-surenb@google.comSigned-off-by: NColin Cross <ccross@google.com>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Glauber <jan.glauber@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rob Landley <rob@landley.net>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a10064f
    • E
      vdpa: Support reporting max device capabilities · cd2629f6
      Eli Cohen 提交于
      Add max_supported_vqs and supported_features fields to struct
      vdpa_mgmt_dev. Upstream drivers need to feel these values according to
      the device capabilities.
      
      These values are reported back in a netlink message when showing management
      devices.
      
      Examples:
      
      $ auxiliary/mlx5_core.sf.1:
        supported_classes net
        max_supported_vqs 257
        dev_features CSUM GUEST_CSUM MTU HOST_TSO4 HOST_TSO6 STATUS CTRL_VQ MQ \
                     CTRL_MAC_ADDR VERSION_1 ACCESS_PLATFORM
      
      $ vdpa -j mgmtdev show
      {"mgmtdev":{"auxiliary/mlx5_core.sf.1":{"supported_classes":["net"], \
        "max_supported_vqs":257,"dev_features":["CSUM","GUEST_CSUM","MTU", \
        "HOST_TSO4","HOST_TSO6","STATUS","CTRL_VQ","MQ","CTRL_MAC_ADDR", \
        "VERSION_1","ACCESS_PLATFORM"]}}}
      
      $ vdpa -jp mgmtdev show
      {
          "mgmtdev": {
              "auxiliary/mlx5_core.sf.1": {
                  "supported_classes": [ "net" ],
                  "max_supported_vqs": 257,
                  "dev_features": ["CSUM","GUEST_CSUM","MTU","HOST_TSO4", \
                                   "HOST_TSO6","STATUS","CTRL_VQ","MQ", \
                                   "CTRL_MAC_ADDR","VERSION_1","ACCESS_PLATFORM"]
              }
          }
      }
      Signed-off-by: NEli Cohen <elic@nvidia.com>
      Link: https://lore.kernel.org/r/20220105114646.577224-11-elic@nvidia.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: Si-Wei Liu<si-wei.liu@oracle.com>
      cd2629f6
    • E
      vdpa: Add support for returning device configuration information · 612f330e
      Eli Cohen 提交于
      Add netlink attribute to store the negotiated features. This can be used
      by userspace to get the current state of the vdpa instance.
      
      Examples:
      
      $ vdpa dev config show vdpa-a
      vdpa-a: mac 00:00:00:00:88:88 link up link_announce false max_vq_pairs 16 mtu 1500
        negotiated_features CSUM GUEST_CSUM MTU MAC HOST_TSO4 HOST_TSO6 STATUS \
        CTRL_VQ MQ CTRL_MAC_ADDR VERSION_1 ACCESS_PLATFORM
      
      $ vdpa -j dev config show vdpa-a
      {"config":{"vdpa-a":{"mac":"00:00:00:00:88:88","link ":"up","link_announce":false, \
       "max_vq_pairs":16,"mtu":1500,"negotiated_features":["CSUM","GUEST_CSUM","MTU","MAC", \
       "HOST_TSO4","HOST_TSO6","STATUS","CTRL_VQ","MQ","CTRL_MAC_ADDR","VERSION_1", \
       "ACCESS_PLATFORM"]}}}
      
      $ vdpa -jp dev config show vdpa-a
      {
          "config": {
              "vdpa-a": {
                  "mac": "00:00:00:00:88:88",
                  "link ": "up",
                  "link_announce ": false,
                  "max_vq_pairs": 16,
                  "mtu": 1500,
                  "negotiated_features": [
      "CSUM","GUEST_CSUM","MTU","MAC","HOST_TSO4","HOST_TSO6","STATUS","CTRL_VQ","MQ", \
      "CTRL_MAC_ADDR","VERSION_1","ACCESS_PLATFORM"
      ]
              }
          }
      }
      Signed-off-by: NEli Cohen <elic@nvidia.com>
      Link: https://lore.kernel.org/r/20220105114646.577224-9-elic@nvidia.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      612f330e
    • G
      kvm: x86: Add support for getting/setting expanded xstate buffer · be50b206
      Guang Zeng 提交于
      With KVM_CAP_XSAVE, userspace uses a hardcoded 4KB buffer to get/set
      xstate data from/to KVM. This doesn't work when dynamic xfeatures
      (e.g. AMX) are exposed to the guest as they require a larger buffer
      size.
      
      Introduce a new capability (KVM_CAP_XSAVE2). Userspace VMM gets the
      required xstate buffer size via KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
      KVM_SET_XSAVE is extended to work with both legacy and new capabilities
      by doing properly-sized memdup_user() based on the guest fpu container.
      KVM_GET_XSAVE is kept for backward-compatible reason. Instead,
      KVM_GET_XSAVE2 is introduced under KVM_CAP_XSAVE2 as the preferred
      interface for getting xstate buffer (4KB or larger size) from KVM
      (Link: https://lkml.org/lkml/2021/12/15/510)
      
      Also, update the api doc with the new KVM_GET_XSAVE2 ioctl.
      Signed-off-by: NGuang Zeng <guang.zeng@intel.com>
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-19-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      be50b206
  14. 13 1月, 2022 1 次提交
  15. 12 1月, 2022 2 次提交
    • D
      module: add in-kernel support for decompressing · b1ae6dc4
      Dmitry Torokhov 提交于
      Current scheme of having userspace decompress kernel modules before
      loading them into the kernel runs afoul of LoadPin security policy, as
      it loses link between the source of kernel module on the disk and binary
      blob that is being loaded into the kernel. To solve this issue let's
      implement decompression in kernel, so that we can pass a file descriptor
      of compressed module file into finit_module() which will keep LoadPin
      happy.
      
      To let userspace know what compression/decompression scheme kernel
      supports it will create /sys/module/compression attribute. kmod can read
      this attribute and decide if it can pass compressed file to
      finit_module(). New MODULE_INIT_COMPRESSED_DATA flag indicates that the
      kernel should attempt to decompress the data read from file descriptor
      prior to trying load the module.
      
      To simplify things kernel will only implement single decompression
      method matching compression method selected when generating modules.
      This patch implements gzip and xz; more can be added later,
      Signed-off-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      b1ae6dc4
    • L
      drm/amdkfd: make SPDX License expression more sound · 9b7a4de9
      Lukas Bulwahn 提交于
      Commit b5f57384 ("drm/amdkfd: Add sysfs bitfields and enums to uAPI")
      adds include/uapi/linux/kfd_sysfs.h with the "GPL-2.0 OR MIT WITH
      Linux-syscall-note" SPDX-License expression.
      
      The command ./scripts/spdxcheck.py warns:
      
        include/uapi/linux/kfd_sysfs.h: 1:48 Exception not valid for license MIT: Linux-syscall-note
      
      For a uapi header, the file under GPLv2 License must be combined with the
      Linux-syscall-note, but combining the MIT License with the
      Linux-syscall-note makes no sense, as the note provides an exception for
      GPL-licensed code, not for permissively licensed code.
      
      So, reorganize the SPDX expression to only combine the note with the GPL
      License condition. This makes spdxcheck happy again.
      
      Fixes: b5f57384 ("drm/amdkfd: Add sysfs bitfields and enums to uAPI")
      Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: kstewart@linuxfoundation.org
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      9b7a4de9
  16. 10 1月, 2022 1 次提交
  17. 08 1月, 2022 1 次提交
  18. 07 1月, 2022 1 次提交
    • D
      KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery · 14243b38
      David Woodhouse 提交于
      This adds basic support for delivering 2 level event channels to a guest.
      
      Initially, it only supports delivery via the IRQ routing table, triggered
      by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast()
      function which will use the pre-mapped shared_info page if it already
      exists and is still valid, while the slow path through the irqfd_inject
      workqueue will remap the shared_info page if necessary.
      
      It sets the bits in the shared_info page but not the vcpu_info; that is
      deferred to __kvm_xen_has_interrupt() which raises the vector to the
      appropriate vCPU.
      
      Add a 'verbose' mode to xen_shinfo_test while adding test cases for this.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-5-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      14243b38
  19. 06 1月, 2022 2 次提交
  20. 05 1月, 2022 2 次提交
    • V
      can: netlink: report the CAN controller mode supported flags · 383f0993
      Vincent Mailhol 提交于
      Currently, the CAN netlink interface provides no easy ways to check
      the capabilities of a given controller. The only method from the
      command line is to try each CAN_CTRLMODE_* individually to check
      whether the netlink interface returns an -EOPNOTSUPP error or not
      (alternatively, one may find it easier to directly check the source
      code of the driver instead...)
      
      This patch introduces a method for the user to check both the
      supported and the static capabilities. The proposed method introduces
      a new IFLA nest: IFLA_CAN_CTRLMODE_EXT which extends the current
      IFLA_CAN_CTRLMODE. This is done to guaranty a full forward and
      backward compatibility between the kernel and the user land
      applications.
      
      The IFLA_CAN_CTRLMODE_EXT nest contains one single entry:
      IFLA_CAN_CTRLMODE_SUPPORTED. Because this entry is only used in one
      direction: kernel to userland, no new struct nla_policy are
      introduced.
      
      Below table explains how IFLA_CAN_CTRLMODE_SUPPORTED (hereafter:
      "supported") and can_ctrlmode::flags (hereafter: "flags") allow us to
      identify both the supported and the static capabilities, when masked
      with any of the CAN_CTRLMODE_* bit flags:
      
       supported &	flags &		Controller capabilities
       CAN_CTRLMODE_*	CAN_CTRLMODE_*
       -----------------------------------------------------------------------
       false		false		Feature not supported (always disabled)
       false		true		Static feature (always enabled)
       true		false		Feature supported but disabled
       true		true		Feature supported and enabled
      
      Link: https://lore.kernel.org/all/20211213160226.56219-5-mailhol.vincent@wanadoo.frSigned-off-by: NVincent Mailhol <mailhol.vincent@wanadoo.fr>
      Signed-off-by: NMarc Kleine-Budde <mkl@pengutronix.de>
      383f0993
    • D
      dmaengine: idxd: change MSIX allocation based on per wq activation · 403a2e23
      Dave Jiang 提交于
      Change the driver where WQ interrupt is requested only when wq is being
      enabled. This new scheme set things up so that request_threaded_irq() is
      only called when a kernel wq type is being enabled. This also sets up for
      future interrupt request where different interrupt handler such as wq
      occupancy interrupt can be setup instead of the wq completion interrupt.
      
      Not calling request_irq() until the WQ actually needs an irq also prevents
      wasting of CPU irq vectors on x86 systems, which is a limited resource.
      
      idxd_flush_pending_descs() is moved to device.c since descriptor flushing
      is now part of wq disable rather than shutdown().
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Link: https://lore.kernel.org/r/163942149487.2412839.6691222855803875848.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NVinod Koul <vkoul@kernel.org>
      403a2e23