1. 22 6月, 2021 1 次提交
  2. 27 5月, 2021 1 次提交
  3. 19 5月, 2021 2 次提交
  4. 10 5月, 2021 1 次提交
  5. 07 5月, 2021 1 次提交
  6. 06 5月, 2021 4 次提交
    • D
      mm/vmscan: move RECLAIM* bits to uapi header · b6676de8
      Dave Hansen 提交于
      It is currently not obvious that the RECLAIM_* bits are part of the uapi
      since they are defined in vmscan.c.  Move them to a uapi header to make it
      obvious.
      
      This should have no functional impact.
      
      Link: https://lkml.kernel.org/r/20210219172557.08074910@viggo.jf.intel.comSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Daniel Wagner <dwagner@suse.de>
      Cc: "Tobin C. Harding" <tobin@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6676de8
    • A
      userfaultfd: add UFFDIO_CONTINUE ioctl · f6191471
      Axel Rasmussen 提交于
      This ioctl is how userspace ought to resolve "minor" userfaults.  The
      idea is, userspace is notified that a minor fault has occurred.  It
      might change the contents of the page using its second non-UFFD mapping,
      or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
      ensured the page contents are correct, carry on setting up the mapping".
      
      Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
      MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
      the minor fault case, we already have some pre-existing underlying page.
      Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
      We'd just use memcpy() or similar instead.
      
      It turns out hugetlb_mcopy_atomic_pte() already does very close to what
      we want, if an existing page is provided via `struct page **pagep`.  We
      already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
      just extend that design: add an enum for the three modes of operation,
      and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
      case.  (Basically, look up the existing page, and avoid adding the
      existing page to the page cache or calling set_page_huge_active() on
      it.)
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6191471
    • A
      userfaultfd: add minor fault registration mode · 7677f7fd
      Axel Rasmussen 提交于
      Patch series "userfaultfd: add minor fault handling", v9.
      
      Overview
      ========
      
      This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
      When enabled (via the UFFDIO_API ioctl), this feature means that any
      hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
      get events for "minor" faults.  By "minor" fault, I mean the following
      situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
      memory).  One of the mappings is registered with userfaultfd (in minor
      mode), and the other is not.  Via the non-UFFD mapping, the underlying
      pages have already been allocated & filled with some contents.  The UFFD
      mapping has not yet been faulted in; when it is touched for the first
      time, this results in what I'm calling a "minor" fault.  As a concrete
      example, when working with hugetlbfs, we have huge_pte_none(), but
      find_lock_page() finds an existing page.
      
      We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
      is, userspace resolves the fault by either a) doing nothing if the
      contents are already correct, or b) updating the underlying contents using
      the second, non-UFFD mapping (via memcpy/memset or similar, or something
      fancier like RDMA, or etc...).  In either case, userspace issues
      UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
      correct, carry on setting up the mapping".
      
      Use Case
      ========
      
      Consider the use case of VM live migration (e.g. under QEMU/KVM):
      
      1. While a VM is still running, we copy the contents of its memory to a
         target machine. The pages are populated on the target by writing to the
         non-UFFD mapping, using the setup described above. The VM is still running
         (and therefore its memory is likely changing), so this may be repeated
         several times, until we decide the target is "up to date enough".
      
      2. We pause the VM on the source, and start executing on the target machine.
         During this gap, the VM's user(s) will *see* a pause, so it is desirable to
         minimize this window.
      
      3. Between the last time any page was copied from the source to the target, and
         when the VM was paused, the contents of that page may have changed - and
         therefore the copy we have on the target machine is out of date. Although we
         can keep track of which pages are out of date, for VMs with large amounts of
         memory, it is "slow" to transfer this information to the target machine. We
         want to resume execution before such a transfer would complete.
      
      4. So, the guest begins executing on the target machine. The first time it
         touches its memory (via the UFFD-registered mapping), userspace wants to
         intercept this fault. Userspace checks whether or not the page is up to date,
         and if not, copies the updated page from the source machine, via the non-UFFD
         mapping. Finally, whether a copy was performed or not, userspace issues a
         UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
         are correct, carry on setting up the mapping".
      
      We don't have to do all of the final updates on-demand. The userfaultfd manager
      can, in the background, also copy over updated pages once it receives the map of
      which pages are up-to-date or not.
      
      Interaction with Existing APIs
      ==============================
      
      Because this is a feature, a registered VMA could potentially receive both
      missing and minor faults.  I spent some time thinking through how the
      existing API interacts with the new feature:
      
      UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
      allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:
      
      - For non-shared memory or shmem, -EINVAL is returned.
      - For hugetlb, -EFAULT is returned.
      
      UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
      Without modifications, the existing codepath assumes a new page needs to
      be allocated.  This is okay, since userspace must have a second
      non-UFFD-registered mapping anyway, thus there isn't much reason to want
      to use these in any case (just memcpy or memset or similar).
      
      - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
      - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
        in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
      - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
        -ENOENT in that case (regardless of the kind of fault).
      
      Future Work
      ===========
      
      This series only supports hugetlbfs.  I have a second series in flight to
      support shmem as well, extending the functionality.  This series is more
      mature than the shmem support at this point, and the functionality works
      fully on hugetlbfs, so this series can be merged first and then shmem
      support will follow.
      
      This patch (of 6):
      
      This feature allows userspace to intercept "minor" faults.  By "minor"
      faults, I mean the following situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
      mappings is registered with userfaultfd (in minor mode), and the other is
      not.  Via the non-UFFD mapping, the underlying pages have already been
      allocated & filled with some contents.  The UFFD mapping has not yet been
      faulted in; when it is touched for the first time, this results in what
      I'm calling a "minor" fault.  As a concrete example, when working with
      hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
      page.
      
      This commit adds the new registration mode, and sets the relevant flag on
      the VMAs being registered.  In the hugetlb fault path, if we find that we
      have huge_pte_none(), but find_lock_page() does indeed find an existing
      page, then we have a "minor" fault, and if the VMA has the userfaultfd
      registration flag, we call into userfaultfd to handle it.
      
      This is implemented as a new registration mode, instead of an API feature.
      This is because the alternative implementation has significant drawbacks
      [1].
      
      However, doing it this was requires we allocate a VM_* flag for the new
      registration mode.  On 32-bit systems, there are no unused bits, so this
      feature is only supported on architectures with
      CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
      MINOR mode on 32-bit architectures, we return -EINVAL.
      
      [1] https://lore.kernel.org/patchwork/patch/1380226/
      
      [peterx@redhat.com: fix minor fault page leak]
        Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7677f7fd
    • A
      vfio/pci: Revert nvlink removal uAPI breakage · 77b8aeb9
      Alex Williamson 提交于
      Revert the uAPI changes from the below commit with notice that these
      regions and capabilities are no longer provided.
      
      Fixes: b392a198 ("vfio/pci: remove vfio_pci_nvlink2")
      Reported-by: NGreg Kurz <groug@kaod.org>
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: NCornelia Huck <cohuck@redhat.com>
      Reviewed-by: NGreg Kurz <groug@kaod.org>
      Tested-by: NGreg Kurz <groug@kaod.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Message-Id: <162014341432.3807030.11054087109120670135.stgit@omen>
      77b8aeb9
  7. 04 5月, 2021 1 次提交
  8. 30 4月, 2021 1 次提交
    • A
      seg6: add counters support for SRv6 Behaviors · 94604548
      Andrea Mayer 提交于
      This patch provides counters for SRv6 Behaviors as defined in [1],
      section 6. For each SRv6 Behavior instance, counters defined in [1] are:
      
       - the total number of packets that have been correctly processed;
       - the total amount of traffic in bytes of all packets that have been
         correctly processed;
      
      In addition, this patch introduces a new counter that counts the number of
      packets that have NOT been properly processed (i.e. errors) by an SRv6
      Behavior instance.
      
      Counters are not only interesting for network monitoring purposes (i.e.
      counting the number of packets processed by a given behavior) but they also
      provide a simple tool for checking whether a behavior instance is working
      as we expect or not.
      Counters can be useful for troubleshooting misconfigured SRv6 networks.
      Indeed, an SRv6 Behavior can silently drop packets for very different
      reasons (i.e. wrong SID configuration, interfaces set with SID addresses,
      etc) without any notification/message to the user.
      
      Due to the nature of SRv6 networks, diagnostic tools such as ping and
      traceroute may be ineffective: paths used for reaching a given router can
      be totally different from the ones followed by probe packets. In addition,
      paths are often asymmetrical and this makes it even more difficult to keep
      up with the journey of the packets and to understand which behaviors are
      actually processing our traffic.
      
      When counters are enabled on an SRv6 Behavior instance, it is possible to
      verify if packets are actually processed by such behavior and what is the
      outcome of the processing. Therefore, the counters for SRv6 Behaviors offer
      an non-invasive observability point which can be leveraged for both traffic
      monitoring and troubleshooting purposes.
      
      [1] https://www.rfc-editor.org/rfc/rfc8986.html#name-counters
      
      Troubleshooting using SRv6 Behavior counters
      --------------------------------------------
      
      Let's make a brief example to see how helpful counters can be for SRv6
      networks. Let's consider a node where an SRv6 End Behavior receives an SRv6
      packet whose Segment Left (SL) is equal to 0. In this case, the End
      Behavior (which accepts only packets with SL >= 1) discards the packet and
      increases the error counter.
      This information can be leveraged by the network operator for
      troubleshooting. Indeed, the error counter is telling the user that the
      packet:
      
        (i) arrived at the node;
       (ii) the packet has been taken into account by the SRv6 End behavior;
      (iii) but an error has occurred during the processing.
      
      The error (iii) could be caused by different reasons, such as wrong route
      settings on the node or due to an invalid SID List carried by the SRv6
      packet. Anyway, the error counter is used to exclude that the packet did
      not arrive at the node or it has not been processed by the behavior at
      all.
      
      Turning on/off counters for SRv6 Behaviors
      ------------------------------------------
      
      Each SRv6 Behavior instance can be configured, at the time of its creation,
      to make use of counters.
      This is done through iproute2 which allows the user to create an SRv6
      Behavior instance specifying the optional "count" attribute as shown in the
      following example:
      
       $ ip -6 route add 2001:db8::1 encap seg6local action End count dev eth0
      
      per-behavior counters can be shown by adding "-s" to the iproute2 command
      line, i.e.:
      
       $ ip -s -6 route show 2001:db8::1
       2001:db8::1 encap seg6local action End packets 0 bytes 0 errors 0 dev eth0
      
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Impact of counters for SRv6 Behaviors on performance
      ====================================================
      
      To determine the performance impact due to the introduction of counters in
      the SRv6 Behavior subsystem, we have carried out extensive tests.
      
      We chose to test the throughput achieved by the SRv6 End.DX2 Behavior
      because, among all the other behaviors implemented so far, it reaches the
      highest throughput which is around 1.5 Mpps (per core at 2.4 GHz on a
      Xeon(R) CPU E5-2630 v3) on kernel 5.12-rc2 using packets of size ~ 100
      bytes.
      
      Three different tests were conducted in order to evaluate the overall
      throughput of the SRv6 End.DX2 Behavior in the following scenarios:
      
       1) vanilla kernel (without the SRv6 Behavior counters patch) and a single
          instance of an SRv6 End.DX2 Behavior;
       2) patched kernel with SRv6 Behavior counters and a single instance of
          an SRv6 End.DX2 Behavior with counters turned off;
       3) patched kernel with SRv6 Behavior counters and a single instance of
          SRv6 End.DX2 Behavior with counters turned on.
      
      All tests were performed on a testbed deployed on the CloudLab facilities
      [2], a flexible infrastructure dedicated to scientific research on the
      future of Cloud Computing.
      
      Results of tests are shown in the following table:
      
      Scenario (1): average 1504764,81 pps (~1504,76 kpps); std. dev 3956,82 pps
      Scenario (2): average 1501469,78 pps (~1501,47 kpps); std. dev 2979,85 pps
      Scenario (3): average 1501315,13 pps (~1501,32 kpps); std. dev 2956,00 pps
      
      As can be observed, throughputs achieved in scenarios (2),(3) did not
      suffer any observable degradation compared to scenario (1).
      
      Thanks to Jakub Kicinski and David Ahern for their valuable suggestions
      and comments provided during the discussion of the proposed RFCs.
      
      [2] https://www.cloudlab.usSigned-off-by: NAndrea Mayer <andrea.mayer@uniroma2.it>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94604548
  9. 29 4月, 2021 1 次提交
  10. 28 4月, 2021 1 次提交
    • P
      netfilter: nftables: add catch-all set element support · aaa31047
      Pablo Neira Ayuso 提交于
      This patch extends the set infrastructure to add a special catch-all set
      element. If the lookup fails to find an element (or range) in the set,
      then the catch-all element is selected. Users can specify a mapping,
      expression(s) and timeout to be attached to the catch-all element.
      
      This patch adds a catchall list to the set, this list might contain more
      than one single catch-all element (e.g. in case that the catch-all
      element is removed and a new one is added in the same transaction).
      However, most of the time, there will be either one element or no
      elements at all in this list.
      
      The catch-all element is identified via NFT_SET_ELEM_CATCHALL flag and
      such special element has no NFTA_SET_ELEM_KEY attribute. There is a new
      nft_set_elem_catchall object that stores a reference to the dummy
      catch-all element (catchall->elem) whose layout is the same of the set
      element type to reuse the existing set element codebase.
      
      The set size does not apply to the catch-all element, users can define a
      catch-all element even if the set is full.
      
      The check for valid set element flags hava been updates to report
      EOPNOTSUPP in case userspace requests flags that are not supported when
      using new userspace nftables and old kernel.
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      aaa31047
  11. 26 4月, 2021 7 次提交
  12. 23 4月, 2021 3 次提交
    • M
      landlock: Enable user space to infer supported features · 3532b0b4
      Mickaël Salaün 提交于
      Add a new flag LANDLOCK_CREATE_RULESET_VERSION to
      landlock_create_ruleset(2).  This enables to retreive a Landlock ABI
      version that is useful to efficiently follow a best-effort security
      approach.  Indeed, it would be a missed opportunity to abort the whole
      sandbox building, because some features are unavailable, instead of
      protecting users as much as possible with the subset of features
      provided by the running kernel.
      
      This new flag enables user space to identify the minimum set of Landlock
      features supported by the running kernel without relying on a filesystem
      interface (e.g. /proc/version, which might be inaccessible) nor testing
      multiple syscall argument combinations (i.e. syscall bisection).  New
      Landlock features will be documented and tied to a minimum version
      number (greater than 1).  The current version will be incremented for
      each new kernel release supporting new Landlock features.  User space
      libraries can leverage this information to seamlessly restrict processes
      as much as possible while being compatible with newer APIs.
      
      This is a much more lighter approach than the previous
      landlock_get_features(2): the complexity is pushed to user space
      libraries.  This flag meets similar needs as securityfs versions:
      selinux/policyvers, apparmor/features/*/version* and tomoyo/version.
      
      Supporting this flag now will be convenient for backward compatibility.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Serge E. Hallyn <serge@hallyn.com>
      Signed-off-by: NMickaël Salaün <mic@linux.microsoft.com>
      Link: https://lore.kernel.org/r/20210422154123.13086-14-mic@digikod.netSigned-off-by: NJames Morris <jamorris@linux.microsoft.com>
      3532b0b4
    • M
      landlock: Add syscall implementations · 265885da
      Mickaël Salaün 提交于
      These 3 system calls are designed to be used by unprivileged processes
      to sandbox themselves:
      * landlock_create_ruleset(2): Creates a ruleset and returns its file
        descriptor.
      * landlock_add_rule(2): Adds a rule (e.g. file hierarchy access) to a
        ruleset, identified by the dedicated file descriptor.
      * landlock_restrict_self(2): Enforces a ruleset on the calling thread
        and its future children (similar to seccomp).  This syscall has the
        same usage restrictions as seccomp(2): the caller must have the
        no_new_privs attribute set or have CAP_SYS_ADMIN in the current user
        namespace.
      
      All these syscalls have a "flags" argument (not currently used) to
      enable extensibility.
      
      Here are the motivations for these new syscalls:
      * A sandboxed process may not have access to file systems, including
        /dev, /sys or /proc, but it should still be able to add more
        restrictions to itself.
      * Neither prctl(2) nor seccomp(2) (which was used in a previous version)
        fit well with the current definition of a Landlock security policy.
      
      All passed structs (attributes) are checked at build time to ensure that
      they don't contain holes and that they are aligned the same way for each
      architecture.
      
      See the user and kernel documentation for more details (provided by a
      following commit):
      * Documentation/userspace-api/landlock.rst
      * Documentation/security/landlock.rst
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NMickaël Salaün <mic@linux.microsoft.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Link: https://lore.kernel.org/r/20210422154123.13086-9-mic@digikod.netSigned-off-by: NJames Morris <jamorris@linux.microsoft.com>
      265885da
    • M
      landlock: Support filesystem access-control · cb2c7d1a
      Mickaël Salaün 提交于
      Using Landlock objects and ruleset, it is possible to tag inodes
      according to a process's domain.  To enable an unprivileged process to
      express a file hierarchy, it first needs to open a directory (or a file)
      and pass this file descriptor to the kernel through
      landlock_add_rule(2).  When checking if a file access request is
      allowed, we walk from the requested dentry to the real root, following
      the different mount layers.  The access to each "tagged" inodes are
      collected according to their rule layer level, and ANDed to create
      access to the requested file hierarchy.  This makes possible to identify
      a lot of files without tagging every inodes nor modifying the
      filesystem, while still following the view and understanding the user
      has from the filesystem.
      
      Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
      keep the same struct inodes for the same inodes whereas these inodes are
      in use.
      
      This commit adds a minimal set of supported filesystem access-control
      which doesn't enable to restrict all file-related actions.  This is the
      result of multiple discussions to minimize the code of Landlock to ease
      review.  Thanks to the Landlock design, extending this access-control
      without breaking user space will not be a problem.  Moreover, seccomp
      filters can be used to restrict the use of syscall families which may
      not be currently handled by Landlock.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Serge E. Hallyn <serge@hallyn.com>
      Signed-off-by: NMickaël Salaün <mic@linux.microsoft.com>
      Link: https://lore.kernel.org/r/20210422154123.13086-8-mic@digikod.netSigned-off-by: NJames Morris <jamorris@linux.microsoft.com>
      cb2c7d1a
  13. 22 4月, 2021 6 次提交
    • B
      KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command · 15fb7de1
      Brijesh Singh 提交于
      The command is used for copying the incoming buffer into the
      SEV guest memory space.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NSteve Rutherford <srutherford@google.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <c5d0e3e719db7bb37ea85d79ed4db52e9da06257.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      15fb7de1
    • B
      KVM: SVM: Add support for KVM_SEV_RECEIVE_START command · af43cbbf
      Brijesh Singh 提交于
      The command is used to create the encryption context for an incoming
      SEV guest. The encryption context can be later used by the hypervisor
      to import the incoming data into the SEV guest memory space.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NSteve Rutherford <srutherford@google.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <c7400111ed7458eee01007c4d8d57cdf2cbb0fc2.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      af43cbbf
    • S
      KVM: SVM: Add support for KVM_SEV_SEND_CANCEL command · 5569e2e7
      Steve Rutherford 提交于
      After completion of SEND_START, but before SEND_FINISH, the source VMM can
      issue the SEND_CANCEL command to stop a migration. This is necessary so
      that a cancelled migration can restart with a new target later.
      Reviewed-by: NNathan Tempelman <natet@google.com>
      Reviewed-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NSteve Rutherford <srutherford@google.com>
      Message-Id: <20210412194408.2458827-1-srutherford@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5569e2e7
    • B
      KVM: SVM: Add KVM_SEND_UPDATE_DATA command · d3d1af85
      Brijesh Singh 提交于
      The command is used for encrypting the guest memory region using the encryption
      context created with KVM_SEV_SEND_START.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by : Steve Rutherford <srutherford@google.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <d6a6ea740b0c668b30905ae31eac5ad7da048bb3.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d3d1af85
    • B
      KVM: SVM: Add KVM_SEV SEND_START command · 4cfdd47d
      Brijesh Singh 提交于
      The command is used to create an outgoing SEV guest encryption context.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NSteve Rutherford <srutherford@google.com>
      Reviewed-by: NVenu Busireddy <venu.busireddy@oracle.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <2f1686d0164e0f1b3d6a41d620408393e0a48376.1618498113.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4cfdd47d
    • N
      KVM: x86: Support KVM VMs sharing SEV context · 54526d1f
      Nathan Tempelman 提交于
      Add a capability for userspace to mirror SEV encryption context from
      one vm to another. On our side, this is intended to support a
      Migration Helper vCPU, but it can also be used generically to support
      other in-guest workloads scheduled by the host. The intention is for
      the primary guest and the mirror to have nearly identical memslots.
      
      The primary benefits of this are that:
      1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
      can't accidentally clobber each other.
      2) The VMs can have different memory-views, which is necessary for post-copy
      migration (the migration vCPUs on the target need to read and write to
      pages, when the primary guest would VMEXIT).
      
      This does not change the threat model for AMD SEV. Any memory involved
      is still owned by the primary guest and its initial state is still
      attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
      to circumvent SEV, they could achieve the same effect by simply attaching
      a vCPU to the primary VM.
      This patch deliberately leaves userspace in charge of the memslots for the
      mirror, as it already has the power to mess with them in the primary guest.
      
      This patch does not support SEV-ES (much less SNP), as it does not
      handle handing off attested VMSAs to the mirror.
      
      For additional context, we need a Migration Helper because SEV PSP
      migration is far too slow for our live migration on its own. Using
      an in-guest migrator lets us speed this up significantly.
      Signed-off-by: NNathan Tempelman <natet@google.com>
      Message-Id: <20210408223214.2582277-1-natet@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54526d1f
  14. 21 4月, 2021 1 次提交
    • S
      capabilities: require CAP_SETFCAP to map uid 0 · db2e718a
      Serge E. Hallyn 提交于
      cap_setfcap is required to create file capabilities.
      
      Since commit 8db6c34f ("Introduce v3 namespaced file capabilities"),
      a process running as uid 0 but without cap_setfcap is able to work
      around this as follows: unshare a new user namespace which maps parent
      uid 0 into the child namespace.
      
      While this task will not have new capabilities against the parent
      namespace, there is a loophole due to the way namespaced file
      capabilities are represented as xattrs.  File capabilities valid in
      userns 1 are distinguished from file capabilities valid in userns 2 by
      the kuid which underlies uid 0.  Therefore the restricted root process
      can unshare a new self-mapping namespace, add a namespaced file
      capability onto a file, then use that file capability in the parent
      namespace.
      
      To prevent that, do not allow mapping parent uid 0 if the process which
      opened the uid_map file does not have CAP_SETFCAP, which is the
      capability for setting file capabilities.
      
      As a further wrinkle: a task can unshare its user namespace, then open
      its uid_map file itself, and map (only) its own uid.  In this case we do
      not have the credential from before unshare, which was potentially more
      restricted.  So, when creating a user namespace, we record whether the
      creator had CAP_SETFCAP.  Then we can use that during map_write().
      
      With this patch:
      
      1. Unprivileged user can still unshare -Ur
      
         ubuntu@caps:~$ unshare -Ur
         root@caps:~# logout
      
      2. Root user can still unshare -Ur
      
         ubuntu@caps:~$ sudo bash
         root@caps:/home/ubuntu# unshare -Ur
         root@caps:/home/ubuntu# logout
      
      3. Root user without CAP_SETFCAP cannot unshare -Ur:
      
         root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
         root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
         unable to set CAP_SETFCAP effective capability: Operation not permitted
         root@caps:/home/ubuntu# unshare -Ur
         unshare: write failed /proc/self/uid_map: Operation not permitted
      
      Note: an alternative solution would be to allow uid 0 mappings by
      processes without CAP_SETFCAP, but to prevent such a namespace from
      writing any file capabilities.  This approach can be seen at [1].
      
      Background history: commit 95ebabde ("capabilities: Don't allow
      writing ambiguous v3 file capabilities") tried to fix the issue by
      preventing v3 fscaps to be written to disk when the root uid would map
      to the same uid in nested user namespaces.  This led to regressions for
      various workloads.  For example, see [2].  Ultimately this is a valid
      use-case we have to support meaning we had to revert this change in
      3b0c2d3e ("Revert 95ebabde ("capabilities: Don't allow writing
      ambiguous v3 file capabilities")").
      
      Link: https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4 [1]
      Link: https://github.com/containers/buildah/issues/3071 [2]
      Signed-off-by: NSerge Hallyn <serge@hallyn.com>
      Reviewed-by: NAndrew G. Morgan <morgan@kernel.org>
      Tested-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Tested-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db2e718a
  15. 20 4月, 2021 4 次提交
  16. 19 4月, 2021 2 次提交
  17. 17 4月, 2021 3 次提交