1. 09 12月, 2021 1 次提交
  2. 19 11月, 2021 1 次提交
  3. 07 11月, 2021 1 次提交
  4. 27 10月, 2021 1 次提交
    • M
      kprobes: Add a test case for stacktrace from kretprobe handler · 1f6d3a8f
      Masami Hiramatsu 提交于
      Add a test case for stacktrace from kretprobe handler and
      nested kretprobe handlers.
      
      This test checks both of stack trace inside kretprobe handler
      and stack trace from pt_regs. Those stack trace must include
      actual function return address instead of kretprobe trampoline.
      The nested kretprobe stacktrace test checks whether the unwinder
      can correctly unwind the call frame on the stack which has been
      modified by the kretprobe.
      
      Since the stacktrace on kretprobe is correctly fixed only on x86,
      this introduces a meta kconfig ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
      which tells user that the stacktrace on kretprobe is correct or not.
      
      The test results will be shown like below;
      
       TAP version 14
       1..1
           # Subtest: kprobes_test
           1..6
           ok 1 - test_kprobe
           ok 2 - test_kprobes
           ok 3 - test_kretprobe
           ok 4 - test_kretprobes
           ok 5 - test_stacktrace_on_kretprobe
           ok 6 - test_stacktrace_on_nested_kretprobe
       # kprobes_test: pass:6 fail:0 skip:0 total:6
       # Totals: pass:6 fail:0 skip:0 total:6
       ok 1 - kprobes_test
      
      Link: https://lkml.kernel.org/r/163516211244.604541.18350507860972214415.stgit@devnote2Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      1f6d3a8f
  5. 26 10月, 2021 1 次提交
    • T
      x86/signal: Implement sigaltstack size validation · 3aac3ebe
      Thomas Gleixner 提交于
      For historical reasons MINSIGSTKSZ is a constant which became already too
      small with AVX512 support.
      
      Add a mechanism to enforce strict checking of the sigaltstack size against
      the real size of the FPU frame.
      
      The strict check can be enabled via a config option and can also be
      controlled via the kernel command line option 'strict_sas_size' independent
      of the config switch.
      
      Enabling it might break existing applications which allocate a too small
      sigaltstack but 'work' because they never get a signal delivered. Though it
      can be handy to filter out binaries which are not yet aware of
      AT_MINSIGSTKSZ.
      
      Also the upcoming support for dynamically enabled FPU features requires a
      strict sanity check to ensure that:
      
         - Enabling of a dynamic feature, which changes the sigframe size fits
           into an enabled sigaltstack
      
         - Installing a too small sigaltstack after a dynamic feature has been
           added is not possible.
      
      Implement the base check which is controlled by config and command line
      options.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChang S. Bae <chang.seok.bae@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20211021225527.10184-3-chang.seok.bae@intel.com
      3aac3ebe
  6. 21 10月, 2021 1 次提交
  7. 19 10月, 2021 1 次提交
  8. 15 10月, 2021 1 次提交
    • T
      sched: Add cluster scheduler level for x86 · 66558b73
      Tim Chen 提交于
      There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
      shared among a cluster of cores instead of being exclusive to one
      single core.
      
      To prevent oversubscription of L2 cache, load should be balanced
      between such L2 clusters, especially for tasks with no shared data.
      On benchmark such as SPECrate mcf test, this change provides a boost
      to performance especially on medium load system on Jacobsville.  on a
      Jacobsville that has 24 Atom cores, arranged into 6 clusters of 4
      cores each, the benchmark number is as follow:
      
       Improvement over baseline kernel for mcf_r
       copies		run time	base rate
       1		-0.1%		-0.2%
       6		25.1%		25.1%
       12		18.8%		19.0%
       24		0.3%		0.3%
      
      So this looks pretty good. In terms of the system's task distribution,
      some pretty bad clumping can be seen for the vanilla kernel without
      the L2 cluster domain for the 6 and 12 copies case. With the extra
      domain for cluster, the load does get evened out between the clusters.
      
      Note this patch isn't an universal win as spreading isn't necessarily
      a win, particually for those workload who can benefit from packing.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210924085104.44806-4-21cnbao@gmail.com
      66558b73
  9. 12 10月, 2021 1 次提交
    • B
      x86/Kconfig: Do not enable AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT automatically · 71188590
      Borislav Petkov 提交于
      This Kconfig option was added initially so that memory encryption is
      enabled by default on machines which support it.
      
      However, devices which have DMA masks that are less than the bit
      position of the encryption bit, aka C-bit, require the use of an IOMMU
      or the use of SWIOTLB.
      
      If the IOMMU is disabled or in passthrough mode, the kernel would switch
      to SWIOTLB bounce-buffering for those transfers.
      
      In order to avoid that,
      
        2cc13bb4 ("iommu: Disable passthrough mode when SME is active")
      
      disables the default IOMMU passthrough mode so that devices for which the
      default 256K DMA is insufficient, can use the IOMMU instead.
      
      However 2, there are cases where the IOMMU is disabled in the BIOS, etc.
      (think the usual hardware folk "oops, I dropped the ball there" cases) or a
      driver doesn't properly use the DMA APIs or a device has a firmware or
      hardware bug, e.g.:
      
        ea68573d ("drm/amdgpu: Fail to load on RAVEN if SME is active")
      
      However 3, in the above GPU use case, there are APIs like Vulkan and
      some OpenGL/OpenCL extensions which are under the assumption that
      user-allocated memory can be passed in to the kernel driver and both the
      GPU and CPU can do coherent and concurrent access to the same memory.
      That cannot work with SWIOTLB bounce buffers, of course.
      
      So, in order for those devices to function, drop the "default y" for the
      SME by default active option so that users who want to have SME enabled,
      will need to either enable it in their config or use "mem_encrypt=on" on
      the kernel command line.
      
       [ tlendacky: Generalize commit message. ]
      
      Fixes: 7744ccdb ("x86/mm: Add Secure Memory Encryption (SME) support")
      Reported-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NAlex Deucher <alexander.deucher@amd.com>
      Acked-by: NTom Lendacky <thomas.lendacky@amd.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/8bbacd0e-4580-3194-19d2-a0ecad7df09c@molgen.mpg.de
      71188590
  10. 07 10月, 2021 2 次提交
  11. 06 10月, 2021 1 次提交
  12. 04 10月, 2021 2 次提交
  13. 20 9月, 2021 1 次提交
  14. 03 9月, 2021 1 次提交
  15. 16 8月, 2021 1 次提交
  16. 30 7月, 2021 2 次提交
  17. 28 7月, 2021 1 次提交
    • B
      x86/mm: Prepare for opt-in based L1D flush in switch_mm() · b5f06f64
      Balbir Singh 提交于
      The goal of this is to allow tasks that want to protect sensitive
      information, against e.g. the recently found snoop assisted data sampling
      vulnerabilites, to flush their L1D on being switched out.  This protects
      their data from being snooped or leaked via side channels after the task
      has context switched out.
      
      This could also be used to wipe L1D when an untrusted task is switched in,
      but that's not a really well defined scenario while the opt-in variant is
      clearly defined.
      
      The mechanism is default disabled and can be enabled on the kernel command
      line.
      
      Prepare for the actual prctl based opt-in:
      
        1) Provide the necessary setup functionality similar to the other
           mitigations and enable the static branch when the command line option
           is set and the CPU provides support for hardware assisted L1D
           flushing. Software based L1D flush is not supported because it's CPU
           model specific and not really well defined.
      
           This does not come with a sysfs file like the other mitigations
           because it is not bound to any specific vulnerability.
      
           Support has to be queried via the prctl(2) interface.
      
        2) Add TIF_SPEC_L1D_FLUSH next to L1D_SPEC_IB so the two bits can be
           mangled into the mm pointer in one go which allows to reuse the
           existing mechanism in switch_mm() for the conditional IBPB speculation
           barrier efficiently.
      
        3) Add the L1D flush specific functionality which flushes L1D when the
           outgoing task opted in.
      
           Also check whether the incoming task has requested L1D flush and if so
           validate that it is not accidentaly running on an SMT sibling as this
           makes the whole excercise moot because SMT siblings share L1D which
           opens tons of other attack vectors. If that happens schedule task work
           which signals the incoming task on return to user/guest with SIGBUS as
           this is part of the paranoid L1D flush contract.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBalbir Singh <sblbir@amazon.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210108121056.21940-1-sblbir@amazon.com
      b5f06f64
  18. 21 7月, 2021 1 次提交
  19. 01 7月, 2021 2 次提交
  20. 30 6月, 2021 1 次提交
  21. 23 6月, 2021 1 次提交
  22. 15 6月, 2021 1 次提交
  23. 07 6月, 2021 1 次提交
  24. 26 5月, 2021 2 次提交
  25. 06 5月, 2021 6 次提交
    • O
      x86/Kconfig: introduce ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE · f91ef222
      Oscar Salvador 提交于
      Enable x86_64 platform to use the MHP_MEMMAP_ON_MEMORY feature.
      
      Link: https://lkml.kernel.org/r/20210421102701.25051-8-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f91ef222
    • A
      mm: drop redundant ARCH_ENABLE_SPLIT_PMD_PTLOCK · 66f24fa7
      Anshuman Khandual 提交于
      ARCH_ENABLE_SPLIT_PMD_PTLOCKS has duplicate definitions on platforms
      that subscribe it.  Drop these redundant definitions and instead just
      select it on applicable platforms.
      
      Link: https://lkml.kernel.org/r/1617259448-22529-6-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Acked-by: Heiko Carstens <hca@linux.ibm.com>		[s390]
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Palmer Dabbelt <palmerdabbelt@google.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66f24fa7
    • A
      mm: drop redundant ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION · 1e866974
      Anshuman Khandual 提交于
      ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION configs have duplicate definitions on
      platforms that subscribe them.  Drop these reduntant definitions and
      instead just select them appropriately.
      
      [akpm@linux-foundation.org: s/x86_64/X86_64/, per Oscar]
      
      Link: https://lkml.kernel.org/r/1617259448-22529-5-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Palmer Dabbelt <palmerdabbelt@google.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e866974
    • A
      mm: generalize ARCH_ENABLE_MEMORY_[HOTPLUG|HOTREMOVE] · 91024b3c
      Anshuman Khandual 提交于
      ARCH_ENABLE_MEMORY_[HOTPLUG|HOTREMOVE] configs have duplicate
      definitions on platforms that subscribe them.  Instead, just make them
      generic options which can be selected on applicable platforms.
      
      Link: https://lkml.kernel.org/r/1617259448-22529-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Acked-by: Heiko Carstens <hca@linux.ibm.com>		[s390]
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Palmer Dabbelt <palmerdabbelt@google.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91024b3c
    • A
      mm: generalize ARCH_HAS_CACHE_LINE_SIZE · c2280be8
      Anshuman Khandual 提交于
      Patch series "mm: some config cleanups", v2.
      
      This series contains config cleanup patches which reduces code
      duplication across platforms and also improves maintainability.  There
      is no functional change intended with this series.
      
      This patch (of 6):
      
      ARCH_HAS_CACHE_LINE_SIZE config has duplicate definitions on platforms
      that subscribe it.  Instead, just make it a generic option which can be
      selected on applicable platforms.  This change reduces code duplication
      and makes it cleaner.
      
      Link: https://lkml.kernel.org/r/1617259448-22529-1-git-send-email-anshuman.khandual@arm.com
      Link: https://lkml.kernel.org/r/1617259448-22529-2-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Acked-by: Vineet Gupta <vgupta@synopsys.com>		[arc]
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmerdabbelt@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2280be8
    • A
      userfaultfd: add minor fault registration mode · 7677f7fd
      Axel Rasmussen 提交于
      Patch series "userfaultfd: add minor fault handling", v9.
      
      Overview
      ========
      
      This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
      When enabled (via the UFFDIO_API ioctl), this feature means that any
      hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
      get events for "minor" faults.  By "minor" fault, I mean the following
      situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
      memory).  One of the mappings is registered with userfaultfd (in minor
      mode), and the other is not.  Via the non-UFFD mapping, the underlying
      pages have already been allocated & filled with some contents.  The UFFD
      mapping has not yet been faulted in; when it is touched for the first
      time, this results in what I'm calling a "minor" fault.  As a concrete
      example, when working with hugetlbfs, we have huge_pte_none(), but
      find_lock_page() finds an existing page.
      
      We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
      is, userspace resolves the fault by either a) doing nothing if the
      contents are already correct, or b) updating the underlying contents using
      the second, non-UFFD mapping (via memcpy/memset or similar, or something
      fancier like RDMA, or etc...).  In either case, userspace issues
      UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
      correct, carry on setting up the mapping".
      
      Use Case
      ========
      
      Consider the use case of VM live migration (e.g. under QEMU/KVM):
      
      1. While a VM is still running, we copy the contents of its memory to a
         target machine. The pages are populated on the target by writing to the
         non-UFFD mapping, using the setup described above. The VM is still running
         (and therefore its memory is likely changing), so this may be repeated
         several times, until we decide the target is "up to date enough".
      
      2. We pause the VM on the source, and start executing on the target machine.
         During this gap, the VM's user(s) will *see* a pause, so it is desirable to
         minimize this window.
      
      3. Between the last time any page was copied from the source to the target, and
         when the VM was paused, the contents of that page may have changed - and
         therefore the copy we have on the target machine is out of date. Although we
         can keep track of which pages are out of date, for VMs with large amounts of
         memory, it is "slow" to transfer this information to the target machine. We
         want to resume execution before such a transfer would complete.
      
      4. So, the guest begins executing on the target machine. The first time it
         touches its memory (via the UFFD-registered mapping), userspace wants to
         intercept this fault. Userspace checks whether or not the page is up to date,
         and if not, copies the updated page from the source machine, via the non-UFFD
         mapping. Finally, whether a copy was performed or not, userspace issues a
         UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
         are correct, carry on setting up the mapping".
      
      We don't have to do all of the final updates on-demand. The userfaultfd manager
      can, in the background, also copy over updated pages once it receives the map of
      which pages are up-to-date or not.
      
      Interaction with Existing APIs
      ==============================
      
      Because this is a feature, a registered VMA could potentially receive both
      missing and minor faults.  I spent some time thinking through how the
      existing API interacts with the new feature:
      
      UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
      allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:
      
      - For non-shared memory or shmem, -EINVAL is returned.
      - For hugetlb, -EFAULT is returned.
      
      UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
      Without modifications, the existing codepath assumes a new page needs to
      be allocated.  This is okay, since userspace must have a second
      non-UFFD-registered mapping anyway, thus there isn't much reason to want
      to use these in any case (just memcpy or memset or similar).
      
      - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
      - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
        in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
      - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
        -ENOENT in that case (regardless of the kind of fault).
      
      Future Work
      ===========
      
      This series only supports hugetlbfs.  I have a second series in flight to
      support shmem as well, extending the functionality.  This series is more
      mature than the shmem support at this point, and the functionality works
      fully on hugetlbfs, so this series can be merged first and then shmem
      support will follow.
      
      This patch (of 6):
      
      This feature allows userspace to intercept "minor" faults.  By "minor"
      faults, I mean the following situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
      mappings is registered with userfaultfd (in minor mode), and the other is
      not.  Via the non-UFFD mapping, the underlying pages have already been
      allocated & filled with some contents.  The UFFD mapping has not yet been
      faulted in; when it is touched for the first time, this results in what
      I'm calling a "minor" fault.  As a concrete example, when working with
      hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
      page.
      
      This commit adds the new registration mode, and sets the relevant flag on
      the VMAs being registered.  In the hugetlb fault path, if we find that we
      have huge_pte_none(), but find_lock_page() does indeed find an existing
      page, then we have a "minor" fault, and if the VMA has the userfaultfd
      registration flag, we call into userfaultfd to handle it.
      
      This is implemented as a new registration mode, instead of an API feature.
      This is because the alternative implementation has significant drawbacks
      [1].
      
      However, doing it this was requires we allocate a VM_* flag for the new
      registration mode.  On 32-bit systems, there are no unused bits, so this
      feature is only supported on architectures with
      CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
      MINOR mode on 32-bit architectures, we return -EINVAL.
      
      [1] https://lore.kernel.org/patchwork/patch/1380226/
      
      [peterx@redhat.com: fix minor fault page leak]
        Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7677f7fd
  26. 01 5月, 2021 1 次提交
  27. 20 4月, 2021 1 次提交
    • I
      x86/platform/uv: Fix !KEXEC build failure · c2209ea5
      Ingo Molnar 提交于
      When KEXEC is disabled, the UV build fails:
      
        arch/x86/platform/uv/uv_nmi.c:875:14: error: ‘uv_nmi_kexec_failed’ undeclared (first use in this function)
      
      Since uv_nmi_kexec_failed is only defined in the KEXEC_CORE #ifdef branch,
      this code cannot ever have been build tested:
      
      	if (main)
      		pr_err("UV: NMI kdump: KEXEC not supported in this kernel\n");
      	atomic_set(&uv_nmi_kexec_failed, 1);
      
      Nor is this use possible in uv_handle_nmi():
      
                      atomic_set(&uv_nmi_kexec_failed, 0);
      
      These bugs were introduced in this commit:
      
          d0a9964e: ("x86/platform/uv: Implement simple dump failover if kdump fails")
      
      Which added the uv_nmi_kexec_failed assignments to !KEXEC code, while making the
      definition KEXEC-only - apparently without testing the !KEXEC case.
      
      Instead of complicating the #ifdef maze, simplify the code by requiring X86_UV
      to depend on KEXEC_CORE. This pattern is present in other architectures as well.
      
      ( We'll remove the untested, 7 years old !KEXEC complications from the file in a
        separate commit. )
      
      Fixes: d0a9964e: ("x86/platform/uv: Implement simple dump failover if kdump fails")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Mike Travis <travis@sgi.com>
      Cc: linux-kernel@vger.kernel.org
      c2209ea5
  28. 19 4月, 2021 1 次提交
  29. 08 4月, 2021 1 次提交
  30. 20 3月, 2021 1 次提交