1. 09 9月, 2021 2 次提交
    • D
      mm/memory_hotplug: remove nid parameter from remove_memory() and friends · e1c158e4
      David Hildenbrand 提交于
      There is only a single user remaining.  We can simply lookup the nid only
      used for node offlining purposes when walking our memory blocks.  We don't
      expect to remove multi-nid ranges; and if we'd ever do, we most probably
      don't care about removing multi-nid ranges that actually result in empty
      nodes.
      
      If ever required, we can detect the "multi-nid" scenario and simply try
      offlining all online nodes.
      
      Link: https://lkml.kernel.org/r/20210712124052.26491-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@ionos.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Sergei Trofimovich <slyfox@gentoo.org>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1c158e4
    • D
      mm/memory_hotplug: remove nid parameter from arch_remove_memory() · 65a2aa5f
      David Hildenbrand 提交于
      The parameter is unused, let's remove it.
      
      Link: https://lkml.kernel.org/r/20210712124052.26491-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
      Acked-by: Heiko Carstens <hca@linux.ibm.com>	[s390]
      Reviewed-by: NPankaj Gupta <pankaj.gupta@ionos.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Sergei Trofimovich <slyfox@gentoo.org>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Michel Lespinasse <michel@lespinasse.org>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Pierre Morel <pmorel@linux.ibm.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65a2aa5f
  2. 30 8月, 2021 1 次提交
  3. 26 8月, 2021 4 次提交
  4. 25 8月, 2021 5 次提交
  5. 22 8月, 2021 1 次提交
  6. 21 8月, 2021 2 次提交
  7. 20 8月, 2021 2 次提交
  8. 19 8月, 2021 3 次提交
    • M
      arm64: initialize all of CNTHCTL_EL2 · bde8fff8
      Mark Rutland 提交于
      In __init_el2_timers we initialize CNTHCTL_EL2.{EL1PCEN,EL1PCTEN} with a
      RMW sequence, leaving all other bits UNKNOWN.
      
      In general, we should initialize all bits in a register rather than
      using an RMW sequence, since most bits are UNKNOWN out of reset, and as
      new bits are added to the reigster their reset value might not result in
      expected behaviour.
      
      In the case of CNTHCTL_EL2, FEAT_ECV added a number of new control bits
      in previously RES0 bits, which reset to UNKNOWN values, and may cause
      issues for EL1 and EL0:
      
      * CNTHCTL_EL2.ECV enables the CNTPOFF_EL2 offset (which itself resets to
        an UNKNOWN value) at EL0 and EL1. Since the offset could reset to
        distinct values across CPUs, when the control bit resets to 1 this
        could break timekeeping generally.
      
      * CNTHCTL_EL2.{EL1TVT,EL1TVCT} trap EL0 and EL1 accesses to the EL1
        virtual timer/counter registers to EL2. When reset to 1, this could
        cause unexpected traps to EL2.
      
      Initializing these bits to zero avoids these problems, and all other
      bits in CNTHCTL_EL2 other than EL1PCEN and EL1PCTEN can safely be reset
      to zero.
      
      This patch ensures we initialize CNTHCTL_EL2 accordingly, only setting
      EL1PCEN and EL1PCTEN, and setting all other bits to zero.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Will Deacon <will@kernel.org>
      Reviewed-by: NOliver Upton <oupton@google.com>
      Acked-by: NMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20210818161535.52786-1-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      bde8fff8
    • M
      powerpc/mm: Fix set_memory_*() against concurrent accesses · 9f7853d7
      Michael Ellerman 提交于
      Laurent reported that STRICT_MODULE_RWX was causing intermittent crashes
      on one of his systems:
      
        kernel tried to execute exec-protected page (c008000004073278) - exploit attempt? (uid: 0)
        BUG: Unable to handle kernel instruction fetch
        Faulting instruction address: 0xc008000004073278
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks ...
        CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
        Workqueue: events control_work_handler [virtio_console]
        NIP:  c008000004073278 LR: c008000004073278 CTR: c0000000001e9de0
        REGS: c00000002e4ef7e0 TRAP: 0400   Not tainted  (5.14.0-rc4+)
        MSR:  800000004280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 24002822 XER: 200400cf
        ...
        NIP fill_queue+0xf0/0x210 [virtio_console]
        LR  fill_queue+0xf0/0x210 [virtio_console]
        Call Trace:
          fill_queue+0xb4/0x210 [virtio_console] (unreliable)
          add_port+0x1a8/0x470 [virtio_console]
          control_work_handler+0xbc/0x1e8 [virtio_console]
          process_one_work+0x290/0x590
          worker_thread+0x88/0x620
          kthread+0x194/0x1a0
          ret_from_kernel_thread+0x5c/0x64
      
      Jordan, Fabiano & Murilo were able to reproduce and identify that the
      problem is caused by the call to module_enable_ro() in do_init_module(),
      which happens after the module's init function has already been called.
      
      Our current implementation of change_page_attr() is not safe against
      concurrent accesses, because it invalidates the PTE before flushing the
      TLB and then installing the new PTE. That leaves a window in time where
      there is no valid PTE for the page, if another CPU tries to access the
      page at that time we see something like the fault above.
      
      We can't simply switch to set_pte_at()/flush TLB, because our hash MMU
      code doesn't handle a set_pte_at() of a valid PTE. See [1].
      
      But we do have pte_update(), which replaces the old PTE with the new,
      meaning there's no window where the PTE is invalid. And the hash MMU
      version hash__pte_update() deals with synchronising the hash page table
      correctly.
      
      [1]: https://lore.kernel.org/linuxppc-dev/87y318wp9r.fsf@linux.ibm.com/
      
      Fixes: 1f9ad21c ("powerpc/mm: Implement set_memory() routines")
      Reported-by: NLaurent Vivier <lvivier@redhat.com>
      Reviewed-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: NMurilo Opsfelder Araújo <muriloo@linux.ibm.com>
      Tested-by: NLaurent Vivier <lvivier@redhat.com>
      Signed-off-by: NFabiano Rosas <farosas@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210818120518.3603172-1-mpe@ellerman.id.au
      9f7853d7
    • C
      powerpc/32s: Fix random crashes by adding isync() after locking/unlocking KUEP · ef486bf4
      Christophe Leroy 提交于
      Commit b5efec00 ("powerpc/32s: Move KUEP locking/unlocking in C")
      removed the 'isync' instruction after adding/removing NX bit in user
      segments. The reasoning behind this change was that when setting the
      NX bit we don't mind it taking effect with delay as the kernel never
      executes text from userspace, and when clearing the NX bit this is
      to return to userspace and then the 'rfi' should synchronise the
      context.
      
      However, it looks like on book3s/32 having a hash page table, at least
      on the G3 processor, we get an unexpected fault from userspace, then
      this is followed by something wrong in the verification of MSR_PR
      at end of another interrupt.
      
      This is fixed by adding back the removed isync() following update
      of NX bit in user segment registers. Only do it for cores with an
      hash table, as 603 cores don't exhibit that problem and the two isync
      increase ./null_syscall selftest by 6 cycles on an MPC 832x.
      
      First problem: unexpected WARN_ON() for mysterious PROTFAULT
      
        WARNING: CPU: 0 PID: 1660 at arch/powerpc/mm/fault.c:354 do_page_fault+0x6c/0x5b0
        Modules linked in:
        CPU: 0 PID: 1660 Comm: Xorg Not tainted 5.13.0-pmac-00028-gb3c15b60339a #40
        NIP:  c001b5c8 LR: c001b6f8 CTR: 00000000
        REGS: e2d09e40 TRAP: 0700   Not tainted  (5.13.0-pmac-00028-gb3c15b60339a)
        MSR:  00021032 <ME,IR,DR,RI>  CR: 42d04f30  XER: 20000000
        GPR00: c000424c e2d09f00 c301b680 e2d09f40 0000001e 42000000 00cba028 00000000
        GPR08: 08000000 48000010 c301b680 e2d09f30 22d09f30 00c1fff0 00cba000 a7b7ba4c
        GPR16: 00000031 00000000 00000000 00000000 00000000 00000000 a7b7b0d0 00c5c010
        GPR24: a7b7b64c a7b7d2f0 00000004 00000000 c1efa6c0 00cba02c 00000300 e2d09f40
        NIP [c001b5c8] do_page_fault+0x6c/0x5b0
        LR [c001b6f8] do_page_fault+0x19c/0x5b0
        Call Trace:
        [e2d09f00] [e2d09f04] 0xe2d09f04 (unreliable)
        [e2d09f30] [c000424c] DataAccess_virt+0xd4/0xe4
        --- interrupt: 300 at 0xa7a261dc
        NIP:  a7a261dc LR: a7a253bc CTR: 00000000
        REGS: e2d09f40 TRAP: 0300   Not tainted  (5.13.0-pmac-00028-gb3c15b60339a)
        MSR:  0000d032 <EE,PR,ME,IR,DR,RI>  CR: 228428e2  XER: 20000000
        DAR: 00cba02c DSISR: 42000000
        GPR00: a7a27448 afa6b0e0 a74c35c0 a7b7b614 0000001e a7b7b614 00cba028 00000000
        GPR08: 00020fd9 00000031 00cb9ff8 a7a273b0 220028e2 00c1fff0 00cba000 a7b7ba4c
        GPR16: 00000031 00000000 00000000 00000000 00000000 00000000 a7b7b0d0 00c5c010
        GPR24: a7b7b64c a7b7d2f0 00000004 00000002 0000001e a7b7b614 a7b7aff4 00000030
        NIP [a7a261dc] 0xa7a261dc
        LR [a7a253bc] 0xa7a253bc
        --- interrupt: 300
        Instruction dump:
        7c4a1378 810300a0 75278410 83820298 83a300a4 553b018c 551e0036 4082038c
        2e1b0000 40920228 75280800 41820220 <0fe00000> 3b600000 41920214 81420594
      
      Second problem: MSR PR is seen unset allthough the interrupt frame shows it set
      
        kernel BUG at arch/powerpc/kernel/interrupt.c:458!
        Oops: Exception in kernel mode, sig: 5 [#1]
        BE PAGE_SIZE=4K MMU=Hash SMP NR_CPUS=2 PowerMac
        Modules linked in:
        CPU: 0 PID: 1660 Comm: Xorg Tainted: G        W         5.13.0-pmac-00028-gb3c15b60339a #40
        NIP:  c0011434 LR: c001629c CTR: 00000000
        REGS: e2d09e70 TRAP: 0700   Tainted: G        W          (5.13.0-pmac-00028-gb3c15b60339a)
        MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 42d09f30  XER: 00000000
        GPR00: 00000000 e2d09f30 c301b680 e2d09f40 83440000 c44d0e68 e2d09e8c 00000000
        GPR08: 00000002 00dc228a 00004000 e2d09f30 22d09f30 00c1fff0 afa6ceb4 00c26144
        GPR16: 00c25fb8 00c26140 afa6ceb8 90000000 00c944d8 0000001c 00000000 00200000
        GPR24: 00000000 000001fb afa6d1b4 00000001 00000000 a539a2a0 a530fd80 00000089
        NIP [c0011434] interrupt_exit_kernel_prepare+0x10/0x70
        LR [c001629c] interrupt_return+0x9c/0x144
        Call Trace:
        [e2d09f30] [c000424c] DataAccess_virt+0xd4/0xe4 (unreliable)
        --- interrupt: 300 at 0xa09be008
        NIP:  a09be008 LR: a09bdfe8 CTR: a09bdfc0
        REGS: e2d09f40 TRAP: 0300   Tainted: G        W          (5.13.0-pmac-00028-gb3c15b60339a)
        MSR:  0000d032 <EE,PR,ME,IR,DR,RI>  CR: 420028e2  XER: 20000000
        DAR: a539a308 DSISR: 0a000000
        GPR00: a7b90d50 afa6b2d0 a74c35c0 a0a8b690 a0a8b698 a5365d70 a4fa82a8 00000004
        GPR08: 00000000 a09bdfc0 00000000 a5360000 a09bde7c 00c1fff0 afa6ceb4 00c26144
        GPR16: 00c25fb8 00c26140 afa6ceb8 90000000 00c944d8 0000001c 00000000 00200000
        GPR24: 00000000 000001fb afa6d1b4 00000001 00000000 a539a2a0 a530fd80 00000089
        NIP [a09be008] 0xa09be008
        LR [a09bdfe8] 0xa09bdfe8
        --- interrupt: 300
        Instruction dump:
        80010024 83e1001c 7c0803a6 4bffff80 3bc00800 4bffffd0 486b42fd 4bffffcc
        81430084 71480002 41820038 554a0462 <0f0a0000> 80620060 74630001 40820034
      
      Fixes: b5efec00 ("powerpc/32s: Move KUEP locking/unlocking in C")
      Cc: stable@vger.kernel.org # v5.13+
      Reported-by: NStan Johnson <userm57@yahoo.com>
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/4856f5574906e2aec0522be17bf3848a22b2cd0b.1629269345.git.christophe.leroy@csgroup.eu
      ef486bf4
  9. 18 8月, 2021 2 次提交
    • N
      powerpc/xive: Do not mark xive_request_ipi() as __init · 3f78c90f
      Nathan Chancellor 提交于
      Compiling ppc64le_defconfig with clang-14 shows a modpost warning:
      
      WARNING: modpost: vmlinux.o(.text+0xa74e0): Section mismatch in
      reference from the function xive_setup_cpu_ipi() to the function
      .init.text:xive_request_ipi()
      The function xive_setup_cpu_ipi() references
      the function __init xive_request_ipi().
      This is often because xive_setup_cpu_ipi lacks a __init
      annotation or the annotation of xive_request_ipi is wrong.
      
      xive_request_ipi() is called from xive_setup_cpu_ipi(), which is not
      __init, so xive_request_ipi() should not be marked __init. Remove the
      attribute so there is no more warning.
      
      Fixes: cbc06f05 ("powerpc/xive: Do not skip CPU-less nodes when creating the IPIs")
      Signed-off-by: NNathan Chancellor <nathan@kernel.org>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210816185711.21563-1-nathan@kernel.org
      3f78c90f
    • N
      s390/pci: fix use after free of zpci_dev · 2a671f77
      Niklas Schnelle 提交于
      The struct pci_dev uses reference counting but zPCI assumed erroneously
      that the last reference would always be the local reference after
      calling pci_stop_and_remove_bus_device(). This is usually the case but
      not how reference counting works and thus inherently fragile.
      
      In fact one case where this causes a NULL pointer dereference when on an
      SRIOV device the function 0 was hot unplugged before another function of
      the same multi-function device. In this case the second function's
      pdev->sriov->dev reference keeps the struct pci_dev of function 0 alive
      even after the unplug. This bug was previously hidden by the fact that
      we were leaking the struct pci_dev which in turn means that it always
      outlived the struct zpci_dev. This was fixed in commit 0b13525c
      ("s390/pci: fix leak of PCI device structure") exposing the broken
      behavior.
      
      Fix this by accounting for the long living reference a struct pci_dev
      has to its underlying struct zpci_dev via the zbus->function[] array and
      only release that in pcibios_release_device() ensuring that the struct
      pci_dev is not left with a dangling reference. This is a minimal fix in
      the future it would probably better to use fine grained reference
      counting for struct zpci_dev.
      
      Fixes: 05bc1be6 ("s390/pci: create zPCI bus")
      Cc: stable@vger.kernel.org
      Reviewed-by: NMatthew Rosato <mjrosato@linux.ibm.com>
      Signed-off-by: NNiklas Schnelle <schnelle@linux.ibm.com>
      Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
      2a671f77
  10. 16 8月, 2021 2 次提交
  11. 13 8月, 2021 10 次提交
    • S
      KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock · ce25681d
      Sean Christopherson 提交于
      Add yet another spinlock for the TDP MMU and take it when marking indirect
      shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
      nested TDP, KVM may encounter shadow pages for the TDP entries managed by
      L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
      is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
      misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
      which runs with mmu_lock held for read, not write.
      
      Lack of a critical section manifests most visibly as an underflow of
      unsync_children in clear_unsync_child_bit() due to unsync_children being
      corrupted when multiple CPUs write it without a critical section and
      without atomic operations.  But underflow is the best case scenario.  The
      worst case scenario is that unsync_children prematurely hits '0' and
      leads to guest memory corruption due to KVM neglecting to properly sync
      shadow pages.
      
      Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
      would functionally be ok.  Usurping the lock could degrade performance when
      building upper level page tables on different vCPUs, especially since the
      unsync flow could hold the lock for a comparatively long time depending on
      the number of indirect shadow pages and the depth of the paging tree.
      
      For simplicity, take the lock for all MMUs, even though KVM could fairly
      easily know that mmu_lock is held for write.  If mmu_lock is held for
      write, there cannot be contention for the inner spinlock, and marking
      shadow pages unsync across multiple vCPUs will be slow enough that
      bouncing the kvm_arch cacheline should be in the noise.
      
      Note, even though L2 could theoretically be given access to its own EPT
      entries, a nested MMU must hold mmu_lock for write and thus cannot race
      against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
      be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
      that is running with the TDP MMU enabled.  Holding mmu_lock for read also
      prevents the indirect shadow page from being freed.  But as above, keep
      it simple and always take the lock.
      
      Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
      effectively disable unsync behavior for nested TDP.  Write protecting leaf
      shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
      VMMs typically don't modify TDP entries, but the same may not hold true for
      non-standard use cases and/or VMMs that are migrating physical pages (from
      L1's perspective).
      
      Alternative #2, the unsync logic could be made thread safe.  In theory,
      simply converting all relevant kvm_mmu_page fields to atomics and using
      atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
      would be required, (b) the code churn would be substantial, and (c) legacy
      shadow paging would incur additional atomic operations in performance
      sensitive paths for no benefit (to legacy shadow paging).
      
      Fixes: a2855afc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181815.3378104-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce25681d
    • S
      KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs · 0103098f
      Sean Christopherson 提交于
      Set the min_level for the TDP iterator at the root level when zapping all
      SPTEs to optimize the iterator's try_step_down().  Zapping a non-leaf
      SPTE will recursively zap all its children, thus there is no need for the
      iterator to attempt to step down.  This avoids rereading the top-level
      SPTEs after they are zapped by causing try_step_down() to short-circuit.
      
      In most cases, optimizing try_step_down() will be in the noise as the cost
      of zapping SPTEs completely dominates the overall time.  The optimization
      is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
      is zapping in response to multiple memslot updates when userspace is adding
      and removing read-only memslots for option ROMs.  In that case, the task
      doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
      for read and thus can be a noisy neighbor of sorts.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0103098f
    • S
      KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs · 524a1e4e
      Sean Christopherson 提交于
      Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
      really zap all SPTEs in this case.  As is, zap_gfn_range() skips non-leaf
      SPTEs whose range exceeds the range to be zapped.  If shadow_phys_bits is
      not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
      paging, the "zap all" flows will skip top-level SPTEs whose range extends
      beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
      
      Use the current upper bound (based on host.MAXPHYADDR) to detect that the
      caller wants to zap all SPTEs, e.g. instead of using the max theoretical
      gfn, 1 << (52 - 12).  The more precise upper bound allows the TDP iterator
      to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
      
      Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
      help future debuggers should KVM decide to leak SPTEs again.
      
      The bug is most easily reproduced by running (and unloading!) KVM in a
      VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
      
        =============================================================================
        BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
        Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
        CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x45/0x59
         slab_err+0x95/0xc9
         __kmem_cache_shutdown.cold+0x3c/0x158
         kmem_cache_destroy+0x3d/0xf0
         kvm_mmu_module_exit+0xa/0x30 [kvm]
         kvm_arch_exit+0x5d/0x90 [kvm]
         kvm_exit+0x78/0x90 [kvm]
         vmx_exit+0x1a/0x50 [kvm_intel]
         __x64_sys_delete_module+0x13f/0x220
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      524a1e4e
    • S
      KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF · 18712c13
      Sean Christopherson 提交于
      Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
      in L2 or if the VM-Exit should be forwarded to L1.  The current logic fails
      to account for the case where #PF is intercepted to handle
      guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
      L1.  At best, L1 will complain and inject the #PF back into L2.  At
      worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
      page faults.
      
      Note, while the bug was technically introduced by the commit that added
      support for the MAXPHYADDR madness, the shame is all on commit
      a0c13434 ("KVM: VMX: introduce vmx_need_pf_intercept").
      
      Fixes: 1dbf5d68 ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
      Cc: stable@vger.kernel.org
      Cc: Peter Shier <pshier@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812045615.3167686-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      18712c13
    • J
      kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault · 85aa8889
      Junaid Shahid 提交于
      When a nested EPT violation/misconfig is injected into the guest,
      the shadow EPT PTEs associated with that address need to be synced.
      This is done by kvm_inject_emulated_page_fault() before it calls
      nested_ept_inject_page_fault(). However, that will only sync the
      shadow EPT PTE associated with the current L1 EPTP. Since the ASID
      is based on EP4TA rather than the full EPTP, so syncing the current
      EPTP is not enough. The SPTEs associated with any other L1 EPTPs
      in the prev_roots cache with the same EP4TA also need to be synced.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Message-Id: <20210806222229.1645356-1-junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      85aa8889
    • P
      KVM: x86: remove dead initialization · ffbe17ca
      Paolo Bonzini 提交于
      hv_vcpu is initialized again a dozen lines below, and at this
      point vcpu->arch.hyperv is not valid.  Remove the initializer.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffbe17ca
    • S
      KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels · 1383279c
      Sean Christopherson 提交于
      Remove an ancient restriction that disallowed exposing EFER.NX to the
      guest if EFER.NX=0 on the host, even if NX is fully supported by the CPU.
      The motivation of the check, added by commit 2cc51560 ("KVM: VMX:
      Avoid saving and restoring msr_efer on lightweight vmexit"), was to rule
      out the case of host.EFER.NX=0 and guest.EFER.NX=1 so that KVM could run
      the guest with the host's EFER.NX and thus avoid context switching EFER
      if the only divergence was the NX bit.
      
      Fast forward to today, and KVM has long since stopped running the guest
      with the host's EFER.NX.  Not only does KVM context switch EFER if
      host.EFER.NX=1 && guest.EFER.NX=0, KVM also forces host.EFER.NX=0 &&
      guest.EFER.NX=1 when using shadow paging (to emulate SMEP).  Furthermore,
      the entire motivation for the restriction was made obsolete over a decade
      ago when Intel added dedicated host and guest EFER fields in the VMCS
      (Nehalem timeframe), which reduced the overhead of context switching EFER
      from 400+ cycles (2 * WRMSR + 1 * RDMSR) to a mere ~2 cycles.
      
      In practice, the removed restriction only affects non-PAE 32-bit kernels,
      as EFER.NX is set during boot if NX is supported and the kernel will use
      PAE paging (32-bit or 64-bit), regardless of whether or not the kernel
      will actually use NX itself (mark PTEs non-executable).
      
      Alternatively and/or complementarily, startup_32_smp() in head_32.S could
      be modified to set EFER.NX=1 regardless of paging mode, thus eliminating
      the scenario where NX is supported but not enabled.  However, that runs
      the risk of breaking non-KVM non-PAE kernels (though the risk is very,
      very low as there are no known EFER.NX errata), and also eliminates an
      easy-to-use mechanism for stressing KVM's handling of guest vs. host EFER
      across nested virtualization transitions.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210805183804.1221554-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1383279c
    • A
      ARM: ixp4xx: fix building both pci drivers · cbfece75
      Arnd Bergmann 提交于
      When both the old and the new PCI drivers are enabled
      in the same kernel, there are a couple of namespace
      conflicts that cause a build failure:
      
      drivers/pci/controller/pci-ixp4xx.c:38: error: "IXP4XX_PCI_CSR" redefined [-Werror]
         38 | #define IXP4XX_PCI_CSR                  0x1c
            |
      In file included from arch/arm/mach-ixp4xx/include/mach/hardware.h:23,
                       from arch/arm/mach-ixp4xx/include/mach/io.h:15,
                       from arch/arm/include/asm/io.h:198,
                       from include/linux/io.h:13,
                       from drivers/pci/controller/pci-ixp4xx.c:20:
      arch/arm/mach-ixp4xx/include/mach/ixp4xx-regs.h:221: note: this is the location of the previous definition
        221 | #define IXP4XX_PCI_CSR(x) ((volatile u32 *)(IXP4XX_PCI_CFG_BASE_VIRT+(x)))
            |
      drivers/pci/controller/pci-ixp4xx.c:148:12: error: 'ixp4xx_pci_read' redeclared as different kind of symbol
        148 | static int ixp4xx_pci_read(struct ixp4xx_pci *p, u32 addr, u32 cmd, u32 *data)
            |            ^~~~~~~~~~~~~~~
      
      Rename both the ixp4xx_pci_read/ixp4xx_pci_write functions and the
      IXP4XX_PCI_CSR macro. In each case, I went with the version that
      has fewer callers to keep the change small.
      
      Fixes: f7821b49 ("PCI: ixp4xx: Add a new driver for IXP4xx")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NLinus Walleij <linus.walleij@linaro.org>
      Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: soc@kernel.org
      Link: https://lore.kernel.org/r/20210721151546.2325937-1-arnd@kernel.org'
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      cbfece75
    • L
      ARM: configs: Update the nhk8815_defconfig · 813bacf4
      Linus Walleij 提交于
      The platform lost the framebuffer due to a commit solving a
      circular dependency in v5.14-rc1, so add it back in by explicitly
      selecting the framebuffer.
      
      Also fix up some Kconfig options that got dropped or moved around
      while we're at it.
      
      Fixes: f611b1e7 ("drm: Avoid circular dependencies for CONFIG_FB")
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210807225518.3607126-1-linus.walleij@linaro.org'
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      813bacf4
    • B
      x86/resctrl: Fix default monitoring groups reporting · 064855a6
      Babu Moger 提交于
      Creating a new sub monitoring group in the root /sys/fs/resctrl leads to
      getting the "Unavailable" value for mbm_total_bytes and mbm_local_bytes
      on the entire filesystem.
      
      Steps to reproduce:
      
        1. mount -t resctrl resctrl /sys/fs/resctrl/
      
        2. cd /sys/fs/resctrl/
      
        3. cat mon_data/mon_L3_00/mbm_total_bytes
           23189832
      
        4. Create sub monitor group:
        mkdir mon_groups/test1
      
        5. cat mon_data/mon_L3_00/mbm_total_bytes
           Unavailable
      
      When a new monitoring group is created, a new RMID is assigned to the
      new group. But the RMID is not active yet. When the events are read on
      the new RMID, it is expected to report the status as "Unavailable".
      
      When the user reads the events on the default monitoring group with
      multiple subgroups, the events on all subgroups are consolidated
      together. Currently, if any of the RMID reads report as "Unavailable",
      then everything will be reported as "Unavailable".
      
      Fix the issue by discarding the "Unavailable" reads and reporting all
      the successful RMID reads. This is not a problem on Intel systems as
      Intel reports 0 on Inactive RMIDs.
      
      Fixes: d89b7379 ("x86/intel_rdt/cqm: Add mon_data")
      Reported-by: NPaweł Szulik <pawel.szulik@intel.com>
      Signed-off-by: NBabu Moger <Babu.Moger@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NReinette Chatre <reinette.chatre@intel.com>
      Cc: stable@vger.kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213311
      Link: https://lkml.kernel.org/r/162793309296.9224.15871659871696482080.stgit@bmoger-ubuntu
      064855a6
  12. 12 8月, 2021 6 次提交