1. 16 12月, 2022 1 次提交
    • M
      powerpc/code-patching: Fix oops with DEBUG_VM enabled · 980411a4
      Michael Ellerman 提交于
      Nathan reported that the new per-cpu mm patching oopses if DEBUG_VM is
      enabled:
      
        ------------[ cut here ]------------
        kernel BUG at arch/powerpc/mm/pgtable.c:333!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc2+ #1
        Hardware name: IBM PowerNV (emulated by qemu) POWER9 0x4e1200 opal:v7.0 PowerNV
        ...
        NIP assert_pte_locked+0x180/0x1a0
        LR  assert_pte_locked+0x170/0x1a0
        Call Trace:
          0x60000000 (unreliable)
          patch_instruction+0x618/0x6d0
          arch_prepare_kprobe+0xfc/0x2d0
          register_kprobe+0x520/0x7c0
          arch_init_kprobes+0x28/0x3c
          init_kprobes+0x108/0x184
          do_one_initcall+0x60/0x2e0
          kernel_init_freeable+0x1f0/0x3e0
          kernel_init+0x34/0x1d0
          ret_from_kernel_thread+0x5c/0x64
      
      It's caused by the assert_spin_locked() failing in assert_pte_locked().
      The assert fails because the PTE was unlocked in text_area_cpu_up_mm(),
      and never relocked.
      
      The PTE page shouldn't be freed, the patching_mm is only used for
      patching on this CPU, only that single PTE is ever mapped, and it's only
      unmapped at CPU offline.
      
      In fact assert_pte_locked() has a special case to ignore init_mm
      entirely, and the patching_mm is more-or-less like init_mm, so possibly
      the check could be skipped for patching_mm too.
      
      But for now be conservative, and use the proper PTE accessors at
      patching time, so that the PTE lock is held while the PTE is used. That
      also avoids the warning in assert_pte_locked().
      
      With that it's no longer necessary to save the PTE in
      cpu_patching_context for the mm_patch_enabled() case.
      
      Fixes: c28c15b6 ("powerpc/code-patching: Use temporary mm for Radix MMU")
      Reported-by: NNathan Chancellor <nathan@kernel.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20221216125913.990972-1-mpe@ellerman.id.au
      980411a4
  2. 02 12月, 2022 4 次提交
    • C
      powerpc/code-patching: Remove protection against patching init addresses after init · 6f3a81b6
      Christophe Leroy 提交于
      Once init section is freed, attempting to patch init code
      ends up in the weed.
      
      Commit 51c3c62b ("powerpc: Avoid code patching freed init sections")
      protected patch_instruction() against that, but it is the responsibility
      of the caller to ensure that the patched memory is valid.
      
      All callers have now been verified and fixed so the check
      can be removed.
      
      This improves ftrace activation by about 2% on 8xx.
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/504310828f473d424e2ed229eff57bf075f52796.1669969781.git.christophe.leroy@csgroup.eu
      6f3a81b6
    • C
      powerpc/code-patching: Remove #ifdef CONFIG_STRICT_KERNEL_RWX · 84ecfe6f
      Christophe Leroy 提交于
      No need to have one implementation of patch_instruction() for
      CONFIG_STRICT_KERNEL_RWX and one for !CONFIG_STRICT_KERNEL_RWX.
      
      In patch_instruction(), call raw_patch_instruction() when
      !CONFIG_STRICT_KERNEL_RWX.
      
      In poking_init(), bail out immediately, it will be equivalent
      to the weak default implementation.
      
      Everything else is declared static and will be discarded by
      GCC when !CONFIG_STRICT_KERNEL_RWX.
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/f67d2a109404d03e8fdf1ea15388c8778337a76b.1669969781.git.christophe.leroy@csgroup.eu
      84ecfe6f
    • B
      powerpc/code-patching: Consolidate and cache per-cpu patching context · 2f228ee1
      Benjamin Gray 提交于
      With the temp mm context support, there are CPU local variables to hold
      the patch address and pte. Use these in the non-temp mm path as well
      instead of adding a level of indirection through the text_poke_area
      vm_struct and pointer chasing the pte.
      
      As both paths use these fields now, there is no need to let unreferenced
      variables be dropped by the compiler, so it is cleaner to merge them
      into a single context struct. This has the additional benefit of
      removing a redundant CPU local pointer, as only one of cpu_patching_mm /
      text_poke_area is ever used, while remaining well-typed. It also groups
      each CPU's data into a single cacheline.
      Signed-off-by: NBenjamin Gray <bgray@linux.ibm.com>
      [mpe: Shorten name to 'area' as suggested by Christophe]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20221109045112.187069-10-bgray@linux.ibm.com
      2f228ee1
    • C
      powerpc/code-patching: Use temporary mm for Radix MMU · c28c15b6
      Christopher M. Riedl 提交于
      x86 supports the notion of a temporary mm which restricts access to
      temporary PTEs to a single CPU. A temporary mm is useful for situations
      where a CPU needs to perform sensitive operations (such as patching a
      STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
      said mappings to other CPUs. Another benefit is that other CPU TLBs do
      not need to be flushed when the temporary mm is torn down.
      
      Mappings in the temporary mm can be set in the userspace portion of the
      address-space.
      
      Interrupts must be disabled while the temporary mm is in use. HW
      breakpoints, which may have been set by userspace as watchpoints on
      addresses now within the temporary mm, are saved and disabled when
      loading the temporary mm. The HW breakpoints are restored when unloading
      the temporary mm. All HW breakpoints are indiscriminately disabled while
      the temporary mm is in use - this may include breakpoints set by perf.
      
      Use the `poking_init` init hook to prepare a temporary mm and patching
      address. Initialize the temporary mm using mm_alloc(). Choose a
      randomized patching address inside the temporary mm userspace address
      space. The patching address is randomized between PAGE_SIZE and
      DEFAULT_MAP_WINDOW-PAGE_SIZE.
      
      Bits of entropy with 64K page size on BOOK3S_64:
      
      	bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
      
      	PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
      	bits of entropy = log2(128TB / 64K)
      	bits of entropy = 31
      
      The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
      operates - by default the space above DEFAULT_MAP_WINDOW is not
      available. Currently the Hash MMU does not use a temporary mm so
      technically this upper limit isn't necessary; however, a larger
      randomization range does not further "harden" this overall approach and
      future work may introduce patching with a temporary mm on Hash as well.
      
      Randomization occurs only once during initialization for each CPU as it
      comes online.
      
      The patching page is mapped with PAGE_KERNEL to set EAA[0] for the PTE
      which ignores the AMR (so no need to unlock/lock KUAP) according to
      PowerISA v3.0b Figure 35 on Radix.
      
      Based on x86 implementation:
      
      commit 4fc19708
      ("x86/alternatives: Initialize temporary mm for patching")
      
      and:
      
      commit b3fd8e83
      ("x86/alternatives: Use temporary mm for text poking")
      
      From: Benjamin Gray <bgray@linux.ibm.com>
      
      Synchronisation is done according to ISA 3.1B Book 3 Chapter 13
      "Synchronization Requirements for Context Alterations". Switching the mm
      is a change to the PID, which requires a CSI before and after the change,
      and a hwsync between the last instruction that performs address
      translation for an associated storage access.
      
      Instruction fetch is an associated storage access, but the instruction
      address mappings are not being changed, so it should not matter which
      context they use. We must still perform a hwsync to guard arbitrary
      prior code that may have accessed a userspace address.
      
      TLB invalidation is local and VA specific. Local because only this core
      used the patching mm, and VA specific because we only care that the
      writable mapping is purged. Leaving the other mappings intact is more
      efficient, especially when performing many code patches in a row (e.g.,
      as ftrace would).
      Signed-off-by: NChristopher M. Riedl <cmr@bluescreens.de>
      Signed-off-by: NBenjamin Gray <bgray@linux.ibm.com>
      [mpe: Use mm_alloc() per 107b6828a7cd ("x86/mm: Use mm_alloc() in poking_init()")]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20221109045112.187069-9-bgray@linux.ibm.com
      c28c15b6
  3. 30 11月, 2022 1 次提交
  4. 01 9月, 2022 1 次提交
  5. 19 5月, 2022 3 次提交
  6. 11 5月, 2022 2 次提交
  7. 08 5月, 2022 1 次提交
  8. 07 3月, 2022 1 次提交
    • M
      powerpc/code-patching: Pre-map patch area · 591b4b26
      Michael Ellerman 提交于
      Paul reported a warning with DEBUG_ATOMIC_SLEEP=y:
      
        BUG: sleeping function called from invalid context at include/linux/sched/mm.h:256
        in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
        preempt_count: 0, expected: 0
        ...
        Call Trace:
          dump_stack_lvl+0xa0/0xec (unreliable)
          __might_resched+0x2f4/0x310
          kmem_cache_alloc+0x220/0x4b0
          __pud_alloc+0x74/0x1d0
          hash__map_kernel_page+0x2cc/0x390
          do_patch_instruction+0x134/0x4a0
          arch_jump_label_transform+0x64/0x78
          __jump_label_update+0x148/0x180
          static_key_enable_cpuslocked+0xd0/0x120
          static_key_enable+0x30/0x50
          check_kvm_guest+0x60/0x88
          pSeries_smp_probe+0x54/0xb0
          smp_prepare_cpus+0x3e0/0x430
          kernel_init_freeable+0x20c/0x43c
          kernel_init+0x30/0x1a0
          ret_from_kernel_thread+0x5c/0x64
      
      Peter pointed out that this is because do_patch_instruction() has
      disabled interrupts, but then map_patch_area() calls map_kernel_page()
      then hash__map_kernel_page() which does a sleeping memory allocation.
      
      We only see the warning in KVM guests with SMT enabled, which is not
      particularly common, or on other platforms if CONFIG_KPROBES is
      disabled, also not common. The reason we don't see it in most
      configurations is that another path that happens to have interrupts
      enabled has allocated the required page tables for us, eg. there's a
      path in kprobes init that does that. That's just pure luck though.
      
      As Christophe suggested, the simplest solution is to do a dummy
      map/unmap when we initialise the patching, so that any required page
      table levels are pre-allocated before the first call to
      do_patch_instruction(). This works because the unmap doesn't free any
      page tables that were allocated by the map, it just clears the PTE,
      leaving the page table levels there for the next map.
      Reported-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Debugged-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220223015821.473097-1-mpe@ellerman.id.au
      591b4b26
  9. 23 12月, 2021 11 次提交
  10. 09 12月, 2021 1 次提交
  11. 29 11月, 2021 1 次提交
  12. 25 11月, 2021 1 次提交
  13. 07 10月, 2021 1 次提交
  14. 21 6月, 2021 1 次提交
  15. 16 6月, 2021 4 次提交
  16. 21 4月, 2021 1 次提交
  17. 26 3月, 2021 1 次提交
  18. 15 9月, 2020 1 次提交
  19. 26 7月, 2020 1 次提交
  20. 10 6月, 2020 1 次提交
    • M
      mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4
      Mike Rapoport 提交于
      Patch series "mm: consolidate definitions of page table accessors", v2.
      
      The low level page table accessors (pXY_index(), pXY_offset()) are
      duplicated across all architectures and sometimes more than once.  For
      instance, we have 31 definition of pgd_offset() for 25 supported
      architectures.
      
      Most of these definitions are actually identical and typically it boils
      down to, e.g.
      
      static inline unsigned long pmd_index(unsigned long address)
      {
              return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
      }
      
      static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
      {
              return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
      }
      
      These definitions can be shared among 90% of the arches provided
      XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
      
      For architectures that really need a custom version there is always
      possibility to override the generic version with the usual ifdefs magic.
      
      These patches introduce include/linux/pgtable.h that replaces
      include/asm-generic/pgtable.h and add the definitions of the page table
      accessors to the new header.
      
      This patch (of 12):
      
      The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
      functions involving page table manipulations, e.g.  pte_alloc() and
      pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
      in the files that include <linux/mm.h>.
      
      The include statements in such cases are remove with a simple loop:
      
      	for f in $(git grep -l "include <linux/mm.h>") ; do
      		sed -i -e '/include <asm\/pgtable.h>/ d' $f
      	done
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e31cf2f4
  21. 05 6月, 2020 1 次提交