1. 19 6月, 2019 1 次提交
  2. 28 5月, 2019 1 次提交
  3. 02 5月, 2019 1 次提交
  4. 28 4月, 2019 1 次提交
  5. 21 4月, 2019 5 次提交
  6. 13 3月, 2019 1 次提交
    • M
      treewide: add checks for the return value of memblock_alloc*() · 8a7f97b9
      Mike Rapoport 提交于
      Add check for the return value of memblock_alloc*() functions and call
      panic() in case of error.  The panic message repeats the one used by
      panicing memblock allocators with adjustment of parameters to include
      only relevant ones.
      
      The replacement was mostly automated with semantic patches like the one
      below with manual massaging of format strings.
      
        @@
        expression ptr, size, align;
        @@
        ptr = memblock_alloc(size, align);
        + if (!ptr)
        + 	panic("%s: Failed to allocate %lu bytes align=0x%lx\n", __func__, size, align);
      
      [anders.roxell@linaro.org: use '%pa' with 'phys_addr_t' type]
        Link: http://lkml.kernel.org/r/20190131161046.21886-1-anders.roxell@linaro.org
      [rppt@linux.ibm.com: fix format strings for panics after memblock_alloc]
        Link: http://lkml.kernel.org/r/1548950940-15145-1-git-send-email-rppt@linux.ibm.com
      [rppt@linux.ibm.com: don't panic if the allocation in sparse_buffer_init fails]
        Link: http://lkml.kernel.org/r/20190131074018.GD28876@rapoport-lnx
      [akpm@linux-foundation.org: fix xtensa printk warning]
      Link: http://lkml.kernel.org/r/1548057848-15136-20-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Reviewed-by: Guo Ren <ren_guo@c-sky.com>		[c-sky]
      Acked-by: Paul Burton <paul.burton@mips.com>		[MIPS]
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>	[s390]
      Reviewed-by: Juergen Gross <jgross@suse.com>		[Xen]
      Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Acked-by: Max Filippov <jcmvbkbc@gmail.com>		[xtensa]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a7f97b9
  7. 08 3月, 2019 1 次提交
    • M
      powerpc: prefer memblock APIs returning virtual address · f806714f
      Mike Rapoport 提交于
      Patch series "memblock: simplify several early memory allocation", v4.
      
      These patches simplify some of the early memory allocations by replacing
      usage of older memblock APIs with newer and shinier ones.
      
      Quite a few places in the arch/ code allocated memory using a memblock
      API that returns a physical address of the allocated area, then
      converted this physical address to a virtual one and then used memset(0)
      to clear the allocated range.
      
      More recent memblock APIs do all the three steps in one call and their
      usage simplifies the code.
      
      It's important to note that regardless of API used, the core allocation
      is nearly identical for any set of memblock allocators: first it tries
      to find a free memory with all the constraints specified by the caller
      and then falls back to the allocation with some or all constraints
      disabled.
      
      The first three patches perform the conversion of call sites that have
      exact requirements for the node and the possible memory range.
      
      The fourth patch is a bit one-off as it simplifies openrisc's
      implementation of pte_alloc_one_kernel(), and not only the memblock
      usage.
      
      The fifth patch takes care of simpler cases when the allocation can be
      satisfied with a simple call to memblock_alloc().
      
      The sixth patch removes one-liner wrappers for memblock_alloc on arm and
      unicore32, as suggested by Christoph.
      
      This patch (of 6):
      
      There are a several places that allocate memory using memblock APIs that
      return a physical address, convert the returned address to the virtual
      address and frequently also memset(0) the allocated range.
      
      Update these places to use memblock allocators already returning a
      virtual address.  Use memblock functions that clear the allocated memory
      instead of calling memset(0) where appropriate.
      
      The calls to memblock_alloc_base() that were not followed by memset(0)
      are replaced with memblock_alloc_try_nid_raw().  Since the latter does
      not panic() when the allocation fails, the appropriate panic() calls are
      added to the call sites.
      
      Link: http://lkml.kernel.org/r/1546248566-14910-2-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Simek <michal.simek@xilinx.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f806714f
  8. 06 3月, 2019 1 次提交
  9. 20 10月, 2018 6 次提交
    • M
      powerpc/mm/radix: Display if mappings are exec or not · afb6d064
      Michael Ellerman 提交于
      At boot we print the ranges we've mapped for the linear mapping and
      what page size we've used. Also track whether the range is mapped
      executable or not and display that as well.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      afb6d064
    • M
      powerpc/mm/radix: Simplify split mapping logic · 232aa407
      Michael Ellerman 提交于
      If we look closely at the logic in create_physical_mapping(), when
      we're doing STRICT_KERNEL_RWX, we do the following steps:
        - determine the gap from where we are to the end of the range
        - choose an appropriate mapping_size based on the gap
        - check if that mapping_size would overlap the __init_begin
          boundary, and if not choose an appropriate mapping_size
      
      We can simplify the logic by taking the __init_begin boundary into
      account when we calculate the initial gap.
      
      So add a next_boundary() function which tells us what the next
      boundary is, either the __init_begin boundary or end. In future we can
      add more boundaries.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      232aa407
    • M
      powerpc/mm/radix: Remove the retry in the split mapping logic · 57306c66
      Michael Ellerman 提交于
      When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the
      linear mapping at the text/data boundary so we can map the kernel
      text read only.
      
      The current logic uses a goto inside the for loop, which works, but is
      hard to reason about.
      
      When we hit the goto retry case we set max_mapping_size to PMD_SIZE
      and go back to the start.
      
      Setting max_mapping_size means we skip the PUD case and go to the PMD
      case.
      
      We know we will pass the alignment and gap checks because the only
      reason we are there is we hit the goto retry, and that is guarded by
      mapping_size == PUD_SIZE, which means addr is PUD aligned and gap is
      greater or equal to PUD_SIZE.
      
      So the only part of the check that can fail is the mmu_psize_defs
      check for the 2M page size.
      
      If we just duplicate that check we can avoid the goto, and we get the
      same result.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      57306c66
    • M
      powerpc/mm/radix: Fix small page at boundary when splitting · 81d1b54d
      Michael Ellerman 提交于
      When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the
      linear mapping at the text/data boundary so we can map the kernel
      text read only.
      
      Currently we always use a small page at the text/data boundary, even
      when that's not necessary:
      
        Mapped 0x0000000000000000-0x0000000000e00000 with 2.00 MiB pages
        Mapped 0x0000000000e00000-0x0000000001000000 with 64.0 KiB pages
        Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages
      
      This is because the check that the mapping crosses the __init_begin
      boundary is too strict, it also returns true when we map exactly up to
      the boundary.
      
      So fix it to check that the mapping would actually map past
      __init_begin, and with that we see:
      
        Mapped 0x0000000000000000-0x0000000040000000 with 2.00 MiB pages
        Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      81d1b54d
    • M
      powerpc/mm/radix: Fix overuse of small pages in splitting logic · 3b5657ed
      Michael Ellerman 提交于
      When we have CONFIG_STRICT_KERNEL_RWX enabled, we want to split the
      linear mapping at the text/data boundary so we can map the kernel text
      read only.
      
      But the current logic uses small pages for the entire text section,
      regardless of whether a larger page size would fit. eg. with the
      boundary at 16M we could use 2M pages, but instead we use 64K pages up
      to the 16M boundary:
      
        Mapped 0x0000000000000000-0x0000000001000000 with 64.0 KiB pages
        Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages
        Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages
      
      This is because the test is checking if addr is < __init_begin
      and addr + mapping_size is >= _stext. But that is true for all pages
      between _stext and __init_begin.
      
      Instead what we want to check is if we are crossing the text/data
      boundary, which is at __init_begin. With that fixed we see:
      
        Mapped 0x0000000000000000-0x0000000000e00000 with 2.00 MiB pages
        Mapped 0x0000000000e00000-0x0000000001000000 with 64.0 KiB pages
        Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages
        Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages
      
      ie. we're correctly using 2MB pages below __init_begin, but we still
      drop down to 64K pages unnecessarily at the boundary.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3b5657ed
    • M
      powerpc/mm/radix: Fix off-by-one in split mapping logic · 5c6499b7
      Michael Ellerman 提交于
      When we have CONFIG_STRICT_KERNEL_RWX enabled, we try to split the
      kernel linear (1:1) mapping so that the kernel text is in a separate
      page to kernel data, so we can mark the former read-only.
      
      We could achieve that just by always using 64K pages for the linear
      mapping, but we try to be smarter. Instead we use huge pages when
      possible, and only switch to smaller pages when necessary.
      
      However we have an off-by-one bug in that logic, which causes us to
      calculate the wrong boundary between text and data.
      
      For example with the end of the kernel text at 16M we see:
      
        radix-mmu: Mapped 0x0000000000000000-0x0000000001200000 with 64.0 KiB pages
        radix-mmu: Mapped 0x0000000001200000-0x0000000040000000 with 2.00 MiB pages
        radix-mmu: Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages
      
      ie. we mapped from 0 to 18M with 64K pages, even though the boundary
      between text and data is at 16M.
      
      With the fix we see we're correctly hitting the 16M boundary:
      
        radix-mmu: Mapped 0x0000000000000000-0x0000000001000000 with 64.0 KiB pages
        radix-mmu: Mapped 0x0000000001000000-0x0000000040000000 with 2.00 MiB pages
        radix-mmu: Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5c6499b7
  10. 23 8月, 2018 1 次提交
  11. 13 8月, 2018 1 次提交
  12. 16 7月, 2018 1 次提交
  13. 03 6月, 2018 5 次提交
    • N
      powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags · f1cb8f9b
      Nicholas Piggin 提交于
      The ISA suggests ptesync after setting a pte, to prevent a table walk
      initiated by a subsequent access from missing that store and causing a
      spurious fault. This is an architectual allowance that allows an
      implementation's page table walker to be incoherent with the store
      queue.
      
      However there is no correctness problem in taking a spurious fault in
      userspace -- the kernel copes with these at any time, so the updated
      pte will be found eventually. Spurious kernel faults on vmap memory
      must be avoided, so a ptesync is put into flush_cache_vmap.
      
      On POWER9 so far I have not found a measurable window where this can
      result in more minor faults, so as an optimisation, remove the costly
      ptesync from pte updates. If an implementation benefits from ptesync,
      it would be better to add it back in update_mmu_cache, so it's not
      done for things like fork(2).
      
      fork --fork --exec benchmark improved 5.2% (12400->13100).
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f1cb8f9b
    • N
      powerpc/64s/radix: do not flush TLB when relaxing access · e5f7cb58
      Nicholas Piggin 提交于
      Radix flushes the TLB when updating ptes to increase permissiveness
      of protection (increase access authority). Book3S does not require
      TLB flushing in this case, and it is not done on hash. This patch
      avoids the flush for radix.
      
      >From Power ISA v3.0B, p.1090:
      
          Setting a Reference or Change Bit or Upgrading Access Authority
          (PTE Subject to Atomic Hardware Updates)
      
          If the only change being made to a valid PTE that is subject to
          atomic hardware updates is to set the Reference or Change bit to 1
          or to add access authorities, a simpler sequence suffices because
          the translation hardware will refetch the PTE if an access is
          attempted for which the only problems were reference and/or change
          bits needing to be set or insufficient access authority.
      
      The nest MMU on POWER9 does not re-fetch the PTE after such an access
      attempt before faulting, so address spaces with a coprocessor
      attached will continue to flush in these cases.
      
      This reduces tlbies for a kernel compile workload from 1.28M to 0.95M,
      tlbiels from 20.17M 19.68M.
      
      fork --fork --exec benchmark improved 2.77% (12000->12300).
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e5f7cb58
    • A
      powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang · bd5050e3
      Aneesh Kumar K.V 提交于
      When relaxing access (read -> read_write update), pte needs to be marked invalid
      to handle a nest MMU bug. We also need to do a tlb flush after the pte is
      marked invalid before updating the pte with new access bits.
      
      We also move tlb flush to platform specific __ptep_set_access_flags. This will
      help us to gerid of unnecessary tlb flush on BOOK3S 64 later. We don't do that
      in this patch. This also helps in avoiding multiple tlbies with coprocessor
      attached.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bd5050e3
    • A
      powerpc/mm: Change function prototype · e4c1112c
      Aneesh Kumar K.V 提交于
      In later patch, we use the vma and psize to do tlb flush. Do the prototype
      update in separate patch to make the review easy.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e4c1112c
    • A
      powerpc/mm/radix: Move function from radix.h to pgtable-radix.c · 044003b5
      Aneesh Kumar K.V 提交于
      In later patch we will update them which require them to be moved
      to pgtable-radix.c. Keeping the function in radix.h results in
      compile warning as below.
      
      ./arch/powerpc/include/asm/book3s/64/radix.h: In function ‘radix__ptep_set_access_flags’:
      ./arch/powerpc/include/asm/book3s/64/radix.h:196:28: error: dereferencing pointer to incomplete type ‘struct vm_area_struct’
        struct mm_struct *mm = vma->vm_mm;
                                  ^~
      ./arch/powerpc/include/asm/book3s/64/radix.h:204:6: error: implicit declaration of function ‘atomic_read’; did you mean ‘__atomic_load’? [-Werror=implicit-function-declaration]
            atomic_read(&mm->context.copros) > 0) {
            ^~~~~~~~~~~
            __atomic_load
      ./arch/powerpc/include/asm/book3s/64/radix.h:204:21: error: dereferencing pointer to incomplete type ‘struct mm_struct’
            atomic_read(&mm->context.copros) > 0) {
      
      Instead of fixing header dependencies, we move the function to pgtable-radix.c
      Also the function is now large to be a static inline . Doing the
      move in separate patch helps in review.
      
      No functional change in this patch. Only code movement.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      044003b5
  14. 15 5月, 2018 3 次提交
  15. 04 4月, 2018 1 次提交
    • A
      powerpc/mm/radix: Update pte fragment count from 16 to 256 on radix · fb4e5dbd
      Aneesh Kumar K.V 提交于
      With split PTL (page table lock) config, we allocate the level
      4 (leaf) page table using pte fragment framework instead of slab cache
      like other levels. This was done to enable us to have split page table
      lock at the level 4 of the page table. We use page->plt backing the
      all the level 4 pte fragment for the lock.
      
      Currently with Radix, we use only 16 fragments out of the allocated
      page. In radix each fragment is 256 bytes which means we use only 4k
      out of the allocated 64K page wasting 60k of the allocated memory.
      This was done earlier to keep it closer to hash.
      
      This patch update the pte fragment count to 256, thereby using the
      full 64K page and reducing the memory usage. Performance tests shows
      really low impact even with THP disabled. With THP disabled we will be
      contenting further less on level 4 ptl and hence the impact should be
      further low.
      
        256 threads:
          without patch (10 runs of ./ebizzy  -m -n 1000 -s 131072 -S 100)
            median = 15678.5
            stdev = 42.1209
      
          with patch:
            median = 15354
            stdev = 194.743
      
      This is with THP disabled. With THP enabled the impact of the patch
      will be less.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fb4e5dbd
  16. 30 3月, 2018 3 次提交
  17. 27 3月, 2018 1 次提交
    • M
      powerpc/mm: Fix section mismatch warning in stop_machine_change_mapping() · bde709a7
      Mauricio Faria de Oliveira 提交于
      Fix the warning messages for stop_machine_change_mapping(), and a number
      of other affected functions in its call chain.
      
      All modified functions are under CONFIG_MEMORY_HOTPLUG, so __meminit
      is okay (keeps them / does not discard them).
      
      Boot-tested on powernv/power9/radix-mmu and pseries/power8/hash-mmu.
      
          $ make -j$(nproc) CONFIG_DEBUG_SECTION_MISMATCH=y vmlinux
          ...
            MODPOST vmlinux.o
          WARNING: vmlinux.o(.text+0x6b130): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping()
          The function stop_machine_change_mapping() references
          the function __meminit create_physical_mapping().
          This is often because stop_machine_change_mapping lacks a __meminit
          annotation or the annotation of create_physical_mapping is wrong.
      
          WARNING: vmlinux.o(.text+0x6b13c): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping()
          The function stop_machine_change_mapping() references
          the function __meminit create_physical_mapping().
          This is often because stop_machine_change_mapping lacks a __meminit
          annotation or the annotation of create_physical_mapping is wrong.
          ...
      Signed-off-by: NMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bde709a7
  18. 13 2月, 2018 1 次提交
    • A
      powerpc/mm: Fix crashes with 16G huge pages · fae22116
      Aneesh Kumar K.V 提交于
      To support memory keys, we moved the hash pte slot information to the
      second half of the page table. This was ok with PTE entries at level
      4 (PTE page) and level 3 (PMD). We already allocate larger page table
      pages at those levels to accomodate extra details. For level 4 we
      already have the extra space which was used to track 4k hash page
      table entry details and at level 3 the extra space was allocated to
      track the THP details.
      
      With hugetlbfs PTE, we used this extra space at the PMD level to store
      the slot details. But we also support hugetlbfs PTE at PUD level for
      16GB pages and PUD level page didn't allocate extra space. This
      resulted in memory corruption.
      
      Fix this by allocating extra space at PUD level when HUGETLB is
      enabled.
      
      Fixes: bf9a95f9 ("powerpc: Free up four 64K PTE bits in 64K backed HPTE pages")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fae22116
  19. 08 2月, 2018 2 次提交
    • B
      powerpc/mm/radix: Split linear mapping on hot-unplug · 4dd5f8a9
      Balbir Singh 提交于
      This patch splits the linear mapping if the hot-unplug range is
      smaller than the mapping size. The code detects if the mapping needs
      to be split into a smaller size and if so, uses the stop machine
      infrastructure to clear the existing mapping and then remap the
      remaining range using a smaller page size.
      
      The code will skip any region of the mapping that overlaps with kernel
      text and warn about it once. We don't want to remove a mapping where
      the kernel text and the LMB we intend to remove overlap in the same
      TLB mapping as it may affect the currently executing code.
      
      I've tested these changes under a kvm guest with 2 vcpus, from a split
      mapping point of view, some of the caveats mentioned above applied to
      the testing I did.
      
      Fixes: 4b5d62ca ("powerpc/mm: add radix__remove_section_mapping()")
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [mpe: Tweak change log to match updated behaviour]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4dd5f8a9
    • N
      powerpc/64s/radix: Boot-time NULL pointer protection using a guard-PID · eeb715c3
      Nicholas Piggin 提交于
      This change restores and formalises the behaviour that access to NULL
      or other user addresses by the kernel during boot should fault rather
      than succeed and modify memory. This was inadvertently broken when
      fixing another bug, because it was previously not well defined and
      only worked by chance.
      
      powerpc/64s/radix uses high address bits to select an address space
      "quadrant", which determines which PID and LPID are used to translate
      the rest of the address (effective PID, effective LPID). The kernel
      mapping at 0xC... selects quadrant 3, which uses PID=0 and LPID=0. So
      the kernel page tables are installed in the PID 0 process table entry.
      
      An address at 0x0... selects quadrant 0, which uses PID=PIDR for
      translating the rest of the address (that is, it uses the value of the
      PIDR register as the effective PID). If PIDR=0, then the translation
      is performed with the PID 0 process table entry page tables. This is
      the kernel mapping, so we effectively get another copy of the kernel
      address space at 0. A NULL pointer access will access physical memory
      address 0.
      
      To prevent duplicating the kernel address space in quadrant 0, this
      patch allocates a guard PID containing no translations, and
      initializes PIDR with this during boot, before the MMU is switched on.
      Any kernel access to quadrant 0 will use this guard PID for
      translation and find no valid mappings, and therefore fault.
      
      After boot, this PID will be switchd away to user context PIDs, but
      those contain user mappings (and usually NULL pointer protection)
      rather than kernel mapping, which is much safer (and by design). It
      may be in future this is tightened further, which the guard PID could
      be used for.
      
      Commit 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before
      setting partition table"), introduced this problem because it zeroes
      PIDR at boot. However previously the value was inherited from firmware
      or kexec, which is not robust and can be zero (e.g., mambo).
      
      Fixes: 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before setting partition table")
      Cc: stable@vger.kernel.org # v4.15+
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Tested-by: NMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      eeb715c3
  20. 17 1月, 2018 3 次提交