1. 18 2月, 2019 4 次提交
  2. 04 2月, 2019 1 次提交
  3. 31 1月, 2019 1 次提交
    • A
      powerpc/radix: Fix kernel crash with mremap() · 579b9239
      Aneesh Kumar K.V 提交于
      With support for split pmd lock, we use pmd page pmd_huge_pte pointer
      to store the deposited page table. In those config when we move page
      tables we need to make sure we move the deposited page table to the
      correct pmd page. Otherwise this can result in crash when we withdraw
      of deposited page table because we can find the pmd_huge_pte NULL.
      
      eg:
      
        __split_huge_pmd+0x1070/0x1940
        __split_huge_pmd+0xe34/0x1940 (unreliable)
        vma_adjust_trans_huge+0x110/0x1c0
        __vma_adjust+0x2b4/0x9b0
        __split_vma+0x1b8/0x280
        __do_munmap+0x13c/0x550
        sys_mremap+0x220/0x7e0
        system_call+0x5c/0x70
      
      Fixes: 675d9952 ("powerpc/book3s64: Enable split pmd ptlock.")
      Cc: stable@vger.kernel.org # v4.18+
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      579b9239
  4. 30 1月, 2019 1 次提交
  5. 15 1月, 2019 1 次提交
  6. 14 1月, 2019 1 次提交
  7. 05 1月, 2019 1 次提交
    • J
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) 提交于
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924
  8. 04 1月, 2019 1 次提交
    • L
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds 提交于
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  9. 29 12月, 2018 1 次提交
  10. 21 12月, 2018 4 次提交
  11. 20 12月, 2018 5 次提交
  12. 19 12月, 2018 8 次提交
  13. 17 12月, 2018 1 次提交
    • S
      KVM: PPC: Book3S HV: Implement functions to access quadrants 1 & 2 · d7b45615
      Suraj Jitindar Singh 提交于
      The POWER9 radix mmu has the concept of quadrants. The quadrant number
      is the two high bits of the effective address and determines the fully
      qualified address to be used for the translation. The fully qualified
      address consists of the effective lpid, the effective pid and the
      effective address. This gives then 4 possible quadrants 0, 1, 2, and 3.
      
      When accessing these quadrants the fully qualified address is obtained
      as follows:
      
      Quadrant		| Hypervisor		| Guest
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b00	| EA[0:1] = 0b00
      0			| effLPID = 0		| effLPID = LPIDR
      			| effPID  = PIDR	| effPID  = PIDR
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b01	|
      1			| effLPID = LPIDR	| Invalid Access
      			| effPID  = PIDR	|
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b10	|
      2			| effLPID = LPIDR	| Invalid Access
      			| effPID  = 0		|
      --------------------------------------------------------------------------
      			| EA[0:1] = 0b11	| EA[0:1] = 0b11
      3			| effLPID = 0		| effLPID = LPIDR
      			| effPID  = 0		| effPID  = 0
      --------------------------------------------------------------------------
      
      In the Guest;
      Quadrant 3 is normally used to address the operating system since this
      uses effPID=0 and effLPID=LPIDR, meaning the PID register doesn't need to
      be switched.
      Quadrant 0 is normally used to address user space since the effLPID and
      effPID are taken from the corresponding registers.
      
      In the Host;
      Quadrant 0 and 3 are used as above, however the effLPID is always 0 to
      address the host.
      
      Quadrants 1 and 2 can be used by the host to address guest memory using
      a guest effective address. Since the effLPID comes from the LPID register,
      the host loads the LPID of the guest it would like to access (and the
      PID of the process) and can perform accesses to a guest effective
      address.
      
      This means quadrant 1 can be used to address the guest user space and
      quadrant 2 can be used to address the guest operating system from the
      hypervisor, using a guest effective address.
      
      Access to the quadrants can cause a Hypervisor Data Storage Interrupt
      (HDSI) due to being unable to perform partition scoped translation.
      Previously this could only be generated from a guest and so the code
      path expects us to take the KVM trampoline in the interrupt handler.
      This is no longer the case so we modify the handler to call
      bad_page_fault() to check if we were expecting this fault so we can
      handle it gracefully and just return with an error code. In the hash mmu
      case we still raise an unknown exception since quadrants aren't defined
      for the hash mmu.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      d7b45615
  14. 09 12月, 2018 1 次提交
    • O
      powerpc/mm: Fallback to RAM if the altmap is unusable · 9ef34630
      Oliver O'Halloran 提交于
      The "altmap" is used to provide a pool of memory that is reserved for
      the vmemmap backing of hot-plugged memory. This is useful when adding
      large amount of ZONE_DEVICE memory to a system with a limited amount of
      normal memory.
      
      On ppc64 we use huge pages to map the vmemmap which requires the backing
      storage to be contigious and aligned to the hugepage size. The altmap
      implementation allows for the altmap provider to reserve a few PFNs at
      the start of the range for it's own uses and when this occurs the
      first chunk of the altmap is not usable for hugepage mappings. On hash
      there is no sane way to fall back to a normal sized page mapping so we
      fail the allocation. This results in memory hotplug failing with
      ENOMEM when the new range doesn't fall into an existing vmemmap block.
      
      This patch handles this case by falling back to using system memory
      rather than failing if we cannot allocate from the altmap. This
      fallback should only ever be used for the first vmemmap block so it
      should not cause excess memory consumption.
      
      Fixes: 7b73d978 ("mm: pass the vmem_altmap to vmemmap_populate")
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9ef34630
  15. 04 12月, 2018 9 次提交
    • C
      powerpc/mm: dump block address translation on book3s/32 · 7c91efce
      Christophe Leroy 提交于
      This patch adds a debugfs file to dump block address translation:
      
      ~# cat /sys/kernel/debug/powerpc/block_address_translation
      ---[ Instruction Block Address Translations ]---
      0:         -
      1:         -
      2: 0xc0000000-0xcfffffff 0x00000000 Kernel EXEC coherent
      3: 0xd0000000-0xdfffffff 0x10000000 Kernel EXEC coherent
      4:         -
      5:         -
      6:         -
      7:         -
      
      ---[ Data Block Address Translations ]---
      0:         -
      1:         -
      2: 0xc0000000-0xcfffffff 0x00000000 Kernel RW coherent
      3: 0xd0000000-0xdfffffff 0x10000000 Kernel RW coherent
      4:         -
      5:         -
      6:         -
      7:         -
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7c91efce
    • C
      powerpc/mm: dump segment registers on book3s/32 · 0261a508
      Christophe Leroy 提交于
      This patch creates a debugfs file to see content of
      segment registers
      
        # cat /sys/kernel/debug/segment_registers
        ---[ User Segments ]---
        0x00000000-0x0fffffff Kern key 1 User key 1 VSID 0xade2b0
        0x10000000-0x1fffffff Kern key 1 User key 1 VSID 0xade3c1
        0x20000000-0x2fffffff Kern key 1 User key 1 VSID 0xade4d2
        0x30000000-0x3fffffff Kern key 1 User key 1 VSID 0xade5e3
        0x40000000-0x4fffffff Kern key 1 User key 1 VSID 0xade6f4
        0x50000000-0x5fffffff Kern key 1 User key 1 VSID 0xade805
        0x60000000-0x6fffffff Kern key 1 User key 1 VSID 0xade916
        0x70000000-0x7fffffff Kern key 1 User key 1 VSID 0xadea27
        0x80000000-0x8fffffff Kern key 1 User key 1 VSID 0xadeb38
        0x90000000-0x9fffffff Kern key 1 User key 1 VSID 0xadec49
        0xa0000000-0xafffffff Kern key 1 User key 1 VSID 0xaded5a
        0xb0000000-0xbfffffff Kern key 1 User key 1 VSID 0xadee6b
      
        ---[ Kernel Segments ]---
        0xc0000000-0xcfffffff Kern key 0 User key 1 VSID 0x000ccc
        0xd0000000-0xdfffffff Kern key 0 User key 1 VSID 0x000ddd
        0xe0000000-0xefffffff Kern key 0 User key 1 VSID 0x000eee
        0xf0000000-0xffffffff Kern key 0 User key 1 VSID 0x000fff
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      [mpe: Move it under /sys/kernel/debug/powerpc, make sr_init() __init]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0261a508
    • C
      powerpc/8xx: Enable 512k hugepage support with HW assistance · 3fb69c6a
      Christophe Leroy 提交于
      For using 512k pages with hardware assistance, the PTEs have to be spread
      every 128 bytes in the L2 table.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3fb69c6a
    • C
      powerpc/8xx: Enable 8M hugepage support with HW assistance · 22569b88
      Christophe Leroy 提交于
      HW assistance naturally supports 8M huge pages without
      further modifications.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      22569b88
    • C
      powerpc/8xx: Use hardware assistance in TLB handlers · 6a8f911b
      Christophe Leroy 提交于
      Today, on the 8xx the TLB handlers do SW tablewalk by doing all
      the calculation in ASM, in order to match with the Linux page
      table structure.
      
      The 8xx offers hardware assistance which allows significant size
      reduction of the TLB handlers, hence also reduces the time spent
      in the handlers.
      
      However, using this HW assistance implies some constraints on the
      page table structure:
      - Regardless of the main page size used (4k or 16k), the
      level 1 table (PGD) contains 1024 entries and each PGD entry covers
      a 4Mbytes area which is managed by a level 2 table (PTE) containing
      also 1024 entries each describing a 4k page.
      - 16k pages require 4 identifical entries in the L2 table
      - 512k pages PTE have to be spread every 128 bytes in the L2 table
      - 8M pages PTE are at the address pointed by the L1 entry and each
      8M page require 2 identical entries in the PGD.
      
      This patch modifies the TLB handlers to use HW assistance for 4K PAGES.
      
      Before that patch, the mean time spent in TLB miss handlers is:
      - ITLB miss: 80 ticks
      - DTLB miss: 62 ticks
      After that patch, the mean time spent in TLB miss handlers is:
      - ITLB miss: 72 ticks
      - DTLB miss: 54 ticks
      So the improvement is 10% for ITLB and 13% for DTLB misses
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      6a8f911b
    • C
      powerpc/8xx: Temporarily disable 16k pages and hugepages · 5af543be
      Christophe Leroy 提交于
      In preparation of making use of hardware assistance in TLB handlers,
      this patch temporarily disables 16K pages and hugepages. The reason
      is that when using HW assistance in 4K pages mode, the linux model
      fit with the HW model for 4K pages and 8M pages.
      
      However for 16K pages and 512K mode some additional work is needed
      to get linux model fit with HW model.
      For the 8M pages, they will naturaly come back when we switch to
      HW assistance, without any additional handling.
      In order to keep the following patch smaller, the removal of the
      current special handling for 8M pages gets removed here as well.
      
      Therefore the 4K pages mode will be implemented first and without
      support for 512k hugepages. Then the 512k hugepages will be brought
      back. And the 16K pages will be implemented in the following step.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5af543be
    • C
      powerpc/mm: remove unnecessary test in pgtable_cache_init() · 32bff4b9
      Christophe Leroy 提交于
      pgtable_cache_add() gracefully handles the case when a cache that
      size already exists by returning early with the following test:
      
      	if (PGT_CACHE(shift))
      		return; /* Already have a cache of this size */
      
      It is then not needed to test the existence of the cache before.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      32bff4b9
    • C
      powerpc/mm: fix a warning when a cache is common to PGD and hugepages · 1e03c7e2
      Christophe Leroy 提交于
      While implementing TLB miss HW assistance on the 8xx, the following
      warning was encountered:
      
      [  423.732965] WARNING: CPU: 0 PID: 345 at mm/slub.c:2412 ___slab_alloc.constprop.30+0x26c/0x46c
      [  423.733033] CPU: 0 PID: 345 Comm: mmap Not tainted 4.18.0-rc8-00664-g2dfff9121c55 #671
      [  423.733075] NIP:  c0108f90 LR: c0109ad0 CTR: 00000004
      [  423.733121] REGS: c455bba0 TRAP: 0700   Not tainted  (4.18.0-rc8-00664-g2dfff9121c55)
      [  423.733147] MSR:  00021032 <ME,IR,DR,RI>  CR: 24224848  XER: 20000000
      [  423.733319]
      [  423.733319] GPR00: c0109ad0 c455bc50 c4521910 c60053c0 007080c0 c0011b34 c7fa41e0 c455be30
      [  423.733319] GPR08: 00000001 c00103a0 c7fa41e0 c49afcc4 24282842 10018840 c079b37c 00000040
      [  423.733319] GPR16: 73f00000 00210d00 00000000 00000001 c455a000 00000100 00000200 c455a000
      [  423.733319] GPR24: c60053c0 c0011b34 007080c0 c455a000 c455a000 c7fa41e0 00000000 00009032
      [  423.734190] NIP [c0108f90] ___slab_alloc.constprop.30+0x26c/0x46c
      [  423.734257] LR [c0109ad0] kmem_cache_alloc+0x210/0x23c
      [  423.734283] Call Trace:
      [  423.734326] [c455bc50] [00000100] 0x100 (unreliable)
      [  423.734430] [c455bcc0] [c0109ad0] kmem_cache_alloc+0x210/0x23c
      [  423.734543] [c455bcf0] [c0011b34] huge_pte_alloc+0xc0/0x1dc
      [  423.734633] [c455bd20] [c01044dc] hugetlb_fault+0x408/0x48c
      [  423.734720] [c455bdb0] [c0104b20] follow_hugetlb_page+0x14c/0x44c
      [  423.734826] [c455be10] [c00e8e54] __get_user_pages+0x1c4/0x3dc
      [  423.734919] [c455be80] [c00e9924] __mm_populate+0xac/0x140
      [  423.735020] [c455bec0] [c00db14c] vm_mmap_pgoff+0xb4/0xb8
      [  423.735127] [c455bf00] [c00f27c0] ksys_mmap_pgoff+0xcc/0x1fc
      [  423.735222] [c455bf40] [c000e0f8] ret_from_syscall+0x0/0x38
      [  423.735271] Instruction dump:
      [  423.735321] 7cbf482e 38fd0008 7fa6eb78 7fc4f378 4bfff5dd 7fe3fb78 4bfffe24 81370010
      [  423.735536] 71280004 41a2ff88 4840c571 4bffff80 <0fe00000> 4bfffeb8 81340010 712a0004
      [  423.735757] ---[ end trace e9b222919a470790 ]---
      
      This warning occurs when calling kmem_cache_zalloc() on a
      cache having a constructor.
      
      In this case it happens because PGD cache and 512k hugepte cache are
      the same size (4k). While a cache with constructor is created for
      the PGD, hugepages create cache without constructor and uses
      kmem_cache_zalloc(). As both expect a cache with the same size,
      the hugepages reuse the cache created for PGD, hence the conflict.
      
      In order to avoid this conflict, this patch:
      - modifies pgtable_cache_add() so that a zeroising constructor is
      added for any cache size.
      - replaces calls to kmem_cache_zalloc() by kmem_cache_alloc()
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1e03c7e2
    • C
      powerpc/mm: replace hugetlb_cache by PGT_CACHE(PTE_T_ORDER) · 03566562
      Christophe Leroy 提交于
      Instead of opencoding cache handling for the special case
      of hugepage tables having a single pte_t element, this
      patch makes use of the common pgtable_cache helpers
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      03566562