1. 16 9月, 2020 5 次提交
    • S
      powerpc/numa: Detect support for coregroup · f9f130ff
      Srikar Dronamraju 提交于
      Add support for grouping cores based on the device-tree classification.
      - The last domain in the associativity domains always refers to the
      core.
      - If primary reference domain happens to be the penultimate domain in
      the associativity domains device-tree property, then there are no
      coregroups. However if its not a penultimate domain, then there are
      coregroups. There can be more than one coregroup. For now we would be
      interested in the last or the smallest coregroups, i.e one sub-group
      per DIE.
      
      Currently there are no firmwares that are exposing this grouping. Hence
      allow the basis for grouping to be abstract.  Once the firmware starts
      using this grouping, code would be added to detect the type of grouping
      and adjust the sd domain flags accordingly.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
      f9f130ff
    • S
      powerpc/numa: Offline memoryless cpuless node 0 · e75130f2
      Srikar Dronamraju 提交于
      Currently Linux kernel with CONFIG_NUMA on a system with multiple
      possible nodes, marks node 0 as online at boot.  However in practice,
      there are systems which have node 0 as memoryless and cpuless.
      
      This can cause numa_balancing to be enabled on systems with only one node
      with memory and CPUs. The existence of this dummy node which is cpuless and
      memoryless node can confuse users/scripts looking at output of lscpu /
      numactl.
      
      By marking, node 0 as offline, lets stop assuming that node 0 is
      always online. If node 0 has CPU or memory that are online, node 0 will
      again be set as online.
      
      v5.8
       available: 2 nodes (0,2)
       node 0 cpus:
       node 0 size: 0 MB
       node 0 free: 0 MB
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31490 MB
       node distances:
       node   0   2
         0:  10  20
         2:  20  10
      
      proc and sys files
      ------------------
       /sys/devices/system/node/online:            0,2
       /proc/sys/kernel/numa_balancing:            1
       /sys/devices/system/node/has_cpu:           2
       /sys/devices/system/node/has_memory:        2
       /sys/devices/system/node/has_normal_memory: 2
       /sys/devices/system/node/possible:          0-31
      
      v5.8 + patch
      ------------------
       available: 1 nodes (2)
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31487 MB
       node distances:
       node   2
         2:  10
      
      proc and sys files
      ------------------
      /sys/devices/system/node/online:            2
      /proc/sys/kernel/numa_balancing:            0
      /sys/devices/system/node/has_cpu:           2
      /sys/devices/system/node/has_memory:        2
      /sys/devices/system/node/has_normal_memory: 2
      /sys/devices/system/node/possible:          0-31
      
      Example of a node with online CPUs/memory on node 0.
      (Same o/p with and without patch)
      numactl -H
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 0 size: 32482 MB
      node 0 free: 22994 MB
      node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
      node 1 size: 0 MB
      node 1 free: 0 MB
      node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
      node 2 size: 0 MB
      node 2 free: 0 MB
      node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
      node 3 free: 0 MB
      node distances:
      node   0   1   2   3
        0:  10  20  40  40
        1:  20  10  40  40
        2:  40  40  10  20
        3:  40  40  20  10
      
      Note: On Powerpc, cpu_to_node of possible but not present cpus would
      previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
      numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
      queried from vphn"). Without the 2 commits, Powerpc system might crash.
      
      1. User space applications like Numactl, lscpu, that parse the sysfs tend to
      believe there is an extra online node. This tends to confuse users and
      applications. Other user space applications start believing that system was
      not able to use all the resources (i.e missing resources) or the system was
      not setup correctly.
      
      2. Also existence of dummy node also leads to inconsistent information. The
      number of online nodes is inconsistent with the information in the
      device-tree and resource-dump
      
      3. When the dummy node is present, single node non-Numa systems end up showing
      up as NUMA systems and numa_balancing gets enabled. This will mean we take
      the hit from the unnecessary numa hinting faults.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
      e75130f2
    • S
      powerpc/numa: Prefer node id queried from vphn · 6398eaa2
      Srikar Dronamraju 提交于
      Node id queried from the static device tree may not
      be correct. For example: it may always show 0 on a shared processor.
      Hence prefer the node id queried from vphn and fallback on the device tree
      based node id if vphn query fails.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
      6398eaa2
    • S
      powerpc/numa: Set numa_node for all possible cpus · a874f100
      Srikar Dronamraju 提交于
      A Powerpc system with multiple possible nodes and with CONFIG_NUMA
      enabled always used to have a node 0, even if node 0 does not any cpus
      or memory attached to it. As per PAPR, node affinity of a cpu is only
      available once its present / online. For all cpus that are possible but
      not present, cpu_to_node() would point to node 0.
      
      To ensure a cpuless, memoryless dummy node is not online, powerpc need
      to make sure all possible but not present cpu_to_node are set to a
      proper node.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
      a874f100
    • S
      powerpc/numa: Restrict possible nodes based on platform · 67df7784
      Srikar Dronamraju 提交于
      As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
      Abstraction Services (RTAS) Node" available at:
        https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
      
      ... there are 2 device tree properties:
      
        "ibm,max-associativity-domains"
         which defines the maximum number of domains that the firmware i.e
         PowerVM can support.
      
      and:
      
        "ibm,current-associativity-domains"
         which defines the maximum number of domains that the current
         platform can support.
      
      The value of "ibm,max-associativity-domains" is always greater than or
      equal to "ibm,current-associativity-domains" property. If the latter
      property is not available, use "ibm,max-associativity-domain" as a
      fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
      is mentioned in page 833 / B.5.3 which is covered under under
      "Appendix B. System Binding" section
      
      Currently powerpc uses the "ibm,max-associativity-domains" property
      while setting the possible number of nodes. This is currently set at
      32. However the possible number of nodes for a platform may be
      significantly less. Hence set the possible number of nodes based on
      "ibm,current-associativity-domains" property.
      
      Nathan Lynch had raised a valid concern that post LPM (Live Partition
      Migration), a user could DLPAR add processors and memory after LPM
      with "new" associativity properties:
        https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
      
      He also pointed out that "ibm,max-associativity-domains" has the same
      contents on all currently available PowerVM systems, unlike
      "ibm,current-associativity-domains" and hence may be better able to
      handle the new NUMA associativity properties.
      
      However with the recent commit dbce4562 ("powerpc/numa: Limit
      possible nodes to within num_possible_nodes"), all new NUMA
      associativity properties are capped to initially set nr_node_ids.
      Hence this commit should be safe with any new DLPAR add post LPM.
      
        $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
        /proc/device-tree/rtas/ibm,current-associativity-domains
        		 00000005 00000001 00000002 00000002 00000002 00000010
        /proc/device-tree/rtas/ibm,max-associativity-domains
        		 00000005 00000001 00000008 00000020 00000020 00000100
      
        $ cat /sys/devices/system/node/possible ##Before patch
        0-31
      
        $ cat /sys/devices/system/node/possible ##After patch
        0-1
      
      Note the maximum nodes this platform can support is only 2 but the
      possible nodes is set to 32.
      
      This is important because lot of kernel and user space code allocate
      structures for all possible nodes leading to a lot of memory that is
      allocated but not used.
      
      I ran a simple experiment to create and destroy 100 memory cgroups on
      boot on a 8 node machine (Power8 Alpine).
      
      Before patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4106816   518820608       22272      570752   516606720
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4628416   518246464       22336      623296   516058688
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4697408   518173760       22400      627008   515987904
        Swap:       4194240           0     4194240
      
      After patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     3969472   518933888       22272      594816   516731776
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4181888   518676096       22208      640192   516496448
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4232320   518619904       22272      645952   516443264
        Swap:       4194240           0     4194240
      
      Observations:
        Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
        Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Reformat change log a bit for readability]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
      67df7784
  2. 15 9月, 2020 6 次提交
  3. 14 9月, 2020 1 次提交
  4. 02 9月, 2020 3 次提交
    • C
    • S
      pseries/drmem: don't cache node id in drmem_lmb struct · e5e179aa
      Scott Cheloha 提交于
      At memory hot-remove time we can retrieve an LMB's nid from its
      corresponding memory_block.  There is no need to store the nid
      in multiple locations.
      
      Note that lmb_to_memblock() uses find_memory_block() to get the
      corresponding memory_block.  As find_memory_block() runs in sub-linear
      time this approach is negligibly slower than what we do at present.
      
      In exchange for this lookup at hot-remove time we no longer need to
      call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
      On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
      spares us an O(n^2) initialization during boot.
      
      On systems with many LMBs that initialization overhead is palpable and
      disruptive.  For example, on a box with 249854 LMBs we're seeing
      drmem_init() take upwards of 30 seconds to complete:
      
      [   53.721639] drmem: initializing drmem v2
      [   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
      [   80.604377] Modules linked in:
      [   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
      [   80.604397] NIP:  c0000000000a4980 LR: c0000000000a4940 CTR: 0000000000000000
      [   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
      [   80.604412] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000248  XER: 0000000d
      [   80.604431] CFAR: c0000000000a4a38 IRQMASK: 0
      [   80.604431] GPR00: c0000000000a4940 c0002dbff8493ac0 c000000001904400 c0003cfffffede30
      [   80.604431] GPR04: 0000000000000000 c000000000f4095a 000000000000002f 0000000010000000
      [   80.604431] GPR08: c0000bf7ecdb7fb8 c0000bf7ecc2d3c8 0000000000000008 c00c0002fdfb2001
      [   80.604431] GPR12: 0000000000000000 c00000001e8ec200
      [   80.604477] NIP [c0000000000a4980] hot_add_scn_to_nid+0xa0/0x3e0
      [   80.604486] LR [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0
      [   80.604492] Call Trace:
      [   80.604498] [c0002dbff8493ac0] [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
      [   80.604509] [c0002dbff8493b20] [c000000000087c10] memory_add_physaddr_to_nid+0x20/0x60
      [   80.604521] [c0002dbff8493b40] [c0000000010d4880] drmem_init+0x25c/0x2f0
      [   80.604530] [c0002dbff8493c10] [c000000000010154] do_one_initcall+0x64/0x2c0
      [   80.604540] [c0002dbff8493ce0] [c0000000010c4aa0] kernel_init_freeable+0x2d8/0x3a0
      [   80.604550] [c0002dbff8493db0] [c000000000010824] kernel_init+0x2c/0x148
      [   80.604560] [c0002dbff8493e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74
      [   80.604567] Instruction dump:
      [   80.604574] 392918e8 e9490000 e90a000a e92a0000 80ea000c 1d080018 3908ffe8 7d094214
      [   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e9490000 7fbe5040
      [   89.047390] drmem: 249854 LMB(s)
      
      With a patched kernel on the same machine we're no longer seeing the
      soft lockup.  drmem_init() now completes in negligible time, even when
      the LMB count is large.
      
      Fixes: b2d3b5ee ("powerpc/pseries: Track LMB nid instead of using device tree")
      Signed-off-by: NScott Cheloha <cheloha@linux.ibm.com>
      Reviewed-by: NNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200811015115.63677-1-cheloha@linux.ibm.com
      e5e179aa
    • C
      powerpc: Rewrite FSL_BOOKE flush_cache_instruction() in C · 704dfe93
      Christophe Leroy 提交于
      Nothing prevents flush_cache_instruction() from being writen in C.
      
      Do it to improve readability and maintainability.
      
      This function is only use by low level callers, it is not
      intended to be used by module. Don't export it.
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/f989eff8296800c427622c0985384148404e4f0b.1597384512.git.christophe.leroy@csgroup.eu
      704dfe93
  5. 28 8月, 2020 1 次提交
    • A
      powerpc/book3s64/radix: Fix boot failure with large amount of guest memory · 103a8542
      Aneesh Kumar K.V 提交于
      If the hypervisor doesn't support hugepages, the kernel ends up allocating a large
      number of page table pages. The early page table allocation was wrongly
      setting the max memblock limit to ppc64_rma_size with radix translation
      which resulted in boot failure as shown below.
      
      Kernel panic - not syncing:
      early_alloc_pgtable: Failed to allocate 16777216 bytes align=0x1000000 nid=-1 from=0x0000000000000000 max_addr=0xffffffffffffffff
       CPU: 0 PID: 0 Comm: swapper Not tainted 5.8.0-24.9-default+ #2
       Call Trace:
       [c0000000016f3d00] [c0000000007c6470] dump_stack+0xc4/0x114 (unreliable)
       [c0000000016f3d40] [c00000000014c78c] panic+0x164/0x418
       [c0000000016f3dd0] [c000000000098890] early_alloc_pgtable+0xe0/0xec
       [c0000000016f3e60] [c0000000010a5440] radix__early_init_mmu+0x360/0x4b4
       [c0000000016f3ef0] [c000000001099bac] early_init_mmu+0x1c/0x3c
       [c0000000016f3f10] [c00000000109a320] early_setup+0x134/0x170
      
      This was because the kernel was checking for the radix feature before we enable the
      feature via mmu_features. This resulted in the kernel using hash restrictions on
      radix.
      
      Rework the early init code such that the kernel boot with memblock restrictions
      as imposed by hash. At that point, the kernel still hasn't finalized the
      translation the kernel will end up using.
      
      We have three different ways of detecting radix.
      
      1. dt_cpu_ftrs_scan -> used only in case of PowerNV
      2. ibm,pa-features -> Used when we don't use cpu_dt_ftr_scan
      3. CAS -> Where we negotiate with hypervisor about the supported translation.
      
      We look at 1 or 2 early in the boot and after that, we look at the CAS vector to
      finalize the translation the kernel will use. We also support a kernel command
      line option (disable_radix) to switch to hash.
      
      Update the memblock limit after mmu_early_init_devtree() if the kernel is going
      to use radix translation. This forces some of the memblock allocations we do before
      mmu_early_init_devtree() to be within the RMA limit.
      
      Fixes: 2bfd65e4 ("powerpc/mm/radix: Add radix callbacks for early init routines")
      Reported-by: NShirisha Ganta <shiganta@in.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: NHari Bathini <hbathini@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200828100852.426575-1-aneesh.kumar@linux.ibm.com
      103a8542
  6. 24 8月, 2020 3 次提交
  7. 21 8月, 2020 1 次提交
  8. 18 8月, 2020 1 次提交
  9. 17 8月, 2020 1 次提交
  10. 13 8月, 2020 3 次提交
    • P
      mm: clean up the last pieces of page fault accountings · a2beb5f1
      Peter Xu 提交于
      Here're the last pieces of page fault accounting that were still done
      outside handle_mm_fault() where we still have regs==NULL when calling
      handle_mm_fault():
      
      arch/powerpc/mm/copro_fault.c:   copro_handle_mm_fault
      arch/sparc/mm/fault_32.c:        force_user_fault
      arch/um/kernel/trap.c:           handle_page_fault
      mm/gup.c:                        faultin_page
                                       fixup_user_fault
      mm/hmm.c:                        hmm_vma_fault
      mm/ksm.c:                        break_ksm
      
      Some of them has the issue of duplicated accounting for page fault
      retries.  Some of them didn't do the accounting at all.
      
      This patch cleans all these up by letting handle_mm_fault() to do per-task
      page fault accounting even if regs==NULL (though we'll still skip the perf
      event accountings).  With that, we can safely remove all the outliers now.
      
      There's another functional change in that now we account the page faults
      to the caller of gup, rather than the task_struct that passed into the gup
      code.  More information of this can be found at [1].
      
      After this patch, below things should never be touched again outside
      handle_mm_fault():
      
        - task_struct.[maj|min]_flt
        - PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]
      
      [1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2beb5f1
    • P
      mm/powerpc: use general page fault accounting · 428fdc09
      Peter Xu 提交于
      Use the general page fault accounting by passing regs into
      handle_mm_fault().
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Link: http://lkml.kernel.org/r/20200707225021.200906-17-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      428fdc09
    • P
      mm: do page fault accounting in handle_mm_fault · bce617ed
      Peter Xu 提交于
      Patch series "mm: Page fault accounting cleanups", v5.
      
      This is v5 of the pf accounting cleanup series.  It originates from Gerald
      Schaefer's report on an issue a week ago regarding to incorrect page fault
      accountings for retried page fault after commit 4064b982 ("mm: allow
      VM_FAULT_RETRY for multiple times"):
      
        https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
      
      What this series did:
      
        - Correct page fault accounting: we do accounting for a page fault
          (no matter whether it's from #PF handling, or gup, or anything else)
          only with the one that completed the fault.  For example, page fault
          retries should not be counted in page fault counters.  Same to the
          perf events.
      
        - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
          event is used in an adhoc way across different archs.
      
          Case (1): for many archs it's done at the entry of a page fault
          handler, so that it will also cover e.g.  errornous faults.
      
          Case (2): for some other archs, it is only accounted when the page
          fault is resolved successfully.
      
          Case (3): there're still quite some archs that have not enabled
          this perf event.
      
          Since this series will touch merely all the archs, we unify this
          perf event to always follow case (1), which is the one that makes most
          sense.  And since we moved the accounting into handle_mm_fault, the
          other two MAJ/MIN perf events are well taken care of naturally.
      
        - Unify definition of "major faults": the definition of "major
          fault" is slightly changed when used in accounting (not
          VM_FAULT_MAJOR).  More information in patch 1.
      
        - Always account the page fault onto the one that triggered the page
          fault.  This does not matter much for #PF handlings, but mostly for
          gup.  More information on this in patch 25.
      
      Patchset layout:
      
      Patch 1:     Introduced the accounting in handle_mm_fault(), not enabled.
      Patch 2-23:  Enable the new accounting for arch #PF handlers one by one.
      Patch 24:    Enable the new accounting for the rest outliers (gup, iommu, etc.)
      Patch 25:    Cleanup GUP task_struct pointer since it's not needed any more
      
      This patch (of 25):
      
      This is a preparation patch to move page fault accountings into the
      general code in handle_mm_fault().  This includes both the per task
      flt_maj/flt_min counters, and the major/minor page fault perf events.  To
      do this, the pt_regs pointer is passed into handle_mm_fault().
      
      PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
      handlers.
      
      So far, all the pt_regs pointer that passed into handle_mm_fault() is
      NULL, which means this patch should have no intented functional change.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
      Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce617ed
  11. 10 8月, 2020 1 次提交
  12. 08 8月, 2020 3 次提交
    • M
      mm/sparse: cleanup the code surrounding memory_present() · c89ab04f
      Mike Rapoport 提交于
      After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
      functions that call memory_present() for each region in memblock.memory:
      sparse_memory_present_with_active_regions() and membocks_present().
      
      Moreover, all architectures have a call to either of these functions
      preceding the call to sparse_init() and in the most cases they are called
      one after the other.
      
      Mark the regions from memblock.memory as present during sparce_init() by
      making sparse_init() call memblocks_present(), make memblocks_present()
      and memory_present() functions static and remove redundant
      sparse_memory_present_with_active_regions() function.
      
      Also remove no longer required HAVE_MEMORY_PRESENT configuration option.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c89ab04f
    • A
      mm/sparsemem: enable vmem_altmap support in vmemmap_alloc_block_buf() · 56993b4e
      Anshuman Khandual 提交于
      There are many instances where vmemap allocation is often switched between
      regular memory and device memory just based on whether altmap is available
      or not.  vmemmap_alloc_block_buf() is used in various platforms to
      allocate vmemmap mappings.  Lets also enable it to handle altmap based
      device memory allocation along with existing regular memory allocations.
      This will help in avoiding the altmap based allocation switch in many
      places.  To summarize there are two different methods to call
      vmemmap_alloc_block_buf().
      
      vmemmap_alloc_block_buf(size, node, NULL)   /* Allocate from system RAM */
      vmemmap_alloc_block_buf(size, node, altmap) /* Allocate from altmap */
      
      This converts altmap_alloc_block_buf() into a static function, drops it's
      entry from the header and updates Documentation/vm/memory-model.rst.
      Suggested-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJia He <justin.he@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Link: http://lkml.kernel.org/r/1594004178-8861-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56993b4e
    • M
      mm: remove unneeded includes of <asm/pgalloc.h> · ca15ca40
      Mike Rapoport 提交于
      Patch series "mm: cleanup usage of <asm/pgalloc.h>"
      
      Most architectures have very similar versions of pXd_alloc_one() and
      pXd_free_one() for intermediate levels of page table.  These patches add
      generic versions of these functions in <asm-generic/pgalloc.h> and enable
      use of the generic functions where appropriate.
      
      In addition, functions declared and defined in <asm/pgalloc.h> headers are
      used mostly by core mm and early mm initialization in arch and there is no
      actual reason to have the <asm/pgalloc.h> included all over the place.
      The first patch in this series removes unneeded includes of
      <asm/pgalloc.h>
      
      In the end it didn't work out as neatly as I hoped and moving
      pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
      unnecessary changes to arches that have custom page table allocations, so
      I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
      to mm/.
      
      This patch (of 8):
      
      In most cases <asm/pgalloc.h> header is required only for allocations of
      page table memory.  Most of the .c files that include that header do not
      use symbols declared in <asm/pgalloc.h> and do not require that header.
      
      As for the other header files that used to include <asm/pgalloc.h>, it is
      possible to move that include into the .c file that actually uses symbols
      from <asm/pgalloc.h> and drop the include from the header file.
      
      The process was somewhat automated using
      
      	sed -i -E '/[<"]asm\/pgalloc\.h/d' \
                      $(grep -L -w -f /tmp/xx \
                              $(git grep -E -l '[<"]asm/pgalloc\.h'))
      
      where /tmp/xx contains all the symbols defined in
      arch/*/include/asm/pgalloc.h.
      
      [rppt@linux.ibm.com: fix powerpc warning]
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca15ca40
  13. 30 7月, 2020 1 次提交
  14. 29 7月, 2020 7 次提交
    • H
      powerpc/drmem: Make LMB walk a bit more flexible · adfefc60
      Hari Bathini 提交于
      Currently, numa & prom are the only users of drmem LMB walk code.
      Loading kdump with kexec_file also needs to walk the drmem LMBs to
      setup the usable memory ranges for kdump kernel. But there are couple
      of issues in using the code as is. One, walk_drmem_lmb() code is built
      into the .init section currently, while kexec_file needs it later.
      Two, there is no scope to pass data to the callback function for
      processing and/or erroring out on certain conditions.
      
      Fix that by, moving drmem LMB walk code out of .init section, adding
      scope to pass data to the callback function and bailing out when an
      error is encountered in the callback function.
      Signed-off-by: NHari Bathini <hbathini@linux.ibm.com>
      Tested-by: NPingfan Liu <piliu@redhat.com>
      Reviewed-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/159602282727.575379.3979857013827701828.stgit@hbathini
      adfefc60
    • A
      powerpc/book3s64/radix: Add kernel command line option to disable radix GTSE · bf6b7661
      Aneesh Kumar K.V 提交于
      This adds a kernel command line option that can be used to disable GTSE support.
      Disabling GTSE implies kernel will make hcalls to invalidate TLB entries.
      
      This was done so that we can do VM migration between configs that enable/disable
      GTSE support via hypervisor. To migrate a VM from a system that supports
      GTSE to a system that doesn't, we can boot the guest with
      radix_hcall_invalidate=on, thereby forcing the guest to use hcalls for TLB
      invalidates.
      
      The check for hcall availability is done in pSeries_setup_arch so that
      the panic message appears on the console. This should only happen on
      a hypervisor that doesn't force the guest to hash translation even
      though it can't handle the radix GTSE=0 request via CAS. With
      radix_hcall_invalidate=on if the hypervisor doesn't support hcall_rpt_invalidate
      hcall it should force the LPAR to hash translation.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Tested-by: NBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200727085908.420806-1-aneesh.kumar@linux.ibm.com
      bf6b7661
    • A
      powerpc/hugetlb/cma: Allocate gigantic hugetlb pages using CMA · ef26b76d
      Aneesh Kumar K.V 提交于
      commit: cf11e85f ("mm: hugetlb: optionally allocate gigantic hugepages using cma")
      added support for allocating gigantic hugepages using CMA. This patch
      enables the same for powerpc
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200713150749.25245-1-aneesh.kumar@linux.ibm.com
      ef26b76d
    • M
      powerpc/64e: Drop dead BOOK3E_MMU_TLB_STATS code · 07e571ea
      Michael Ellerman 提交于
      This code was merged 11 years ago in commit 13363ab9 ("powerpc:
      Add definitions used by exception handling on 64-bit Book3E") but was
      never able to be built because CONFIG_BOOK3E_MMU_TLB_STATS never
      existed. Remove it.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200724131728.1643966-4-mpe@ellerman.id.au
      07e571ea
    • B
      powerpc/mm: Limit resize_hpt_for_hotplug() call to hash guests only · 55548a86
      Bharata B Rao 提交于
      During memory hotplug and unplug, resize_hpt_for_hotplug() gets called
      for both hash and radix guests but it should be called only for hash
      guests. Though the call does nothing in the radix guest case, it is
      cleaner to push this call into hash specific memory hotplug routines.
      Reported-by: NNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: NBharata B Rao <bharata@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200727095704.1432916-1-bharata@linux.ibm.com
      55548a86
    • M
      powerpc/mm: Remove custom stack expansion checking · 773b3e53
      Michael Ellerman 提交于
      We have powerpc specific logic in our page fault handling to decide if
      an access to an unmapped address below the stack pointer should expand
      the stack VMA.
      
      The logic aims to prevent userspace from doing bad accesses below the
      stack pointer. However as long as the stack is < 1MB in size, we allow
      all accesses without further checks. Adding some debug I see that I
      can do a full kernel build and LTP run, and not a single process has
      used more than 1MB of stack. So for the majority of processes the
      logic never even fires.
      
      We also recently found a nasty bug in this code which could cause
      userspace programs to be killed during signal delivery. It went
      unnoticed presumably because most processes use < 1MB of stack.
      
      The generic mm code has also grown support for stack guard pages since
      this code was originally written, so the most heinous case of the
      stack expanding into other mappings is now handled for us.
      
      Finally although some other arches have special logic in this path,
      from what I can tell none of x86, arm64, arm and s390 impose any extra
      checks other than those in expand_stack().
      
      So drop our complicated logic and like other architectures just let
      the stack expand as long as its within the rlimit.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NDaniel Axtens <dja@axtens.net>
      Link: https://lore.kernel.org/r/20200724092528.1578671-4-mpe@ellerman.id.au
      773b3e53
    • M
      powerpc: Allow 4224 bytes of stack expansion for the signal frame · 63dee5df
      Michael Ellerman 提交于
      We have powerpc specific logic in our page fault handling to decide if
      an access to an unmapped address below the stack pointer should expand
      the stack VMA.
      
      The code was originally added in 2004 "ported from 2.4". The rough
      logic is that the stack is allowed to grow to 1MB with no extra
      checking. Over 1MB the access must be within 2048 bytes of the stack
      pointer, or be from a user instruction that updates the stack pointer.
      
      The 2048 byte allowance below the stack pointer is there to cover the
      288 byte "red zone" as well as the "about 1.5kB" needed by the signal
      delivery code.
      
      Unfortunately since then the signal frame has expanded, and is now
      4224 bytes on 64-bit kernels with transactional memory enabled. This
      means if a process has consumed more than 1MB of stack, and its stack
      pointer lies less than 4224 bytes from the next page boundary, signal
      delivery will fault when trying to expand the stack and the process
      will see a SEGV.
      
      The total size of the signal frame is the size of struct rt_sigframe
      (which includes the red zone) plus __SIGNAL_FRAMESIZE (128 bytes on
      64-bit).
      
      The 2048 byte allowance was correct until 2008 as the signal frame
      was:
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1440 */
              /* --- cacheline 11 boundary (1408 bytes) was 32 bytes ago --- */
              long unsigned int          _unused[2];           /*  1440    16 */
              unsigned int               tramp[6];             /*  1456    24 */
              struct siginfo *           pinfo;                /*  1480     8 */
              void *                     puc;                  /*  1488     8 */
              struct siginfo     info;                         /*  1496   128 */
              /* --- cacheline 12 boundary (1536 bytes) was 88 bytes ago --- */
              char                       abigap[288];          /*  1624   288 */
      
              /* size: 1920, cachelines: 15, members: 7 */
              /* padding: 8 */
      };
      
      1920 + 128 = 2048
      
      Then in commit ce48b210 ("powerpc: Add VSX context save/restore,
      ptrace and signal support") (Jul 2008) the signal frame expanded to
      2304 bytes:
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */	<--
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              long unsigned int          _unused[2];           /*  1696    16 */
              unsigned int               tramp[6];             /*  1712    24 */
              struct siginfo *           pinfo;                /*  1736     8 */
              void *                     puc;                  /*  1744     8 */
              struct siginfo     info;                         /*  1752   128 */
              /* --- cacheline 14 boundary (1792 bytes) was 88 bytes ago --- */
              char                       abigap[288];          /*  1880   288 */
      
              /* size: 2176, cachelines: 17, members: 7 */
              /* padding: 8 */
      };
      
      2176 + 128 = 2304
      
      At this point we should have been exposed to the bug, though as far as
      I know it was never reported. I no longer have a system old enough to
      easily test on.
      
      Then in 2010 commit 320b2b8d ("mm: keep a guard page below a
      grow-down stack segment") caused our stack expansion code to never
      trigger, as there was always a VMA found for a write up to PAGE_SIZE
      below r1.
      
      That meant the bug was hidden as we continued to expand the signal
      frame in commit 2b0a576d ("powerpc: Add new transactional memory
      state to the signal context") (Feb 2013):
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              struct ucontext    uc_transact;                  /*  1696  1696 */	<--
              /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
              long unsigned int          _unused[2];           /*  3392    16 */
              unsigned int               tramp[6];             /*  3408    24 */
              struct siginfo *           pinfo;                /*  3432     8 */
              void *                     puc;                  /*  3440     8 */
              struct siginfo     info;                         /*  3448   128 */
              /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
              char                       abigap[288];          /*  3576   288 */
      
              /* size: 3872, cachelines: 31, members: 8 */
              /* padding: 8 */
              /* last cacheline: 32 bytes */
      };
      
      3872 + 128 = 4000
      
      And commit 573ebfa6 ("powerpc: Increase stack redzone for 64-bit
      userspace to 512 bytes") (Feb 2014):
      
      struct rt_sigframe {
              struct ucontext    uc;                           /*     0  1696 */
              /* --- cacheline 13 boundary (1664 bytes) was 32 bytes ago --- */
              struct ucontext    uc_transact;                  /*  1696  1696 */
              /* --- cacheline 26 boundary (3328 bytes) was 64 bytes ago --- */
              long unsigned int          _unused[2];           /*  3392    16 */
              unsigned int               tramp[6];             /*  3408    24 */
              struct siginfo *           pinfo;                /*  3432     8 */
              void *                     puc;                  /*  3440     8 */
              struct siginfo     info;                         /*  3448   128 */
              /* --- cacheline 27 boundary (3456 bytes) was 120 bytes ago --- */
              char                       abigap[512];          /*  3576   512 */	<--
      
              /* size: 4096, cachelines: 32, members: 8 */
              /* padding: 8 */
      };
      
      4096 + 128 = 4224
      
      Then finally in 2017, commit 1be7107f ("mm: larger stack guard
      gap, between vmas") exposed us to the existing bug, because it changed
      the stack VMA to be the correct/real size, meaning our stack expansion
      code is now triggered.
      
      Fix it by increasing the allowance to 4224 bytes.
      
      Hard-coding 4224 is obviously unsafe against future expansions of the
      signal frame in the same way as the existing code. We can't easily use
      sizeof() because the signal frame structure is not in a header. We
      will either fix that, or rip out all the custom stack expansion
      checking logic entirely.
      
      Fixes: ce48b210 ("powerpc: Add VSX context save/restore, ptrace and signal support")
      Cc: stable@vger.kernel.org # v2.6.27+
      Reported-by: NTom Lane <tgl@sss.pgh.pa.us>
      Tested-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200724092528.1578671-2-mpe@ellerman.id.au
      63dee5df
  15. 27 7月, 2020 1 次提交
    • N
      powerpc/64s/hash: Fix hash_preload running with interrupts enabled · 909adfc6
      Nicholas Piggin 提交于
      Commit 2f92447f ("powerpc/book3s64/hash: Use the pte_t address from the
      caller") removed the local_irq_disable from hash_preload, but it was
      required for more than just the page table walk: the hash pte busy bit is
      effectively a lock which may be taken in interrupt context, and the local
      update flag test must not be preempted before it's used.
      
      This solves apparent lockups with perf interrupting __hash_page_64K. If
      get_perf_callchain then also takes a hash fault on the same page while it
      is already locked, it will loop forever taking hash faults, which looks like
      this:
      
        cpu 0x49e: Vector: 100 (System Reset) at [c00000001a4f7d70]
            pc: c000000000072dc8: hash_page_mm+0x8/0x800
            lr: c00000000000c5a4: do_hash_page+0x24/0x38
            sp: c0002ac1cc69ac70
           msr: 8000000000081033
          current = 0xc0002ac1cc602e00
          paca    = 0xc00000001de1f280   irqmask: 0x03   irq_happened: 0x01
            pid   = 20118, comm = pread2_processe
        Linux version 5.8.0-rc6-00345-g1fad14f18bc6
        49e:mon> t
        [c0002ac1cc69ac70] c00000000000c5a4 do_hash_page+0x24/0x38 (unreliable)
        --- Exception: 300 (Data Access) at c00000000008fa60 __copy_tofrom_user_power7+0x20c/0x7ac
        [link register   ] c000000000335d10 copy_from_user_nofault+0xf0/0x150
        [c0002ac1cc69af70] c00032bf9fa3c880 (unreliable)
        [c0002ac1cc69afa0] c000000000109df0 read_user_stack_64+0x70/0xf0
        [c0002ac1cc69afd0] c000000000109fcc perf_callchain_user_64+0x15c/0x410
        [c0002ac1cc69b060] c000000000109c00 perf_callchain_user+0x20/0x40
        [c0002ac1cc69b080] c00000000031c6cc get_perf_callchain+0x25c/0x360
        [c0002ac1cc69b120] c000000000316b50 perf_callchain+0x70/0xa0
        [c0002ac1cc69b140] c000000000316ddc perf_prepare_sample+0x25c/0x790
        [c0002ac1cc69b1a0] c000000000317350 perf_event_output_forward+0x40/0xb0
        [c0002ac1cc69b220] c000000000306138 __perf_event_overflow+0x88/0x1a0
        [c0002ac1cc69b270] c00000000010cf70 record_and_restart+0x230/0x750
        [c0002ac1cc69b620] c00000000010d69c perf_event_interrupt+0x20c/0x510
        [c0002ac1cc69b730] c000000000027d9c performance_monitor_exception+0x4c/0x60
        [c0002ac1cc69b750] c00000000000b2f8 performance_monitor_common_virt+0x1b8/0x1c0
        --- Exception: f00 (Performance Monitor) at c0000000000cb5b0 pSeries_lpar_hpte_insert+0x0/0x160
        [link register   ] c0000000000846f0 __hash_page_64K+0x210/0x540
        [c0002ac1cc69ba50] 0000000000000000 (unreliable)
        [c0002ac1cc69bb00] c000000000073ae0 update_mmu_cache+0x390/0x3a0
        [c0002ac1cc69bb70] c00000000037f024 wp_page_copy+0x364/0xce0
        [c0002ac1cc69bc20] c00000000038272c do_wp_page+0xdc/0xa60
        [c0002ac1cc69bc70] c0000000003857bc handle_mm_fault+0xb9c/0x1b60
        [c0002ac1cc69bd50] c00000000006c434 __do_page_fault+0x314/0xc90
        [c0002ac1cc69be20] c00000000000c5c8 handle_page_fault+0x10/0x2c
        --- Exception: 300 (Data Access) at 00007fff8c861fe8
        SP (7ffff6b19660) is in userspace
      
      Fixes: 2f92447f ("powerpc/book3s64/hash: Use the pte_t address from the caller")
      Reported-by: NAthira Rajeev <atrajeev@linux.vnet.ibm.com>
      Reported-by: NAnton Blanchard <anton@ozlabs.org>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200727060947.10060-1-npiggin@gmail.com
      909adfc6
  16. 26 7月, 2020 2 次提交