1. 06 10月, 2020 2 次提交
    • S
      pseries/hotplug-memory: hot-add: skip redundant LMB lookup · 72cdd117
      Scott Cheloha 提交于
      During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
      to determine which node id (nid) to use when later calling __add_memory().
      
      This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
      appropriate nid for a given address by looking up the LMB containing the
      address and then passing that LMB to of_drconf_to_nid_single() to get the
      nid.  In dlpar_add_lmb() we get this address from the LMB itself.
      
      In short, we have a pointer to an LMB and then we are searching for
      that LMB *again* in order to find its nid.
      
      If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
      can skip the redundant lookup.  The only error handling we need to
      duplicate from memory_add_physaddr_to_nid() is the fallback to the
      default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
      an invalid nid.
      
      Skipping the extra lookup makes hot-add operations faster, especially
      on machines with many LMBs.
      
      Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
      LMBs on an upatched kernel took ~3.5 hours while a patched kernel
      completed the same operation in ~2 hours:
      
      Unpatched (12450 seconds):
      Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
      Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      Patched (7065 seconds):
      Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
      Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      It should be noted that the speedup grows more substantial when
      hot-adding LMBs at the end of the drconf range.  This is because we
      are skipping a linear LMB search.
      
      To see the distinction, consider smaller hot-add test on the same
      LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
      LMBs completed less than 1 second faster on a patched kernel:
      
      Unpatched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
                   4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
                   2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
         445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
           8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
         300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
         258,091,488,691      instructions              #    0.58  insn per cycle
                                                        #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
          70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
           3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
      
                 105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
      
      Patched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
                   4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
                   2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
         442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
           8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
         299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
         252,731,168,193      instructions              #    0.57  insn per cycle
                                                        #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
          68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
           3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
      
                 104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
      
      This is consistent.  An add-by-count hot-add operation adds LMBs
      greedily, so LMBs near the start of the drconf range are considered
      first.  On an otherwise idle LPAR with so many LMBs we would expect to
      find the LMBs we need near the start of the drconf range, hence the
      smaller speedup.
      Signed-off-by: NScott Cheloha <cheloha@linux.ibm.com>
      Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
      72cdd117
    • N
      powerpc/64s: Add cp_abort after tlbiel to invalidate copy-buffer address · 05504b42
      Nicholas Piggin 提交于
      The copy buffer is implemented as a real address in the nest which is
      translated from EA by copy, and used for memory access by paste. This
      requires that it be invalidated by TLB invalidation.
      
      TLBIE does invalidate the copy buffer, but TLBIEL does not. Add
      cp_abort to the tlbiel sequence.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      [mpe: Fixup whitespace and comment formatting]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200916030234.4110379-2-npiggin@gmail.com
      05504b42
  2. 18 9月, 2020 1 次提交
  3. 16 9月, 2020 8 次提交
    • S
      powerpc/smp: Implement cpu_to_coregroup_id · fa35e868
      Srikar Dronamraju 提交于
      Lookup the coregroup id from the associativity array.
      
      If unable to detect the coregroup id, fallback on the core id.
      This way, ensure sched_domain degenerates and an extra sched domain is
      not created.
      
      Ideally this function should have been implemented in
      arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
      don't need to find the primary domain again.
      
      If the device-tree mentions more than one coregroup, then kernel
      implements only the last or the smallest coregroup, which currently
      corresponds to the penultimate domain in the device-tree.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com
      fa35e868
    • S
      powerpc/smp: Create coregroup domain · 72730bfc
      Srikar Dronamraju 提交于
      Add percpu coregroup maps and masks to create coregroup domain.
      If a coregroup doesn't exist, the coregroup domain will be degenerated
      in favour of SMT/CACHE domain. Do note this patch is only creating stubs
      for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
      would be in a subsequent patch.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com
      72730bfc
    • S
      powerpc/numa: Detect support for coregroup · f9f130ff
      Srikar Dronamraju 提交于
      Add support for grouping cores based on the device-tree classification.
      - The last domain in the associativity domains always refers to the
      core.
      - If primary reference domain happens to be the penultimate domain in
      the associativity domains device-tree property, then there are no
      coregroups. However if its not a penultimate domain, then there are
      coregroups. There can be more than one coregroup. For now we would be
      interested in the last or the smallest coregroups, i.e one sub-group
      per DIE.
      
      Currently there are no firmwares that are exposing this grouping. Hence
      allow the basis for grouping to be abstract.  Once the firmware starts
      using this grouping, code would be added to detect the type of grouping
      and adjust the sd domain flags accordingly.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
      f9f130ff
    • S
      powerpc/numa: Offline memoryless cpuless node 0 · e75130f2
      Srikar Dronamraju 提交于
      Currently Linux kernel with CONFIG_NUMA on a system with multiple
      possible nodes, marks node 0 as online at boot.  However in practice,
      there are systems which have node 0 as memoryless and cpuless.
      
      This can cause numa_balancing to be enabled on systems with only one node
      with memory and CPUs. The existence of this dummy node which is cpuless and
      memoryless node can confuse users/scripts looking at output of lscpu /
      numactl.
      
      By marking, node 0 as offline, lets stop assuming that node 0 is
      always online. If node 0 has CPU or memory that are online, node 0 will
      again be set as online.
      
      v5.8
       available: 2 nodes (0,2)
       node 0 cpus:
       node 0 size: 0 MB
       node 0 free: 0 MB
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31490 MB
       node distances:
       node   0   2
         0:  10  20
         2:  20  10
      
      proc and sys files
      ------------------
       /sys/devices/system/node/online:            0,2
       /proc/sys/kernel/numa_balancing:            1
       /sys/devices/system/node/has_cpu:           2
       /sys/devices/system/node/has_memory:        2
       /sys/devices/system/node/has_normal_memory: 2
       /sys/devices/system/node/possible:          0-31
      
      v5.8 + patch
      ------------------
       available: 1 nodes (2)
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31487 MB
       node distances:
       node   2
         2:  10
      
      proc and sys files
      ------------------
      /sys/devices/system/node/online:            2
      /proc/sys/kernel/numa_balancing:            0
      /sys/devices/system/node/has_cpu:           2
      /sys/devices/system/node/has_memory:        2
      /sys/devices/system/node/has_normal_memory: 2
      /sys/devices/system/node/possible:          0-31
      
      Example of a node with online CPUs/memory on node 0.
      (Same o/p with and without patch)
      numactl -H
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 0 size: 32482 MB
      node 0 free: 22994 MB
      node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
      node 1 size: 0 MB
      node 1 free: 0 MB
      node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
      node 2 size: 0 MB
      node 2 free: 0 MB
      node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
      node 3 free: 0 MB
      node distances:
      node   0   1   2   3
        0:  10  20  40  40
        1:  20  10  40  40
        2:  40  40  10  20
        3:  40  40  20  10
      
      Note: On Powerpc, cpu_to_node of possible but not present cpus would
      previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
      numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
      queried from vphn"). Without the 2 commits, Powerpc system might crash.
      
      1. User space applications like Numactl, lscpu, that parse the sysfs tend to
      believe there is an extra online node. This tends to confuse users and
      applications. Other user space applications start believing that system was
      not able to use all the resources (i.e missing resources) or the system was
      not setup correctly.
      
      2. Also existence of dummy node also leads to inconsistent information. The
      number of online nodes is inconsistent with the information in the
      device-tree and resource-dump
      
      3. When the dummy node is present, single node non-Numa systems end up showing
      up as NUMA systems and numa_balancing gets enabled. This will mean we take
      the hit from the unnecessary numa hinting faults.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
      e75130f2
    • S
      powerpc/numa: Prefer node id queried from vphn · 6398eaa2
      Srikar Dronamraju 提交于
      Node id queried from the static device tree may not
      be correct. For example: it may always show 0 on a shared processor.
      Hence prefer the node id queried from vphn and fallback on the device tree
      based node id if vphn query fails.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
      6398eaa2
    • S
      powerpc/numa: Set numa_node for all possible cpus · a874f100
      Srikar Dronamraju 提交于
      A Powerpc system with multiple possible nodes and with CONFIG_NUMA
      enabled always used to have a node 0, even if node 0 does not any cpus
      or memory attached to it. As per PAPR, node affinity of a cpu is only
      available once its present / online. For all cpus that are possible but
      not present, cpu_to_node() would point to node 0.
      
      To ensure a cpuless, memoryless dummy node is not online, powerpc need
      to make sure all possible but not present cpu_to_node are set to a
      proper node.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
      a874f100
    • S
      powerpc/numa: Restrict possible nodes based on platform · 67df7784
      Srikar Dronamraju 提交于
      As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
      Abstraction Services (RTAS) Node" available at:
        https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
      
      ... there are 2 device tree properties:
      
        "ibm,max-associativity-domains"
         which defines the maximum number of domains that the firmware i.e
         PowerVM can support.
      
      and:
      
        "ibm,current-associativity-domains"
         which defines the maximum number of domains that the current
         platform can support.
      
      The value of "ibm,max-associativity-domains" is always greater than or
      equal to "ibm,current-associativity-domains" property. If the latter
      property is not available, use "ibm,max-associativity-domain" as a
      fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
      is mentioned in page 833 / B.5.3 which is covered under under
      "Appendix B. System Binding" section
      
      Currently powerpc uses the "ibm,max-associativity-domains" property
      while setting the possible number of nodes. This is currently set at
      32. However the possible number of nodes for a platform may be
      significantly less. Hence set the possible number of nodes based on
      "ibm,current-associativity-domains" property.
      
      Nathan Lynch had raised a valid concern that post LPM (Live Partition
      Migration), a user could DLPAR add processors and memory after LPM
      with "new" associativity properties:
        https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
      
      He also pointed out that "ibm,max-associativity-domains" has the same
      contents on all currently available PowerVM systems, unlike
      "ibm,current-associativity-domains" and hence may be better able to
      handle the new NUMA associativity properties.
      
      However with the recent commit dbce4562 ("powerpc/numa: Limit
      possible nodes to within num_possible_nodes"), all new NUMA
      associativity properties are capped to initially set nr_node_ids.
      Hence this commit should be safe with any new DLPAR add post LPM.
      
        $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
        /proc/device-tree/rtas/ibm,current-associativity-domains
        		 00000005 00000001 00000002 00000002 00000002 00000010
        /proc/device-tree/rtas/ibm,max-associativity-domains
        		 00000005 00000001 00000008 00000020 00000020 00000100
      
        $ cat /sys/devices/system/node/possible ##Before patch
        0-31
      
        $ cat /sys/devices/system/node/possible ##After patch
        0-1
      
      Note the maximum nodes this platform can support is only 2 but the
      possible nodes is set to 32.
      
      This is important because lot of kernel and user space code allocate
      structures for all possible nodes leading to a lot of memory that is
      allocated but not used.
      
      I ran a simple experiment to create and destroy 100 memory cgroups on
      boot on a 8 node machine (Power8 Alpine).
      
      Before patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4106816   518820608       22272      570752   516606720
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4628416   518246464       22336      623296   516058688
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4697408   518173760       22400      627008   515987904
        Swap:       4194240           0     4194240
      
      After patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     3969472   518933888       22272      594816   516731776
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4181888   518676096       22208      640192   516496448
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4232320   518619904       22272      645952   516443264
        Swap:       4194240           0     4194240
      
      Observations:
        Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
        Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Reformat change log a bit for readability]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
      67df7784
    • N
      powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm · a665eec0
      Nicholas Piggin 提交于
      Commit 0cef77c7 ("powerpc/64s/radix: flush remote CPUs out of
      single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
      a process under certain conditions. One of the assumptions is that
      mm_users would not be incremented via a reference outside the process
      context with mmget_not_zero() then go on to kthread_use_mm() via that
      reference.
      
      That invariant was broken by io_uring code (see previous sparc64 fix),
      but I'll point Fixes: to the original powerpc commit because we are
      changing that assumption going forward, so this will make backports
      match up.
      
      Fix this by no longer relying on that assumption, but by having each CPU
      check the mm is not being used, and clearing their own bit from the mask
      only if it hasn't been switched-to by the time the IPI is processed.
      
      This relies on commit 38cf307c ("mm: fix kthread_use_mm() vs TLB
      invalidate") and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to disable irqs over mm
      switch sequences.
      
      Fixes: 0cef77c7 ("powerpc/64s/radix: flush remote CPUs out of single-threaded mm_cpumask")
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NMichael Ellerman <mpe@ellerman.id.au>
      Depends-on: 38cf307c ("mm: fix kthread_use_mm() vs TLB invalidate")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200914045219.3736466-5-npiggin@gmail.com
      a665eec0
  4. 15 9月, 2020 6 次提交
  5. 14 9月, 2020 1 次提交
  6. 02 9月, 2020 3 次提交
    • C
    • S
      pseries/drmem: don't cache node id in drmem_lmb struct · e5e179aa
      Scott Cheloha 提交于
      At memory hot-remove time we can retrieve an LMB's nid from its
      corresponding memory_block.  There is no need to store the nid
      in multiple locations.
      
      Note that lmb_to_memblock() uses find_memory_block() to get the
      corresponding memory_block.  As find_memory_block() runs in sub-linear
      time this approach is negligibly slower than what we do at present.
      
      In exchange for this lookup at hot-remove time we no longer need to
      call memory_add_physaddr_to_nid() during drmem_init() for each LMB.
      On powerpc, memory_add_physaddr_to_nid() is a linear search, so this
      spares us an O(n^2) initialization during boot.
      
      On systems with many LMBs that initialization overhead is palpable and
      disruptive.  For example, on a box with 249854 LMBs we're seeing
      drmem_init() take upwards of 30 seconds to complete:
      
      [   53.721639] drmem: initializing drmem v2
      [   80.604346] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
      [   80.604377] Modules linked in:
      [   80.604389] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc2+ #4
      [   80.604397] NIP:  c0000000000a4980 LR: c0000000000a4940 CTR: 0000000000000000
      [   80.604407] REGS: c0002dbff8493830 TRAP: 0901   Not tainted  (5.6.0-rc2+)
      [   80.604412] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 44000248  XER: 0000000d
      [   80.604431] CFAR: c0000000000a4a38 IRQMASK: 0
      [   80.604431] GPR00: c0000000000a4940 c0002dbff8493ac0 c000000001904400 c0003cfffffede30
      [   80.604431] GPR04: 0000000000000000 c000000000f4095a 000000000000002f 0000000010000000
      [   80.604431] GPR08: c0000bf7ecdb7fb8 c0000bf7ecc2d3c8 0000000000000008 c00c0002fdfb2001
      [   80.604431] GPR12: 0000000000000000 c00000001e8ec200
      [   80.604477] NIP [c0000000000a4980] hot_add_scn_to_nid+0xa0/0x3e0
      [   80.604486] LR [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0
      [   80.604492] Call Trace:
      [   80.604498] [c0002dbff8493ac0] [c0000000000a4940] hot_add_scn_to_nid+0x60/0x3e0 (unreliable)
      [   80.604509] [c0002dbff8493b20] [c000000000087c10] memory_add_physaddr_to_nid+0x20/0x60
      [   80.604521] [c0002dbff8493b40] [c0000000010d4880] drmem_init+0x25c/0x2f0
      [   80.604530] [c0002dbff8493c10] [c000000000010154] do_one_initcall+0x64/0x2c0
      [   80.604540] [c0002dbff8493ce0] [c0000000010c4aa0] kernel_init_freeable+0x2d8/0x3a0
      [   80.604550] [c0002dbff8493db0] [c000000000010824] kernel_init+0x2c/0x148
      [   80.604560] [c0002dbff8493e20] [c00000000000b648] ret_from_kernel_thread+0x5c/0x74
      [   80.604567] Instruction dump:
      [   80.604574] 392918e8 e9490000 e90a000a e92a0000 80ea000c 1d080018 3908ffe8 7d094214
      [   80.604586] 7fa94040 419d00dc e9490010 714a0088 <2faa0008> 409e00ac e9490000 7fbe5040
      [   89.047390] drmem: 249854 LMB(s)
      
      With a patched kernel on the same machine we're no longer seeing the
      soft lockup.  drmem_init() now completes in negligible time, even when
      the LMB count is large.
      
      Fixes: b2d3b5ee ("powerpc/pseries: Track LMB nid instead of using device tree")
      Signed-off-by: NScott Cheloha <cheloha@linux.ibm.com>
      Reviewed-by: NNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200811015115.63677-1-cheloha@linux.ibm.com
      e5e179aa
    • C
      powerpc: Rewrite FSL_BOOKE flush_cache_instruction() in C · 704dfe93
      Christophe Leroy 提交于
      Nothing prevents flush_cache_instruction() from being writen in C.
      
      Do it to improve readability and maintainability.
      
      This function is only use by low level callers, it is not
      intended to be used by module. Don't export it.
      Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/f989eff8296800c427622c0985384148404e4f0b.1597384512.git.christophe.leroy@csgroup.eu
      704dfe93
  7. 28 8月, 2020 1 次提交
    • A
      powerpc/book3s64/radix: Fix boot failure with large amount of guest memory · 103a8542
      Aneesh Kumar K.V 提交于
      If the hypervisor doesn't support hugepages, the kernel ends up allocating a large
      number of page table pages. The early page table allocation was wrongly
      setting the max memblock limit to ppc64_rma_size with radix translation
      which resulted in boot failure as shown below.
      
      Kernel panic - not syncing:
      early_alloc_pgtable: Failed to allocate 16777216 bytes align=0x1000000 nid=-1 from=0x0000000000000000 max_addr=0xffffffffffffffff
       CPU: 0 PID: 0 Comm: swapper Not tainted 5.8.0-24.9-default+ #2
       Call Trace:
       [c0000000016f3d00] [c0000000007c6470] dump_stack+0xc4/0x114 (unreliable)
       [c0000000016f3d40] [c00000000014c78c] panic+0x164/0x418
       [c0000000016f3dd0] [c000000000098890] early_alloc_pgtable+0xe0/0xec
       [c0000000016f3e60] [c0000000010a5440] radix__early_init_mmu+0x360/0x4b4
       [c0000000016f3ef0] [c000000001099bac] early_init_mmu+0x1c/0x3c
       [c0000000016f3f10] [c00000000109a320] early_setup+0x134/0x170
      
      This was because the kernel was checking for the radix feature before we enable the
      feature via mmu_features. This resulted in the kernel using hash restrictions on
      radix.
      
      Rework the early init code such that the kernel boot with memblock restrictions
      as imposed by hash. At that point, the kernel still hasn't finalized the
      translation the kernel will end up using.
      
      We have three different ways of detecting radix.
      
      1. dt_cpu_ftrs_scan -> used only in case of PowerNV
      2. ibm,pa-features -> Used when we don't use cpu_dt_ftr_scan
      3. CAS -> Where we negotiate with hypervisor about the supported translation.
      
      We look at 1 or 2 early in the boot and after that, we look at the CAS vector to
      finalize the translation the kernel will use. We also support a kernel command
      line option (disable_radix) to switch to hash.
      
      Update the memblock limit after mmu_early_init_devtree() if the kernel is going
      to use radix translation. This forces some of the memblock allocations we do before
      mmu_early_init_devtree() to be within the RMA limit.
      
      Fixes: 2bfd65e4 ("powerpc/mm/radix: Add radix callbacks for early init routines")
      Reported-by: NShirisha Ganta <shiganta@in.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: NHari Bathini <hbathini@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200828100852.426575-1-aneesh.kumar@linux.ibm.com
      103a8542
  8. 24 8月, 2020 3 次提交
  9. 21 8月, 2020 1 次提交
  10. 18 8月, 2020 1 次提交
  11. 17 8月, 2020 1 次提交
  12. 13 8月, 2020 3 次提交
    • P
      mm: clean up the last pieces of page fault accountings · a2beb5f1
      Peter Xu 提交于
      Here're the last pieces of page fault accounting that were still done
      outside handle_mm_fault() where we still have regs==NULL when calling
      handle_mm_fault():
      
      arch/powerpc/mm/copro_fault.c:   copro_handle_mm_fault
      arch/sparc/mm/fault_32.c:        force_user_fault
      arch/um/kernel/trap.c:           handle_page_fault
      mm/gup.c:                        faultin_page
                                       fixup_user_fault
      mm/hmm.c:                        hmm_vma_fault
      mm/ksm.c:                        break_ksm
      
      Some of them has the issue of duplicated accounting for page fault
      retries.  Some of them didn't do the accounting at all.
      
      This patch cleans all these up by letting handle_mm_fault() to do per-task
      page fault accounting even if regs==NULL (though we'll still skip the perf
      event accountings).  With that, we can safely remove all the outliers now.
      
      There's another functional change in that now we account the page faults
      to the caller of gup, rather than the task_struct that passed into the gup
      code.  More information of this can be found at [1].
      
      After this patch, below things should never be touched again outside
      handle_mm_fault():
      
        - task_struct.[maj|min]_flt
        - PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN]
      
      [1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2beb5f1
    • P
      mm/powerpc: use general page fault accounting · 428fdc09
      Peter Xu 提交于
      Use the general page fault accounting by passing regs into
      handle_mm_fault().
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Link: http://lkml.kernel.org/r/20200707225021.200906-17-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      428fdc09
    • P
      mm: do page fault accounting in handle_mm_fault · bce617ed
      Peter Xu 提交于
      Patch series "mm: Page fault accounting cleanups", v5.
      
      This is v5 of the pf accounting cleanup series.  It originates from Gerald
      Schaefer's report on an issue a week ago regarding to incorrect page fault
      accountings for retried page fault after commit 4064b982 ("mm: allow
      VM_FAULT_RETRY for multiple times"):
      
        https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/
      
      What this series did:
      
        - Correct page fault accounting: we do accounting for a page fault
          (no matter whether it's from #PF handling, or gup, or anything else)
          only with the one that completed the fault.  For example, page fault
          retries should not be counted in page fault counters.  Same to the
          perf events.
      
        - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
          event is used in an adhoc way across different archs.
      
          Case (1): for many archs it's done at the entry of a page fault
          handler, so that it will also cover e.g.  errornous faults.
      
          Case (2): for some other archs, it is only accounted when the page
          fault is resolved successfully.
      
          Case (3): there're still quite some archs that have not enabled
          this perf event.
      
          Since this series will touch merely all the archs, we unify this
          perf event to always follow case (1), which is the one that makes most
          sense.  And since we moved the accounting into handle_mm_fault, the
          other two MAJ/MIN perf events are well taken care of naturally.
      
        - Unify definition of "major faults": the definition of "major
          fault" is slightly changed when used in accounting (not
          VM_FAULT_MAJOR).  More information in patch 1.
      
        - Always account the page fault onto the one that triggered the page
          fault.  This does not matter much for #PF handlings, but mostly for
          gup.  More information on this in patch 25.
      
      Patchset layout:
      
      Patch 1:     Introduced the accounting in handle_mm_fault(), not enabled.
      Patch 2-23:  Enable the new accounting for arch #PF handlers one by one.
      Patch 24:    Enable the new accounting for the rest outliers (gup, iommu, etc.)
      Patch 25:    Cleanup GUP task_struct pointer since it's not needed any more
      
      This patch (of 25):
      
      This is a preparation patch to move page fault accountings into the
      general code in handle_mm_fault().  This includes both the per task
      flt_maj/flt_min counters, and the major/minor page fault perf events.  To
      do this, the pt_regs pointer is passed into handle_mm_fault().
      
      PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
      handlers.
      
      So far, all the pt_regs pointer that passed into handle_mm_fault() is
      NULL, which means this patch should have no intented functional change.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
      Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bce617ed
  13. 10 8月, 2020 1 次提交
  14. 08 8月, 2020 3 次提交
    • M
      mm/sparse: cleanup the code surrounding memory_present() · c89ab04f
      Mike Rapoport 提交于
      After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
      functions that call memory_present() for each region in memblock.memory:
      sparse_memory_present_with_active_regions() and membocks_present().
      
      Moreover, all architectures have a call to either of these functions
      preceding the call to sparse_init() and in the most cases they are called
      one after the other.
      
      Mark the regions from memblock.memory as present during sparce_init() by
      making sparse_init() call memblocks_present(), make memblocks_present()
      and memory_present() functions static and remove redundant
      sparse_memory_present_with_active_regions() function.
      
      Also remove no longer required HAVE_MEMORY_PRESENT configuration option.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c89ab04f
    • A
      mm/sparsemem: enable vmem_altmap support in vmemmap_alloc_block_buf() · 56993b4e
      Anshuman Khandual 提交于
      There are many instances where vmemap allocation is often switched between
      regular memory and device memory just based on whether altmap is available
      or not.  vmemmap_alloc_block_buf() is used in various platforms to
      allocate vmemmap mappings.  Lets also enable it to handle altmap based
      device memory allocation along with existing regular memory allocations.
      This will help in avoiding the altmap based allocation switch in many
      places.  To summarize there are two different methods to call
      vmemmap_alloc_block_buf().
      
      vmemmap_alloc_block_buf(size, node, NULL)   /* Allocate from system RAM */
      vmemmap_alloc_block_buf(size, node, altmap) /* Allocate from altmap */
      
      This converts altmap_alloc_block_buf() into a static function, drops it's
      entry from the header and updates Documentation/vm/memory-model.rst.
      Suggested-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NJia He <justin.he@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Will Deacon <will@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Hsin-Yi Wang <hsinyi@chromium.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Link: http://lkml.kernel.org/r/1594004178-8861-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56993b4e
    • M
      mm: remove unneeded includes of <asm/pgalloc.h> · ca15ca40
      Mike Rapoport 提交于
      Patch series "mm: cleanup usage of <asm/pgalloc.h>"
      
      Most architectures have very similar versions of pXd_alloc_one() and
      pXd_free_one() for intermediate levels of page table.  These patches add
      generic versions of these functions in <asm-generic/pgalloc.h> and enable
      use of the generic functions where appropriate.
      
      In addition, functions declared and defined in <asm/pgalloc.h> headers are
      used mostly by core mm and early mm initialization in arch and there is no
      actual reason to have the <asm/pgalloc.h> included all over the place.
      The first patch in this series removes unneeded includes of
      <asm/pgalloc.h>
      
      In the end it didn't work out as neatly as I hoped and moving
      pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
      unnecessary changes to arches that have custom page table allocations, so
      I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
      to mm/.
      
      This patch (of 8):
      
      In most cases <asm/pgalloc.h> header is required only for allocations of
      page table memory.  Most of the .c files that include that header do not
      use symbols declared in <asm/pgalloc.h> and do not require that header.
      
      As for the other header files that used to include <asm/pgalloc.h>, it is
      possible to move that include into the .c file that actually uses symbols
      from <asm/pgalloc.h> and drop the include from the header file.
      
      The process was somewhat automated using
      
      	sed -i -E '/[<"]asm\/pgalloc\.h/d' \
                      $(grep -L -w -f /tmp/xx \
                              $(git grep -E -l '[<"]asm/pgalloc\.h'))
      
      where /tmp/xx contains all the symbols defined in
      arch/*/include/asm/pgalloc.h.
      
      [rppt@linux.ibm.com: fix powerpc warning]
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca15ca40
  15. 30 7月, 2020 1 次提交
  16. 29 7月, 2020 4 次提交