1. 24 2月, 2017 1 次提交
    • N
      sparc64: Multi-page size support · c7d9f77d
      Nitin Gupta 提交于
      Add support for using multiple hugepage sizes simultaneously
      on mainline. Currently, support for 256M has been added which
      can be used along with 8M pages.
      
      Page tables are set like this (e.g. for 256M page):
          VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]
      
      and TSB is set similarly:
          VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]
      
      - Testing
      
      Tested on Sonoma (which supports 256M pages) by running stream
      benchmark instances in parallel: one instance uses 8M pages and
      another uses 256M pages, consuming 48G each.
      
      Boot params used:
      
      default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
      hugepages=10000
      Signed-off-by: NNitin Gupta <nitin.m.gupta@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d9f77d
  2. 25 12月, 2016 1 次提交
  3. 15 11月, 2016 1 次提交
  4. 11 11月, 2016 1 次提交
    • T
      sparc64: Fix find_node warning if numa node cannot be found · 74a5ed5c
      Thomas Tai 提交于
      When booting up LDOM, find_node() warns that a physical address
      doesn't match a NUMA node.
      
      WARNING: CPU: 0 PID: 0 at arch/sparc/mm/init_64.c:835
      find_node+0xf4/0x120 find_node: A physical address doesn't
      match a NUMA node rule. Some physical memory will be
      owned by node 0.Modules linked in:
      
      CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.0-rc3 #4
      Call Trace:
       [0000000000468ba0] __warn+0xc0/0xe0
       [0000000000468c74] warn_slowpath_fmt+0x34/0x60
       [00000000004592f4] find_node+0xf4/0x120
       [0000000000dd0774] add_node_ranges+0x38/0xe4
       [0000000000dd0b1c] numa_parse_mdesc+0x268/0x2e4
       [0000000000dd0e9c] bootmem_init+0xb8/0x160
       [0000000000dd174c] paging_init+0x808/0x8fc
       [0000000000dcb0d0] setup_arch+0x2c8/0x2f0
       [0000000000dc68a0] start_kernel+0x48/0x424
       [0000000000dcb374] start_early_boot+0x27c/0x28c
       [0000000000a32c08] tlb_fixup_done+0x4c/0x64
       [0000000000027f08] 0x27f08
      
      It is because linux use an internal structure node_masks[] to
      keep the best memory latency node only. However, LDOM mdesc can
      contain single latency-group with multiple memory latency nodes.
      
      If the address doesn't match the best latency node within
      node_masks[], it should check for an alternative via mdesc.
      The warning message should only be printed if the address
      doesn't match any node_masks[] nor within mdesc. To minimize
      the impact of searching mdesc every time, the last matched
      mask and index is stored in a variable.
      Signed-off-by: NThomas Tai <thomas.tai@oracle.com>
      Reviewed-by: NChris Hyser <chris.hyser@oracle.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74a5ed5c
  5. 06 10月, 2016 1 次提交
  6. 28 9月, 2016 2 次提交
    • A
      sparc64: Fix irq stack bootmem allocation. · ebb99a4c
      Atish Patra 提交于
      Currently, irq stack bootmem is allocated for all possible cpus
      before nr_cpus value changes the list of possible cpus. As a result,
      there is unnecessary wastage of bootmemory.
      
      Move the irq stack bootmem allocation so that it happens after
      possible cpu list is modified based on nr_cpus value.
      Signed-off-by: NAtish Patra <atish.patra@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Reviewed-by: NVijay Kumar <vijay.ac.kumar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebb99a4c
    • P
      sparc64: fix section mismatch in find_numa_latencies_for_group · bdf2f59e
      Paul Gortmaker 提交于
      To fix:
      
        WARNING: vmlinux.o(.text.unlikely+0x580): Section mismatch in
        reference from the function find_numa_latencies_for_group() to the
        function .init.text:find_mlgroup()
      
        The function find_numa_latencies_for_group() references the
        function __init find_mlgroup().  This is often because
        find_numa_latencies_for_group lacks a __init annotation or the
        annotation of find_mlgroup is wrong.
      
      It turns out find_numa_latencies_for_group is only called from:
          static int __init numa_parse_mdesc(void)
      and hence we can tag find_numa_latencies_for_group with __init.
      
      In doing so we see that find_best_numa_node_for_mlgroup is only
      called from within __init and hence can also be marked with __init.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: sparclinux@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdf2f59e
  7. 30 7月, 2016 1 次提交
  8. 29 7月, 2016 1 次提交
    • M
      sparc64 mm: Fix base TSB sizing when hugetlb pages are used · af1b1a9b
      Mike Kravetz 提交于
      do_sparc64_fault() calculates both the base and huge page RSS sizes and
      uses this information in calls to tsb_grow().  The calculation for base
      page TSB size is not correct if the task uses hugetlb pages.  hugetlb
      pages are not accounted for in RSS, therefore the call to get_mm_rss(mm)
      does not include hugetlb pages.  However, the number of pages based on
      huge_pte_count (which does include hugetlb pages) is subtracted from
      this value.  This will result in an artificially small and often negative
      RSS calculation.  The base TSB size is then often set to max_tsb_size
      as the passed RSS is unsigned, so a negative value looks really big.
      
      THP pages are also accounted for in huge_pte_count, and THP pages are
      accounted for in RSS so the calculation in do_sparc64_fault() is correct
      if a task only uses THP pages.
      
      A single huge_pte_count is not sufficient for TSB sizing if both hugetlb
      and THP pages can be used.  Instead of a single counter, use two:  one
      for hugetlb and one for THP.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af1b1a9b
  9. 25 6月, 2016 1 次提交
    • M
      tree wide: get rid of __GFP_REPEAT for order-0 allocations part I · 32d6bd90
      Michal Hocko 提交于
      This is the third version of the patchset previously sent [1].  I have
      basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
      rid of superfluous gfp flags" which went through dm tree.  I am sending
      it now because it is tree wide and chances for conflicts are reduced
      considerably when we want to target rc2.  I plan to send the next step
      and rename the flag and move to a better semantic later during this
      release cycle so we will have a new semantic ready for 4.8 merge window
      hopefully.
      
      Motivation:
      
      While working on something unrelated I've checked the current usage of
      __GFP_REPEAT in the tree.  It seems that a majority of the usage is and
      always has been bogus because __GFP_REPEAT has always been about costly
      high order allocations while we are using it for order-0 or very small
      orders very often.  It seems that a big pile of them is just a
      copy&paste when a code has been adopted from one arch to another.
      
      I think it makes some sense to get rid of them because they are just
      making the semantic more unclear.  Please note that GFP_REPEAT is
      documented as
      
      * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
      
      * _might_ fail.  This depends upon the particular VM implementation.
        while !costly requests have basically nofail semantic.  So one could
        reasonably expect that order-0 request with __GFP_REPEAT will not loop
        for ever.  This is not implemented right now though.
      
      I would like to move on with __GFP_REPEAT and define a better semantic
      for it.
      
        $ git grep __GFP_REPEAT origin/master | wc -l
        111
        $ git grep __GFP_REPEAT | wc -l
        36
      
      So we are down to the third after this patch series.  The remaining
      places really seem to be relying on __GFP_REPEAT due to large allocation
      requests.  This still needs some double checking which I will do later
      after all the simple ones are sorted out.
      
      I am touching a lot of arch specific code here and I hope I got it right
      but as a matter of fact I even didn't compile test for some archs as I
      do not have cross compiler for them.  Patches should be quite trivial to
      review for stupid compile mistakes though.  The tricky parts are usually
      hidden by macro definitions and thats where I would appreciate help from
      arch maintainers.
      
      [1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org
      
      This patch (of 19):
      
      __GFP_REPEAT has a rather weak semantic but since it has been introduced
      around 2.6.12 it has been ignored for low order allocations.  Yet we
      have the full kernel tree with its usage for apparently order-0
      allocations.  This is really confusing because __GFP_REPEAT is
      explicitly documented to allow allocation failures which is a weaker
      semantic than the current order-0 has (basically nofail).
      
      Let's simply drop __GFP_REPEAT from those places.  This would allow to
      identify place which really need allocator to retry harder and formulate
      a more specific semantic for what the flag is supposed to do actually.
      
      Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Crispin <blogic@openwrt.org>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32d6bd90
  10. 26 5月, 2016 1 次提交
  11. 21 5月, 2016 1 次提交
  12. 22 4月, 2016 1 次提交
  13. 30 1月, 2016 1 次提交
    • T
      arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM · 35d98e93
      Toshi Kani 提交于
      Set IORESOURCE_SYSTEM_RAM in flags of resource ranges with
      "System RAM", "Kernel code", "Kernel data", and "Kernel bss".
      
      Note that:
      
       - IORESOURCE_SYSRAM (i.e. modifier bit) is set in flags when
         IORESOURCE_MEM is already set. IORESOURCE_SYSTEM_RAM is defined
         as (IORESOURCE_MEM|IORESOURCE_SYSRAM).
      
       - Some archs do not set 'flags' for children nodes, such as
         "Kernel code".  This patch does not change 'flags' in this
         case.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mips@linux-mips.org
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: linux-parisc@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: sparclinux@vger.kernel.org
      Link: http://lkml.kernel.org/r/1453841853-11383-7-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35d98e93
  14. 15 1月, 2016 1 次提交
  15. 05 11月, 2015 1 次提交
    • N
      sparc64: Fix numa distance values · 52708d69
      Nitin Gupta 提交于
      Orabug: 21896119
      
      Use machine descriptor (MD) to get node latency
      values instead of just using default values.
      
      Testing:
      On an T5-8 system with:
       - total nodes = 8
       - self latencies = 0x26d18
       - latency to other nodes = 0x3a598
         => latency ratio = ~1.5
      
      output of numactl --hardware
      
       - before fix:
      
      node distances:
      node   0   1   2   3   4   5   6   7
        0:  10  20  20  20  20  20  20  20
        1:  20  10  20  20  20  20  20  20
        2:  20  20  10  20  20  20  20  20
        3:  20  20  20  10  20  20  20  20
        4:  20  20  20  20  10  20  20  20
        5:  20  20  20  20  20  10  20  20
        6:  20  20  20  20  20  20  10  20
        7:  20  20  20  20  20  20  20  10
      
       - after fix:
      
      node distances:
      node   0   1   2   3   4   5   6   7
        0:  10  15  15  15  15  15  15  15
        1:  15  10  15  15  15  15  15  15
        2:  15  15  10  15  15  15  15  15
        3:  15  15  15  10  15  15  15  15
        4:  15  15  15  15  10  15  15  15
        5:  15  15  15  15  15  10  15  15
        6:  15  15  15  15  15  15  10  15
        7:  15  15  15  15  15  15  15  10
      Signed-off-by: NNitin Gupta <nitin.m.gupta@oracle.com>
      Reviewed-by: NChris Hyser <chris.hyser@oracle.com>
      Reviewed-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52708d69
  16. 25 6月, 2015 1 次提交
    • T
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck 提交于
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for safety vs. NUMA local memory for performance
      
      Gen3: Address range partial memory mirror - some mirror on each memory
            controller
      	Pro: Can tune the amount of mirror and keep NUMA performance
      	Con: I have to write memory management code to implement
      
      The current plan is just to use mirrored memory for kernel allocations.
      This has been broken into two phases:
      
      1) This patch series - find the mirrored memory, use it for boot time
         allocations
      
      2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
         unused mirrored memory from mm/memblock.c and only give it out to
         select kernel allocations (this is still being scoped because
         page_alloc.c is scary).
      
      This patch (of 3):
      
      Add extra "flags" to memblock to allow selection of memory based on
      attribute.  No functional changes
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc6daaf9
  17. 01 6月, 2015 1 次提交
    • K
      sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE · 494e5b6f
      Khalid Aziz 提交于
      sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE
      
      Bit 9 of TTE is CV (Cacheable in V-cache) on sparc v9 processor while
      the same bit 9 is MCDE (Memory Corruption Detection Enable) on M7
      processor. This creates a conflicting usage of the same bit. Kernel
      sets TTE.cv bit on all pages for sun4v architecture which works well
      for sparc v9 but enables memory corruption detection on M7 processor
      which is not the intent. This patch adds code to determine if kernel
      is running on M7 processor and takes steps to not enable memory
      corruption detection in TTE erroneously.
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      494e5b6f
  18. 19 5月, 2015 1 次提交
    • D
      mm/fault, arch: Use pagefault_disable() to check for disabled pagefaults in the handler · 70ffdb93
      David Hildenbrand 提交于
      Introduce faulthandler_disabled() and use it to check for irq context and
      disabled pagefaults (via pagefault_disable()) in the pagefault handlers.
      
      Please note that we keep the in_atomic() checks in place - to detect
      whether in irq context (in which case preemption is always properly
      disabled).
      
      In contrast, preempt_disable() should never be used to disable pagefaults.
      With !CONFIG_PREEMPT_COUNT, preempt_disable() doesn't modify the preempt
      counter, and therefore the result of in_atomic() differs.
      We validate that condition by using might_fault() checks when calling
      might_sleep().
      
      Therefore, add a comment to faulthandler_disabled(), describing why this
      is needed.
      
      faulthandler_disabled() and pagefault_disable() are defined in
      linux/uaccess.h, so let's properly add that include to all relevant files.
      
      This patch is based on a patch from Thomas Gleixner.
      Reviewed-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: benh@kernel.crashing.org
      Cc: bigeasy@linutronix.de
      Cc: borntraeger@de.ibm.com
      Cc: daniel.vetter@intel.com
      Cc: heiko.carstens@de.ibm.com
      Cc: herbert@gondor.apana.org.au
      Cc: hocko@suse.cz
      Cc: hughd@google.com
      Cc: mst@redhat.com
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: schwidefsky@de.ibm.com
      Cc: yang.shi@windriver.com
      Link: http://lkml.kernel.org/r/1431359540-32227-7-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      70ffdb93
  19. 19 3月, 2015 1 次提交
  20. 14 12月, 2014 1 次提交
    • J
      mm/debug-pagealloc: make debug-pagealloc boottime configurable · 031bc574
      Joonsoo Kim 提交于
      Now, we have prepared to avoid using debug-pagealloc in boottime.  So
      introduce new kernel-parameter to disable debug-pagealloc in boottime, and
      makes related functions to be disabled in this case.
      
      Only non-intuitive part is change of guard page functions.  Because guard
      page is effective only if debug-pagealloc is enabled, turning off
      according to debug-pagealloc is reasonable thing to do.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031bc574
  21. 06 10月, 2014 8 次提交
  22. 05 10月, 2014 1 次提交
    • D
      sparc64: Fix reversed start/end in flush_tlb_kernel_range() · 473ad7f4
      David S. Miller 提交于
      When we have to split up a flush request into multiple pieces
      (in order to avoid the firmware range) we don't specify the
      arguments in the right order for the second piece.
      
      Fix the order, or else we get hangs as the code tries to
      flush "a lot" of entries and we get lockups like this:
      
      [ 4422.981276] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [expect:117032]
      [ 4422.996130] Modules linked in: ipv6 loop usb_storage igb ptp sg sr_mod ehci_pci ehci_hcd pps_core n2_rng rng_core
      [ 4423.016617] CPU: 12 PID: 117032 Comm: expect Not tainted 3.17.0-rc4+ #1608
      [ 4423.030331] task: fff8003cc730e220 ti: fff8003d99d54000 task.ti: fff8003d99d54000
      [ 4423.045282] TSTATE: 0000000011001602 TPC: 00000000004521e8 TNPC: 00000000004521ec Y: 00000000    Not tainted
      [ 4423.064905] TPC: <__flush_tlb_kernel_range+0x28/0x40>
      [ 4423.074964] g0: 000000000052fd10 g1: 00000001295a8000 g2: ffffff7176ffc000 g3: 0000000000002000
      [ 4423.092324] g4: fff8003cc730e220 g5: fff8003dfedcc000 g6: fff8003d99d54000 g7: 0000000000000006
      [ 4423.109687] o0: 0000000000000000 o1: 0000000000000000 o2: 0000000000000003 o3: 00000000f0000000
      [ 4423.127058] o4: 0000000000000080 o5: 00000001295a8000 sp: fff8003d99d56d01 ret_pc: 000000000052ff54
      [ 4423.145121] RPC: <__purge_vmap_area_lazy+0x314/0x3a0>
      [ 4423.155185] l0: 0000000000000000 l1: 0000000000000000 l2: 0000000000a38040 l3: 0000000000000000
      [ 4423.172559] l4: fff8003dae8965e0 l5: ffffffffffffffff l6: 0000000000000000 l7: 00000000f7e2b138
      [ 4423.189913] i0: fff8003d99d576a0 i1: fff8003d99d576a8 i2: fff8003d99d575e8 i3: 0000000000000000
      [ 4423.207284] i4: 0000000000008008 i5: fff8003d99d575c8 i6: fff8003d99d56df1 i7: 0000000000530c24
      [ 4423.224640] I7: <free_vmap_area_noflush+0x64/0x80>
      [ 4423.234193] Call Trace:
      [ 4423.239051]  [0000000000530c24] free_vmap_area_noflush+0x64/0x80
      [ 4423.251029]  [0000000000531a7c] remove_vm_area+0x5c/0x80
      [ 4423.261628]  [0000000000531b80] __vunmap+0x20/0x120
      [ 4423.271352]  [000000000071cf18] n_tty_close+0x18/0x40
      [ 4423.281423]  [00000000007222b0] tty_ldisc_close+0x30/0x60
      [ 4423.292183]  [00000000007225a4] tty_ldisc_reinit+0x24/0xa0
      [ 4423.303120]  [0000000000722ab4] tty_ldisc_hangup+0xd4/0x1e0
      [ 4423.314232]  [0000000000719aa0] __tty_hangup+0x280/0x3c0
      [ 4423.324835]  [0000000000724cb4] pty_close+0x134/0x1a0
      [ 4423.334905]  [000000000071aa24] tty_release+0x104/0x500
      [ 4423.345316]  [00000000005511d0] __fput+0x90/0x1e0
      [ 4423.354701]  [000000000047fa54] task_work_run+0x94/0xe0
      [ 4423.365126]  [0000000000404b44] __handle_signal+0xc/0x2c
      
      Fixes: 4ca9a237 ("sparc64: Guard against flushing openfirmware mappings.")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      473ad7f4
  23. 17 9月, 2014 2 次提交
    • B
      sparc64: mem boot option correction · 7c21d533
      bob picco 提交于
      The "mem" boot option can result in many unexpected consequences. This patch
      attempts to prevent boot hangs which have been experienced on T4-4 and T5-8.
      Basically the boot loader allocates vmlinuz and initrd higher in available
      OBP physical memory. For example, on a 2Tb T5-8 it isn't possible to boot
      with mem=20G.
      
      The patch utilizes memblock to avoid reserved regions and trim memory which
      is only free. Other improvements are possible for a multi-node machine.
      
      This is a snippet of the boot log with mem=20G on T5-8 with the patch applied:
      MEMBLOCK configuration:	<- before memory reduction
       memory size = 0x1ffad6ce000 reserved size = 0xa1adf44
       memory.cnt  = 0xb
       memory[0x0]    [0x00000030400000-0x00003fdde47fff], 0x3fada48000 bytes
       memory[0x1]    [0x00003fdde4e000-0x00003fdde4ffff], 0x2000 bytes
       memory[0x2]    [0x00080000000000-0x00083fffffffff], 0x4000000000 bytes
       memory[0x3]    [0x00100000000000-0x00103fffffffff], 0x4000000000 bytes
       memory[0x4]    [0x00180000000000-0x00183fffffffff], 0x4000000000 bytes
       memory[0x5]    [0x00200000000000-0x00203fffffffff], 0x4000000000 bytes
       memory[0x6]    [0x00280000000000-0x00283fffffffff], 0x4000000000 bytes
       memory[0x7]    [0x00300000000000-0x00303fffffffff], 0x4000000000 bytes
       memory[0x8]    [0x00380000000000-0x00383fffc71fff], 0x3fffc72000 bytes
       memory[0x9]    [0x00383fffc92000-0x00383fffca1fff], 0x10000 bytes
       memory[0xa]    [0x00383fffcb4000-0x00383fffcb5fff], 0x2000 bytes
       reserved.cnt  = 0x2
       reserved[0x0]  [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes
       reserved[0x1]  [0x00380004000000-0x0038000d02f74a], 0x902f74b bytes
      ...
      MEMBLOCK configuration:	<- after reduction of memory
       memory size = 0x50a1adf44 reserved size = 0xa1adf44
       memory.cnt  = 0x4
       memory[0x0]    [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes
       memory[0x1]    [0x00380004000000-0x0038050d01d74a], 0x50901d74b bytes
       memory[0x2]    [0x00383fffc92000-0x00383fffca1fff], 0x10000 bytes
       memory[0x3]    [0x00383fffcb4000-0x00383fffcb5fff], 0x2000 bytes
       reserved.cnt  = 0x2
       reserved[0x0]  [0x00380000000000-0x0038000117e7f8], 0x117e7f9 bytes
       reserved[0x1]  [0x00380004000000-0x0038000d02f74a], 0x902f74b bytes
      ...
      Early memory node ranges
        node   7: [mem 0x380000000000-0x38000117dfff]
        node   7: [mem 0x380004000000-0x380f0d01bfff]
        node   7: [mem 0x383fffc92000-0x383fffca1fff]
        node   7: [mem 0x383fffcb4000-0x383fffcb5fff]
      Could not find start_pfn for node 0
      Could not find start_pfn for node 1
      Could not find start_pfn for node 2
      Could not find start_pfn for node 3
      Could not find start_pfn for node 4
      Could not find start_pfn for node 5
      Could not find start_pfn for node 6
      .
      
      The patch was tested on T4-1, T5-8 and Jalap?no.
      
      Cc: sparclinux@vger.kernel.org
      Signed-off-by: NBob Picco <bob.picco@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c21d533
    • B
      sparc64: find_node adjustment · 3dee9df5
      bob picco 提交于
      We have seen an issue with guest boot into LDOM that causes early boot failures
      because of no matching rules for node identitity of the memory. I analyzed this
      on my T4 and concluded there might not be a solution. I saw the issue in
      mainline too when booting into the control/primary domain - with guests
      configured.  Note, this could be a firmware bug on some older machines.
      
      I'll provide a full explanation of the issues below. Should we not find a
      matching BEST latency group for a real address (RA) then we will assume node 0.
      On the T4-2 here with the information provided I can't see an alternative.
      
      Technically the LDOM shown below should match the MBLOCK to the
      favorable latency group. However other factors must be considered too. Were
      the memory controllers configured "fine" grained interleave or "coarse"
      grain interleaved -  T4. Also should a "group" MD node be considered a NUMA
      node?
      
      There has to be at least one Machine Description (MD) "group" and hence one
      NUMA node. The group can have one or more latency groups (lg) - more than one
      memory controller. The current code chooses the smallest latency as the most
      favorable per group. The latency and lg information is in MLGROUP below.
      MBLOCK is the base and size of the RAs for the machine as fetched from OBP
      /memory "available" property. My machine has one MBLOCK but more would be
      possible - with holes?
      
      For a T4-2 the following information has been gathered:
      with LDOM guest
      MEMBLOCK configuration:
       memory size = 0x27f870000
       memory.cnt  = 0x3
       memory[0x0]    [0x00000020400000-0x0000029fc67fff], 0x27f868000 bytes
       memory[0x1]    [0x0000029fd8a000-0x0000029fd8bfff], 0x2000 bytes
       memory[0x2]    [0x0000029fd92000-0x0000029fd97fff], 0x6000 bytes
       reserved.cnt  = 0x2
       reserved[0x0]  [0x00000020800000-0x000000216c15c0], 0xec15c1 bytes
       reserved[0x1]  [0x00000024800000-0x0000002c180c1e], 0x7980c1f bytes
      MBLOCK[0]: base[20000000] size[280000000] offset[0]
      (note: "base" and "size" reported in "MBLOCK" encompass the "memory[X]" values)
      (note: (RA + offset) & mask = val is the formula to detect a match for the
      memory controller. should there be no match for find_node node, a return
      value of -1 resulted for the node - BAD)
      
      There is one group. It has these forward links
      MLGROUP[1]: node[545] latency[1f7e8] match[200000000] mask[200000000]
      MLGROUP[2]: node[54d] latency[2de60] match[0] mask[200000000]
      NUMA NODE[0]: node[545] mask[200000000] val[200000000] (latency[1f7e8])
      (note: "val" is the best lg's (smallest latency) "match")
      
      no LDOM guest - bare metal
      MEMBLOCK configuration:
       memory size = 0xfdf2d0000
       memory.cnt  = 0x3
       memory[0x0]    [0x00000020400000-0x00000fff6adfff], 0xfdf2ae000 bytes
       memory[0x1]    [0x00000fff6d2000-0x00000fff6e7fff], 0x16000 bytes
       memory[0x2]    [0x00000fff766000-0x00000fff771fff], 0xc000 bytes
       reserved.cnt  = 0x2
       reserved[0x0]  [0x00000020800000-0x00000021a04580], 0x1204581 bytes
       reserved[0x1]  [0x00000024800000-0x0000002c7d29fc], 0x7fd29fd bytes
      MBLOCK[0]: base[20000000] size[fe0000000] offset[0]
      
      there are two groups
      group node[16d5]
      MLGROUP[0]: node[1765] latency[1f7e8] match[0] mask[200000000]
      MLGROUP[3]: node[177d] latency[2de60] match[200000000] mask[200000000]
      NUMA NODE[0]: node[1765] mask[200000000] val[0] (latency[1f7e8])
      group node[171d]
      MLGROUP[2]: node[1775] latency[2de60] match[0] mask[200000000]
      MLGROUP[1]: node[176d] latency[1f7e8] match[200000000] mask[200000000]
      NUMA NODE[1]: node[176d] mask[200000000] val[200000000] (latency[1f7e8])
      (note: for this two "group" bare metal machine, 1/2 memory is in group one's
      lg and 1/2 memory is in group two's lg).
      
      Cc: sparclinux@vger.kernel.org
      Signed-off-by: NBob Picco <bob.picco@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dee9df5
  24. 06 8月, 2014 1 次提交
  25. 05 8月, 2014 2 次提交
    • D
      sparc64: Guard against flushing openfirmware mappings. · 4ca9a237
      David S. Miller 提交于
      Based almost entirely upon a patch by Christopher Alexander Tobias
      Schulze.
      
      In commit db64fe02 ("mm: rewrite vmap
      layer") lazy VMAP tlb flushing was added to the vmalloc layer.  This
      causes problems on sparc64.
      
      Sparc64 has two VMAP mapped regions and they are not contiguous with
      eachother.  First we have the malloc mapping area, then another
      unrelated region, then the vmalloc region.
      
      This "another unrelated region" is where the firmware is mapped.
      
      If the lazy TLB flushing logic in the vmalloc code triggers after
      we've had both a module unload and a vfree or similar, it will pass an
      address range that goes from somewhere inside the malloc region to
      somewhere inside the vmalloc region, and thus covering the
      openfirmware area entirely.
      
      The sparc64 kernel learns about openfirmware's dynamic mappings in
      this region early in the boot, and then services TLB misses in this
      area.  But openfirmware has some locked TLB entries which are not
      mentioned in those dynamic mappings and we should thus not disturb
      them.
      
      These huge lazy TLB flush ranges causes those openfirmware locked TLB
      entries to be removed, resulting in all kinds of problems including
      hard hangs and crashes during reboot/reset.
      
      Besides causing problems like this, such huge TLB flush ranges are
      also incredibly inefficient.  A plea has been made with the author of
      the VMAP lazy TLB flushing code, but for now we'll put a safety guard
      into our flush_tlb_kernel_range() implementation.
      
      Since the implementation has become non-trivial, stop defining it as a
      macro and instead make it a function in a C source file.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ca9a237
    • D
      sparc64: Do not insert non-valid PTEs into the TSB hash table. · 18f38132
      David S. Miller 提交于
      The assumption was that update_mmu_cache() (and the equivalent for PMDs) would
      only be called when the PTE being installed will be accessible by the user.
      
      This is not true for code paths originating from remove_migration_pte().
      
      There are dire consequences for placing a non-valid PTE into the TSB.  The TLB
      miss frramework assumes thatwhen a TSB entry matches we can just load it into
      the TLB and return from the TLB miss trap.
      
      So if a non-valid PTE is in there, we will deadlock taking the TLB miss over
      and over, never satisfying the miss.
      
      Just exit early from update_mmu_cache() and friends in this situation.
      
      Based upon a report and patch from Christopher Alexander Tobias Schulze.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18f38132
  26. 22 7月, 2014 1 次提交
  27. 19 5月, 2014 2 次提交
  28. 04 5月, 2014 1 次提交
  29. 22 1月, 2014 1 次提交
    • T
      memblock: make memblock_set_node() support different memblock_type · e7e8de59
      Tang Chen 提交于
      [sfr@canb.auug.org.au: fix powerpc build]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Liu Jiang <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7e8de59