1. 24 2月, 2017 2 次提交
    • N
      sparc64: Multi-page size support · c7d9f77d
      Nitin Gupta 提交于
      Add support for using multiple hugepage sizes simultaneously
      on mainline. Currently, support for 256M has been added which
      can be used along with 8M pages.
      
      Page tables are set like this (e.g. for 256M page):
          VA + (8M * x) -> PA + (8M * x) (sz bit = 256M) where x in [0, 31]
      
      and TSB is set similarly:
          VA + (4M * x) -> PA + (4M * x) (sz bit = 256M) where x in [0, 63]
      
      - Testing
      
      Tested on Sonoma (which supports 256M pages) by running stream
      benchmark instances in parallel: one instance uses 8M pages and
      another uses 256M pages, consuming 48G each.
      
      Boot params used:
      
      default_hugepagesz=256M hugepagesz=256M hugepages=300 hugepagesz=8M
      hugepages=10000
      Signed-off-by: NNitin Gupta <nitin.m.gupta@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7d9f77d
    • B
      sparc32: mm: srmmu: add __ro_after_init to sparc32_cachetlb_ops structures · cbc41b43
      Bhumika Goyal 提交于
      The objects viking_ops, viking_sun4d_smp_ops and smp_cachetlb_ops of
      type sparc32_cachetlb_ops are not modified anywhere after getting modified
      in the init functions. Inside init their reference is also stored in a
      pointer of type const struct sparc32_cachetlb_ops *. So these structures
      are never modified after init, therefore add __ro_after to the declaration
      of these structures.
      Signed-off-by: NBhumika Goyal <bhumirks@gmail.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbc41b43
  2. 23 2月, 2017 1 次提交
  3. 26 1月, 2017 1 次提交
  4. 25 12月, 2016 1 次提交
  5. 15 11月, 2016 1 次提交
  6. 11 11月, 2016 1 次提交
    • T
      sparc64: Fix find_node warning if numa node cannot be found · 74a5ed5c
      Thomas Tai 提交于
      When booting up LDOM, find_node() warns that a physical address
      doesn't match a NUMA node.
      
      WARNING: CPU: 0 PID: 0 at arch/sparc/mm/init_64.c:835
      find_node+0xf4/0x120 find_node: A physical address doesn't
      match a NUMA node rule. Some physical memory will be
      owned by node 0.Modules linked in:
      
      CPU: 0 PID: 0 Comm: swapper Not tainted 4.9.0-rc3 #4
      Call Trace:
       [0000000000468ba0] __warn+0xc0/0xe0
       [0000000000468c74] warn_slowpath_fmt+0x34/0x60
       [00000000004592f4] find_node+0xf4/0x120
       [0000000000dd0774] add_node_ranges+0x38/0xe4
       [0000000000dd0b1c] numa_parse_mdesc+0x268/0x2e4
       [0000000000dd0e9c] bootmem_init+0xb8/0x160
       [0000000000dd174c] paging_init+0x808/0x8fc
       [0000000000dcb0d0] setup_arch+0x2c8/0x2f0
       [0000000000dc68a0] start_kernel+0x48/0x424
       [0000000000dcb374] start_early_boot+0x27c/0x28c
       [0000000000a32c08] tlb_fixup_done+0x4c/0x64
       [0000000000027f08] 0x27f08
      
      It is because linux use an internal structure node_masks[] to
      keep the best memory latency node only. However, LDOM mdesc can
      contain single latency-group with multiple memory latency nodes.
      
      If the address doesn't match the best latency node within
      node_masks[], it should check for an alternative via mdesc.
      The warning message should only be printed if the address
      doesn't match any node_masks[] nor within mdesc. To minimize
      the impact of searching mdesc every time, the last matched
      mask and index is stored in a variable.
      Signed-off-by: NThomas Tai <thomas.tai@oracle.com>
      Reviewed-by: NChris Hyser <chris.hyser@oracle.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74a5ed5c
  7. 28 10月, 2016 1 次提交
    • D
      sparc64: Handle extremely large kernel TLB range flushes more gracefully. · a74ad5e6
      David S. Miller 提交于
      When the vmalloc area gets fragmented, and because the firmware
      mapping area sits between where modules live and the vmalloc area, we
      can sometimes receive requests for enormous kernel TLB range flushes.
      
      When this happens the cpu just spins flushing billions of pages and
      this triggers the NMI watchdog and other problems.
      
      We took care of this on the TSB side by doing a linear scan of the
      table once we pass a certain threshold.
      
      Do something similar for the TLB flush, however we are limited by
      the TLB flush facilities provided by the different chip variants.
      
      First of all we use an (mostly arbitrary) cut-off of 256K which is
      about 32 pages.  This can be tuned in the future.
      
      The huge range code path for each chip works as follows:
      
      1) On spitfire we flush all non-locked TLB entries using diagnostic
         acceses.
      
      2) On cheetah we use the "flush all" TLB flush.
      
      3) On sun4v/hypervisor we do a TLB context flush on context 0, which
         unlike previous chips does not remove "permanent" or locked
         entries.
      
      We could probably do something better on spitfire, such as limiting
      the flush to kernel TLB entries or even doing range comparisons.
      However that probably isn't worth it since those chips are old and
      the TLB only had 64 entries.
      Reported-by: NJames Clarke <jrtc27@jrtc27.com>
      Tested-by: NJames Clarke <jrtc27@jrtc27.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a74ad5e6
  8. 27 10月, 2016 2 次提交
  9. 26 10月, 2016 2 次提交
  10. 19 10月, 2016 1 次提交
  11. 06 10月, 2016 1 次提交
  12. 28 9月, 2016 3 次提交
    • A
      sparc64: Fix irq stack bootmem allocation. · ebb99a4c
      Atish Patra 提交于
      Currently, irq stack bootmem is allocated for all possible cpus
      before nr_cpus value changes the list of possible cpus. As a result,
      there is unnecessary wastage of bootmemory.
      
      Move the irq stack bootmem allocation so that it happens after
      possible cpu list is modified based on nr_cpus value.
      Signed-off-by: NAtish Patra <atish.patra@oracle.com>
      Reviewed-by: NBob Picco <bob.picco@oracle.com>
      Reviewed-by: NVijay Kumar <vijay.ac.kumar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebb99a4c
    • M
      sparc64 mm: Fix more TSB sizing issues · 1e953d84
      Mike Kravetz 提交于
      Commit af1b1a9b ("sparc64 mm: Fix base TSB sizing when hugetlb
      pages are used") addressed the difference between hugetlb and THP
      pages when computing TSB sizes.  The following additional issues
      were also discovered while working with the code.
      
      In order to save memory, THP makes use of a huge zero page.  This huge
      zero page does not count against a task's RSS, but it does consume TSB
      entries.  This is similar to hugetlb pages.  Therefore, count huge
      zero page entries in hugetlb_pte_count.
      
      Accounting of THP pages is done in the routine set_pmd_at().
      Unfortunately, this does not catch the case where a THP page is split.
      To handle this case, decrement the count in pmdp_invalidate().
      pmdp_invalidate is only called when splitting a THP.  However, 'sanity
      checks' are added in case it is ever called for other purposes.
      
      A more general issue exists with HPAGE_SIZE accounting.
      hugetlb_pte_count tracks the number of HPAGE_SIZE (8M) pages.  This
      value is used to size the TSB for HPAGE_SIZE pages.  However,
      each HPAGE_SIZE page consists of two REAL_HPAGE_SIZE (4M) pages.
      The TSB contains an entry for each REAL_HPAGE_SIZE page.  Therefore,
      the number of REAL_HPAGE_SIZE pages should be used to size the huge
      page TSB.  A new compile time constant REAL_HPAGE_PER_HPAGE is used
      to multiply hugetlb_pte_count before sizing the TSB.
      
      Changes from V1
      - Fixed build issue if hugetlb or THP not configured
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e953d84
    • P
      sparc64: fix section mismatch in find_numa_latencies_for_group · bdf2f59e
      Paul Gortmaker 提交于
      To fix:
      
        WARNING: vmlinux.o(.text.unlikely+0x580): Section mismatch in
        reference from the function find_numa_latencies_for_group() to the
        function .init.text:find_mlgroup()
      
        The function find_numa_latencies_for_group() references the
        function __init find_mlgroup().  This is often because
        find_numa_latencies_for_group lacks a __init annotation or the
        annotation of find_mlgroup is wrong.
      
      It turns out find_numa_latencies_for_group is only called from:
          static int __init numa_parse_mdesc(void)
      and hence we can tag find_numa_latencies_for_group with __init.
      
      In doing so we see that find_best_numa_node_for_mlgroup is only
      called from within __init and hence can also be marked with __init.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Nitin Gupta <nitin.m.gupta@oracle.com>
      Cc: Chris Hyser <chris.hyser@oracle.com>
      Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: sparclinux@vger.kernel.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdf2f59e
  13. 30 7月, 2016 1 次提交
  14. 29 7月, 2016 1 次提交
    • M
      sparc64 mm: Fix base TSB sizing when hugetlb pages are used · af1b1a9b
      Mike Kravetz 提交于
      do_sparc64_fault() calculates both the base and huge page RSS sizes and
      uses this information in calls to tsb_grow().  The calculation for base
      page TSB size is not correct if the task uses hugetlb pages.  hugetlb
      pages are not accounted for in RSS, therefore the call to get_mm_rss(mm)
      does not include hugetlb pages.  However, the number of pages based on
      huge_pte_count (which does include hugetlb pages) is subtracted from
      this value.  This will result in an artificially small and often negative
      RSS calculation.  The base TSB size is then often set to max_tsb_size
      as the passed RSS is unsigned, so a negative value looks really big.
      
      THP pages are also accounted for in huge_pte_count, and THP pages are
      accounted for in RSS so the calculation in do_sparc64_fault() is correct
      if a task only uses THP pages.
      
      A single huge_pte_count is not sufficient for TSB sizing if both hugetlb
      and THP pages can be used.  Instead of a single counter, use two:  one
      for hugetlb and one for THP.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af1b1a9b
  15. 27 7月, 2016 1 次提交
  16. 25 6月, 2016 1 次提交
    • M
      tree wide: get rid of __GFP_REPEAT for order-0 allocations part I · 32d6bd90
      Michal Hocko 提交于
      This is the third version of the patchset previously sent [1].  I have
      basically only rebased it on top of 4.7-rc1 tree and dropped "dm: get
      rid of superfluous gfp flags" which went through dm tree.  I am sending
      it now because it is tree wide and chances for conflicts are reduced
      considerably when we want to target rc2.  I plan to send the next step
      and rename the flag and move to a better semantic later during this
      release cycle so we will have a new semantic ready for 4.8 merge window
      hopefully.
      
      Motivation:
      
      While working on something unrelated I've checked the current usage of
      __GFP_REPEAT in the tree.  It seems that a majority of the usage is and
      always has been bogus because __GFP_REPEAT has always been about costly
      high order allocations while we are using it for order-0 or very small
      orders very often.  It seems that a big pile of them is just a
      copy&paste when a code has been adopted from one arch to another.
      
      I think it makes some sense to get rid of them because they are just
      making the semantic more unclear.  Please note that GFP_REPEAT is
      documented as
      
      * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
      
      * _might_ fail.  This depends upon the particular VM implementation.
        while !costly requests have basically nofail semantic.  So one could
        reasonably expect that order-0 request with __GFP_REPEAT will not loop
        for ever.  This is not implemented right now though.
      
      I would like to move on with __GFP_REPEAT and define a better semantic
      for it.
      
        $ git grep __GFP_REPEAT origin/master | wc -l
        111
        $ git grep __GFP_REPEAT | wc -l
        36
      
      So we are down to the third after this patch series.  The remaining
      places really seem to be relying on __GFP_REPEAT due to large allocation
      requests.  This still needs some double checking which I will do later
      after all the simple ones are sorted out.
      
      I am touching a lot of arch specific code here and I hope I got it right
      but as a matter of fact I even didn't compile test for some archs as I
      do not have cross compiler for them.  Patches should be quite trivial to
      review for stupid compile mistakes though.  The tricky parts are usually
      hidden by macro definitions and thats where I would appreciate help from
      arch maintainers.
      
      [1] http://lkml.kernel.org/r/1461849846-27209-1-git-send-email-mhocko@kernel.org
      
      This patch (of 19):
      
      __GFP_REPEAT has a rather weak semantic but since it has been introduced
      around 2.6.12 it has been ignored for low order allocations.  Yet we
      have the full kernel tree with its usage for apparently order-0
      allocations.  This is really confusing because __GFP_REPEAT is
      explicitly documented to allow allocation failures which is a weaker
      semantic than the current order-0 has (basically nofail).
      
      Let's simply drop __GFP_REPEAT from those places.  This would allow to
      identify place which really need allocator to retry harder and formulate
      a more specific semantic for what the flag is supposed to do actually.
      
      Link: http://lkml.kernel.org/r/1464599699-30131-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com> [for tile]
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Crispin <blogic@openwrt.org>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32d6bd90
  17. 26 5月, 2016 1 次提交
  18. 21 5月, 2016 3 次提交
  19. 22 4月, 2016 1 次提交
  20. 21 3月, 2016 1 次提交
  21. 18 3月, 2016 1 次提交
  22. 16 2月, 2016 1 次提交
    • D
      mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm · d4edcf0d
      Dave Hansen 提交于
      We will soon modify the vanilla get_user_pages() so it can no
      longer be used on mm/tasks other than 'current/current->mm',
      which is by far the most common way it is called.  For now,
      we allow the old-style calls, but warn when they are used.
      (implemented in previous patch)
      
      This patch switches all callers of:
      
      	get_user_pages()
      	get_user_pages_unlocked()
      	get_user_pages_locked()
      
      to stop passing tsk/mm so they will no longer see the warnings.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: jack@suse.cz
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4edcf0d
  23. 30 1月, 2016 1 次提交
    • T
      arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM · 35d98e93
      Toshi Kani 提交于
      Set IORESOURCE_SYSTEM_RAM in flags of resource ranges with
      "System RAM", "Kernel code", "Kernel data", and "Kernel bss".
      
      Note that:
      
       - IORESOURCE_SYSRAM (i.e. modifier bit) is set in flags when
         IORESOURCE_MEM is already set. IORESOURCE_SYSTEM_RAM is defined
         as (IORESOURCE_MEM|IORESOURCE_SYSRAM).
      
       - Some archs do not set 'flags' for children nodes, such as
         "Kernel code".  This patch does not change 'flags' in this
         case.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: linux-arch@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mips@linux-mips.org
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: linux-parisc@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Cc: linux-sh@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: sparclinux@vger.kernel.org
      Link: http://lkml.kernel.org/r/1453841853-11383-7-git-send-email-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35d98e93
  24. 16 1月, 2016 2 次提交
  25. 15 1月, 2016 1 次提交
  26. 05 11月, 2015 1 次提交
    • N
      sparc64: Fix numa distance values · 52708d69
      Nitin Gupta 提交于
      Orabug: 21896119
      
      Use machine descriptor (MD) to get node latency
      values instead of just using default values.
      
      Testing:
      On an T5-8 system with:
       - total nodes = 8
       - self latencies = 0x26d18
       - latency to other nodes = 0x3a598
         => latency ratio = ~1.5
      
      output of numactl --hardware
      
       - before fix:
      
      node distances:
      node   0   1   2   3   4   5   6   7
        0:  10  20  20  20  20  20  20  20
        1:  20  10  20  20  20  20  20  20
        2:  20  20  10  20  20  20  20  20
        3:  20  20  20  10  20  20  20  20
        4:  20  20  20  20  10  20  20  20
        5:  20  20  20  20  20  10  20  20
        6:  20  20  20  20  20  20  10  20
        7:  20  20  20  20  20  20  20  10
      
       - after fix:
      
      node distances:
      node   0   1   2   3   4   5   6   7
        0:  10  15  15  15  15  15  15  15
        1:  15  10  15  15  15  15  15  15
        2:  15  15  10  15  15  15  15  15
        3:  15  15  15  10  15  15  15  15
        4:  15  15  15  15  10  15  15  15
        5:  15  15  15  15  15  10  15  15
        6:  15  15  15  15  15  15  10  15
        7:  15  15  15  15  15  15  15  10
      Signed-off-by: NNitin Gupta <nitin.m.gupta@oracle.com>
      Reviewed-by: NChris Hyser <chris.hyser@oracle.com>
      Reviewed-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52708d69
  27. 25 6月, 2015 3 次提交
    • D
      sparc64: Convert BUG_ON to warning · 2bf7c3ef
      David Ahern 提交于
      Pagefault handling has a BUG_ON path that panics the system. Convert it to
      a warning instead. There is no need to bring down the system for this kind
      of failure.
      
      The following was hit while running:
          perf sched record -g -- make -j 16
      
      [3609412.782801] kernel BUG at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:416!
      [3609412.782833]               \|/ ____ \|/
      [3609412.782833]               "@'/ .. \`@"
      [3609412.782833]               /_| \__/ |_\
      [3609412.782833]                  \__U_/
      [3609412.782870] cat(4516): Kernel bad sw trap 5 [#1]
      [3609412.782889] CPU: 0 PID: 4516 Comm: cat Tainted: G            E   4.1.0-rc8+ #6
      [3609412.782909] task: fff8000126e31f80 ti: fff8000110d90000 task.ti: fff8000110d90000
      [3609412.782931] TSTATE: 0000004411001603 TPC: 000000000096b164 TNPC: 000000000096b168 Y: 0000004e    Tainted: G            E
      [3609412.782964] TPC: <do_sparc64_fault+0x5e4/0x6a0>
      [3609412.782979] g0: 000000000096abe0 g1: 0000000000d314c4 g2: 0000000000000000 g3: 0000000000000001
      [3609412.783009] g4: fff8000126e31f80 g5: fff80001302d2000 g6: fff8000110d90000 g7: 00000000000000ff
      [3609412.783045] o0: 0000000000aff6a8 o1: 00000000000001a0 o2: 0000000000000001 o3: 0000000000000054
      [3609412.783080] o4: fff8000100026820 o5: 0000000000000001 sp: fff8000110d935f1 ret_pc: 000000000096b15c
      [3609412.783117] RPC: <do_sparc64_fault+0x5dc/0x6a0>
      [3609412.783137] l0: 000007feff996000 l1: 0000000000030001 l2: 0000000000000004 l3: fff8000127bd0120
      [3609412.783174] l4: 0000000000000054 l5: fff8000127bd0188 l6: 0000000000000000 l7: fff8000110d9dba8
      [3609412.783210] i0: fff8000110d93f60 i1: fff8000110ca5530 i2: 000000000000003f i3: 0000000000000054
      [3609412.783244] i4: fff800010000081a i5: fff8000100000398 i6: fff8000110d936a1 i7: 0000000000407c6c
      [3609412.783286] I7: <sparc64_realfault_common+0x10/0x20>
      [3609412.783308] Call Trace:
      [3609412.783329]  [0000000000407c6c] sparc64_realfault_common+0x10/0x20
      [3609412.783353] Disabling lock debugging due to kernel taint
      [3609412.783379] Caller[0000000000407c6c]: sparc64_realfault_common+0x10/0x20
      [3609412.783449] Caller[fff80001002283e4]: 0xfff80001002283e4
      [3609412.783471] Instruction DUMP: 921021a0  7feaff91  901222a8 <91d02005> 82086100  02f87f7b  808a2873  81cfe008  01000000
      [3609412.783542] Kernel panic - not syncing: Fatal exception
      [3609412.784605] Press Stop-A (L1-A) to return to the boot prom
      [3609412.784615] ---[ end Kernel panic - not syncing: Fatal exception
      
      With this patch rather than a panic I occasionally get something like this:
          perf sched record -g -m 1024  -- make -j N
      
      where N is based on number of cpus (128 to 1024 for a T7-4 and 8 for an 8 cpu
      VM on a T5-2).
      
      WARNING: CPU: 211 PID: 52565 at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:417 do_sparc64_fault+0x340/0x70c()
      address (7feffcd6000) != regs->tpc (fff80001004873c0)
      Modules linked in: ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 cdc_ether usbnet mii ixgbe mdio igb i2c_algo_bit i2c_core ptp crc32c_sparc64 camellia_sparc64 des_sparc64 des_generic md5_sparc64 sha512_sparc64 sha1_sparc64 uio_pdrv_genirq uio usb_storage mpt3sas scsi_transport_sas raid_class aes_sparc64 sunvnet sunvdc sha256_sparc64(E) sha256_generic(E)
      CPU: 211 PID: 52565 Comm: ld Tainted: G        W   E   4.1.0-rc8+ #19
      Call Trace:
       [000000000045ce30] warn_slowpath_common+0x7c/0xa0
       [000000000045ceec] warn_slowpath_fmt+0x30/0x40
       [000000000098ad64] do_sparc64_fault+0x340/0x70c
       [0000000000407c2c] sparc64_realfault_common+0x10/0x20
      ---[ end trace 62ee02065a01a049 ]---
      ld[52565]: segfault at fff80001004873c0 ip fff80001004873c0 (rpc fff8000100158868) sp 000007feffcd70e1 error 30002 in libc-2.12.so[fff8000100410000+184000]
      
      The segfault is horrible, but better than a system panic.
      
      An 8-cpu VM on a T5-2 also showed the above traces from time to time,
      so it is a general problem and not specific to the T7 or baremetal.
      Signed-off-by: NDavid Ahern <david.ahern@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bf7c3ef
    • T
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck 提交于
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for safety vs. NUMA local memory for performance
      
      Gen3: Address range partial memory mirror - some mirror on each memory
            controller
      	Pro: Can tune the amount of mirror and keep NUMA performance
      	Con: I have to write memory management code to implement
      
      The current plan is just to use mirrored memory for kernel allocations.
      This has been broken into two phases:
      
      1) This patch series - find the mirrored memory, use it for boot time
         allocations
      
      2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
         unused mirrored memory from mm/memblock.c and only give it out to
         select kernel allocations (this is still being scoped because
         page_alloc.c is scary).
      
      This patch (of 3):
      
      Add extra "flags" to memblock to allow selection of memory based on
      attribute.  No functional changes
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc6daaf9
    • Z
      mm/hugetlb: reduce arch dependent code about huge_pmd_unshare · e81f2d22
      Zhang Zhen 提交于
      Currently we have many duplicates in definitions of huge_pmd_unshare.  In
      all architectures this function just returns 0 when
      CONFIG_ARCH_WANT_HUGE_PMD_SHARE is N.
      
      This patch puts the default implementation in mm/hugetlb.c and lets these
      architectures use the common code.
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Yang <James.Yang@freescale.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e81f2d22
  28. 01 6月, 2015 1 次提交
    • K
      sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE · 494e5b6f
      Khalid Aziz 提交于
      sparc: Resolve conflict between sparc v9 and M7 on usage of bit 9 of TTE
      
      Bit 9 of TTE is CV (Cacheable in V-cache) on sparc v9 processor while
      the same bit 9 is MCDE (Memory Corruption Detection Enable) on M7
      processor. This creates a conflicting usage of the same bit. Kernel
      sets TTE.cv bit on all pages for sun4v architecture which works well
      for sparc v9 but enables memory corruption detection on M7 processor
      which is not the intent. This patch adds code to determine if kernel
      is running on M7 processor and takes steps to not enable memory
      corruption detection in TTE erroneously.
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      494e5b6f
  29. 19 5月, 2015 2 次提交