1. 23 9月, 2015 6 次提交
  2. 11 9月, 2015 1 次提交
  3. 09 9月, 2015 1 次提交
    • T
      mem-hotplug: handle node hole when initializing numa_meminfo. · 95cf82ec
      Tang Chen 提交于
      When parsing SRAT, all memory ranges are added into numa_meminfo.  In
      numa_init(), before entering numa_cleanup_meminfo(), all possible memory
      ranges are in numa_meminfo.  And numa_cleanup_meminfo() removes all
      ranges over max_pfn or empty.
      
      But, this only works if the nodes are continuous.  Let's have a look at
      the following example:
      
      We have an SRAT like this:
      SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
      SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
      SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
      SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
      SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
      SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
      SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
      SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
      SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
      
      On boot, only node 0,1,2,3 exist.
      
      And the numa_meminfo will look like this:
      numa_meminfo.nr_blks = 9
      1. on node 0: [0, 60000000]
      2. on node 0: [100000000, 20000000000]
      3. on node 1: [20000000000, 40000000000]
      4. on node 4: [40000000000, 60000000000]
      5. on node 5: [60000000000, 80000000000]
      6. on node 2: [80000000000, a0000000000]
      7. on node 3: [a0000000000, a0800000000]
      8. on node 6: [c0000000000, a0800000000]
      9. on node 7: [e0000000000, a0800000000]
      
      And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
      end address is over max_pfn, which is a0800000000.  But 4 and 5 are not
      removed because their end addresses are less then max_pfn.  But in fact,
      node 4 and 5 don't exist.
      
      In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
      
      Since memory ranges in node 4 and 5 are in numa_meminfo, in
      numa_register_memblks(), node 4 and 5 will be mistakenly set to online.
      
      If you run lscpu, it will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node2 CPU(s):
      NUMA node3 CPU(s):
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      
      In this patch, we use memblock_overlaps_region() to check if ranges in
      numa_meminfo overlap with ranges in memory_block.  Since memory_block
      contains all available memory at boot time, if they overlap, it means the
      ranges exist.  If not, then remove them from numa_meminfo.
      
      After this patch, lscpu will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95cf82ec
  4. 05 9月, 2015 1 次提交
    • M
      x86, mm: trace when an IPI is about to be sent · 5b74283a
      Mel Gorman 提交于
      When unmapping pages it is necessary to flush the TLB.  If that page was
      accessed by another CPU then an IPI is used to flush the remote CPU.  That
      is a lot of IPIs if kswapd is scanning and unmapping >100K pages per
      second.
      
      There already is a window between when a page is unmapped and when it is
      TLB flushed.  This series increases the window so multiple pages can be
      flushed using a single IPI.  This should be safe or the kernel is hosed
      already.
      
      Patch 1 simply made the rest of the series easier to write as ftrace
              could identify all the senders of TLB flush IPIS.
      
      Patch 2 tracks what CPUs potentially map a PFN and then sends an IPI
              to flush the entire TLB.
      
      Patch 3 tracks when there potentially are writable TLB entries that
              need to be batched differently
      
      Patch 4 increases SWAP_CLUSTER_MAX to further batch flushes
      
      The performance impact is documented in the changelogs but in the optimistic
      case on a 4-socket machine the full series reduces interrupts from 900K
      interrupts/second to 60K interrupts/second.
      
      This patch (of 4):
      
      It is easy to trace when an IPI is received to flush a TLB but harder to
      detect what event sent it.  This patch makes it easy to identify the
      source of IPIs being transmitted for TLB flushes on x86.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b74283a
  5. 02 9月, 2015 1 次提交
  6. 28 8月, 2015 1 次提交
    • D
      mm: ZONE_DEVICE for "device memory" · 033fbae9
      Dan Williams 提交于
      While pmem is usable as a block device or via DAX mappings to userspace
      there are several usage scenarios that can not target pmem due to its
      lack of struct page coverage. In preparation for "hot plugging" pmem
      into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
      separately from the ones that are subject to standard page allocations.
      Importantly "device memory" can be removed at will by userspace
      unbinding the driver of the device.
      
      Having a separate zone prevents allocation and otherwise marks these
      pages that are distinct from typical uniform memory.  Device memory has
      different lifetime and performance characteristics than RAM.  However,
      since we have run out of ZONES_SHIFT bits this functionality currently
      depends on sacrificing ZONE_DMA.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Jerome Glisse <j.glisse@gmail.com>
      [hch: various simplifications in the arch interface]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      033fbae9
  7. 25 8月, 2015 1 次提交
  8. 22 8月, 2015 1 次提交
    • A
      x86/kasan, mm: Introduce generic kasan_populate_zero_shadow() · 69786cdb
      Andrey Ryabinin 提交于
      Introduce generic kasan_populate_zero_shadow(shadow_start,
      shadow_end). This function maps kasan_zero_page to the
      [shadow_start, shadow_end] addresses.
      
      This replaces x86_64 specific populate_zero_shadow() and will
      be used for ARM64 in follow on patches.
      
      The main changes from original version are:
      
       * Use p?d_populate*() instead of set_p?d()
       * Use memblock allocator directly instead of vmemmap_alloc_block()
       * __pa() instead of __pa_nodebug(). __pa() causes troubles
         iff we use it before kasan_early_init(). kasan_populate_zero_shadow()
         will be used later, so we ok with __pa() here.
      Signed-off-by: NAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexey Klimov <klimov.linux@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: David Keitel <dkeitel@codeaurora.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yury <yury.norov@gmail.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1439444244-26057-3-git-send-email-ryabinin.a.a@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      69786cdb
  9. 21 8月, 2015 1 次提交
  10. 31 7月, 2015 2 次提交
  11. 26 7月, 2015 1 次提交
  12. 24 7月, 2015 1 次提交
  13. 22 7月, 2015 2 次提交
    • T
      x86/mm: Remove region_is_ram() call from ioremap · 9a58eebe
      Toshi Kani 提交于
      __ioremap_caller() calls region_is_ram() to walk through the
      iomem_resource table to check if a target range is in RAM, which was
      added to improve the lookup performance over page_is_ram() (commit
      906e36c5 "x86: use optimized ioresource lookup in ioremap
      function"). page_is_ram() was no longer used when this change was
      added, though.
      
      __ioremap_caller() then calls walk_system_ram_range(), which had
      replaced page_is_ram() to improve the lookup performance (commit
      c81c8a1e "x86, ioremap: Speed up check for RAM pages").
      
      Since both checks walk through the same iomem_resource table for
      the same purpose, there is no need to call both functions.
      
      Aside of that walk_system_ram_range() is the only useful check at the
      moment because region_is_ram() always returns -1 due to an
      implementation bug. That bug in region_is_ram() cannot be fixed
      without breaking existing ioremap callers, which rely on the subtle
      difference of walk_system_ram_range() versus non page aligned ranges.
      
      Once these offending callers are fixed we can use region_is_ram() and
      remove walk_system_ram_range().
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1437088996-28511-3-git-send-email-toshi.kani@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      9a58eebe
    • T
      x86/mm: Move warning from __ioremap_check_ram() to the call site · 1c9cf9b2
      Toshi Kani 提交于
      __ioremap_check_ram() has a WARN_ONCE() which is emitted when the
      given pfn range is not RAM. The warning is bogus in two aspects:
      
      - it never triggers since walk_system_ram_range() only calls
        __ioremap_check_ram() for RAM ranges.
      
      - the warning message is wrong as it says: "ioremap on RAM' after it
        established that the pfn range is not RAM.
      
      Move the WARN_ONCE() to __ioremap_caller(), and update the message to
      include the address range so we get an actual warning when something
      tries to ioremap system RAM.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Cc: Luis R. Rodriguez <mcgrof@suse.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1437088996-28511-2-git-send-email-toshi.kani@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1c9cf9b2
  14. 21 7月, 2015 4 次提交
  15. 06 7月, 2015 4 次提交
  16. 25 6月, 2015 1 次提交
    • T
      mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute · fc6daaf9
      Tony Luck 提交于
      Some high end Intel Xeon systems report uncorrectable memory errors as a
      recoverable machine check.  Linux has included code for some time to
      process these and just signal the affected processes (or even recover
      completely if the error was in a read only page that can be replaced by
      reading from disk).
      
      But we have no recovery path for errors encountered during kernel code
      execution.  Except for some very specific cases were are unlikely to ever
      be able to recover.
      
      Enter memory mirroring. Actually 3rd generation of memory mirroing.
      
      Gen1: All memory is mirrored
      	Pro: No s/w enabling - h/w just gets good data from other side of the
      	     mirror
      	Con: Halves effective memory capacity available to OS/applications
      
      Gen2: Partial memory mirror - just mirror memory begind some memory controllers
      	Pro: Keep more of the capacity
      	Con: Nightmare to enable. Have to choose between allocating from
      	     mirrored memory for safety vs. NUMA local memory for performance
      
      Gen3: Address range partial memory mirror - some mirror on each memory
            controller
      	Pro: Can tune the amount of mirror and keep NUMA performance
      	Con: I have to write memory management code to implement
      
      The current plan is just to use mirrored memory for kernel allocations.
      This has been broken into two phases:
      
      1) This patch series - find the mirrored memory, use it for boot time
         allocations
      
      2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
         unused mirrored memory from mm/memblock.c and only give it out to
         select kernel allocations (this is still being scoped because
         page_alloc.c is scary).
      
      This patch (of 3):
      
      Add extra "flags" to memblock to allow selection of memory based on
      attribute.  No functional changes
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Xiexiuqi <xiexiuqi@huawei.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc6daaf9
  17. 09 6月, 2015 11 次提交