1. 12 12月, 2012 2 次提交
    • W
      memory-hotplug, mm/sparse.c: clear the memory to store struct page · 3ac19f8e
      Wen Congyang 提交于
      If sparse memory vmemmap is enabled, we can't free the memory to store
      struct page when a memory device is hotremoved, because we may store
      struct page in the memory to manage the memory which doesn't belong to
      this memory device.  When we hotadded this memory device again, we will
      reuse this memory to store struct page, and struct page may contain some
      obsolete information, and we will get bad-page state:
      
        init_memory_mapping: [mem 0x80000000-0x9fffffff]
        Built 2 zonelists in Node order, mobility grouping on.  Total pages: 547617
        Policy zone: Normal
        BUG: Bad page state in process bash  pfn:9b6dc
        page:ffffea0002200020 count:0 mapcount:0 mapping:          (null) index:0xfdfdfdfdfdfdfdfd
        page flags: 0x2fdfdfdfd5df9fd(locked|referenced|uptodate|dirty|lru|active|slab|owner_priv_1|private|private_2|writeback|head|tail|swapcache|reclaim|swapbacked|unevictable|uncached|compound_lock)
        Modules linked in: netconsole acpiphp pci_hotplug acpi_memhotplug loop kvm_amd kvm microcode tpm_tis tpm tpm_bios evdev psmouse serio_raw i2c_piix4 i2c_core parport_pc parport processor button thermal_sys ext3 jbd mbcache sg sr_mod cdrom ata_generic virtio_net ata_piix virtio_blk libata virtio_pci virtio_ring virtio scsi_mod
        Pid: 988, comm: bash Not tainted 3.6.0-rc7-guest #12
        Call Trace:
         [<ffffffff810e9b30>] ? bad_page+0xb0/0x100
         [<ffffffff810ea4c3>] ? free_pages_prepare+0xb3/0x100
         [<ffffffff810ea668>] ? free_hot_cold_page+0x48/0x1a0
         [<ffffffff8112cc08>] ? online_pages_range+0x68/0xa0
         [<ffffffff8112cba0>] ? __online_page_increment_counters+0x10/0x10
         [<ffffffff81045561>] ? walk_system_ram_range+0x101/0x110
         [<ffffffff814c4f95>] ? online_pages+0x1a5/0x2b0
         [<ffffffff8135663d>] ? __memory_block_change_state+0x20d/0x270
         [<ffffffff81356756>] ? store_mem_state+0xb6/0xf0
         [<ffffffff8119e482>] ? sysfs_write_file+0xd2/0x160
         [<ffffffff8113769a>] ? vfs_write+0xaa/0x160
         [<ffffffff81137977>] ? sys_write+0x47/0x90
         [<ffffffff814e2f25>] ? async_page_fault+0x25/0x30
         [<ffffffff814ea239>] ? system_call_fastpath+0x16/0x1b
        Disabling lock debugging due to kernel taint
      
      This patch clears the memory to store struct page to avoid unexpected error.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reported-by: NVasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ac19f8e
    • W
      memory-hotplug: update mce_bad_pages when removing the memory · 95a4774d
      Wen Congyang 提交于
      When we hotremove a memory device, we will free the memory to store struct
      page.  If the page is hwpoisoned page, we should decrease mce_bad_pages.
      
      [akpm@linux-foundation.org: cleanup ifdefs]
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95a4774d
  2. 01 12月, 2012 1 次提交
    • J
      mm/vmemmap: fix wrong use of virt_to_page · ae64ffca
      Jianguo Wu 提交于
      I enable CONFIG_DEBUG_VIRTUAL and CONFIG_SPARSEMEM_VMEMMAP, when doing
      memory hotremove, there is a kernel BUG at arch/x86/mm/physaddr.c:20.
      
      It is caused by free_section_usemap()->virt_to_page(), virt_to_page() is
      only used for kernel direct mapping address, but sparse-vmemmap uses
      vmemmap address, so it is going wrong here.
      
        ------------[ cut here ]------------
        kernel BUG at arch/x86/mm/physaddr.c:20!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: acpihp_drv acpihp_slot edd cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf fuse vfat fat loop dm_mod coretemp kvm crc32c_intel ipv6 ixgbe igb iTCO_wdt i7core_edac edac_core pcspkr iTCO_vendor_support ioatdma microcode joydev sr_mod i2c_i801 dca lpc_ich mfd_core mdio tpm_tis i2c_core hid_generic tpm cdrom sg tpm_bios rtc_cmos button ext3 jbd mbcache usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif processor thermal_sys hwmon scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh_emc scsi_dh ata_generic ata_piix libata megaraid_sas scsi_mod
        CPU 39
        Pid: 6454, comm: sh Not tainted 3.7.0-rc1-acpihp-final+ #45 QCI QSSC-S4R/QSSC-S4R
        RIP: 0010:[<ffffffff8103c908>]  [<ffffffff8103c908>] __phys_addr+0x88/0x90
        RSP: 0018:ffff8804440d7c08  EFLAGS: 00010006
        RAX: 0000000000000006 RBX: ffffea0012000000 RCX: 000000000000002c
        ...
      Signed-off-by: NJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Reviewd-by: NWen Congyang <wency@cn.fujitsu.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae64ffca
  3. 01 8月, 2012 4 次提交
  4. 12 7月, 2012 2 次提交
    • Y
      mm: sparse: fix usemap allocation above node descriptor section · 99ab7b19
      Yinghai Lu 提交于
      After commit f5bf18fa ("bootmem/sparsemem: remove limit constraint
      in alloc_bootmem_section"), usemap allocations may easily be placed
      outside the optimal section that holds the node descriptor, even if
      there is space available in that section.  This results in unnecessary
      hotplug dependencies that need to have the node unplugged before the
      section holding the usemap.
      
      The reason is that the bootmem allocator doesn't guarantee a linear
      search starting from the passed allocation goal but may start out at a
      much higher address absent an upper limit.
      
      Fix this by trying the allocation with the limit at the section end,
      then retry without if that fails.  This keeps the fix from f5bf18fa
      of not panicking if the allocation does not fit in the section, but
      still makes sure to try to stay within the section at first.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.3.x, 3.4.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99ab7b19
    • Y
      mm: sparse: fix section usemap placement calculation · 07b4e2bc
      Yinghai Lu 提交于
      Commit 238305bb ("mm: remove sparsemem allocation details from the
      bootmem allocator") introduced a bug in the allocation goal calculation
      that put section usemaps not in the same section as the node
      descriptors, creating unnecessary hotplug dependencies between them:
      
        node 0 must be removed before remove section 16399
        node 1 must be removed before remove section 16399
        node 2 must be removed before remove section 16399
        node 3 must be removed before remove section 16399
        node 4 must be removed before remove section 16399
        node 5 must be removed before remove section 16399
        node 6 must be removed before remove section 16399
      
      The reason is that it applies PAGE_SECTION_MASK to the physical address
      of the node descriptor when finding a suitable place to put the usemap,
      when this mask is actually intended to be used with PFNs.  Because the
      PFN mask is wider, the target address will point beyond the wanted
      section holding the node descriptor and the node must be offlined before
      the section holding the usemap can go.
      
      Fix this by extending the mask to address width before use.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07b4e2bc
  5. 30 5月, 2012 1 次提交
  6. 22 3月, 2012 1 次提交
    • N
      bootmem/sparsemem: remove limit constraint in alloc_bootmem_section · f5bf18fa
      Nishanth Aravamudan 提交于
      While testing AMS (Active Memory Sharing) / CMO (Cooperative Memory
      Overcommit) on powerpc, we tripped the following:
      
        kernel BUG at mm/bootmem.c:483!
        cpu 0x0: Vector: 700 (Program Check) at [c000000000c03940]
            pc: c000000000a62bd8: .alloc_bootmem_core+0x90/0x39c
            lr: c000000000a64bcc: .sparse_early_usemaps_alloc_node+0x84/0x29c
            sp: c000000000c03bc0
           msr: 8000000000021032
          current = 0xc000000000b0cce0
          paca    = 0xc000000001d80000
            pid   = 0, comm = swapper
        kernel BUG at mm/bootmem.c:483!
        enter ? for help
        [c000000000c03c80] c000000000a64bcc
        .sparse_early_usemaps_alloc_node+0x84/0x29c
        [c000000000c03d50] c000000000a64f10 .sparse_init+0x12c/0x28c
        [c000000000c03e20] c000000000a474f4 .setup_arch+0x20c/0x294
        [c000000000c03ee0] c000000000a4079c .start_kernel+0xb4/0x460
        [c000000000c03f90] c000000000009670 .start_here_common+0x1c/0x2c
      
      This is
      
              BUG_ON(limit && goal + size > limit);
      
      and after some debugging, it seems that
      
      	goal = 0x7ffff000000
      	limit = 0x80000000000
      
      and sparse_early_usemaps_alloc_node ->
      sparse_early_usemaps_alloc_pgdat_section calls
      
      	return alloc_bootmem_section(usemap_size() * count, section_nr);
      
      This is on a system with 8TB available via the AMS pool, and as a quirk
      of AMS in firmware, all of that memory shows up in node 0.  So, we end
      up with an allocation that will fail the goal/limit constraints.
      
      In theory, we could "fall-back" to alloc_bootmem_node() in
      sparse_early_usemaps_alloc_node(), but since we actually have HOTREMOVE
      defined, we'll BUG_ON() instead.  A simple solution appears to be to
      unconditionally remove the limit condition in alloc_bootmem_section,
      meaning allocations are allowed to cross section boundaries (necessary
      for systems of this size).
      
      Johannes Weiner pointed out that if alloc_bootmem_section() no longer
      guarantees section-locality, we need check_usemap_section_nr() to print
      possible cross-dependencies between node descriptors and the usemaps
      allocated through it.  That makes the two loops in
      sparse_early_usemaps_alloc_node() identical, so re-factor the code a
      bit.
      
      [akpm@linux-foundation.org: code simplification]
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Anton Blanchard <anton@au1.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ben Herrenschmidt <benh@kernel.crashing.org>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.3.1]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5bf18fa
  7. 31 10月, 2011 1 次提交
  8. 26 7月, 2011 1 次提交
  9. 31 3月, 2011 1 次提交
  10. 14 1月, 2011 1 次提交
  11. 25 5月, 2010 1 次提交
    • Y
      sparsemem: on no vmemmap path put mem_map on node high too · e48e67e0
      Yinghai Lu 提交于
      We need to put mem_map high when virtual memmap is not used.
      
      before this patch
      free mem pfn range on first node:
      [    0.000000]  19 - 1f
      [    0.000000]  28 40 - 80 95
      [    0.000000]  702 740 - 1000 1000
      [    0.000000]  347c - 347e
      [    0.000000]  34e7 3500 - 3b80 3b8b
      [    0.000000]  73b8b 73bc0 - 73c00 73c00
      [    0.000000]  73ddd - 73e00
      [    0.000000]  73fdd - 74000
      [    0.000000]  741dd - 74200
      [    0.000000]  743dd - 74400
      [    0.000000]  745dd - 74600
      [    0.000000]  747dd - 74800
      [    0.000000]  749dd - 74a00
      [    0.000000]  74bdd - 74c00
      [    0.000000]  74ddd - 74e00
      [    0.000000]  74fdd - 75000
      [    0.000000]  751dd - 75200
      [    0.000000]  753dd - 75400
      [    0.000000]  755dd - 75600
      [    0.000000]  757dd - 75800
      [    0.000000]  759dd - 75a00
      [    0.000000]  79bdd 79c00 - 7d540 7d550
      [    0.000000]  7f745 - 7f750
      [    0.000000]  10000b 100040 - 2080000 2080000
      so only 79c00 - 7d540 are major free block under 4g...
      
      after this patch, we will get
      [    0.000000]  19 - 1f
      [    0.000000]  28 40 - 80 95
      [    0.000000]  702 740 - 1000 1000
      [    0.000000]  347c - 347e
      [    0.000000]  34e7 3500 - 3600 3600
      [    0.000000]  37dd - 3800
      [    0.000000]  39dd - 3a00
      [    0.000000]  3bdd - 3c00
      [    0.000000]  3ddd - 3e00
      [    0.000000]  3fdd - 4000
      [    0.000000]  41dd - 4200
      [    0.000000]  43dd - 4400
      [    0.000000]  45dd - 4600
      [    0.000000]  47dd - 4800
      [    0.000000]  49dd - 4a00
      [    0.000000]  4bdd - 4c00
      [    0.000000]  4ddd - 4e00
      [    0.000000]  4fdd - 5000
      [    0.000000]  51dd - 5200
      [    0.000000]  53dd - 5400
      [    0.000000]  95dd 9600 - 7d540 7d550
      [    0.000000]  7f745 - 7f750
      [    0.000000]  17000b 170040 - 2080000 2080000
      we will have 9600 - 7d540 for major free block...
      
      sparse-vmemmap path already used __alloc_bootmem_node_high()
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e48e67e0
  12. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  13. 02 3月, 2010 1 次提交
  14. 13 2月, 2010 2 次提交
    • Y
      sparsemem: Put mem map for one node together. · 9bdac914
      Yinghai Lu 提交于
      Add vmemmap_alloc_block_buf for mem map only.
      
      It will fallback to the old way if it cannot get a block that big.
      
      Before this patch, when a node have 128g ram installed, memmap are
      split into two parts or more.
      [    0.000000]  [ffffea0000000000-ffffea003fffffff] PMD -> [ffff880100600000-ffff88013e9fffff] on node 1
      [    0.000000]  [ffffea0040000000-ffffea006fffffff] PMD -> [ffff88013ec00000-ffff88016ebfffff] on node 1
      [    0.000000]  [ffffea0070000000-ffffea007fffffff] PMD -> [ffff882000600000-ffff8820105fffff] on node 0
      [    0.000000]  [ffffea0080000000-ffffea00bfffffff] PMD -> [ffff882010800000-ffff8820507fffff] on node 0
      [    0.000000]  [ffffea00c0000000-ffffea00dfffffff] PMD -> [ffff882050a00000-ffff8820709fffff] on node 0
      [    0.000000]  [ffffea00e0000000-ffffea00ffffffff] PMD -> [ffff884000600000-ffff8840205fffff] on node 2
      [    0.000000]  [ffffea0100000000-ffffea013fffffff] PMD -> [ffff884020800000-ffff8840607fffff] on node 2
      [    0.000000]  [ffffea0140000000-ffffea014fffffff] PMD -> [ffff884060a00000-ffff8840709fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea017fffffff] PMD -> [ffff886000600000-ffff8860305fffff] on node 3
      [    0.000000]  [ffffea0180000000-ffffea01bfffffff] PMD -> [ffff886030800000-ffff8860707fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea01ffffffff] PMD -> [ffff888000600000-ffff8880405fffff] on node 4
      [    0.000000]  [ffffea0200000000-ffffea022fffffff] PMD -> [ffff888040800000-ffff8880707fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea023fffffff] PMD -> [ffff88a000600000-ffff88a0105fffff] on node 5
      [    0.000000]  [ffffea0240000000-ffffea027fffffff] PMD -> [ffff88a010800000-ffff88a0507fffff] on node 5
      [    0.000000]  [ffffea0280000000-ffffea029fffffff] PMD -> [ffff88a050a00000-ffff88a0709fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea02bfffffff] PMD -> [ffff88c000600000-ffff88c0205fffff] on node 6
      [    0.000000]  [ffffea02c0000000-ffffea02ffffffff] PMD -> [ffff88c020800000-ffff88c0607fffff] on node 6
      [    0.000000]  [ffffea0300000000-ffffea030fffffff] PMD -> [ffff88c060a00000-ffff88c0709fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea033fffffff] PMD -> [ffff88e000600000-ffff88e0305fffff] on node 7
      [    0.000000]  [ffffea0340000000-ffffea037fffffff] PMD -> [ffff88e030800000-ffff88e0707fffff] on node 7
      
      after patch will get
      [    0.000000]  [ffffea0000000000-ffffea006fffffff] PMD -> [ffff880100200000-ffff88016e5fffff] on node 0
      [    0.000000]  [ffffea0070000000-ffffea00dfffffff] PMD -> [ffff882000200000-ffff8820701fffff] on node 1
      [    0.000000]  [ffffea00e0000000-ffffea014fffffff] PMD -> [ffff884000200000-ffff8840701fffff] on node 2
      [    0.000000]  [ffffea0150000000-ffffea01bfffffff] PMD -> [ffff886000200000-ffff8860701fffff] on node 3
      [    0.000000]  [ffffea01c0000000-ffffea022fffffff] PMD -> [ffff888000200000-ffff8880701fffff] on node 4
      [    0.000000]  [ffffea0230000000-ffffea029fffffff] PMD -> [ffff88a000200000-ffff88a0701fffff] on node 5
      [    0.000000]  [ffffea02a0000000-ffffea030fffffff] PMD -> [ffff88c000200000-ffff88c0701fffff] on node 6
      [    0.000000]  [ffffea0310000000-ffffea037fffffff] PMD -> [ffff88e000200000-ffff88e0701fffff] on node 7
      
      -v2: change buf to vmemmap_buf instead according to Ingo
           also add CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER according to Ingo
      -v3: according to Andrew, use sizeof(name) instead of hard coded 15
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-19-git-send-email-yinghai@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      9bdac914
    • Y
      sparsemem: Put usemap for one node together · a4322e1b
      Yinghai Lu 提交于
      Could save some buffer space instead of applying one by one.
      
      Could help that system that is going to use early_res instead of bootmem
      less entries in early_res make search more faster on system with more memory.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <1265793639-15071-18-git-send-email-yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      a4322e1b
  15. 22 9月, 2009 1 次提交
  16. 01 4月, 2009 1 次提交
  17. 01 12月, 2008 1 次提交
  18. 13 8月, 2008 1 次提交
  19. 27 7月, 2008 1 次提交
  20. 25 7月, 2008 2 次提交
    • Y
      memory hotplug: allocate usemap on the section with pgdat · 48c90682
      Yasunori Goto 提交于
      Usemaps are allocated on the section which has pgdat by this.
      
      Because usemap size is very small, many other sections usemaps are
      allocated on only one page.  If a section has usemap, it can't be removed
      until removing other sections.  This dependency is not desirable for
      memory removing.
      
      Pgdat has similar feature.  When a section has pgdat area, it must be the
      last section for removing on the node.  So, if section A has pgdat and
      section B has usemap for section A, Both sections can't be removed due to
      dependency each other.
      
      To solve this issue, this patch collects usemap on same section with pgdat
      as much as possible.  If other sections doesn't have any dependency, this
      section will be able to be removed finally.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hiroyuki KAMEZAWA <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tony Breeds <tony@bakeyournoodle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c90682
    • M
      mm: make defensive checks around PFN values registered for memory usage · 2dbb51c4
      Mel Gorman 提交于
      There are a number of different views to how much memory is currently active.
      There is the arch-independent zone-sizing view, the bootmem allocator and
      memory models view.
      
      Architectures register this information at different times and is not
      necessarily in sync particularly with respect to some SPARSEMEM limitations.
      
      This patch introduces mminit_validate_memmodel_limits() which is able to
      validate and correct PFN ranges with respect to the memory model.  It is only
      SPARSEMEM that currently validates itself.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dbb51c4
  21. 30 4月, 2008 2 次提交
  22. 28 4月, 2008 5 次提交
    • Y
      memory hotplug: free memmaps allocated by bootmem · 0c0a4a51
      Yasunori Goto 提交于
      This patch is to free memmaps which is allocated by bootmem.
      
      Freeing usemap is not necessary.  The pages of usemap may be necessary for
      other sections.
      
      If removing section is last section on the node, its section is the final user
      of usemap page.  (usemaps are allocated on its section by previous patch.) But
      it shouldn't be freed too, because the section must be logical offline state
      which all pages are isolated against page allocater.  If it is freed, page
      alloctor may use it which will be removed physically soon.  It will be
      disaster.  So, this patch keeps it as it is.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c0a4a51
    • Y
      memory hotplug: allocate usemap on the section with pgdat · 86f6dae1
      Yasunori Goto 提交于
      Usemaps are allocated on the section which has pgdat by this.
      
      Because usemap size is very small, many other sections usemaps are allocated
      on only one page.  If a section has usemap, it can't be removed until removing
      other sections.  This dependency is not desirable for memory removing.
      
      Pgdat has similar feature.  When a section has pgdat area, it must be the last
      section for removing on the node.  So, if section A has pgdat and section B
      has usemap for section A, Both sections can't be removed due to dependency
      each other.
      
      To solve this issue, this patch collects usemap on same section with pgdat.
      If other sections doesn't have any dependency, this section will be able to be
      removed finally.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86f6dae1
    • Y
      memory hotplug: align memmap to page size · 9d99217a
      Yasunori Goto 提交于
      To free memmap easier, this patch aligns it to page size.  Bootmem allocater
      may mix some objects in one pages.  It's not good for freeing memmap of memory
      hot-remove.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d99217a
    • Y
      memory hotplug: register section/node id to free · 04753278
      Yasunori Goto 提交于
      This patch set is to free pages which is allocated by bootmem for
      memory-hotremove.  Some structures of memory management are allocated by
      bootmem.  ex) memmap, etc.
      
      To remove memory physically, some of them must be freed according to
      circumstance.  This patch set makes basis to free those pages, and free
      memmaps.
      
      Basic my idea is using remain members of struct page to remember information
      of users of bootmem (section number or node id).  When the section is
      removing, kernel can confirm it.  By this information, some issues can be
      solved.
      
        1) When the memmap of removing section is allocated on other
           section by bootmem, it should/can be free.
        2) When the memmap of removing section is allocated on the
           same section, it shouldn't be freed. Because the section has to be
           logical memory offlined already and all pages must be isolated against
           page allocater. If it is freed, page allocator may use it which will
           be removed physically soon.
        3) When removing section has other section's memmap,
           kernel will be able to show easily which section should be removed
           before it for user. (Not implemented yet)
        4) When the above case 2), the page isolation will be able to check and skip
           memmap's page when logical memory offline (offline_pages()).
           Current page isolation code fails in this case because this page is
           just reserved page and it can't distinguish this pages can be
           removed or not. But, it will be able to do by this patch.
           (Not implemented yet.)
        5) The node information like pgdat has similar issues. But, this
           will be able to be solved too by this.
           (Not implemented yet, but, remembering node id in the pages.)
      
      Fortunately, current bootmem allocator just keeps PageReserved flags,
      and doesn't use any other members of page struct. The users of
      bootmem doesn't use them too.
      
      This patch:
      
      This is to register information which is node or section's id.  Kernel can
      distinguish which node/section uses the pages allcated by bootmem.  This is
      basis for hot-remove sections or nodes.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04753278
    • B
      hotplug memory remove: generic __remove_pages() support · ea01ea93
      Badari Pulavarty 提交于
      Generic helper function to remove section mappings and sysfs entries for the
      section of the memory we are removing.  offline_pages() correctly adjusted
      zone and marked the pages reserved.
      
      TODO: Yasunori Goto is working on patches to free up allocations from bootmem.
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea01ea93
  23. 27 4月, 2008 2 次提交
    • Y
      x86_64/mm: check and print vmemmap allocation continuous · c2b91e2e
      Yinghai Lu 提交于
      On big systems with lots of memory, don't print out too much during
      bootup, and make it easy to find if it is continuous.
      
      on 256G 8 sockets system will get
       [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
      [ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
       [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
      [ffffe20038700000-ffffe200387fffff] potential offnode page_structs
       [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
       [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
       [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
      [ffffe20054700000-ffffe200547fffff] potential offnode page_structs
       [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
      [ffffe20070700000-ffffe200707fffff] potential offnode page_structs
       [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
       [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
       [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
      [ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
       [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
      [ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
       [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
       [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
       [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
      [ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
       [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
       [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7
      
      instead of a very long print out...
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c2b91e2e
    • Y
      mm: make mem_map allocation continuous · e123dd3f
      Yinghai Lu 提交于
      vmemmap allocation currently has this layout:
      
       [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
       [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001800000 on node 0
       [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001c00000 on node 0
       [ffffe20000600000-ffffe200007fffff] PMD ->ffff810002000000 on node 0
       [ffffe20000800000-ffffe200009fffff] PMD ->ffff810002400000 on node 0
      ...
      
      note that there is a 2M hole between them - not optimal.
      
      the root cause is that usemap (24 bytes) will be allocated after every 2M
      mem_map, and it will push next vmemmap (2M) to the next (2M) alignment.
      
      solution: try to allocate the mem_map continously.
      
      after the patch, we get:
      
       [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
       [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001600000 on node 0
       [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001800000 on node 0
       [ffffe20000600000-ffffe200007fffff] PMD ->ffff810001a00000 on node 0
       [ffffe20000800000-ffffe200009fffff] PMD ->ffff810001c00000 on node 0
      ...
      
      which is the ideal layout.
      
      and usemap will share a page because of they are allocated continuously too:
      
      sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24
      sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24
      sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24
      sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24
      ...
      
      so we make the bootmem allocation more compact and use less memory
      for usemap => mission accomplished ;-)
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e123dd3f
  24. 16 4月, 2008 1 次提交
    • I
      mm: sparsemem memory_present() fix · bead9a3a
      Ingo Molnar 提交于
      Fix memory corruption and crash on 32-bit x86 systems.
      
      If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of
      RAM, then we call memory_present() with a start/end that goes outside
      the scope of MAX_PHYSMEM_BITS.
      
      That causes this loop to happily walk over the limit of the sparse
      memory section map:
      
          for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
                      unsigned long section = pfn_to_section_nr(pfn);
                      struct mem_section *ms;
      
                      sparse_index_init(section, nid);
                      set_section_nid(section, nid);
      
                      ms = __nr_to_section(section);
                      if (!ms->section_mem_map)
                              ms->section_mem_map = sparse_encode_early_nid(nid) |
      			                                SECTION_MARKED_PRESENT;
      
      'ms' will be out of bounds and we'll corrupt a small amount of memory by
      encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it.
      
      The corruption might happen when encoding a non-zero node ID, or due to
      the SECTION_MARKED_PRESENT which is 0x1:
      
      	mmzone.h:#define	SECTION_MARKED_PRESENT	(1UL<<0)
      
      The fix is to sanity check anything the architecture passes to
      sparsemem.
      
      This bug seems to be rather old (as old as sparsemem support itself),
      but the exact incarnation depended on random details like configs, which
      made this bug more prominent in v2.6.25-to-be.
      
      An additional enhancement might be to print a warning about ignored or
      trimmed memory ranges.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Tested-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Yinghai Lu <Yinghai.Lu@sun.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bead9a3a
  25. 06 2月, 2008 2 次提交
  26. 18 12月, 2007 1 次提交