1. 31 7月, 2014 1 次提交
  2. 13 11月, 2013 3 次提交
    • Z
    • T
      x86/mem-hotplug: support initialize page tables in bottom-up · b959ed6c
      Tang Chen 提交于
      The Linux kernel cannot migrate pages used by the kernel.  As a result,
      kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
      memory for the kernel.
      
      In a memory hotplug system, any numa node the kernel resides in should be
      unhotpluggable.  And for a modern server, each node could have at least
      16GB memory.  So memory around the kernel image is highly likely
      unhotpluggable.
      
      ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
      info.  But before SRAT is parsed, memblock has already started to allocate
      memory for the kernel.  So we need to prevent memblock from doing this.
      
      So direct memory mapping page tables setup is the case.
      init_mem_mapping() is called before SRAT is parsed.  To prevent page
      tables being allocated within hotpluggable memory, we will use bottom-up
      direction to allocate page tables from the end of kernel image to the
      higher memory.
      
      Note:
      As for allocating page tables in lower memory, TJ said:
      
      : This is an optional behavior which is triggered by a very specific kernel
      : boot param, which I suspect is gonna need to stick around to support
      : memory hotplug in the current setup unless we add another layer of address
      : translation to support memory hotplug.
      
      As for page tables may occupy too much lower memory if using 4K mapping
      (CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
      pages), TJ said:
      
      : But as I said in the same paragraph, parsing SRAT earlier doesn't solve
      : the problem in itself either.  Ignoring the option if 4k mapping is
      : required and memory consumption would be prohibitive should work, no?
      : Something like that would be necessary if we're gonna worry about cases
      : like this no matter how we implement it, but, frankly, I'm not sure this
      : is something worth worrying about.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b959ed6c
    • T
      x86/mm: factor out of top-down direct mapping setup · 0167d7d8
      Tang Chen 提交于
      Create a new function memory_map_top_down to factor out of the top-down
      direct memory mapping pagetable setup.  This is also a preparation for the
      following patch, which will introduce the bottom-up memory mapping.  That
      said, we will put the two ways of pagetable setup into separate functions,
      and choose to use which way in init_mem_mapping, which makes the code more
      clear.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0167d7d8
  3. 10 9月, 2013 1 次提交
  4. 20 8月, 2013 1 次提交
    • Y
      x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAM · 527bf129
      Yinghai Lu 提交于
      Dave Hansen reported that systems between 500G and 600G RAM
      crash early if DEBUG_PAGEALLOC is selected.
      
       > [    0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
       > [    0.000000]  [mem 0x00000000-0x000fffff] page 4k
       > [    0.000000] BRK [0x02086000, 0x02086fff] PGTABLE
       > [    0.000000] BRK [0x02087000, 0x02087fff] PGTABLE
       > [    0.000000] BRK [0x02088000, 0x02088fff] PGTABLE
       > [    0.000000] init_memory_mapping: [mem 0xe80ee00000-0xe80effffff]
       > [    0.000000]  [mem 0xe80ee00000-0xe80effffff] page 4k
       > [    0.000000] BRK [0x02089000, 0x02089fff] PGTABLE
       > [    0.000000] BRK [0x0208a000, 0x0208afff] PGTABLE
       > [    0.000000] Kernel panic - not syncing: alloc_low_page: ran out of memory
      
      It turns out that we missed increasing needed pages in BRK to
      mapping initial 2M and [0,1M) when we switched to use the #PF
      handler to set memory mappings:
      
       > commit 8170e6be
       > Author: H. Peter Anvin <hpa@zytor.com>
       > Date:   Thu Jan 24 12:19:52 2013 -0800
       >
       >     x86, 64bit: Use a #PF handler to materialize early mappings on demand
      
      Before that, we had the maping from [0,512M) in head_64.S, and we
      can spare two pages [0-1M).  After that change, we can not reuse
      pages anymore.
      
      When we have more than 512M ram, we need an extra page for pgd page
      with [512G, 1024g).
      
      Increase pages in BRK for page table to solve the boot crash.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Bisected-by: NDave Hansen <dave.hansen@intel.com>
      Tested-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: <stable@vger.kernel.org> # v3.9 and later
      Link: http://lkml.kernel.org/r/1376351004-4015-1-git-send-email-yinghai@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      527bf129
  5. 04 7月, 2013 1 次提交
    • J
      mm/x86: use free_reserved_area() to simplify code · c88442ec
      Jiang Liu 提交于
      Use common help function free_reserved_area() to simplify code.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c88442ec
  6. 01 6月, 2013 1 次提交
    • Y
      x86: Fix adjust_range_size_mask calling position · 7de3d66b
      Yinghai Lu 提交于
      Commit
      
          8d57470d x86, mm: setup page table in top-down
      
      causes a kernel panic while setting mem=2G.
      
           [mem 0x00000000-0x000fffff] page 4k
           [mem 0x7fe00000-0x7fffffff] page 1G
           [mem 0x7c000000-0x7fdfffff] page 1G
           [mem 0x00100000-0x001fffff] page 4k
           [mem 0x00200000-0x7bffffff] page 2M
      
      for last entry is not what we want, we should have
           [mem 0x00200000-0x3fffffff] page 2M
           [mem 0x40000000-0x7bffffff] page 1G
      
      Actually we merge the continuous ranges with same page size too early.
      in this case, before merging we have
           [mem 0x00200000-0x3fffffff] page 2M
           [mem 0x40000000-0x7bffffff] page 2M
      after merging them, will get
           [mem 0x00200000-0x7bffffff] page 2M
      even we can use 1G page to map
           [mem 0x40000000-0x7bffffff]
      
      that will cause problem, because we already map
           [mem 0x7fe00000-0x7fffffff] page 1G
           [mem 0x7c000000-0x7fdfffff] page 1G
      with 1G page, aka [0x40000000-0x7fffffff] is mapped with 1G page already.
      During phys_pud_init() for [0x40000000-0x7bffffff], it will not
      reuse existing that pud page, and allocate new one then try to use
      2M page to map it instead, as page_size_mask does not include
      PG_LEVEL_1G. At end will have [7c000000-0x7fffffff] not mapped, loop
      in phys_pmd_init stop mapping at 0x7bffffff.
      
      That is right behavoir, it maps exact range with exact page size that
      we ask, and we should explicitly call it to map [7c000000-0x7fffffff]
      before or after mapping 0x40000000-0x7bffffff.
      Anyway we need to make sure ranges' page_size_mask correct and consistent
      after split_mem_range for each range.
      
      Fix that by calling adjust_range_size_mask before merging range
      with same page size.
      
      -v2: update change log.
      -v3: add more explanation why [7c000000-0x7fffffff] is not mapped, and
          it causes panic.
      Bisected-by: N"Xie, ChanglongX" <changlongx.xie@intel.com>
      Bisected-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Reported-and-tested-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1370015587-20835-1-git-send-email-yinghai@kernel.org
      Cc: <stable@vger.kernel.org> v3.9
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      7de3d66b
  7. 10 5月, 2013 1 次提交
  8. 30 4月, 2013 1 次提交
  9. 07 3月, 2013 1 次提交
  10. 01 2月, 2013 1 次提交
  11. 30 1月, 2013 3 次提交
    • Y
      x86, kexec, 64bit: Only set ident mapping for ram. · 0e691cf8
      Yinghai Lu 提交于
      We should set mappings only for usable memory ranges under max_pfn
      Otherwise causes same problem that is fixed by
      
      	x86, mm: Only direct map addresses that are marked as E820_RAM
      
      This patch exposes pfn_mapped array, and only sets ident mapping for ranges
      in that array.
      
      This patch relies on new kernel_ident_mapping_init that could handle existing
      pgd/pud between different calls.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-25-git-send-email-yinghai@kernel.org
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      0e691cf8
    • H
      x86, 64bit: Use a #PF handler to materialize early mappings on demand · 8170e6be
      H. Peter Anvin 提交于
      Linear mode (CR0.PG = 0) is mutually exclusive with 64-bit mode; all
      64-bit code has to use page tables.  This makes it awkward before we
      have first set up properly all-covering page tables to access objects
      that are outside the static kernel range.
      
      So far we have dealt with that simply by mapping a fixed amount of
      low memory, but that fails in at least two upcoming use cases:
      
      1. We will support load and run kernel, struct boot_params, ramdisk,
         command line, etc. above the 4 GiB mark.
      2. need to access ramdisk early to get microcode to update that as
         early possible.
      
      We could use early_iomap to access them too, but it will make code to
      messy and hard to be unified with 32 bit.
      
      Hence, set up a #PF table and use a fixed number of buffers to set up
      page tables on demand.  If the buffers fill up then we simply flush
      them and start over.  These buffers are all in __initdata, so it does
      not increase RAM usage at runtime.
      
      Thus, with the help of the #PF handler, we can set the final kernel
      mapping from blank, and switch to init_level4_pgt later.
      
      During the switchover in head_64.S, before #PF handler is available,
      we use three pages to handle kernel crossing 1G, 512G boundaries with
      sharing page by playing games with page aliasing: the same page is
      mapped twice in the higher-level tables with appropriate wraparound.
      The kernel region itself will be properly mapped; other mappings may
      be spurious.
      
      early_make_pgtable is using kernel high mapping address to access pages
      to set page table.
      
      -v4: Add phys_base offset to make kexec happy, and add
      	init_mapping_kernel()   - Yinghai
      -v5: fix compiling with xen, and add back ident level3 and level2 for xen
           also move back init_level4_pgt from BSS to DATA again.
           because we have to clear it anyway.  - Yinghai
      -v6: switch to init_level4_pgt in init_mem_mapping. - Yinghai
      -v7: remove not needed clear_page for init_level4_page
           it is with fill 512,8,0 already in head_64.S  - Yinghai
      -v8: we need to keep that handler alive until init_mem_mapping and don't
           let early_trap_init to trash that early #PF handler.
           So split early_trap_pf_init out and move it down. - Yinghai
      -v9: switchover only cover kernel space instead of 1G so could avoid
           touch possible mem holes. - Yinghai
      -v11: change far jmp back to far return to initial_code, that is needed
           to fix failure that is reported by Konrad on AMD systems.  - Yinghai
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-12-git-send-email-yinghai@kernel.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      8170e6be
    • Y
      x86, mm: Fix page table early allocation offset checking · c9b3234a
      Yinghai Lu 提交于
      During debugging loading kernel above 4G, found that one page is not used
      in pre-allocated BRK area for early page allocation.
      pgt_buf_top is address that can not be used, so should check if that new
      end is above that top, otherwise last page will not be used.
      
      Fix that checking and also add print out for allocation from pre-allocated
      BRK area to catch possible bugs later.
      
      But after we get back that page for pgt, it tiggers one bug in pgt allocation
      with xen: We need to avoid to use page as pgt to map range that is
      overlapping with that pgt page.
      
      Add checking about overlapping, when it happens, use memblock allocation
      instead.  That fixes crash on Xen PV guest with 2G that Stefan found.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1359058816-7615-2-git-send-email-yinghai@kernel.orgAcked-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Tested-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      c9b3234a
  12. 18 11月, 2012 25 次提交