1. 04 7月, 2013 5 次提交
    • J
      mm/x86: prepare for removing num_physpages and simplify mem_init() · 46a84132
      Jiang Liu 提交于
      Prepare for removing num_physpages and simplify mem_init().
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46a84132
    • J
      mm: concentrate modification of totalram_pages into the mm core · 0c988534
      Jiang Liu 提交于
      Concentrate code to modify totalram_pages into the mm core, so the arch
      memory initialized code doesn't need to take care of it.  With these
      changes applied, only following functions from mm core modify global
      variable totalram_pages: free_bootmem_late(), free_all_bootmem(),
      free_all_bootmem_node(), adjust_managed_page_count().
      
      With this patch applied, it will be much more easier for us to keep
      totalram_pages and zone->managed_pages in consistence.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c988534
    • J
      mm: make __free_pages_bootmem() only available at boot time · 170a5a7e
      Jiang Liu 提交于
      In order to simpilify management of totalram_pages and
      zone->managed_pages, make __free_pages_bootmem() only available at boot
      time.  With this change applied, __free_pages_bootmem() will only be
      used by bootmem.c and nobootmem.c at boot time, so mark it as __init.
      Other callers of __free_pages_bootmem() have been converted to use
      free_reserved_page(), which handles totalram_pages and
      zone->managed_pages in a safer way.
      
      This patch also fix a bug in free_pagetable() for x86_64, which should
      increase zone->managed_pages instead of zone->present_pages when freeing
      reserved pages.
      
      And now we have managed_pages_count_lock to protect totalram_pages and
      zone->managed_pages, so remove the redundant ppb_lock lock in
      put_page_bootmem().  This greatly simplifies the locking rules.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      170a5a7e
    • J
      mm: accurately calculate zone->managed_pages for highmem zones · 7b4b2a0d
      Jiang Liu 提交于
      Commit "mm: introduce new field 'managed_pages' to struct zone" assumes
      that all highmem pages will be freed into the buddy system by function
      mem_init().  But that's not always true, some architectures may reserve
      some highmem pages during boot.  For example PPC may allocate highmem
      pages for giagant HugeTLB pages, and several architectures have code to
      check PageReserved flag to exclude highmem pages allocated during boot
      when freeing highmem pages into the buddy system.
      
      So treat highmem pages in the same way as normal pages, that is to:
      1) reset zone->managed_pages to zero in mem_init().
      2) recalculate managed_pages when freeing pages into the buddy system.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b4b2a0d
    • J
      mm/x86: use free_reserved_area() to simplify code · c88442ec
      Jiang Liu 提交于
      Use common help function free_reserved_area() to simplify code.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: <sworddragon2@aol.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c88442ec
  2. 28 6月, 2013 1 次提交
  3. 14 6月, 2013 2 次提交
  4. 01 6月, 2013 1 次提交
    • Y
      x86: Fix adjust_range_size_mask calling position · 7de3d66b
      Yinghai Lu 提交于
      Commit
      
          8d57470d x86, mm: setup page table in top-down
      
      causes a kernel panic while setting mem=2G.
      
           [mem 0x00000000-0x000fffff] page 4k
           [mem 0x7fe00000-0x7fffffff] page 1G
           [mem 0x7c000000-0x7fdfffff] page 1G
           [mem 0x00100000-0x001fffff] page 4k
           [mem 0x00200000-0x7bffffff] page 2M
      
      for last entry is not what we want, we should have
           [mem 0x00200000-0x3fffffff] page 2M
           [mem 0x40000000-0x7bffffff] page 1G
      
      Actually we merge the continuous ranges with same page size too early.
      in this case, before merging we have
           [mem 0x00200000-0x3fffffff] page 2M
           [mem 0x40000000-0x7bffffff] page 2M
      after merging them, will get
           [mem 0x00200000-0x7bffffff] page 2M
      even we can use 1G page to map
           [mem 0x40000000-0x7bffffff]
      
      that will cause problem, because we already map
           [mem 0x7fe00000-0x7fffffff] page 1G
           [mem 0x7c000000-0x7fdfffff] page 1G
      with 1G page, aka [0x40000000-0x7fffffff] is mapped with 1G page already.
      During phys_pud_init() for [0x40000000-0x7bffffff], it will not
      reuse existing that pud page, and allocate new one then try to use
      2M page to map it instead, as page_size_mask does not include
      PG_LEVEL_1G. At end will have [7c000000-0x7fffffff] not mapped, loop
      in phys_pmd_init stop mapping at 0x7bffffff.
      
      That is right behavoir, it maps exact range with exact page size that
      we ask, and we should explicitly call it to map [7c000000-0x7fffffff]
      before or after mapping 0x40000000-0x7bffffff.
      Anyway we need to make sure ranges' page_size_mask correct and consistent
      after split_mem_range for each range.
      
      Fix that by calling adjust_range_size_mask before merging range
      with same page size.
      
      -v2: update change log.
      -v3: add more explanation why [7c000000-0x7fffffff] is not mapped, and
          it causes panic.
      Bisected-by: N"Xie, ChanglongX" <changlongx.xie@intel.com>
      Bisected-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Reported-and-tested-by: NYuanhan Liu <yuanhan.liu@linux.intel.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/1370015587-20835-1-git-send-email-yinghai@kernel.org
      Cc: <stable@vger.kernel.org> v3.9
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      7de3d66b
  5. 28 5月, 2013 1 次提交
  6. 10 5月, 2013 1 次提交
  7. 30 4月, 2013 11 次提交
  8. 15 4月, 2013 1 次提交
  9. 13 4月, 2013 1 次提交
    • D
      x86-32: Fix possible incomplete TLB invalidate with PAE pagetables · 1de14c3c
      Dave Hansen 提交于
      This patch attempts to fix:
      
      	https://bugzilla.kernel.org/show_bug.cgi?id=56461
      
      The symptom is a crash and messages like this:
      
      	chrome: Corrupted page table at address 34a03000
      	*pdpt = 0000000000000000 *pde = 0000000000000000
      	Bad pagetable: 000f [#1] PREEMPT SMP
      
      Ingo guesses this got introduced by commit 611ae8e3 ("x86/tlb:
      enable tlb flush range support for x86") since that code started to free
      unused pagetables.
      
      On x86-32 PAE kernels, that new code has the potential to free an entire
      PMD page and will clear one of the four page-directory-pointer-table
      (aka pgd_t entries).
      
      The hardware aggressively "caches" these top-level entries and invlpg
      does not actually affect the CPU's copy.  If we clear one we *HAVE* to
      do a full TLB flush, otherwise we might continue using a freed pmd page.
      (note, we do this properly on the population side in pud_populate()).
      
      This patch tracks whenever we clear one of these entries in the 'struct
      mmu_gather', and ensures that we follow up with a full tlb flush.
      
      BTW, I disassembled and checked that:
      
      	if (tlb->fullmm == 0)
      and
      	if (!tlb->fullmm && !tlb->need_flush_all)
      
      generate essentially the same code, so there should be zero impact there
      to the !PAE case.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Artem S Tashkinov <t.artem@mailcity.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1de14c3c
  10. 12 4月, 2013 2 次提交
  11. 11 4月, 2013 3 次提交
  12. 10 4月, 2013 1 次提交
  13. 03 4月, 2013 1 次提交
  14. 08 3月, 2013 3 次提交
    • D
      x86: Do not try to sync identity map for non-mapped pages · 60f583d5
      Dave Hansen 提交于
      kernel_map_sync_memtype() is called from a variety of contexts.  The
      pat.c code that calls it seems to ensure that it is not called for
      non-ram areas by checking via pat_pagerange_is_ram().  It is important
      that it only be called on the actual identity map because there *IS*
      no map to sync for highmem pages, or for memory holes.
      
      The ioremap.c uses are not as careful as those from pat.c, and call
      kernel_map_sync_memtype() on PCI space which is in the middle of the
      kernel identity map _range_, but is not actually mapped.
      
      This patch adds a check to kernel_map_sync_memtype() which probably
      duplicates some of the checks already in pat.c.  But, it is necessary
      for the ioremap.c uses and shouldn't hurt other callers.
      
      I have reproduced this bug and this patch fixes it for me and the
      original bug reporter:
      
      	https://lkml.org/lkml/2013/2/5/396Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/20130307163151.D9B58C4E@kernel.stglabs.ibm.comSigned-off-by: NDave Hansen <dave@sr71.net>
      Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      60f583d5
    • F
      context_tracking: Restore correct previous context state on exception exit · 6c1e0256
      Frederic Weisbecker 提交于
      On exception exit, we restore the previous context tracking state based on
      the regs of the interrupted frame. Iff that frame is in user mode as
      stated by user_mode() helper, we restore the context tracking user mode.
      
      However there is a tiny chunck of low level arch code after we pass through
      user_enter() and until the CPU eventually resumes userspace.
      If an exception happens in this tiny area, exception_enter() correctly
      exits the context tracking user mode but exception_exit() won't restore
      it because of the value returned by user_mode(regs).
      
      As a result we may return to userspace with the wrong context tracking
      state.
      
      To fix this, change exception_enter() to return the context tracking state
      prior to its call and pass this saved state to exception_exit(). This restores
      the real context tracking state of the interrupted frame.
      
      (May be this patch was suggested to me, I don't recall exactly. If so,
      sorry for the missing credit).
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Mats Liljegren <mats.liljegren@enea.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      6c1e0256
    • F
      context_tracking: Move exception handling to generic code · 56dd9470
      Frederic Weisbecker 提交于
      Exceptions handling on context tracking should share common
      treatment: on entry we exit user mode if the exception triggered
      in that context. Then on exception exit we return to that previous
      context.
      
      Generalize this to avoid duplication across archs.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Mats Liljegren <mats.liljegren@enea.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      56dd9470
  15. 07 3月, 2013 1 次提交
  16. 03 3月, 2013 1 次提交
    • Y
      x86, ACPI, mm: Revert movablemem_map support · 20e6926d
      Yinghai Lu 提交于
      Tim found:
      
        WARNING: at arch/x86/kernel/smpboot.c:324 topology_sane.isra.2+0x6f/0x80()
        Hardware name: S2600CP
        sched: CPU #1's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
        smpboot: Booting Node   1, Processors  #1
        Modules linked in:
        Pid: 0, comm: swapper/1 Not tainted 3.9.0-0-generic #1
        Call Trace:
          set_cpu_sibling_map+0x279/0x449
          start_secondary+0x11d/0x1e5
      
      Don Morris reproduced on a HP z620 workstation, and bisected it to
      commit e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock
      is ready")
      
      It turns out movable_map has some problems, and it breaks several things
      
      1. numa_init is called several times, NOT just for srat. so those
      	nodes_clear(numa_nodes_parsed)
      	memset(&numa_meminfo, 0, sizeof(numa_meminfo))
         can not be just removed.  Need to consider sequence is: numaq, srat, amd, dummy.
         and make fall back path working.
      
      2. simply split acpi_numa_init to early_parse_srat.
         a. that early_parse_srat is NOT called for ia64, so you break ia64.
         b.  for (i = 0; i < MAX_LOCAL_APIC; i++)
      	     set_apicid_to_node(i, NUMA_NO_NODE)
           still left in numa_init. So it will just clear result from early_parse_srat.
           it should be moved before that....
         c.  it breaks ACPI_TABLE_OVERIDE...as the acpi table scan is moved
             early before override from INITRD is settled.
      
      3. that patch TITLE is total misleading, there is NO x86 in the title,
         but it changes critical x86 code. It caused x86 guys did not
         pay attention to find the problem early. Those patches really should
         be routed via tip/x86/mm.
      
      4. after that commit, following range can not use movable ram:
        a. real_mode code.... well..funny, legacy Node0 [0,1M) could be hot-removed?
        b. initrd... it will be freed after booting, so it could be on movable...
        c. crashkernel for kdump...: looks like we can not put kdump kernel above 4G
      	anymore.
        d. init_mem_mapping: can not put page table high anymore.
        e. initmem_init: vmemmap can not be high local node anymore. That is
           not good.
      
      If node is hotplugable, the mem related range like page table and
      vmemmap could be on the that node without problem and should be on that
      node.
      
      We have workaround patch that could fix some problems, but some can not
      be fixed.
      
      So just remove that offending commit and related ones including:
      
       f7210e6c ("mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to
          protect movablecore_map in memblock_overlaps_region().")
      
       01a178a9 ("acpi, memory-hotplug: support getting hotplug info from
          SRAT")
      
       27168d38 ("acpi, memory-hotplug: extend movablemem_map ranges to
          the end of node")
      
       e8d19552 ("acpi, memory-hotplug: parse SRAT before memblock is
          ready")
      
       fb06bc8e ("page_alloc: bootmem limit with movablecore_map")
      
       42f47e27 ("page_alloc: make movablemem_map have higher priority")
      
       6981ec31 ("page_alloc: introduce zone_movable_limit[] to keep
          movable limit for nodes")
      
       34b71f1e ("page_alloc: add movable_memmap kernel parameter")
      
       4d59a751 ("x86: get pg_data_t's memory from other node")
      
      Later we should have patches that will make sure kernel put page table
      and vmemmap on local node ram instead of push them down to node0.  Also
      need to find way to put other kernel used ram to local node ram.
      Reported-by: NTim Gardner <tim.gardner@canonical.com>
      Reported-by: NDon Morris <don.morris@hp.com>
      Bisected-by: NDon Morris <don.morris@hp.com>
      Tested-by: NDon Morris <don.morris@hp.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20e6926d
  17. 24 2月, 2013 4 次提交
    • A
      x86/mm/pageattr: Prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge · a8aed3e0
      Andrea Arcangeli 提交于
      Without this patch any kernel code that reads kernel memory in
      non present kernel pte/pmds (as set by pageattr.c) will crash.
      
      With this kernel code:
      
      static struct page *crash_page;
      static unsigned long *crash_address;
      [..]
      	crash_page = alloc_pages(GFP_KERNEL, 9);
      	crash_address = page_address(crash_page);
      	if (set_memory_np((unsigned long)crash_address, 1))
      		printk("set_memory_np failure\n");
      [..]
      
      The kernel will crash if inside the "crash tool" one would try
      to read the memory at the not present address.
      
      crash> p crash_address
      crash_address = $8 = (long unsigned int *) 0xffff88023c000000
      crash> rd 0xffff88023c000000
      [ *lockup* ]
      
      The lockup happens because _PAGE_GLOBAL and _PAGE_PROTNONE
      shares the same bit, and pageattr leaves _PAGE_GLOBAL set on a
      kernel pte which is then mistaken as _PAGE_PROTNONE (so
      pte_present returns true by mistake and the kernel fault then
      gets confused and loops).
      
      With THP the same can happen after we taught pmd_present to
      check _PAGE_PROTNONE and _PAGE_PSE in commit
      027ef6c8 ("mm: thp: fix pmd_present for
      split_huge_page and PROT_NONE with THP").  THP has the same
      problem with _PAGE_GLOBAL as the 4k pages, but it also has a
      problem with _PAGE_PSE, which must be cleared too.
      
      After the patch is applied copy_user correctly returns -EFAULT
      and doesn't lockup anymore.
      
      crash> p crash_address
      crash_address = $9 = (long unsigned int *) 0xffff88023c000000
      crash> rd 0xffff88023c000000
      rd: read error: kernel virtual address: ffff88023c000000  type:
      "64-bit KVADDR"
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a8aed3e0
    • A
      Revert "x86, mm: Make spurious_fault check explicitly check explicitly check the PRESENT bit" · 954f8571
      Andrea Arcangeli 提交于
      I got a report for a minor regression introduced by commit
      027ef6c8 ("mm: thp: fix pmd_present for split_huge_page and
      PROT_NONE with THP").
      
      So the problem is, pageattr creates kernel pagetables (pte and
      pmds) that breaks pte_present/pmd_present and the patch above
      exposed this invariant breakage for pmd_present.
      
      The same problem already existed for the pte and pte_present and
      it was fixed by commit 660a293e ("x86, mm: Make
      spurious_fault check explicitly check the PRESENT bit") (if it
      wasn't for that commit, it wouldn't even be a regression).  That
      fix avoids the pagefault to use pte_present.  I could follow
      through by stopping using pmd_present/pmd_huge too.
      
      However I think it's more robust to fix pageattr and to clear
      the PSE/GLOBAL bitflags too in addition to the present bitflag.
      So the kernel page fault can keep using the regular
      pte_present/pmd_present/pmd_huge.
      
      The confusion arises because _PAGE_GLOBAL and _PAGE_PROTNONE are
      sharing the same bit, and in the pmd case we pretend _PAGE_PSE
      to be set only in present pmds (to facilitate split_huge_page
      final tlb flush).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      954f8571
    • W
      x86/mm/numa: Don't check if node is NUMA_NO_NODE · 942670d0
      Wen Congyang 提交于
      If we aren't debugging per_cpu maps, the cpu's node is stored in
      per_cpu variable numa_node.  If `node' is NUMA_NO_NODE, it means
      the caller wants to clear the cpu's node.  So we should also
      call set_cpu_numa_node() in this case.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      942670d0
    • T
      acpi, memory-hotplug: support getting hotplug info from SRAT · 01a178a9
      Tang Chen 提交于
      We now provide an option for users who don't want to specify physical
      memory address in kernel commandline.
      
               /*
                * For movablemem_map=acpi:
                *
                * SRAT:                |_____| |_____| |_________| |_________| ......
                * node id:                0       1         1           2
                * hotpluggable:           n       y         y           n
                * movablemem_map:              |_____| |_________|
                *
                * Using movablemem_map, we can prevent memblock from allocating memory
                * on ZONE_MOVABLE at boot time.
                */
      
      So user just specify movablemem_map=acpi, and the kernel will use
      hotpluggable info in SRAT to determine which memory ranges should be set
      as ZONE_MOVABLE.
      
      If all the memory ranges in SRAT is hotpluggable, then no memory can be
      used by kernel.  But before parsing SRAT, memblock has already reserve
      some memory ranges for other purposes, such as for kernel image, and so
      on.  We cannot prevent kernel from using these memory.  So we need to
      exclude these ranges even if these memory is hotpluggable.
      
      Furthermore, there could be several memory ranges in the single node
      which the kernel resides in.  We may skip one range that have memory
      reserved by memblock, but if the rest of memory is too small, then the
      kernel will fail to boot.  So, make the whole node which the kernel
      resides in un-hotpluggable.  Then the kernel has enough memory to use.
      
      NOTE: Using this way will cause NUMA performance down because the
            whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
            on it.  If users don't want to lose NUMA performance, just don't use
            it.
      
      [akpm@linux-foundation.org: fix warning]
      [akpm@linux-foundation.org: use strcmp()]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Wu Jianguo <wujianguo@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: "Brown, Len" <len.brown@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01a178a9