1. 05 6月, 2014 5 次提交
    • L
      hugetlb: add support for gigantic page allocation at runtime · 944d9fec
      Luiz Capitulino 提交于
      HugeTLB is limited to allocating hugepages whose size are less than
      MAX_ORDER order.  This is so because HugeTLB allocates hugepages via the
      buddy allocator.  Gigantic pages (that is, pages whose size is greater
      than MAX_ORDER order) have to be allocated at boottime.
      
      However, boottime allocation has at least two serious problems.  First,
      it doesn't support NUMA and second, gigantic pages allocated at boottime
      can't be freed.
      
      This commit solves both issues by adding support for allocating gigantic
      pages during runtime.  It works just like regular sized hugepages,
      meaning that the interface in sysfs is the same, it supports NUMA, and
      gigantic pages can be freed.
      
      For example, on x86_64 gigantic pages are 1GB big. To allocate two 1G
      gigantic pages on node 1, one can do:
      
       # echo 2 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      And to free them all:
      
       # echo 0 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      The one problem with gigantic page allocation at runtime is that it
      can't be serviced by the buddy allocator.  To overcome that problem,
      this commit scans all zones from a node looking for a large enough
      contiguous region.  When one is found, it's allocated by using CMA, that
      is, we call alloc_contig_range() to do the actual allocation.  For
      example, on x86_64 we scan all zones looking for a 1GB contiguous
      region.  When one is found, it's allocated by alloc_contig_range().
      
      One expected issue with that approach is that such gigantic contiguous
      regions tend to vanish as runtime goes by.  The best way to avoid this
      for now is to make gigantic page allocations very early during system
      boot, say from a init script.  Other possible optimization include using
      compaction, which is supported by CMA but is not explicitly used by this
      commit.
      
      It's also important to note the following:
      
       1. Gigantic pages allocated at boottime by the hugepages= command-line
          option can be freed at runtime just fine
      
       2. This commit adds support for gigantic pages only to x86_64. The
          reason is that I don't have access to nor experience with other archs.
          The code is arch indepedent though, so it should be simple to add
          support to different archs
      
       3. I didn't add support for hugepage overcommit, that is allocating
          a gigantic page on demand when
         /proc/sys/vm/nr_overcommit_hugepages > 0. The reason is that I don't
         think it's reasonable to do the hard and long work required for
         allocating a gigantic page at fault time. But it should be simple
         to add this if wanted
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      944d9fec
    • L
      hugetlb: move helpers up in the file · 1cac6f2c
      Luiz Capitulino 提交于
      Next commit will add new code which will want to call
      for_each_node_mask_to_alloc() macro.  Move it, its buddy
      for_each_node_mask_to_free() and their dependencies up in the file so the
      new code can use them.  This is just code movement, no logic change.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cac6f2c
    • L
      hugetlb: update_and_free_page(): don't clear PG_reserved bit · a7407a27
      Luiz Capitulino 提交于
      Hugepages pages never get the PG_reserved bit set, so don't clear it.
      
      However, note that if the bit gets mistakenly set free_pages_check() will
      catch it.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7407a27
    • L
      hugetlb: add hstate_is_gigantic() · bae7f4ae
      Luiz Capitulino 提交于
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bae7f4ae
    • L
      hugetlb: prep_compound_gigantic_page(): drop __init marker · 2906dd52
      Luiz Capitulino 提交于
      The HugeTLB subsystem uses the buddy allocator to allocate hugepages
      during runtime.  This means that hugepages allocation during runtime is
      limited to MAX_ORDER order.  For archs supporting gigantic pages (that
      is, page sizes greater than MAX_ORDER), this in turn means that those
      pages can't be allocated at runtime.
      
      HugeTLB supports gigantic page allocation during boottime, via the boot
      allocator.  To this end the kernel provides the command-line options
      hugepagesz= and hugepages=, which can be used to instruct the kernel to
      allocate N gigantic pages during boot.
      
      For example, x86_64 supports 2M and 1G hugepages, but only 2M hugepages
      can be allocated and freed at runtime.  If one wants to allocate 1G
      gigantic pages, this has to be done at boot via the hugepagesz= and
      hugepages= command-line options.
      
      Now, gigantic page allocation at boottime has two serious problems:
      
       1. Boottime allocation is not NUMA aware. On a NUMA machine the kernel
          evenly distributes boottime allocated hugepages among nodes.
      
          For example, suppose you have a four-node NUMA machine and want
          to allocate four 1G gigantic pages at boottime. The kernel will
          allocate one gigantic page per node.
      
          On the other hand, we do have users who want to be able to specify
          which NUMA node gigantic pages should allocated from. So that they
          can place virtual machines on a specific NUMA node.
      
       2. Gigantic pages allocated at boottime can't be freed
      
      At this point it's important to observe that regular hugepages allocated
      at runtime don't have those problems.  This is so because HugeTLB
      interface for runtime allocation in sysfs supports NUMA and runtime
      allocated pages can be freed just fine via the buddy allocator.
      
      This series adds support for allocating gigantic pages at runtime.  It
      does so by allocating gigantic pages via CMA instead of the buddy
      allocator.  Releasing gigantic pages is also supported via CMA.  As this
      series builds on top of the existing HugeTLB interface, it makes gigantic
      page allocation and releasing just like regular sized hugepages.  This
      also means that NUMA support just works.
      
      For example, to allocate two 1G gigantic pages on node 1, one can do:
      
       # echo 2 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      And, to release all gigantic pages on the same node:
      
       # echo 0 > \
         /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
      
      Please, refer to patch 5/5 for full technical details.
      
      Finally, please note that this series is a follow up for a previous series
      that tried to extend the command-line options set to be NUMA aware:
      
       http://marc.info/?l=linux-mm&m=139593335312191&w=2
      
      During the discussion of that series it was agreed that having runtime
      allocation support for gigantic pages was a better solution.
      
      This patch (of 5):
      
      This function is going to be used by non-init code in a future
      commit.
      Signed-off-by: NLuiz Capitulino <lcapitulino@redhat.com>
      Reviewed-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2906dd52
  2. 07 5月, 2014 1 次提交
    • N
      hugetlb: ensure hugepage access is denied if hugepages are not supported · 457c1b27
      Nishanth Aravamudan 提交于
      Currently, I am seeing the following when I `mount -t hugetlbfs /none
      /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
      related to the fact that hugetlbfs is properly not correctly setting
      itself up in this state?:
      
        Unable to handle kernel paging request for data at address 0x00000031
        Faulting instruction address: 0xc000000000245710
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        ....
      
      In KVM guests on Power, in a guest not backed by hugepages, we see the
      following:
      
        AnonHugePages:         0 kB
        HugePages_Total:       0
        HugePages_Free:        0
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:         64 kB
      
      HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
      are not supported at boot-time, but this is only checked in
      hugetlb_init().  Extract the check to a helper function, and use it in a
      few relevant places.
      
      This does make hugetlbfs not supported (not registered at all) in this
      environment.  I believe this is fine, as there are no valid hugepages
      and that won't change at runtime.
      
      [akpm@linux-foundation.org: use pr_info(), per Mel]
      [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      457c1b27
  3. 19 4月, 2014 1 次提交
  4. 08 4月, 2014 5 次提交
  5. 04 4月, 2014 8 次提交
  6. 24 1月, 2014 1 次提交
  7. 22 1月, 2014 5 次提交
  8. 22 11月, 2013 2 次提交
    • A
      mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7
      Andrea Arcangeli 提交于
      Commit 7cb2ef56 ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27c73ae7
    • D
      mm: thp: give transparent hugepage code a separate copy_page · 30b0a105
      Dave Hansen 提交于
      Right now, the migration code in migrate_page_copy() uses copy_huge_page()
      for hugetlbfs and thp pages:
      
             if (PageHuge(page) || PageTransHuge(page))
                      copy_huge_page(newpage, page);
      
      So, yay for code reuse.  But:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
      
      and a non-hugetlbfs page has no page_hstate().  This works 99% of the
      time because page_hstate() determines the hstate from the page order
      alone.  Since the page order of a THP page matches the default hugetlbfs
      page order, it works.
      
      But, if you change the default huge page size on the boot command-line
      (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
      so page_hstate() returns null and copy_huge_page() oopses pretty fast
      since copy_huge_page() dereferences the hstate:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
              if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
        ...
      
      Mel noticed that the migration code is really the only user of these
      functions.  This moves all the copy code over to migrate.c and makes
      copy_huge_page() work for THP by checking for it explicitly.
      
      I believe the bug was introduced in commit b32967ff ("mm: numa: Add
      THP migration for the NUMA working set scanning fault case")
      
      [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Tested-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30b0a105
  9. 15 11月, 2013 1 次提交
    • K
      mm, hugetlb: convert hugetlbfs to use split pmd lock · cb900f41
      Kirill A. Shutemov 提交于
      Hugetlb supports multiple page sizes. We use split lock only for PMD
      level, but not for PUD.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb900f41
  10. 17 10月, 2013 2 次提交
    • A
      mm: hugetlb: initialize PG_reserved for tail pages of gigantic compound pages · ef5a22be
      Andrea Arcangeli 提交于
      Commit 11feeb49 ("kvm: optimize away THP checks in
      kvm_is_mmio_pfn()") introduced a memory leak when KVM is run on gigantic
      compound pages.
      
      That commit depends on the assumption that PG_reserved is identical for
      all head and tail pages of a compound page.  So that if get_user_pages
      returns a tail page, we don't need to check the head page in order to
      know if we deal with a reserved page that requires different
      refcounting.
      
      The assumption that PG_reserved is the same for head and tail pages is
      certainly correct for THP and regular hugepages, but gigantic hugepages
      allocated through bootmem don't clear the PG_reserved on the tail pages
      (the clearing of PG_reserved is done later only if the gigantic hugepage
      is freed).
      
      This patch corrects the gigantic compound page initialization so that we
      can retain the optimization in 11feeb49.  The cacheline was already
      modified in order to set PG_tail so this won't affect the boot time of
      large memory systems.
      
      [akpm@linux-foundation.org: tweak comment layout and grammar]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: Nandy123 <ajs124.ajs124@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef5a22be
    • J
      mm/hugetlb.c: correct missing private flag clearing · 16c794b4
      Joonsoo Kim 提交于
      We should clear the page's private flag when returing the page to the
      hugepage pool.  Otherwise, marked hugepage can be allocated to the user
      who tries to allocate the non-reserved hugepage.  If this user fail to
      map this hugepage, he would try to return the page to the hugepage pool.
      Since this page has a private flag, resv_huge_pages would mistakenly
      increase.  This patch fixes this situation.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c794b4
  11. 12 9月, 2013 9 次提交