1. 14 1月, 2014 1 次提交
  2. 29 12月, 2013 1 次提交
    • L
      slub: Fix calculation of cpu slabs · 8afb1474
      Li Zefan 提交于
        /sys/kernel/slab/:t-0000048 # cat cpu_slabs
        231 N0=16 N1=215
        /sys/kernel/slab/:t-0000048 # cat slabs
        145 N0=36 N1=109
      
      See, the number of slabs is smaller than that of cpu slabs.
      
      The bug was introduced by commit 49e22585
      ("slub: per cpu cache for partial pages").
      
      We should use page->pages instead of page->pobjects when calculating
      the number of cpu partial slabs. This also fixes the mapping of slabs
      and nodes.
      
      As there's no variable storing the number of total/active objects in
      cpu partial slabs, and we don't have user interfaces requiring those
      statistics, I just add WARN_ON for those cases.
      
      Cc: <stable@vger.kernel.org> # 3.2+
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      8afb1474
  3. 22 11月, 2013 3 次提交
    • D
      mm, mempolicy: silence gcc warning · b7a9f420
      David Rientjes 提交于
      Fengguang Wu reports that compiling mm/mempolicy.c results in a warning:
      
        mm/mempolicy.c: In function 'mpol_to_str':
        mm/mempolicy.c:2878:2: error: format not a string literal and no format arguments
      
      Kees says this is because he is using -Wformat-security.
      
      Silence the warning.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Suggested-by: NKees Cook <keescook@chromium.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7a9f420
    • A
      mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7
      Andrea Arcangeli 提交于
      Commit 7cb2ef56 ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27c73ae7
    • D
      mm: thp: give transparent hugepage code a separate copy_page · 30b0a105
      Dave Hansen 提交于
      Right now, the migration code in migrate_page_copy() uses copy_huge_page()
      for hugetlbfs and thp pages:
      
             if (PageHuge(page) || PageTransHuge(page))
                      copy_huge_page(newpage, page);
      
      So, yay for code reuse.  But:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
      
      and a non-hugetlbfs page has no page_hstate().  This works 99% of the
      time because page_hstate() determines the hstate from the page order
      alone.  Since the page order of a THP page matches the default hugetlbfs
      page order, it works.
      
      But, if you change the default huge page size on the boot command-line
      (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
      so page_hstate() returns null and copy_huge_page() oopses pretty fast
      since copy_huge_page() dereferences the hstate:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
              if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
        ...
      
      Mel noticed that the migration code is really the only user of these
      functions.  This moves all the copy code over to migrate.c and makes
      copy_huge_page() work for THP by checking for it explicitly.
      
      I believe the bug was introduced in commit b32967ff ("mm: numa: Add
      THP migration for the NUMA working set scanning fault case")
      
      [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Tested-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30b0a105
  4. 21 11月, 2013 1 次提交
  5. 15 11月, 2013 13 次提交
    • S
      kfifo API type safety · 498d319b
      Stefani Seibold 提交于
      This patch enhances the type safety for the kfifo API.  It is now safe
      to put const data into a non const FIFO and the API will now generate a
      compiler warning when reading from the fifo where the destination
      address is pointing to a const variable.
      
      As a side effect the kfifo_put() does now expect the value of an element
      instead a pointer to the element.  This was suggested Russell King.  It
      make the handling of the kfifo_put easier since there is no need to
      create a helper variable for getting the address of a pointer or to pass
      integers of different sizes.
      
      IMHO the API break is okay, since there are currently only six users of
      kfifo_put().
      
      The code is also cleaner by kicking out the "if (0)" expressions.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Hauke Mehrtens <hauke@hauke-m.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      498d319b
    • K
      mm: create a separate slab for page->ptl allocation · ea1e7ed3
      Kirill A. Shutemov 提交于
      If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
      is 72 bytes.  For page->ptl they will be allocated from kmalloc-96 slab,
      so we loose 24 on each.  An average system can easily allocate few tens
      thousands of page->ptl and overhead is significant.
      
      Let's create a separate slab for page->ptl allocation to solve this.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1e7ed3
    • P
      mm: properly separate the bloated ptl from the regular case · 539edb58
      Peter Zijlstra 提交于
      Use kernel/bounds.c to convert build-time spinlock_t size check into a
      preprocessor symbol and apply that to properly separate the page::ptl
      situation.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      539edb58
    • K
      mm: dynamically allocate page->ptl if it cannot be embedded to struct page · 49076ec2
      Kirill A. Shutemov 提交于
      If split page table lock is in use, we embed the lock into struct page
      of table's page.  We have to disable split lock, if spinlock_t is too
      big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
      enabled.
      
      This patch add support for dynamic allocation of split page table lock
      if we can't embed it to struct page.
      
      page->ptl is unsigned long now and we use it as spinlock_t if
      sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t.
      
      The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
      pgtable_pmd_page_ctor() for PMD table.  All other helpers converted to
      support dynamically allocated page->ptl.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49076ec2
    • K
      mm: implement split page table lock for PMD level · e009bb30
      Kirill A. Shutemov 提交于
      The basic idea is the same as with PTE level: the lock is embedded into
      struct page of table's page.
      
      We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
      take mm->page_table_lock anymore.  Let's reuse page->lru of table's page
      for that.
      
      pgtable_pmd_page_ctor() returns true, if initialization is successful
      and false otherwise.  Current implementation never fails, but assumption
      that constructor can fail will help to port it to -rt where spinlock_t
      is rather huge and cannot be embedded into struct page -- dynamic
      allocation is required.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e009bb30
    • K
      mm: convert the rest to new page table lock api · c4088ebd
      Kirill A. Shutemov 提交于
      Only trivial cases left. Let's convert them altogether.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4088ebd
    • K
      mm, hugetlb: convert hugetlbfs to use split pmd lock · cb900f41
      Kirill A. Shutemov 提交于
      Hugetlb supports multiple page sizes. We use split lock only for PMD
      level, but not for PUD.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb900f41
    • K
      mm, thp: do not access mm->pmd_huge_pte directly · c389a250
      Kirill A. Shutemov 提交于
      Currently mm->pmd_huge_pte protected by page table lock.  It will not
      work with split lock.  We have to have per-pmd pmd_huge_pte for proper
      access serialization.
      
      For now, let's just introduce wrapper to access mm->pmd_huge_pte.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c389a250
    • K
      mm, thp: move ptl taking inside page_check_address_pmd() · 117b0791
      Kirill A. Shutemov 提交于
      With split page table lock we can't know which lock we need to take
      before we find the relevant pmd.
      
      Let's move lock taking inside the function.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      117b0791
    • K
      mm, thp: change pmd_trans_huge_lock() to return taken lock · bf929152
      Kirill A. Shutemov 提交于
      With split ptlock it's important to know which lock
      pmd_trans_huge_lock() took.  This patch adds one more parameter to the
      function to return the lock.
      
      In most places migration to new api is trivial.  Exception is
      move_huge_pmd(): we need to take two locks if pmd tables are different.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf929152
    • K
      mm: convert mm->nr_ptes to atomic_long_t · e1f56c89
      Kirill A. Shutemov 提交于
      With split page table lock for PMD level we can't hold mm->page_table_lock
      while updating nr_ptes.
      
      Let's convert it to atomic_long_t to avoid races.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1f56c89
    • K
      mm: avoid increase sizeof(struct page) due to split page table lock · e9bb18c7
      Kirill A. Shutemov 提交于
      Alex Thorlton noticed that some massively threaded workloads work poorly,
      if THP enabled.  This patchset fixes this by introducing split page table
      lock for PMD tables.  hugetlbfs is not covered yet.
      
      This patchset is based on work by Naoya Horiguchi.
      
      : akpm result summary:
      :
      : THP off, v3.12-rc2: 18.059261877 seconds time elapsed
      : THP off, patched:   16.768027318 seconds time elapsed
      :
      : THP on, v3.12-rc2:  42.162306788 seconds time elapsed
      : THP on, patched:    8.397885779 seconds time elapsed
      :
      : HUGETLB, v3.12-rc2: 47.574936948 seconds time elapsed
      : HUGETLB, patched:   19.447481153 seconds time elapsed
      
      THP off, v3.12-rc2:
      -------------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
          1037072.835207 task-clock                #   57.426 CPUs utilized            ( +-  3.59% )
                  95,093 context-switches          #    0.092 K/sec                    ( +-  3.93% )
                     140 cpu-migrations            #    0.000 K/sec                    ( +-  5.28% )
              10,000,550 page-faults               #    0.010 M/sec                    ( +-  0.00% )
       2,455,210,400,261 cycles                    #    2.367 GHz                      ( +-  3.62% ) [83.33%]
       2,429,281,882,056 stalled-cycles-frontend   #   98.94% frontend cycles idle     ( +-  3.67% ) [83.33%]
       1,975,960,019,659 stalled-cycles-backend    #   80.48% backend  cycles idle     ( +-  3.88% ) [66.68%]
          46,503,296,013 instructions              #    0.02  insns per cycle
                                                   #   52.24  stalled cycles per insn  ( +-  3.21% ) [83.34%]
           9,278,997,542 branches                  #    8.947 M/sec                    ( +-  4.00% ) [83.34%]
              89,881,640 branch-misses             #    0.97% of all branches          ( +-  1.17% ) [83.33%]
      
            18.059261877 seconds time elapsed                                          ( +-  2.65% )
      
      THP on, v3.12-rc2:
      ------------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
          3114745.395974 task-clock                #   73.875 CPUs utilized            ( +-  1.84% )
                 267,356 context-switches          #    0.086 K/sec                    ( +-  1.84% )
                      99 cpu-migrations            #    0.000 K/sec                    ( +-  1.40% )
                  58,313 page-faults               #    0.019 K/sec                    ( +-  0.28% )
       7,416,635,817,510 cycles                    #    2.381 GHz                      ( +-  1.83% ) [83.33%]
       7,342,619,196,993 stalled-cycles-frontend   #   99.00% frontend cycles idle     ( +-  1.88% ) [83.33%]
       6,267,671,641,967 stalled-cycles-backend    #   84.51% backend  cycles idle     ( +-  2.03% ) [66.67%]
         117,819,935,165 instructions              #    0.02  insns per cycle
                                                   #   62.32  stalled cycles per insn  ( +-  4.39% ) [83.34%]
          28,899,314,777 branches                  #    9.278 M/sec                    ( +-  4.48% ) [83.34%]
              71,787,032 branch-misses             #    0.25% of all branches          ( +-  1.03% ) [83.33%]
      
            42.162306788 seconds time elapsed                                          ( +-  1.73% )
      
      HUGETLB, v3.12-rc2:
      -------------------
      
       Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
      
          2588052.787264 task-clock                #   54.400 CPUs utilized            ( +-  3.69% )
                 246,831 context-switches          #    0.095 K/sec                    ( +-  4.15% )
                     138 cpu-migrations            #    0.000 K/sec                    ( +-  5.30% )
                  21,027 page-faults               #    0.008 K/sec                    ( +-  0.01% )
       6,166,666,307,263 cycles                    #    2.383 GHz                      ( +-  3.68% ) [83.33%]
       6,086,008,929,407 stalled-cycles-frontend   #   98.69% frontend cycles idle     ( +-  3.77% ) [83.33%]
       5,087,874,435,481 stalled-cycles-backend    #   82.51% backend  cycles idle     ( +-  4.41% ) [66.67%]
         133,782,831,249 instructions              #    0.02  insns per cycle
                                                   #   45.49  stalled cycles per insn  ( +-  4.30% ) [83.34%]
          34,026,870,541 branches                  #   13.148 M/sec                    ( +-  4.24% ) [83.34%]
              68,670,942 branch-misses             #    0.20% of all branches          ( +-  3.26% ) [83.33%]
      
            47.574936948 seconds time elapsed                                          ( +-  2.09% )
      
      THP off, patched:
      -----------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
           943301.957892 task-clock                #   56.256 CPUs utilized            ( +-  3.01% )
                  86,218 context-switches          #    0.091 K/sec                    ( +-  3.17% )
                     121 cpu-migrations            #    0.000 K/sec                    ( +-  6.64% )
              10,000,551 page-faults               #    0.011 M/sec                    ( +-  0.00% )
       2,230,462,457,654 cycles                    #    2.365 GHz                      ( +-  3.04% ) [83.32%]
       2,204,616,385,805 stalled-cycles-frontend   #   98.84% frontend cycles idle     ( +-  3.09% ) [83.32%]
       1,778,640,046,926 stalled-cycles-backend    #   79.74% backend  cycles idle     ( +-  3.47% ) [66.69%]
          45,995,472,617 instructions              #    0.02  insns per cycle
                                                   #   47.93  stalled cycles per insn  ( +-  2.51% ) [83.34%]
           9,179,700,174 branches                  #    9.731 M/sec                    ( +-  3.04% ) [83.35%]
              89,166,529 branch-misses             #    0.97% of all branches          ( +-  1.45% ) [83.33%]
      
            16.768027318 seconds time elapsed                                          ( +-  2.47% )
      
      THP on, patched:
      ----------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
           458793.837905 task-clock                #   54.632 CPUs utilized            ( +-  0.79% )
                  41,831 context-switches          #    0.091 K/sec                    ( +-  0.97% )
                      98 cpu-migrations            #    0.000 K/sec                    ( +-  1.66% )
                  57,829 page-faults               #    0.126 K/sec                    ( +-  0.62% )
       1,077,543,336,716 cycles                    #    2.349 GHz                      ( +-  0.81% ) [83.33%]
       1,067,403,802,964 stalled-cycles-frontend   #   99.06% frontend cycles idle     ( +-  0.87% ) [83.33%]
         864,764,616,143 stalled-cycles-backend    #   80.25% backend  cycles idle     ( +-  0.73% ) [66.68%]
          16,129,177,440 instructions              #    0.01  insns per cycle
                                                   #   66.18  stalled cycles per insn  ( +-  7.94% ) [83.35%]
           3,618,938,569 branches                  #    7.888 M/sec                    ( +-  8.46% ) [83.36%]
              33,242,032 branch-misses             #    0.92% of all branches          ( +-  2.02% ) [83.32%]
      
             8.397885779 seconds time elapsed                                          ( +-  0.18% )
      
      HUGETLB, patched:
      -----------------
      
       Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
      
           395353.076837 task-clock                #   20.329 CPUs utilized            ( +-  8.16% )
                  55,730 context-switches          #    0.141 K/sec                    ( +-  5.31% )
                     138 cpu-migrations            #    0.000 K/sec                    ( +-  4.24% )
                  21,027 page-faults               #    0.053 K/sec                    ( +-  0.00% )
         930,219,717,244 cycles                    #    2.353 GHz                      ( +-  8.21% ) [83.32%]
         914,295,694,103 stalled-cycles-frontend   #   98.29% frontend cycles idle     ( +-  8.35% ) [83.33%]
         704,137,950,187 stalled-cycles-backend    #   75.70% backend  cycles idle     ( +-  9.16% ) [66.69%]
          30,541,538,385 instructions              #    0.03  insns per cycle
                                                   #   29.94  stalled cycles per insn  ( +-  3.98% ) [83.35%]
           8,415,376,631 branches                  #   21.286 M/sec                    ( +-  3.61% ) [83.36%]
              32,645,478 branch-misses             #    0.39% of all branches          ( +-  3.41% ) [83.32%]
      
            19.447481153 seconds time elapsed                                          ( +-  2.00% )
      
      This patch (of 11):
      
      CONFIG_GENERIC_LOCKBREAK increases sizeof(spinlock_t) to 8 bytes.  It
      leads to increase sizeof(struct page) by 4 bytes on 32-bit system if split
      page table lock is in use, since page->ptl shares space in union with
      longs and pointers.
      
      Let's disable split page table lock on 32-bit systems with
      GENERIC_LOCKBREAK enabled.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9bb18c7
    • K
      mm: drop actor argument of do_generic_file_read() · b77d88d4
      Kirill A. Shutemov 提交于
      There's only one caller of do_generic_file_read() and the only actor is
      file_read_actor().  No reason to have a callback parameter.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b77d88d4
  6. 13 11月, 2013 21 次提交
    • M
      mm: numa: return the number of base pages altered by protection changes · 72403b4a
      Mel Gorman 提交于
      Commit 0255d491 ("mm: Account for a THP NUMA hinting update as one
      PTE update") was added to account for the number of PTE updates when
      marking pages prot_numa.  task_numa_work was using the old return value
      to track how much address space had been updated.  Altering the return
      value causes the scanner to do more work than it is configured or
      documented to in a single unit of work.
      
      This patch reverts that commit and accounts for the number of THP
      updates separately in vmstat.  It is up to the administrator to
      interpret the pair of values correctly.  This is a straight-forward
      operation and likely to only be of interest when actively debugging NUMA
      balancing problems.
      
      The impact of this patch is that the NUMA PTE scanner will scan slower
      when THP is enabled and workloads may converge slower as a result.  On
      the flip size system CPU usage should be lower than recent tests
      reported.  This is an illustrative example of a short single JVM specjbb
      test
      
      specjbb
                             3.12.0                3.12.0
                            vanilla      acctupdates
      TPut 1      26143.00 (  0.00%)     25747.00 ( -1.51%)
      TPut 7     185257.00 (  0.00%)    183202.00 ( -1.11%)
      TPut 13    329760.00 (  0.00%)    346577.00 (  5.10%)
      TPut 19    442502.00 (  0.00%)    460146.00 (  3.99%)
      TPut 25    540634.00 (  0.00%)    549053.00 (  1.56%)
      TPut 31    512098.00 (  0.00%)    519611.00 (  1.47%)
      TPut 37    461276.00 (  0.00%)    474973.00 (  2.97%)
      TPut 43    403089.00 (  0.00%)    414172.00 (  2.75%)
      
                    3.12.0      3.12.0
                   vanillaacctupdates
      User         5169.64     5184.14
      System        100.45       80.02
      Elapsed       252.75      251.85
      
      Performance is similar but note the reduction in system CPU time.  While
      this showed a performance gain, it will not be universal but at least
      it'll be behaving as documented.  The vmstats are obviously different but
      here is an obvious interpretation of them from mmtests.
      
                                      3.12.0      3.12.0
                                     vanillaacctupdates
      NUMA page range updates        1408326    11043064
      NUMA huge PMD updates                0       21040
      NUMA PTE updates               1408326      291624
      
      "NUMA page range updates" == nr_pte_updates and is the value returned to
      the NUMA pte scanner.  NUMA huge PMD updates were the number of THP
      updates which in combination can be used to calculate how many ptes were
      updated from userspace.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NAlex Thorlton <athorlton@sgi.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72403b4a
    • J
      mm: factor commit limit calculation · 00619bcc
      Jerome Marchand 提交于
      The same calculation is currently done in three differents places.
      Factor that code so future changes has to be made at only one place.
      
      [akpm@linux-foundation.org: uninline vm_commit_limit()]
      Signed-off-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00619bcc
    • Z
    • W
      mm/zswap: refactor the get/put routines · 0ab0abcf
      Weijie Yang 提交于
      The refcount routine was not fit the kernel get/put semantic exactly,
      There were too many judgement statements on refcount and it could be
      minus.
      
      This patch does the following:
      
       - move refcount judgement to zswap_entry_put() to hide resource free function.
      
       - add a new function zswap_entry_find_get(), so that callers can use
         easily in the following pattern:
      
           zswap_entry_find_get
           .../* do something */
           zswap_entry_put
      
       - to eliminate compile error, move some functions declaration
      
      This patch is based on Minchan Kim <minchan@kernel.org> 's idea and suggestion.
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ab0abcf
    • W
      mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently · 67d13fe8
      Weijie Yang 提交于
      Consider the following scenario:
      
      thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
      thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
      	finished, entry x and its zbud is not freed as its refcount != 0
      	now, the swap_map[x] = 0
      thread 0: now call zswap_get_swap_cache_page
      	swapcache_prepare return -ENOENT because entry x is not used any more
      	zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
      	zswap_writeback_entry do nothing except put refcount
      
      Now, the memory of zswap_entry x and its zpage leak.
      
      Modify:
       - check the refcount in fail path, free memory if it is not referenced.
      
       - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
         can be not only caused by nomem but also by invalidate.
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67d13fe8
    • Q
      memcg, kmem: use cache_from_memcg_idx instead of hard code · 7a67d7ab
      Qiang Huang 提交于
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a67d7ab
    • Q
      memcg, kmem: rename cache_from_memcg to cache_from_memcg_idx · 2ade4de8
      Qiang Huang 提交于
      We can't see the relationship with memcg from the parameters,
      so the name with memcg_idx would be more reasonable.
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ade4de8
    • Q
      memcg, kmem: use is_root_cache instead of hard code · f35c3a8e
      Qiang Huang 提交于
      Signed-off-by: NQiang Huang <h.huangqiang@huawei.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f35c3a8e
    • A
      mm: ensure get_unmapped_area() returns higher address than mmap_min_addr · 2afc745f
      Akira Takeuchi 提交于
      This patch fixes the problem that get_unmapped_area() can return illegal
      address and result in failing mmap(2) etc.
      
      In case that the address higher than PAGE_SIZE is set to
      /proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
      returned by get_unmapped_area(), even if you do not pass any virtual
      address hint (i.e.  the second argument).
      
      This is because the current get_unmapped_area() code does not take into
      account mmap_min_addr.
      
      This leads to two actual problems as follows:
      
      1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
         although any illegal parameter is not passed.
      
      2. The bottom-up search path after the top-down search might not work in
         arch_get_unmapped_area_topdown().
      
      Note: The first and third chunk of my patch, which changes "len" check,
      are for more precise check using mmap_min_addr, and not for solving the
      above problem.
      
      [How to reproduce]
      
      	--- test.c -------------------------------------------------
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/errno.h>
      
      	int main(int argc, char *argv[])
      	{
      		void *ret = NULL, *last_map;
      		size_t pagesize = sysconf(_SC_PAGESIZE);
      
      		do {
      			last_map = ret;
      			ret = mmap(0, pagesize, PROT_NONE,
      				MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      	//		printf("ret=%p\n", ret);
      		} while (ret != MAP_FAILED);
      
      		if (errno != ENOMEM) {
      			printf("ERR: unexpected errno: %d (last map=%p)\n",
      			errno, last_map);
      		}
      
      		return 0;
      	}
      	---------------------------------------------------------------
      
      	$ gcc -m32 -o test test.c
      	$ sudo sysctl -w vm.mmap_min_addr=65536
      	vm.mmap_min_addr = 65536
      	$ ./test  (run as non-priviledge user)
      	ERR: unexpected errno: 1 (last map=0x10000)
      Signed-off-by: NAkira Takeuchi <takeuchi.akr@jp.panasonic.com>
      Signed-off-by: NKiyoshi Owada <owada.kiyoshi@jp.panasonic.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2afc745f
    • K
      mm: __rmqueue_fallback() should respect pageblock type · 0cbef29a
      KOSAKI Motohiro 提交于
      When __rmqueue_fallback() doesn't find a free block with the required size
      it splits a larger page and puts the rest of the page onto the free list.
      
      But it has one serious mistake.  When putting back, __rmqueue_fallback()
      always use start_migratetype if type is not CMA.  However,
      __rmqueue_fallback() is only called when all of the start_migratetype
      queue is empty.  That said, __rmqueue_fallback always puts back memory to
      the wrong queue except try_to_steal_freepages() changed pageblock type
      (i.e.  requested size is smaller than half of page block).  The end result
      is that the antifragmentation framework increases fragmenation instead of
      decreasing it.
      
      Mel's original anti fragmentation does the right thing.  But commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.
      
      This patch restores sane and old behavior.  It also removes an incorrect
      comment which was introduced by commit fef903ef ("mm/page_alloc.c:
      restructure free-page stealing code and fix a bug").
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cbef29a
    • K
      mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag() · 52c8f6a5
      KOSAKI Motohiro 提交于
      In general, every tracepoint should be zero overhead if it is disabled.
      However, trace_mm_page_alloc_extfrag() is one of exception.  It evaluate
      "new_type == start_migratetype" even if tracepoint is disabled.
      
      However, the code can be moved into tracepoint's TP_fast_assign() and
      TP_fast_assign exist exactly such purpose.  This patch does it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52c8f6a5
    • K
      mm: fix page_group_by_mobility_disabled breakage · 5d0f3f72
      KOSAKI Motohiro 提交于
      Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
      MIGRATE_ISOLATE if page_group_by_mobility_disabled is true.  It rewrites
      the argument to MIGRATE_UNMOVABLE and we lost these attribute.
      
      The problem was introduced by commit 49255c61 ("page allocator: move
      check for disabled anti-fragmentation out of fastpath").  So a 4 year
      old issue may mean that nobody uses page_group_by_mobility_disabled.
      
      But anyway, this patch fixes the problem.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d0f3f72
    • D
      readahead: fix sequential read cache miss detection · af248a0c
      Damien Ramonda 提交于
      The kernel's readahead algorithm sometimes interprets random read
      accesses as sequential and triggers unnecessary data prefecthing from
      storage device (impacting random read average latency).
      
      In order to identify sequential cache read misses, the readahead
      algorithm intends to check whether offset - previous offset == 1
      (trivial sequential reads) or offset - previous offset == 0 (sequential
      reads not aligned on page boundary):
      
        if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL)
      
      The current offset is stored in the "offset" variable of type "pgoff_t"
      (unsigned long), while previous offset is stored in "ra->prev_pos" of
      type "loff_t" (long long).  Therefore, operands of the if statement are
      implicitly converted to type long long.  Consequently, when previous
      offset > current offset (which happens on random pattern), the if
      condition is true and access is wrongly interpeted as sequential.  An
      unnecessary data prefetching is triggered, impacting the average random
      read latency.
      
      Storing the previous offset value in a "pgoff_t" variable (unsigned
      long) fixes the sequential read detection logic.
      Signed-off-by: NDamien Ramonda <damien.ramonda@intel.com>
      Reviewed-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by: NPierre Tardy <pierre.tardy@intel.com>
      Acked-by: NDavid Cohen <david.a.cohen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af248a0c
    • D
    • T
      mm: clear N_CPU from node_states at CPU offline · 807a1bd2
      Toshi Kani 提交于
      vmstat_cpuup_callback() is a CPU notifier callback, which marks N_CPU to a
      node at CPU online event.  However, it does not update this N_CPU info at
      CPU offline event.
      
      Changed vmstat_cpuup_callback() to clear N_CPU when the last CPU in the
      node is put into offline, i.e.  the node no longer has any online CPU.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Tested-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      807a1bd2
    • T
      mm: set N_CPU to node_states during boot · d7e0b37a
      Toshi Kani 提交于
      After a system booted, N_CPU is not set to any node as has_cpu shows an
      empty line.
      
        # cat /sys/devices/system/node/has_cpu
        (show-empty-line)
      
      setup_vmstat() registers its CPU notifier callback,
      vmstat_cpuup_callback(), which marks N_CPU to a node when a CPU is put
      into online.  However, setup_vmstat() is called after all CPUs are
      launched in the boot sequence.
      
      Changed setup_vmstat() to mark N_CPU to the nodes with online CPUs at
      boot, which is consistent with other operations in
      vmstat_cpuup_callback(), i.e.  start_cpu_timer() and
      refresh_zone_stat_thresholds().
      
      Also added get_online_cpus() to protect the for_each_online_cpu() loop.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Tested-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7e0b37a
    • T
      mem-hotplug: introduce movable_node boot option · c5320926
      Tang Chen 提交于
      The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
      As we mentioned before, if hotpluggable memory is used by the kernel, it
      cannot be hot-removed.  So memory hotplug users may want to set all
      hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
      
      Memory hotplug users may also set a node as movable node, which has
      ZONE_MOVABLE only, so that the whole node can be hot-removed.
      
      But the kernel cannot use memory in ZONE_MOVABLE.  By doing this, the
      kernel cannot use memory in movable nodes.  This will cause NUMA
      performance down.  And other users may be unhappy.
      
      So we need a way to allow users to enable and disable this functionality.
      In this patch, we introduce movable_node boot option to allow users to
      choose to not to consume hotpluggable memory at early boot time and later
      we can set it as ZONE_MOVABLE.
      
      To achieve this, the movable_node boot option will control the memblock
      allocation direction.  That said, after memblock is ready, before SRAT is
      parsed, we should allocate memory near the kernel image as we explained in
      the previous patches.  So if movable_node boot option is set, the kernel
      does the following:
      
      1. After memblock is ready, make memblock allocate memory bottom up.
      2. After SRAT is parsed, make memblock behave as default, allocate memory
         top down.
      
      Users can specify "movable_node" in kernel commandline to enable this
      functionality.  For those who don't use memory hotplug or who don't want
      to lose their NUMA performance, just don't specify anything.  The kernel
      will work as before.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Suggested-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5320926
    • T
      mm/memblock.c: introduce bottom-up allocation mode · 79442ed1
      Tang Chen 提交于
      The Linux kernel cannot migrate pages used by the kernel.  As a result,
      kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
      memory for the kernel.
      
      ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
      info.  But before SRAT is parsed, memblock has already started to allocate
      memory for the kernel.  So we need to prevent memblock from doing this.
      
      In a memory hotplug system, any numa node the kernel resides in should be
      unhotpluggable.  And for a modern server, each node could have at least
      16GB memory.  So memory around the kernel image is highly likely
      unhotpluggable.
      
      So the basic idea is: Allocate memory from the end of the kernel image and
      to the higher memory.  Since memory allocation before SRAT is parsed won't
      be too much, it could highly likely be in the same node with kernel image.
      
      The current memblock can only allocate memory top-down.  So this patch
      introduces a new bottom-up allocation mode to allocate memory bottom-up.
      And later when we use this allocation direction to allocate memory, we
      will limit the start address above the kernel.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79442ed1
    • T
      mm/memblock.c: factor out of top-down allocation · 1402899e
      Tang Chen 提交于
      [Problem]
      
      The current Linux cannot migrate pages used by the kernel because of the
      kernel direct mapping.  In Linux kernel space, va = pa + PAGE_OFFSET.
      When the pa is changed, we cannot simply update the pagetable and keep the
      va unmodified.  So the kernel pages are not migratable.
      
      There are also some other issues will cause the kernel pages not
      migratable.  For example, the physical address may be cached somewhere and
      will be used.  It is not to update all the caches.
      
      When doing memory hotplug in Linux, we first migrate all the pages in one
      memory device somewhere else, and then remove the device.  But if pages
      are used by the kernel, they are not migratable.  As a result, memory used
      by the kernel cannot be hot-removed.
      
      Modifying the kernel direct mapping mechanism is too difficult to do.  And
      it may cause the kernel performance down and unstable.  So we use the
      following way to do memory hotplug.
      
      [What we are doing]
      
      In Linux, memory in one numa node is divided into several zones.  One of
      the zones is ZONE_MOVABLE, which the kernel won't use.
      
      In order to implement memory hotplug in Linux, we are going to arrange all
      hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
      memory.  To do this, we need ACPI's help.
      
      In ACPI, SRAT(System Resource Affinity Table) contains NUMA info.  The
      memory affinities in SRAT record every memory range in the system, and
      also, flags specifying if the memory range is hotpluggable.  (Please refer
      to ACPI spec 5.0 5.2.16)
      
      With the help of SRAT, we have to do the following two things to achieve our
      goal:
      
      1. When doing memory hot-add, allow the users arranging hotpluggable as
         ZONE_MOVABLE.
         (This has been done by the MOVABLE_NODE functionality in Linux.)
      
      2. when the system is booting, prevent bootmem allocator from allocating
         hotpluggable memory for the kernel before the memory initialization
         finishes.
      
      The problem 2 is the key problem we are going to solve. But before solving it,
      we need some preparation. Please see below.
      
      [Preparation]
      
      Bootloader has to load the kernel image into memory.  And this memory must
      be unhotpluggable.  We cannot prevent this anyway.  So in a memory hotplug
      system, we can assume any node the kernel resides in is not hotpluggable.
      
      Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
       But memblock has already started to work.  In the current kernel,
      memblock allocates the following memory before SRAT is parsed:
      
      setup_arch()
       |->memblock_x86_fill()            /* memblock is ready */
       |......
       |->early_reserve_e820_mpc_new()   /* allocate memory under 1MB */
       |->reserve_real_mode()            /* allocate memory under 1MB */
       |->init_mem_mapping()             /* allocate page tables, about 2MB to map 1GB memory */
       |->dma_contiguous_reserve()       /* specified by user, should be low */
       |->setup_log_buf()                /* specified by user, several mega bytes */
       |->relocate_initrd()              /* could be large, but will be freed after boot, should reorder */
       |->acpi_initrd_override()         /* several mega bytes */
       |->reserve_crashkernel()          /* could be large, should reorder */
       |......
       |->initmem_init()                 /* Parse SRAT */
      
      According to Tejun's advice, before SRAT is parsed, we should try our best
      to allocate memory near the kernel image.  Since the whole node the kernel
      resides in won't be hotpluggable, and for a modern server, a node may have
      at least 16GB memory, allocating several mega bytes memory around the
      kernel image won't cross to hotpluggable memory.
      
      [About this patchset]
      
      So this patchset is the preparation for the problem 2 that we want to
      solve.  It does the following:
      
      1. Make memblock be able to allocate memory bottom up.
         1) Keep all the memblock APIs' prototype unmodified.
         2) When the direction is bottom up, keep the start address greater than the
            end of kernel image.
      
      2. Improve init_mem_mapping() to support allocate page tables in
         bottom up direction.
      
      3. Introduce "movable_node" boot option to enable and disable this
         functionality.
      
      This patch (of 6):
      
      Create a new function __memblock_find_range_top_down to factor out of
      top-down allocation from memblock_find_in_range_node.  This is a
      preparation because we will introduce a new bottom-up allocation mode in
      the following patch.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1402899e
    • H
      mmap: arch_get_unmapped_area(): use proper mmap base for bottom up direction · 4e99b021
      Heiko Carstens 提交于
      This is more or less the generic variant of commit 41aacc1e ("x86
      get_unmapped_area: Access mmap_legacy_base through mm_struct member").
      
      So effectively architectures which use an own arch_pick_mmap_layout()
      implementation but call the generic arch_get_unmapped_area() now can
      also randomize their mmap_base.
      
      All architectures which have an own arch_pick_mmap_layout() and call the
      generic arch_get_unmapped_area() (arm64, s390, tile) currently set
      mmap_base to TASK_UNMAPPED_BASE.  This is also true for the generic
      arch_pick_mmap_layout() function.  So this change is a no-op currently.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Radu Caragea <sinaelgl@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e99b021
    • W
      mm/zswap: avoid unnecessary page scanning · b349acc7
      Weijie Yang 提交于
      Add SetPageReclaim() before __swap_writepage() so that page can be moved
      to the tail of the inactive list, which can avoid unnecessary page
      scanning as this page was reclaimed by swap subsystem before.
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b349acc7