1. 19 12月, 2013 9 次提交
  2. 13 12月, 2013 4 次提交
    • J
      mm: memcg: do not allow task about to OOM kill to bypass the limit · 1f14c1ac
      Johannes Weiner 提交于
      Commit 49426420 ("mm: memcg: handle non-error OOM situations more
      gracefully") allowed tasks that already entered a memcg OOM condition to
      bypass the memcg limit on subsequent allocation attempts hoping this
      would expedite finishing the page fault and executing the kill.
      
      David Rientjes is worried that this breaks memcg isolation guarantees
      and since there is no evidence that the bypass actually speeds up fault
      processing just change it so that these subsequent charge attempts fail
      outright.  The notable exception being __GFP_NOFAIL charges which are
      required to bypass the limit regardless.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-bt: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f14c1ac
    • J
      mm: memcg: fix race condition between memcg teardown and swapin · 96f1c58d
      Johannes Weiner 提交于
      There is a race condition between a memcg being torn down and a swapin
      triggered from a different memcg of a page that was recorded to belong
      to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension).  The
      result is unreclaimable pages pointing to dead memcgs, which can lead to
      anything from endless loops in later memcg teardown (the page is charged
      to all hierarchical parents but is not on any LRU list) or crashes from
      following the dangling memcg pointer.
      
      Memcgs with tasks in them can not be torn down and usually charges don't
      show up in memcgs without tasks.  Swapin with the CONFIG_MEMCG_SWAP
      extension is the notable exception because it charges the cgroup that
      was recorded as owner during swapout, which may be empty and in the
      process of being torn down when a task in another memcg triggers the
      swapin:
      
        teardown:                 swapin:
      
                                  lookup_swap_cgroup_id()
                                  rcu_read_lock()
                                  mem_cgroup_lookup()
                                  css_tryget()
                                  rcu_read_unlock()
        disable css_tryget()
        call_rcu()
          offline_css()
            reparent_charges()
                                  res_counter_charge() (hierarchical!)
                                  css_put()
                                    css_free()
                                  pc->mem_cgroup = dead memcg
                                  add page to dead lru
      
      Add a final reparenting step into css_free() to make sure any such raced
      charges are moved out of the memcg before it's finally freed.
      
      In the longer term it would be cleaner to have the css_tryget() and the
      res_counter charge under the same RCU lock section so that the charge
      reparenting is deferred until the last charge whose tryget succeeded is
      visible.  But this will require more invasive changes that will be
      harder to evaluate and backport into stable, so better defer them to a
      separate change set.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96f1c58d
    • K
      thp: move preallocated PTE page table on move_huge_pmd() · 3592806c
      Kirill A. Shutemov 提交于
      Andrey Wagin reported crash on VM_BUG_ON() in pgtable_pmd_page_dtor() with
      fallowing backtrace:
      
        free_pgd_range+0x2bf/0x410
        free_pgtables+0xce/0x120
        unmap_region+0xe0/0x120
        do_munmap+0x249/0x360
        move_vma+0x144/0x270
        SyS_mremap+0x3b9/0x510
        system_call_fastpath+0x16/0x1b
      
      The crash can be reproduce with this test case:
      
        #define _GNU_SOURCE
        #include <sys/mman.h>
        #include <stdio.h>
        #include <unistd.h>
      
        #define MB (1024 * 1024UL)
        #define GB (1024 * MB)
      
        int main(int argc, char **argv)
        {
      	char *p;
      	int i;
      
      	p = mmap((void *) GB, 10 * MB, PROT_READ | PROT_WRITE,
      			MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
      	for (i = 0; i < 10 * MB; i += 4096)
      		p[i] = 1;
      	mremap(p, 10 * MB, 10 * MB, MREMAP_FIXED | MREMAP_MAYMOVE, 2 * GB);
      	return 0;
        }
      
      Due to split PMD lock, we now store preallocated PTE tables for THP
      pages per-PMD table.  It means we need to move them to other PMD table
      if huge PMD moved there.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NAndrey Vagin <avagin@openvz.org>
      Tested-by: NAndrey Vagin <avagin@openvz.org>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3592806c
    • J
      mm: memcg: do not declare OOM from __GFP_NOFAIL allocations · a0d8b00a
      Johannes Weiner 提交于
      Commit 84235de3 ("fs: buffer: move allocation failure loop into the
      allocator") started recognizing __GFP_NOFAIL in memory cgroups but
      forgot to disable the OOM killer.
      
      Any task that does not fail allocation will also not enter the OOM
      completion path.  So don't declare an OOM state in this case or it'll be
      leaked and the task be able to bypass the limit until the next
      userspace-triggered page fault cleans up the OOM state.
      Reported-by: NWilliam Dauchy <wdauchy@gmail.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0d8b00a
  3. 02 12月, 2013 1 次提交
    • E
      security: shmem: implement kernel private shmem inodes · c7277090
      Eric Paris 提交于
      We have a problem where the big_key key storage implementation uses a
      shmem backed inode to hold the key contents.  Because of this detail of
      implementation LSM checks are being done between processes trying to
      read the keys and the tmpfs backed inode.  The LSM checks are already
      being handled on the key interface level and should not be enforced at
      the inode level (since the inode is an implementation detail, not a
      part of the security model)
      
      This patch implements a new function shmem_kernel_file_setup() which
      returns the equivalent to shmem_file_setup() only the underlying inode
      has S_PRIVATE set.  This means that all LSM checks for the inode in
      question are skipped.  It should only be used for kernel internal
      operations where the inode is not exposed to userspace without proper
      LSM checking.  It is possible that some other users of
      shmem_file_setup() should use the new interface, but this has not been
      explored.
      
      Reproducing this bug is a little bit difficult.  The steps I used on
      Fedora are:
      
       (1) Turn off selinux enforcing:
      
      	setenforce 0
      
       (2) Create a huge key
      
      	k=`dd if=/dev/zero bs=8192 count=1 | keyctl padd big_key test-key @s`
      
       (3) Access the key in another context:
      
      	runcon system_u:system_r:httpd_t:s0-s0:c0.c1023 keyctl print $k >/dev/null
      
       (4) Examine the audit logs:
      
      	ausearch -m AVC -i --subject httpd_t | audit2allow
      
      If the last command's output includes a line that looks like:
      
      	allow httpd_t user_tmpfs_t:file { open read };
      
      There was an inode check between httpd and the tmpfs filesystem.  With
      this patch no such denial will be seen.  (NOTE! you should clear your
      audit log if you have tested for this previously)
      
      (Please return you box to enforcing)
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hugh Dickins <hughd@google.com>
      cc: linux-mm@kvack.org
      c7277090
  4. 22 11月, 2013 3 次提交
    • D
      mm, mempolicy: silence gcc warning · b7a9f420
      David Rientjes 提交于
      Fengguang Wu reports that compiling mm/mempolicy.c results in a warning:
      
        mm/mempolicy.c: In function 'mpol_to_str':
        mm/mempolicy.c:2878:2: error: format not a string literal and no format arguments
      
      Kees says this is because he is using -Wformat-security.
      
      Silence the warning.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Suggested-by: NKees Cook <keescook@chromium.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7a9f420
    • A
      mm: hugetlbfs: fix hugetlbfs optimization · 27c73ae7
      Andrea Arcangeli 提交于
      Commit 7cb2ef56 ("mm: fix aio performance regression for database
      caused by THP") can cause dereference of a dangling pointer if
      split_huge_page runs during PageHuge() if there are updates to the
      tail_page->private field.
      
      Also it is repeating compound_head twice for hugetlbfs and it is running
      compound_head+compound_trans_head for THP when a single one is needed in
      both cases.
      
      The new code within the PageSlab() check doesn't need to verify that the
      THP page size is never bigger than the smallest hugetlbfs page size, to
      avoid memory corruption.
      
      A longstanding theoretical race condition was found while fixing the
      above (see the change right after the skip_unlock label, that is
      relevant for the compound_lock path too).
      
      By re-establishing the _mapcount tail refcounting for all compound
      pages, this also fixes the below problem:
      
        echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
      
        BUG: Bad page state in process bash  pfn:59a01
        page:ffffea000139b038 count:0 mapcount:10 mapping:          (null) index:0x0
        page flags: 0x1c00000000008000(tail)
        Modules linked in:
        CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x55/0x76
          bad_page+0xd5/0x130
          free_pages_prepare+0x213/0x280
          __free_pages+0x36/0x80
          update_and_free_page+0xc1/0xd0
          free_pool_huge_page+0xc2/0xe0
          set_max_huge_pages.part.58+0x14c/0x220
          nr_hugepages_store_common.isra.60+0xd0/0xf0
          nr_hugepages_store+0x13/0x20
          kobj_attr_store+0xf/0x20
          sysfs_write_file+0x189/0x1e0
          vfs_write+0xc5/0x1f0
          SyS_write+0x55/0xb0
          system_call_fastpath+0x16/0x1b
      Signed-off-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27c73ae7
    • D
      mm: thp: give transparent hugepage code a separate copy_page · 30b0a105
      Dave Hansen 提交于
      Right now, the migration code in migrate_page_copy() uses copy_huge_page()
      for hugetlbfs and thp pages:
      
             if (PageHuge(page) || PageTransHuge(page))
                      copy_huge_page(newpage, page);
      
      So, yay for code reuse.  But:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
      
      and a non-hugetlbfs page has no page_hstate().  This works 99% of the
      time because page_hstate() determines the hstate from the page order
      alone.  Since the page order of a THP page matches the default hugetlbfs
      page order, it works.
      
      But, if you change the default huge page size on the boot command-line
      (say default_hugepagesz=1G), then we might not even *have* a 2MB hstate
      so page_hstate() returns null and copy_huge_page() oopses pretty fast
      since copy_huge_page() dereferences the hstate:
      
        void copy_huge_page(struct page *dst, struct page *src)
        {
              struct hstate *h = page_hstate(src);
              if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) {
        ...
      
      Mel noticed that the migration code is really the only user of these
      functions.  This moves all the copy code over to migrate.c and makes
      copy_huge_page() work for THP by checking for it explicitly.
      
      I believe the bug was introduced in commit b32967ff ("mm: numa: Add
      THP migration for the NUMA working set scanning fault case")
      
      [akpm@linux-foundation.org: fix coding-style and comment text, per Naoya Horiguchi]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Tested-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30b0a105
  5. 21 11月, 2013 1 次提交
  6. 15 11月, 2013 13 次提交
    • S
      kfifo API type safety · 498d319b
      Stefani Seibold 提交于
      This patch enhances the type safety for the kfifo API.  It is now safe
      to put const data into a non const FIFO and the API will now generate a
      compiler warning when reading from the fifo where the destination
      address is pointing to a const variable.
      
      As a side effect the kfifo_put() does now expect the value of an element
      instead a pointer to the element.  This was suggested Russell King.  It
      make the handling of the kfifo_put easier since there is no need to
      create a helper variable for getting the address of a pointer or to pass
      integers of different sizes.
      
      IMHO the API break is okay, since there are currently only six users of
      kfifo_put().
      
      The code is also cleaner by kicking out the "if (0)" expressions.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Hauke Mehrtens <hauke@hauke-m.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      498d319b
    • K
      mm: create a separate slab for page->ptl allocation · ea1e7ed3
      Kirill A. Shutemov 提交于
      If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64
      is 72 bytes.  For page->ptl they will be allocated from kmalloc-96 slab,
      so we loose 24 on each.  An average system can easily allocate few tens
      thousands of page->ptl and overhead is significant.
      
      Let's create a separate slab for page->ptl allocation to solve this.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea1e7ed3
    • P
      mm: properly separate the bloated ptl from the regular case · 539edb58
      Peter Zijlstra 提交于
      Use kernel/bounds.c to convert build-time spinlock_t size check into a
      preprocessor symbol and apply that to properly separate the page::ptl
      situation.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      539edb58
    • K
      mm: dynamically allocate page->ptl if it cannot be embedded to struct page · 49076ec2
      Kirill A. Shutemov 提交于
      If split page table lock is in use, we embed the lock into struct page
      of table's page.  We have to disable split lock, if spinlock_t is too
      big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
      enabled.
      
      This patch add support for dynamic allocation of split page table lock
      if we can't embed it to struct page.
      
      page->ptl is unsigned long now and we use it as spinlock_t if
      sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t.
      
      The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
      pgtable_pmd_page_ctor() for PMD table.  All other helpers converted to
      support dynamically allocated page->ptl.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49076ec2
    • K
      mm: implement split page table lock for PMD level · e009bb30
      Kirill A. Shutemov 提交于
      The basic idea is the same as with PTE level: the lock is embedded into
      struct page of table's page.
      
      We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
      take mm->page_table_lock anymore.  Let's reuse page->lru of table's page
      for that.
      
      pgtable_pmd_page_ctor() returns true, if initialization is successful
      and false otherwise.  Current implementation never fails, but assumption
      that constructor can fail will help to port it to -rt where spinlock_t
      is rather huge and cannot be embedded into struct page -- dynamic
      allocation is required.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e009bb30
    • K
      mm: convert the rest to new page table lock api · c4088ebd
      Kirill A. Shutemov 提交于
      Only trivial cases left. Let's convert them altogether.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4088ebd
    • K
      mm, hugetlb: convert hugetlbfs to use split pmd lock · cb900f41
      Kirill A. Shutemov 提交于
      Hugetlb supports multiple page sizes. We use split lock only for PMD
      level, but not for PUD.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb900f41
    • K
      mm, thp: do not access mm->pmd_huge_pte directly · c389a250
      Kirill A. Shutemov 提交于
      Currently mm->pmd_huge_pte protected by page table lock.  It will not
      work with split lock.  We have to have per-pmd pmd_huge_pte for proper
      access serialization.
      
      For now, let's just introduce wrapper to access mm->pmd_huge_pte.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c389a250
    • K
      mm, thp: move ptl taking inside page_check_address_pmd() · 117b0791
      Kirill A. Shutemov 提交于
      With split page table lock we can't know which lock we need to take
      before we find the relevant pmd.
      
      Let's move lock taking inside the function.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      117b0791
    • K
      mm, thp: change pmd_trans_huge_lock() to return taken lock · bf929152
      Kirill A. Shutemov 提交于
      With split ptlock it's important to know which lock
      pmd_trans_huge_lock() took.  This patch adds one more parameter to the
      function to return the lock.
      
      In most places migration to new api is trivial.  Exception is
      move_huge_pmd(): we need to take two locks if pmd tables are different.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf929152
    • K
      mm: convert mm->nr_ptes to atomic_long_t · e1f56c89
      Kirill A. Shutemov 提交于
      With split page table lock for PMD level we can't hold mm->page_table_lock
      while updating nr_ptes.
      
      Let's convert it to atomic_long_t to avoid races.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1f56c89
    • K
      mm: avoid increase sizeof(struct page) due to split page table lock · e9bb18c7
      Kirill A. Shutemov 提交于
      Alex Thorlton noticed that some massively threaded workloads work poorly,
      if THP enabled.  This patchset fixes this by introducing split page table
      lock for PMD tables.  hugetlbfs is not covered yet.
      
      This patchset is based on work by Naoya Horiguchi.
      
      : akpm result summary:
      :
      : THP off, v3.12-rc2: 18.059261877 seconds time elapsed
      : THP off, patched:   16.768027318 seconds time elapsed
      :
      : THP on, v3.12-rc2:  42.162306788 seconds time elapsed
      : THP on, patched:    8.397885779 seconds time elapsed
      :
      : HUGETLB, v3.12-rc2: 47.574936948 seconds time elapsed
      : HUGETLB, patched:   19.447481153 seconds time elapsed
      
      THP off, v3.12-rc2:
      -------------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
          1037072.835207 task-clock                #   57.426 CPUs utilized            ( +-  3.59% )
                  95,093 context-switches          #    0.092 K/sec                    ( +-  3.93% )
                     140 cpu-migrations            #    0.000 K/sec                    ( +-  5.28% )
              10,000,550 page-faults               #    0.010 M/sec                    ( +-  0.00% )
       2,455,210,400,261 cycles                    #    2.367 GHz                      ( +-  3.62% ) [83.33%]
       2,429,281,882,056 stalled-cycles-frontend   #   98.94% frontend cycles idle     ( +-  3.67% ) [83.33%]
       1,975,960,019,659 stalled-cycles-backend    #   80.48% backend  cycles idle     ( +-  3.88% ) [66.68%]
          46,503,296,013 instructions              #    0.02  insns per cycle
                                                   #   52.24  stalled cycles per insn  ( +-  3.21% ) [83.34%]
           9,278,997,542 branches                  #    8.947 M/sec                    ( +-  4.00% ) [83.34%]
              89,881,640 branch-misses             #    0.97% of all branches          ( +-  1.17% ) [83.33%]
      
            18.059261877 seconds time elapsed                                          ( +-  2.65% )
      
      THP on, v3.12-rc2:
      ------------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
          3114745.395974 task-clock                #   73.875 CPUs utilized            ( +-  1.84% )
                 267,356 context-switches          #    0.086 K/sec                    ( +-  1.84% )
                      99 cpu-migrations            #    0.000 K/sec                    ( +-  1.40% )
                  58,313 page-faults               #    0.019 K/sec                    ( +-  0.28% )
       7,416,635,817,510 cycles                    #    2.381 GHz                      ( +-  1.83% ) [83.33%]
       7,342,619,196,993 stalled-cycles-frontend   #   99.00% frontend cycles idle     ( +-  1.88% ) [83.33%]
       6,267,671,641,967 stalled-cycles-backend    #   84.51% backend  cycles idle     ( +-  2.03% ) [66.67%]
         117,819,935,165 instructions              #    0.02  insns per cycle
                                                   #   62.32  stalled cycles per insn  ( +-  4.39% ) [83.34%]
          28,899,314,777 branches                  #    9.278 M/sec                    ( +-  4.48% ) [83.34%]
              71,787,032 branch-misses             #    0.25% of all branches          ( +-  1.03% ) [83.33%]
      
            42.162306788 seconds time elapsed                                          ( +-  1.73% )
      
      HUGETLB, v3.12-rc2:
      -------------------
      
       Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
      
          2588052.787264 task-clock                #   54.400 CPUs utilized            ( +-  3.69% )
                 246,831 context-switches          #    0.095 K/sec                    ( +-  4.15% )
                     138 cpu-migrations            #    0.000 K/sec                    ( +-  5.30% )
                  21,027 page-faults               #    0.008 K/sec                    ( +-  0.01% )
       6,166,666,307,263 cycles                    #    2.383 GHz                      ( +-  3.68% ) [83.33%]
       6,086,008,929,407 stalled-cycles-frontend   #   98.69% frontend cycles idle     ( +-  3.77% ) [83.33%]
       5,087,874,435,481 stalled-cycles-backend    #   82.51% backend  cycles idle     ( +-  4.41% ) [66.67%]
         133,782,831,249 instructions              #    0.02  insns per cycle
                                                   #   45.49  stalled cycles per insn  ( +-  4.30% ) [83.34%]
          34,026,870,541 branches                  #   13.148 M/sec                    ( +-  4.24% ) [83.34%]
              68,670,942 branch-misses             #    0.20% of all branches          ( +-  3.26% ) [83.33%]
      
            47.574936948 seconds time elapsed                                          ( +-  2.09% )
      
      THP off, patched:
      -----------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
           943301.957892 task-clock                #   56.256 CPUs utilized            ( +-  3.01% )
                  86,218 context-switches          #    0.091 K/sec                    ( +-  3.17% )
                     121 cpu-migrations            #    0.000 K/sec                    ( +-  6.64% )
              10,000,551 page-faults               #    0.011 M/sec                    ( +-  0.00% )
       2,230,462,457,654 cycles                    #    2.365 GHz                      ( +-  3.04% ) [83.32%]
       2,204,616,385,805 stalled-cycles-frontend   #   98.84% frontend cycles idle     ( +-  3.09% ) [83.32%]
       1,778,640,046,926 stalled-cycles-backend    #   79.74% backend  cycles idle     ( +-  3.47% ) [66.69%]
          45,995,472,617 instructions              #    0.02  insns per cycle
                                                   #   47.93  stalled cycles per insn  ( +-  2.51% ) [83.34%]
           9,179,700,174 branches                  #    9.731 M/sec                    ( +-  3.04% ) [83.35%]
              89,166,529 branch-misses             #    0.97% of all branches          ( +-  1.45% ) [83.33%]
      
            16.768027318 seconds time elapsed                                          ( +-  2.47% )
      
      THP on, patched:
      ----------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
           458793.837905 task-clock                #   54.632 CPUs utilized            ( +-  0.79% )
                  41,831 context-switches          #    0.091 K/sec                    ( +-  0.97% )
                      98 cpu-migrations            #    0.000 K/sec                    ( +-  1.66% )
                  57,829 page-faults               #    0.126 K/sec                    ( +-  0.62% )
       1,077,543,336,716 cycles                    #    2.349 GHz                      ( +-  0.81% ) [83.33%]
       1,067,403,802,964 stalled-cycles-frontend   #   99.06% frontend cycles idle     ( +-  0.87% ) [83.33%]
         864,764,616,143 stalled-cycles-backend    #   80.25% backend  cycles idle     ( +-  0.73% ) [66.68%]
          16,129,177,440 instructions              #    0.01  insns per cycle
                                                   #   66.18  stalled cycles per insn  ( +-  7.94% ) [83.35%]
           3,618,938,569 branches                  #    7.888 M/sec                    ( +-  8.46% ) [83.36%]
              33,242,032 branch-misses             #    0.92% of all branches          ( +-  2.02% ) [83.32%]
      
             8.397885779 seconds time elapsed                                          ( +-  0.18% )
      
      HUGETLB, patched:
      -----------------
      
       Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
      
           395353.076837 task-clock                #   20.329 CPUs utilized            ( +-  8.16% )
                  55,730 context-switches          #    0.141 K/sec                    ( +-  5.31% )
                     138 cpu-migrations            #    0.000 K/sec                    ( +-  4.24% )
                  21,027 page-faults               #    0.053 K/sec                    ( +-  0.00% )
         930,219,717,244 cycles                    #    2.353 GHz                      ( +-  8.21% ) [83.32%]
         914,295,694,103 stalled-cycles-frontend   #   98.29% frontend cycles idle     ( +-  8.35% ) [83.33%]
         704,137,950,187 stalled-cycles-backend    #   75.70% backend  cycles idle     ( +-  9.16% ) [66.69%]
          30,541,538,385 instructions              #    0.03  insns per cycle
                                                   #   29.94  stalled cycles per insn  ( +-  3.98% ) [83.35%]
           8,415,376,631 branches                  #   21.286 M/sec                    ( +-  3.61% ) [83.36%]
              32,645,478 branch-misses             #    0.39% of all branches          ( +-  3.41% ) [83.32%]
      
            19.447481153 seconds time elapsed                                          ( +-  2.00% )
      
      This patch (of 11):
      
      CONFIG_GENERIC_LOCKBREAK increases sizeof(spinlock_t) to 8 bytes.  It
      leads to increase sizeof(struct page) by 4 bytes on 32-bit system if split
      page table lock is in use, since page->ptl shares space in union with
      longs and pointers.
      
      Let's disable split page table lock on 32-bit systems with
      GENERIC_LOCKBREAK enabled.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9bb18c7
    • K
      mm: drop actor argument of do_generic_file_read() · b77d88d4
      Kirill A. Shutemov 提交于
      There's only one caller of do_generic_file_read() and the only actor is
      file_read_actor().  No reason to have a callback parameter.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b77d88d4
  7. 13 11月, 2013 9 次提交