1. 16 11月, 2012 1 次提交
  2. 26 10月, 2012 4 次提交
    • D
      mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant · 6b187d02
      David Rientjes 提交于
      Commit 957f822a ("mm, numa: reclaim from all nodes within reclaim
      distance") caused zone_reclaim_mode to be set for all systems where two
      nodes are within RECLAIM_DISTANCE of each other.  This is the opposite
      of what we actually want: zone_reclaim_mode should be set if two nodes
      are sufficiently distant.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NJulian Wollrath <jwollrath@web.de>
      Tested-by: NJulian Wollrath <jwollrath@web.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Patrik Kullman <patrik.kullman@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b187d02
    • G
      mm/mmu_notifier: allocate mmu_notifier in advance · 35cfa2b0
      Gavin Shan 提交于
      While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
      to work in case of tight available memory.  Eventually, that would lead to
      a deadlock while the swap deamon swaps anonymous pages.  It was caused by
      commit e0f3c3f7 ("mm/mmu_notifier: init notifier if necessary").
      
        =================================
        [ INFO: inconsistent lock state ]
        3.7.0-rc1+ #518 Not tainted
        ---------------------------------
        inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
        kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
         (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
        {RECLAIM_FS-ON-W} state was registered at:
           mark_held_locks+0x86/0x150
           lockdep_trace_alloc+0x67/0xc0
           kmem_cache_alloc_trace+0x33/0x230
           do_mmu_notifier_register+0x87/0x180
           mmu_notifier_register+0x13/0x20
           kvm_dev_ioctl+0x428/0x510
           do_vfs_ioctl+0x98/0x570
           sys_ioctl+0x91/0xb0
           system_call_fastpath+0x16/0x1b
        irq event stamp: 825
        hardirqs last  enabled at (825): _raw_spin_unlock_irq+0x30/0x60
        hardirqs last disabled at (824): _raw_spin_lock_irq+0x19/0x80
        softirqs last  enabled at (0): copy_process+0x630/0x17c0
        softirqs last disabled at (0): (null)
        ...
      
      Simply back out the above commit, which was a small performance
      optimization.
      Signed-off-by: NGavin Shan <shangw@linux.vnet.ibm.com>
      Reported-by: NAndrea Righi <andrea@betterlinux.com>
      Tested-by: NAndrea Righi <andrea@betterlinux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Sagi Grimberg <sagig@mellanox.co.il>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35cfa2b0
    • B
      mm/page_alloc.c:alloc_contig_range(): return early for err path · 86a595f9
      Bob Liu 提交于
      If start_isolate_page_range() failed, unset_migratetype_isolate() has been
      done inside it.
      Signed-off-by: NBob Liu <lliubbo@gmail.com>
      Cc: Ni zhan Chen <nizhan.chen@gmail.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86a595f9
    • J
      mm: fix XFS oops due to dirty pages without buffers on s390 · ef5d437f
      Jan Kara 提交于
      On s390 any write to a page (even from kernel itself) sets architecture
      specific page dirty bit.  Thus when a page is written to via buffered
      write, HW dirty bit gets set and when we later map and unmap the page,
      page_remove_rmap() finds the dirty bit and calls set_page_dirty().
      
      Dirtying of a page which shouldn't be dirty can cause all sorts of
      problems to filesystems.  The bug we observed in practice is that
      buffers from the page get freed, so when the page gets later marked as
      dirty and writeback writes it, XFS crashes due to an assertion
      BUG_ON(!PagePrivate(page)) in page_buffers() called from
      xfs_count_page_state().
      
      Similar problem can also happen when zero_user_segment() call from
      xfs_vm_writepage() (or block_write_full_page() for that matter) set the
      hardware dirty bit during writeback, later buffers get freed, and then
      page unmapped.
      
      Fix the issue by ignoring s390 HW dirty bit for page cache pages of
      mappings with mapping_cap_account_dirty().  This is safe because for
      such mappings when a page gets marked as writeable in PTE it is also
      marked dirty in do_wp_page() or do_page_fault().  When the dirty bit is
      cleared by clear_page_dirty_for_io(), the page gets writeprotected in
      page_mkclean().  So pagecache page is writeable if and only if it is
      dirty.
      
      Thanks to Hugh Dickins for pointing out mapping has to have
      mapping_cap_account_dirty() for things to work and proposing a cleaned
      up variant of the patch.
      
      The patch has survived about two hours of running fsx-linux on tmpfs
      while heavily swapping and several days of running on out build machines
      where the original problem was triggered.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@vger.kernel.org>		[3.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef5d437f
  3. 25 10月, 2012 1 次提交
  4. 23 10月, 2012 1 次提交
  5. 20 10月, 2012 2 次提交
  6. 17 10月, 2012 1 次提交
    • D
      mm, mempolicy: fix printing stack contents in numa_maps · 32f8516a
      David Rientjes 提交于
      When reading /proc/pid/numa_maps, it's possible to return the contents of
      the stack where the mempolicy string should be printed if the policy gets
      freed from beneath us.
      
      This happens because mpol_to_str() may return an error the
      stack-allocated buffer is then printed without ever being stored.
      
      There are two possible error conditions in mpol_to_str():
      
       - if the buffer allocated is insufficient for the string to be stored,
         and
      
       - if the mempolicy has an invalid mode.
      
      The first error condition is not triggered in any of the callers to
      mpol_to_str(): at least 50 bytes is always allocated on the stack and this
      is sufficient for the string to be written.  A future patch should convert
      this into BUILD_BUG_ON() since we know the maximum strlen possible, but
      that's not -rc material.
      
      The second error condition is possible if a race occurs in dropping a
      reference to a task's mempolicy causing it to be freed during the read().
      The slab poison value is then used for the mode and mpol_to_str() returns
      -EINVAL.
      
      This race is only possible because get_vma_policy() believes that
      mm->mmap_sem protects task->mempolicy, which isn't true.  The exit path
      does not hold mm->mmap_sem when dropping the reference or setting
      task->mempolicy to NULL: it uses task_lock(task) instead.
      
      Thus, it's required for the caller of a task mempolicy to hold
      task_lock(task) while grabbing the mempolicy and reading it.  Callers with
      a vma policy store their mempolicy earlier and can simply increment the
      reference count so it's guaranteed not to be freed.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32f8516a
  7. 15 10月, 2012 1 次提交
  8. 13 10月, 2012 2 次提交
    • J
      vfs: make path_openat take a struct filename pointer · 669abf4e
      Jeff Layton 提交于
      ...and fix up the callers. For do_file_open_root, just declare a
      struct filename on the stack and fill out the .name field. For
      do_filp_open, make it also take a struct filename pointer, and fix up its
      callers to call it appropriately.
      
      For filp_open, add a variant that takes a struct filename pointer and turn
      filp_open into a wrapper around it.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      669abf4e
    • J
      vfs: define struct filename and have getname() return it · 91a27b2a
      Jeff Layton 提交于
      getname() is intended to copy pathname strings from userspace into a
      kernel buffer. The result is just a string in kernel space. It would
      however be quite helpful to be able to attach some ancillary info to
      the string.
      
      For instance, we could attach some audit-related info to reduce the
      amount of audit-related processing needed. When auditing is enabled,
      we could also call getname() on the string more than once and not
      need to recopy it from userspace.
      
      This patchset converts the getname()/putname() interfaces to return
      a struct instead of a string. For now, the struct just tracks the
      string in kernel space and the original userland pointer for it.
      
      Later, we'll add other information to the struct as it becomes
      convenient.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      91a27b2a
  9. 10 10月, 2012 3 次提交
    • J
      mm, slab: release slab_mutex earlier in kmem_cache_destroy() · 210ed9de
      Jiri Kosina 提交于
      Commit 1331e7a1 ("rcu: Remove _rcu_barrier() dependency on
      __stop_machine()") introduced slab_mutex -> cpu_hotplug.lock dependency
      through kmem_cache_destroy() -> rcu_barrier() -> _rcu_barrier() ->
      get_online_cpus().
      
      Lockdep thinks that this might actually result in ABBA deadlock,
      and reports it as below:
      
      === [ cut here ] ===
       ======================================================
       [ INFO: possible circular locking dependency detected ]
       3.6.0-rc5-00004-g0d8ee37e #143 Not tainted
       -------------------------------------------------------
       kworker/u:2/40 is trying to acquire lock:
        (rcu_sched_state.barrier_mutex){+.+...}, at: [<ffffffff810f2126>] _rcu_barrier+0x26/0x1e0
      
       but task is already holding lock:
        (slab_mutex){+.+.+.}, at: [<ffffffff81176e15>] kmem_cache_destroy+0x45/0xe0
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #2 (slab_mutex){+.+.+.}:
              [<ffffffff810ae1e2>] validate_chain+0x632/0x720
              [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530
              [<ffffffff810ae921>] lock_acquire+0x121/0x190
              [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450
              [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50
              [<ffffffff81558cb5>] cpuup_callback+0x2f/0xbe
              [<ffffffff81564b83>] notifier_call_chain+0x93/0x140
              [<ffffffff81076f89>] __raw_notifier_call_chain+0x9/0x10
              [<ffffffff8155719d>] _cpu_up+0xba/0x14e
              [<ffffffff815572ed>] cpu_up+0xbc/0x117
              [<ffffffff81ae05e3>] smp_init+0x6b/0x9f
              [<ffffffff81ac47d6>] kernel_init+0x147/0x1dc
              [<ffffffff8156ab44>] kernel_thread_helper+0x4/0x10
      
       -> #1 (cpu_hotplug.lock){+.+.+.}:
              [<ffffffff810ae1e2>] validate_chain+0x632/0x720
              [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530
              [<ffffffff810ae921>] lock_acquire+0x121/0x190
              [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450
              [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50
              [<ffffffff81049197>] get_online_cpus+0x37/0x50
              [<ffffffff810f21bb>] _rcu_barrier+0xbb/0x1e0
              [<ffffffff810f22f0>] rcu_barrier_sched+0x10/0x20
              [<ffffffff810f2309>] rcu_barrier+0x9/0x10
              [<ffffffff8118c129>] deactivate_locked_super+0x49/0x90
              [<ffffffff8118cc01>] deactivate_super+0x61/0x70
              [<ffffffff811aaaa7>] mntput_no_expire+0x127/0x180
              [<ffffffff811ab49e>] sys_umount+0x6e/0xd0
              [<ffffffff81569979>] system_call_fastpath+0x16/0x1b
      
       -> #0 (rcu_sched_state.barrier_mutex){+.+...}:
              [<ffffffff810adb4e>] check_prev_add+0x3de/0x440
              [<ffffffff810ae1e2>] validate_chain+0x632/0x720
              [<ffffffff810ae5d9>] __lock_acquire+0x309/0x530
              [<ffffffff810ae921>] lock_acquire+0x121/0x190
              [<ffffffff8155d4cc>] __mutex_lock_common+0x5c/0x450
              [<ffffffff8155d9ee>] mutex_lock_nested+0x3e/0x50
              [<ffffffff810f2126>] _rcu_barrier+0x26/0x1e0
              [<ffffffff810f22f0>] rcu_barrier_sched+0x10/0x20
              [<ffffffff810f2309>] rcu_barrier+0x9/0x10
              [<ffffffff81176ea1>] kmem_cache_destroy+0xd1/0xe0
              [<ffffffffa04c3154>] nf_conntrack_cleanup_net+0xe4/0x110 [nf_conntrack]
              [<ffffffffa04c31aa>] nf_conntrack_cleanup+0x2a/0x70 [nf_conntrack]
              [<ffffffffa04c42ce>] nf_conntrack_net_exit+0x5e/0x80 [nf_conntrack]
              [<ffffffff81454b79>] ops_exit_list+0x39/0x60
              [<ffffffff814551ab>] cleanup_net+0xfb/0x1b0
              [<ffffffff8106917b>] process_one_work+0x26b/0x4c0
              [<ffffffff81069f3e>] worker_thread+0x12e/0x320
              [<ffffffff8106f73e>] kthread+0x9e/0xb0
              [<ffffffff8156ab44>] kernel_thread_helper+0x4/0x10
      
       other info that might help us debug this:
      
       Chain exists of:
         rcu_sched_state.barrier_mutex --> cpu_hotplug.lock --> slab_mutex
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(slab_mutex);
                                      lock(cpu_hotplug.lock);
                                      lock(slab_mutex);
         lock(rcu_sched_state.barrier_mutex);
      
        *** DEADLOCK ***
      === [ cut here ] ===
      
      This is actually a false positive. Lockdep has no way of knowing the fact
      that the ABBA can actually never happen, because of special semantics of
      cpu_hotplug.refcount and its handling in cpu_hotplug_begin(); the mutual
      exclusion there is not achieved through mutex, but through
      cpu_hotplug.refcount.
      
      The "neither cpu_up() nor cpu_down() will proceed past cpu_hotplug_begin()
      until everyone who called get_online_cpus() will call put_online_cpus()"
      semantics is totally invisible to lockdep.
      
      This patch therefore moves the unlock of slab_mutex so that rcu_barrier()
      is being called with it unlocked. It has two advantages:
      
      - it slightly reduces hold time of slab_mutex; as it's used to protect
        the cachep list, it's not necessary to hold it over kmem_cache_free()
        call any more
      - it silences the lockdep false positive warning, as it avoids lockdep ever
        learning about slab_mutex -> cpu_hotplug.lock dependency
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      210ed9de
    • H
      tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking · 35c2a7f4
      Hugh Dickins 提交于
      Fuzzing with trinity oopsed on the 1st instruction of shmem_fh_to_dentry(),
      	u64 inum = fid->raw[2];
      which is unhelpfully reported as at the end of shmem_alloc_inode():
      
      BUG: unable to handle kernel paging request at ffff880061cd3000
      IP: [<ffffffff812190d0>] shmem_alloc_inode+0x40/0x40
      Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      Call Trace:
       [<ffffffff81488649>] ? exportfs_decode_fh+0x79/0x2d0
       [<ffffffff812d77c3>] do_handle_open+0x163/0x2c0
       [<ffffffff812d792c>] sys_open_by_handle_at+0xc/0x10
       [<ffffffff83a5f3f8>] tracesys+0xe1/0xe6
      
      Right, tmpfs is being stupid to access fid->raw[2] before validating that
      fh_len includes it: the buffer kmalloc'ed by do_sys_name_to_handle() may
      fall at the end of a page, and the next page not be present.
      
      But some other filesystems (ceph, gfs2, isofs, reiserfs, xfs) are being
      careless about fh_len too, in fh_to_dentry() and/or fh_to_parent(), and
      could oops in the same way: add the missing fh_len checks to those.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Sage Weil <sage@inktank.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      35c2a7f4
    • A
      mm/slob: use min_t() to compare ARCH_SLAB_MINALIGN · baaf1dd4
      Arnd Bergmann 提交于
      The definition of ARCH_SLAB_MINALIGN is architecture dependent
      and can be either of type size_t or int. Comparing that value
      with ARCH_KMALLOC_MINALIGN can cause harmless warnings on
      platforms where they are different. Since both are always
      small positive integer numbers, using the size_t type to compare
      them is safe and gets rid of the warning.
      
      Without this patch, building ARM collie_defconfig results in:
      
      mm/slob.c: In function '__kmalloc_node':
      mm/slob.c:431:152: warning: comparison of distinct pointer types lacks a cast [enabled by default]
      mm/slob.c: In function 'kfree':
      mm/slob.c:484:153: warning: comparison of distinct pointer types lacks a cast [enabled by default]
      mm/slob.c: In function 'ksize':
      mm/slob.c:503:153: warning: comparison of distinct pointer types lacks a cast [enabled by default]
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      baaf1dd4
  10. 09 10月, 2012 24 次提交
    • D
      mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). · f5c8ad47
      David Miller 提交于
      Invalidation sequences are handled in various ways on various
      architectures.
      
      One way, which sparc64 uses, is to let the set_*_at() functions accumulate
      pending flushes into a per-cpu array.  Then the flush_tlb_range() et al.
      calls process the pending TLB flushes.
      
      In this regime, the __tlb_remove_*tlb_entry() implementations are
      essentially NOPs.
      
      The canonical PTE zap in mm/memory.c is:
      
      			ptent = ptep_get_and_clear_full(mm, addr, pte,
      							tlb->fullmm);
      			tlb_remove_tlb_entry(tlb, pte, addr);
      
      With a subsequent tlb_flush_mmu() if needed.
      
      Mirror this in the THP PMD zapping using:
      
      		orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
      		page = pmd_page(orig_pmd);
      		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
      
      And we properly accomodate TLB flush mechanims like the one described
      above.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5c8ad47
    • D
      mm: Add and use update_mmu_cache_pmd() in transparent huge page code. · b113da65
      David Miller 提交于
      The transparent huge page code passes a PMD pointer in as the third
      argument of update_mmu_cache(), which expects a PTE pointer.
      
      This never got noticed because X86 implements update_mmu_cache() as a
      macro and thus we don't get any type checking, and X86 is the only
      architecture which supports transparent huge pages currently.
      
      Before other architectures can support transparent huge pages properly we
      need to add a new interface which will take a PMD pointer as the third
      argument rather than a PTE pointer.
      
      [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b113da65
    • Y
      memory-hotplug: suppress "Trying to free nonexistent resource... · d760afd4
      Yasuaki Ishimatsu 提交于
      memory-hotplug: suppress "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>" warning
      
      When our x86 box calls __remove_pages(), release_mem_region() shows many
      warnings.  And x86 box cannot unregister iomem_resource.
      
        "Trying to free nonexistent resource <XXXXXXXXXXXXXXXX-YYYYYYYYYYYYYYYY>"
      
      release_mem_region() has been changed to be called in each
      PAGES_PER_SECTION by commit de7f0cba ("memory hotplug: release
      memory regions in PAGES_PER_SECTION chunks").  Because powerpc registers
      iomem_resource in each PAGES_PER_SECTION chunk.  But when I hot add
      memory on x86 box, iomem_resource is register in each _CRS not
      PAGES_PER_SECTION chunk.  So x86 box unregisters iomem_resource.
      
      The patch fixes the problem.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Nathan Fontenot <nfont@austin.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d760afd4
    • A
      mm: document PageHuge somewhat · 7795912c
      Andrew Morton 提交于
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7795912c
    • K
      mm: use %pK for /proc/vmallocinfo · 45ec1690
      Kees Cook 提交于
      In the paranoid case of sysctl kernel.kptr_restrict=2, mask the kernel
      virtual addresses in /proc/vmallocinfo too.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NBrad Spengler <spender@grsecurity.net>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45ec1690
    • D
      mm, thp: fix mlock statistics · 8449d21f
      David Rientjes 提交于
      NR_MLOCK is only accounted in single page units: there's no logic to
      handle transparent hugepages.  This patch checks the appropriate number of
      pages to adjust the statistics by so that the correct amount of memory is
      reflected.
      
      Currently:
      
      		$ grep Mlocked /proc/meminfo
      		Mlocked:           19636 kB
      
      	#define MAP_SIZE	(4 << 30)	/* 4GB */
      
      	void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      			 MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	mlock(ptr, MAP_SIZE);
      
      		$ grep Mlocked /proc/meminfo
      		Mlocked:           29844 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		$ grep Mlocked /proc/meminfo
      		Mlocked:           19636 kB
      
      And with this patch:
      
      		$ grep Mlock /proc/meminfo
      		Mlocked:           19636 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		$ grep Mlock /proc/meminfo
      		Mlocked:         4213664 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		$ grep Mlock /proc/meminfo
      		Mlocked:           19636 kB
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NHugh Dickens <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8449d21f
    • D
      mm, thp: fix mapped pages avoiding unevictable list on mlock · b676b293
      David Rientjes 提交于
      When a transparent hugepage is mapped and it is included in an mlock()
      range, follow_page() incorrectly avoids setting the page's mlock bit and
      moving it to the unevictable lru.
      
      This is evident if you try to mlock(), munlock(), and then mlock() a
      range again.  Currently:
      
      	#define MAP_SIZE	(4 << 30)	/* 4GB */
      
      	void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      			 MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	mlock(ptr, MAP_SIZE);
      
      		$ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
      		Inactive(anon):     6304 kB
      		Unevictable:     4213924 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4186252 kB
      		Unevictable:       19652 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198556 kB
      		Unevictable:       21684 kB
      
      Notice that less than 2MB was added to the unevictable list; this is
      because these pages in the range are not transparent hugepages since the
      4GB range was allocated with mmap() and has no specific alignment.  If
      posix_memalign() were used instead, unevictable would not have grown at
      all on the second mlock().
      
      The fix is to call mlock_vma_page() so that the mlock bit is set and the
      page is added to the unevictable list.  With this patch:
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4056 kB
      		Unevictable:     4213940 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198268 kB
      		Unevictable:       19636 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4008 kB
      		Unevictable:     4213940 kB
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b676b293
    • W
      memory-hotplug: update memory block's state and notify userspace · e90bdb7f
      Wen Congyang 提交于
      remove_memory() will be called when hot removing a memory device.  But
      even if offlining memory, we cannot notice it.  So the patch updates the
      memory block's state and sends notification to userspace.
      
      Additionally, the memory device may contain more than one memory block.
      If the memory block has been offlined, __offline_pages() will fail.  So we
      should try to offline one memory block at a time.
      
      Thus remove_memory() also check each memory block's state.  So there is no
      need to check the memory block's state before calling remove_memory().
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e90bdb7f
    • W
      memory-hotplug: preparation to notify memory block's state at memory hot remove · a16cee10
      Wen Congyang 提交于
      remove_memory() is called in two cases:
      1. echo offline >/sys/devices/system/memory/memoryXX/state
      2. hot remove a memory device
      
      In the 1st case, the memory block's state is changed and the notification
      that memory block's state changed is sent to userland after calling
      remove_memory().  So user can notice memory block is changed.
      
      But in the 2nd case, the memory block's state is not changed and the
      notification is not also sent to userspcae even if calling
      remove_memory().  So user cannot notice memory block is changed.
      
      For adding the notification at memory hot remove, the patch just prepare
      as follows:
      1st case uses offline_pages() for offlining memory.
      2nd case uses remove_memory() for offlining memory and changing memory block's
          state and notifing the information.
      
      The patch does not implement notification to remove_memory().
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a16cee10
    • R
      mm: avoid section mismatch warning for memblock_type_name · c2233116
      Raghavendra D Prabhu 提交于
      Following section mismatch warning is thrown during build;
      
          WARNING: vmlinux.o(.text+0x32408f): Section mismatch in reference from the function memblock_type_name() to the variable .meminit.data:memblock
          The function memblock_type_name() references
          the variable __meminitdata memblock.
          This is often because memblock_type_name lacks a __meminitdata
          annotation or the annotation of memblock is wrong.
      
      This is because memblock_type_name makes reference to memblock variable
      with attribute __meminitdata.  Hence, the warning (even if the function is
      inline).
      
      [akpm@linux-foundation.org: remove inline]
      Signed-off-by: NRaghavendra D Prabhu <rprabhu@wnohang.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2233116
    • M
      cma: decrease cc.nr_migratepages after reclaiming pagelist · beb51eaa
      Minchan Kim 提交于
      reclaim_clean_pages_from_list() reclaims clean pages before migration so
      cc.nr_migratepages should be updated.  Currently, there is no problem but
      it can be wrong if we try to use the value in future.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      beb51eaa
    • M
      CMA: migrate mlocked pages · e46a2879
      Minchan Kim 提交于
      Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
      contiguous memory space.
      
      This patch makes mlocked pages be migrated out.  Of course, it can affect
      realtime processes but in CMA usecase, contiguous memory allocation failing
      is far worse than access latency to an mlocked page being variable while
      CMA is running.  If someone wants to make the system realtime, he shouldn't
      enable CMA because stalls can still happen at random times.
      
      [akpm@linux-foundation.org: tweak comment text, per Mel]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46a2879
    • R
    • H
      mm: remove unevictable_pgs_mlockfreed · 8befedfe
      Hugh Dickins 提交于
      Simply remove UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed line
      from /proc/vmstat: Johannes and Mel point out that it was very unlikely to
      have been used by any tool, and of course we can restore it easily enough
      if that turns out to be wrong.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8befedfe
    • M
      memory-hotplug: fix zone stat mismatch · 5a883813
      Minchan Kim 提交于
      During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
      causing the kernel to hang.  When the system doesn't have enough free
      pages, it enters reclaim but never reclaim any pages due to
      too_many_isolated()==true and loops forever.
      
      The cause is that when we do memory-hotadd after memory-remove,
      __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
      although the vm_stat_diff of all CPUs still have values.
      
      In addtion, when we offline all pages of the zone, we reset them in
      zone_pcp_reset without draining so we loss some zone stat item.
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a883813
    • M
      mm: revert 0def08e3 ("mm/mempolicy.c: check return code of check_range") · 08270807
      Minchan Kim 提交于
      Revert commit 0def08e3 because check_range can't fail in
      migrate_to_node with considering current usecases.
      
      Quote from Johannes
      
      : I think it makes sense to revert.  Not because of the semantics, but I
      : just don't see how check_range() could even fail for this callsite:
      :
      : 1. we pass mm->mmap->vm_start in there, so we should not fail due to
      :    find_vma()
      :
      : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply
      :    and so can not fail
      :
      : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will
      :    continue until addr == end, so we never fail with -EIO
      
      And I added a new VM_BUG_ON for checking migrate_to_node's future usecase
      which might pass to MPOL_MF_STRICT.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vasiliy Kulikov <segooon@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08270807
    • H
      mm: wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end · 6bdb913f
      Haggai Eran 提交于
      In order to allow sleeping during invalidate_page mmu notifier calls, we
      need to avoid calling when holding the PT lock.  In addition to its direct
      calls, invalidate_page can also be called as a substitute for a change_pte
      call, in case the notifier client hasn't implemented change_pte.
      
      This patch drops the invalidate_page call from change_pte, and instead
      wraps all calls to change_pte with invalidate_range_start and
      invalidate_range_end calls.
      
      Note that change_pte still cannot sleep after this patch, and that clients
      implementing change_pte should not take action on it in case the number of
      outstanding invalidate_range_start calls is larger than one, otherwise
      they might miss a later invalidation.
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Cc: Andrea Arcangeli <andrea@qumranet.com>
      Cc: Sagi Grimberg <sagig@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bdb913f
    • S
      mm: move all mmu notifier invocations to be done outside the PT lock · 2ec74c3e
      Sagi Grimberg 提交于
      In order to allow sleeping during mmu notifier calls, we need to avoid
      invoking them under the page table spinlock.  This patch solves the
      problem by calling invalidate_page notification after releasing the lock
      (but before freeing the page itself), or by wrapping the page invalidation
      with calls to invalidate_range_begin and invalidate_range_end.
      
      To prevent accidental changes to the invalidate_range_end arguments after
      the call to invalidate_range_begin, the patch introduces a convention of
      saving the arguments in consistently named locals:
      
      	unsigned long mmun_start;	/* For mmu_notifiers */
      	unsigned long mmun_end;	/* For mmu_notifiers */
      
      	...
      
      	mmun_start = ...
      	mmun_end = ...
      	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
      
      	...
      
      	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
      
      The patch changes code to use this convention for all calls to
      mmu_notifier_invalidate_range_start/end, except those where the calls are
      close enough so that anyone who glances at the code can see the values
      aren't changing.
      
      This patchset is a preliminary step towards on-demand paging design to be
      added to the RDMA stack.
      
      Why do we want on-demand paging for Infiniband?
      
        Applications register memory with an RDMA adapter using system calls,
        and subsequently post IO operations that refer to the corresponding
        virtual addresses directly to HW.  Until now, this was achieved by
        pinning the memory during the registration calls.  The goal of on demand
        paging is to avoid pinning the pages of registered memory regions (MRs).
         This will allow users the same flexibility they get when swapping any
        other part of their processes address spaces.  Instead of requiring the
        entire MR to fit in physical memory, we can allow the MR to be larger,
        and only fit the current working set in physical memory.
      
      Why should anyone care?  What problems are users currently experiencing?
      
        This can make programming with RDMA much simpler.  Today, developers
        that are working with more data than their RAM can hold need either to
        deregister and reregister memory regions throughout their process's
        life, or keep a single memory region and copy the data to it.  On demand
        paging will allow these developers to register a single MR at the
        beginning of their process's life, and let the operating system manage
        which pages needs to be fetched at a given time.  In the future, we
        might be able to provide a single memory access key for each process
        that would provide the entire process's address as one large memory
        region, and the developers wouldn't need to register memory regions at
        all.
      
      Is there any prospect that any other subsystems will utilise these
      infrastructural changes?  If so, which and how, etc?
      
        As for other subsystems, I understand that XPMEM wanted to sleep in
        MMU notifiers, as Christoph Lameter wrote at
        http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
        perhaps Andrea knows about other use cases.
      
        Scheduling in mmu notifications is required since we need to sync the
        hardware with the secondary page tables change.  A TLB flush of an IO
        device is inherently slower than a CPU TLB flush, so our design works by
        sending the invalidation request to the device, and waiting for an
        interrupt before exiting the mmu notifier handler.
      
      Avi said:
      
        kvm may be a buyer.  kvm::mmu_lock, which serializes guest page
        faults, also protects long operations such as destroying large ranges.
        It would be good to convert it into a spinlock, but as it is used inside
        mmu notifiers, this cannot be done.
      
        (there are alternatives, such as keeping the spinlock and using a
        generation counter to do the teardown in O(1), which is what the "may"
        is doing up there).
      
      [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ec74c3e
    • M
      hugetlb: do not use vma_hugecache_offset() for vma_prio_tree_foreach · 36e4f20a
      Michal Hocko 提交于
      Commit 0c176d52 ("mm: hugetlb: fix pgoff computation when unmapping
      page from vma") fixed pgoff calculation but it has replaced it by
      vma_hugecache_offset() which is not approapriate for offsets used for
      vma_prio_tree_foreach() because that one expects index in page units
      rather than in huge_page_shift.
      
      Johannes said:
      
      : The resulting index may not be too big, but it can be too small: assume
      : hpage size of 2M and the address to unmap to be 0x200000.  This is regular
      : page index 512 and hpage index 1.  If you have a VMA that maps the file
      : only starting at the second huge page, that VMAs vm_pgoff will be 512 but
      : you ask for offset 1 and miss it even though it does map the page of
      : interest.  hugetlb_cow() will try to unmap, miss the vma, and retry the
      : cow until the allocation succeeds or the skipped vma(s) go away.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36e4f20a
    • D
      mm, numa: reclaim from all nodes within reclaim distance · 957f822a
      David Rientjes 提交于
      RECLAIM_DISTANCE represents the distance between nodes at which it is
      deemed too costly to allocate from; it's preferred to try to reclaim from
      a local zone before falling back to allocating on a remote node with such
      a distance.
      
      To do this, zone_reclaim_mode is set if the distance between any two
      nodes on the system is greather than this distance.  This, however, ends
      up causing the page allocator to reclaim from every zone regardless of
      its affinity.
      
      What we really want is to reclaim only from zones that are closer than
      RECLAIM_DISTANCE.  This patch adds a nodemask to each node that
      represents the set of nodes that are within this distance.  During the
      zone iteration, if the bit for a zone's node is set for the local node,
      then reclaim is attempted; otherwise, the zone is skipped.
      
      [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      957f822a
    • H
      mm: remove free_page_mlock · a0c5e813
      Hugh Dickins 提交于
      We should not be seeing non-0 unevictable_pgs_mlockfreed any longer.  So
      remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
      already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
      checking it, reporting "BUG: Bad page state" if it's ever found set.
      Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0c5e813
    • H
      mm: use clear_page_mlock() in page_remove_rmap() · e6c509f8
      Hugh Dickins 提交于
      We had thought that pages could no longer get freed while still marked as
      mlocked; but Johannes Weiner posted this program to demonstrate that
      truncating an mlocked private file mapping containing COWed pages is still
      mishandled:
      
      #include <sys/types.h>
      #include <sys/mman.h>
      #include <sys/stat.h>
      #include <stdlib.h>
      #include <unistd.h>
      #include <fcntl.h>
      #include <stdio.h>
      
      int main(void)
      {
      	char *map;
      	int fd;
      
      	system("grep mlockfreed /proc/vmstat");
      	fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR);
      	unlink("chigurh");
      	ftruncate(fd, 4096);
      	map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0);
      	map[0] = 11;
      	mlock(map, sizeof(fd));
      	ftruncate(fd, 0);
      	close(fd);
      	munlock(map, sizeof(fd));
      	munmap(map, 4096);
      	system("grep mlockfreed /proc/vmstat");
      	return 0;
      }
      
      The anon COWed pages are not caught by truncation's clear_page_mlock() of
      the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to
      look out for them there in page_remove_rmap().  Indeed, why should
      truncation or invalidation be doing the clear_page_mlock() when removing
      from pagecache?  mlock is a property of mapping in userspace, not a
      property of pagecache: an mlocked unmapped page is nonsensical.
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6c509f8
    • H
      mm: remove vma arg from page_evictable · 39b5f29a
      Hugh Dickins 提交于
      page_evictable(page, vma) is an irritant: almost all its callers pass
      NULL for vma.  Remove the vma arg and use mlocked_vma_newpage(vma, page)
      explicitly in the couple of places it's needed.  But in those places we
      don't even need page_evictable() itself!  They're dealing with a freshly
      allocated anonymous page, which has no "mapping" and cannot be mlocked yet.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39b5f29a
    • H
      mm: fix invalidate_complete_page2() lock ordering · ec4d9f62
      Hugh Dickins 提交于
      In fuzzing with trinity, lockdep protested "possible irq lock inversion
      dependency detected" when isolate_lru_page() reenabled interrupts while
      still holding the supposedly irq-safe tree_lock:
      
      invalidate_inode_pages2
        invalidate_complete_page2
          spin_lock_irq(&mapping->tree_lock)
          clear_page_mlock
            isolate_lru_page
              spin_unlock_irq(&zone->lru_lock)
      
      isolate_lru_page() is correct to enable interrupts unconditionally:
      invalidate_complete_page2() is incorrect to call clear_page_mlock() while
      holding tree_lock, which is supposed to nest inside lru_lock.
      
      Both truncate_complete_page() and invalidate_complete_page() call
      clear_page_mlock() before taking tree_lock to remove page from radix_tree.
       I guess invalidate_complete_page2() preferred to test PageDirty (again)
      under tree_lock before committing to the munlock; but since the page has
      already been unmapped, its state is already somewhat inconsistent, and no
      worse if clear_page_mlock() moved up.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Deciphered-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec4d9f62