1. 16 7月, 2014 1 次提交
    • N
      sched: Remove proliferation of wait_on_bit() action functions · 74316201
      NeilBrown 提交于
      The current "wait_on_bit" interface requires an 'action'
      function to be provided which does the actual waiting.
      There are over 20 such functions, many of them identical.
      Most cases can be satisfied by one of just two functions, one
      which uses io_schedule() and one which just uses schedule().
      
      So:
       Rename wait_on_bit and        wait_on_bit_lock to
              wait_on_bit_action and wait_on_bit_lock_action
       to make it explicit that they need an action function.
      
       Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
       which are *not* given an action function but implicitly use
       a standard one.
       The decision to error-out if a signal is pending is now made
       based on the 'mode' argument rather than being encoded in the action
       function.
      
       All instances of the old wait_on_bit and wait_on_bit_lock which
       can use the new version have been changed accordingly and their
       action functions have been discarded.
       wait_on_bit{_lock} does not return any specific error code in the
       event of a signal so the caller must check for non-zero and
       interpolate their own error code as appropriate.
      
      The wait_on_bit() call in __fscache_wait_on_invalidate() was
      ambiguous as it specified TASK_UNINTERRUPTIBLE but used
      fscache_wait_bit_interruptible as an action function.
      David Howells confirms this should be uniformly
      "uninterruptible"
      
      The main remaining user of wait_on_bit{,_lock}_action is NFS
      which needs to use a freezer-aware schedule() call.
      
      A comment in fs/gfs2/glock.c notes that having multiple 'action'
      functions is useful as they display differently in the 'wchan'
      field of 'ps'. (and /proc/$PID/wchan).
      As the new bit_wait{,_io} functions are tagged "__sched", they
      will not show up at all, but something higher in the stack.  So
      the distinction will still be visible, only with different
      function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
      gfs2/glock.c case).
      
      Since first version of this patch (against 3.15) two new action
      functions appeared, on in NFS and one in CIFS.  CIFS also now
      uses an action function that makes the same freezer aware
      schedule call as NFS.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
      Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74316201
  2. 04 7月, 2014 5 次提交
    • H
      shmem: fix init_page_accessed use to stop !PageLRU bug · 66d2f4d2
      Hugh Dickins 提交于
      Under shmem swapping load, I sometimes hit the VM_BUG_ON_PAGE(!PageLRU)
      in isolate_lru_pages() at mm/vmscan.c:1281!
      
      Commit 2457aec6 ("mm: non-atomically mark page accessed during page
      cache allocation where possible") looks like interrupted work-in-progress.
      
      mm/filemap.c's call to init_page_accessed() is fine, but not mm/shmem.c's
      - shmem_write_begin() is clearly wrong to use it after shmem_getpage(),
      when the page is always visible in radix_tree, and often already on LRU.
      
      Revert change to shmem_write_begin(), and use init_page_accessed() or
      mark_page_accessed() appropriately for SGP_WRITE in shmem_getpage_gfp().
      
      SGP_WRITE also covers shmem_symlink(), which did not mark_page_accessed()
      before; but since many other filesystems use [__]page_symlink(), which did
      and does mark the page accessed, consider this as rectifying an oversight.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Prabhakar Lad <prabhakar.csengg@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66d2f4d2
    • C
      hwpoison: fix the handling path of the victimized page frame that belong to non-LRU · 0bc1f8b0
      Chen Yucong 提交于
      Until now, the kernel has the same policy to handle victimized page
      frames that belong to kernel-space(reserved/slab-subsystem) or
      non-LRU(unknown page state).  In other word, the result of handling
      either of these victimized page frames is (IGNORED | FAILED), and the
      return value of memory_failure() is -EBUSY.
      
      This patch is to avoid that memory_failure() returns very soon due to
      the "true" value of (!PageLRU(p)), and it also ensures that
      action_result() can report more precise information("reserved kernel",
      "kernel slab", and "unknown page state") instead of "non LRU",
      especially for memory errors which are detected by memory-scrubbing.
      
      Andi said:
      
      : While running the mcelog test suite on 3.14 I hit the following VM_BUG_ON:
      :
      : soft_offline: 0x56d4: unknown non LRU page type 3ffff800008000
      : page:ffffea000015b400 count:3 mapcount:2097169 mapping:          (null) index:0xffff8800056d7000
      : page flags: 0x3ffff800004081(locked|slab|head)
      : ------------[ cut here ]------------
      : kernel BUG at mm/rmap.c:1495!
      :
      : I think what happened is that a LRU page turned into a slab page in
      : parallel with offlining.  memory_failure initially tests for this case,
      : but doesn't retest later after the page has been locked.
      :
      : ...
      :
      : I ran this patch in a loop over night with some stress plus
      : the mcelog test suite running in a loop. I cannot guarantee it hit it,
      : but it should have given it a good beating.
      :
      : The kernel survived with no messages, although the mcelog test suite
      : got killed at some point because it couldn't fork anymore. Probably
      : some unrelated problem.
      :
      : So the patch is ok for me for .16.
      Signed-off-by: NChen Yucong <slaoub@gmail.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NAndi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bc1f8b0
    • N
      msync: fix incorrect fstart calculation · 496a8e68
      Namjae Jeon 提交于
      Fix a regression caused by 7fc34a62 ("mm/msync.c: sync only the
      requested range in msync()").
      
      xfstests generic/075 fail occured on ext4 data=journal mode because the
      intended range was not syncing due to wrong fstart calculation.
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Reported-by: NEric Whitney <enwlinux@gmail.com>
      Tested-by: NEric Whitney <enwlinux@gmail.com>
      Acked-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      Tested-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      496a8e68
    • J
      slub: fix off by one in number of slab tests · 8a5b20ae
      Joonsoo Kim 提交于
      min_partial means minimum number of slab cached in node partial list.
      So, if nr_partial is less than it, we keep newly empty slab on node
      partial list rather than freeing it.  But if nr_partial is equal or
      greater than it, it means that we have enough partial slabs so should
      free newly empty slab.  Current implementation missed the equal case so
      if we set min_partial is 0, then, at least one slab could be cached.
      This is critical problem to kmemcg destroying logic because it doesn't
      works properly if some slabs is cached.  This patch fixes this problem.
      
      Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
      immediately").
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a5b20ae
    • M
      mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER · dc78327c
      Michal Nazarewicz 提交于
      With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
      the following is triggered at early boot:
      
        SMP: Total of 8 processors activated.
        devtmpfs: initialized
        Unable to handle kernel NULL pointer dereference at virtual address 00000008
        pgd = fffffe0000050000
        [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
        Internal error: Oops: 96000006 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
        task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
        PC is at __list_add+0x10/0xd4
        LR is at free_one_page+0x270/0x638
        ...
        Call trace:
          __list_add+0x10/0xd4
          free_one_page+0x26c/0x638
          __free_pages_ok.part.52+0x84/0xbc
          __free_pages+0x74/0xbc
          init_cma_reserved_pageblock+0xe8/0x104
          cma_init_reserved_areas+0x190/0x1e4
          do_one_initcall+0xc4/0x154
          kernel_init_freeable+0x204/0x2a8
          kernel_init+0xc/0xd4
      
      This happens because init_cma_reserved_pageblock() calls
      __free_one_page() with pageblock_order as page order but it is bigger
      than MAX_ORDER.  This in turn causes accesses past zone->free_list[].
      
      Fix the problem by changing init_cma_reserved_pageblock() such that it
      splits pageblock into individual MAX_ORDER pages if pageblock is bigger
      than a MAX_ORDER page.
      
      In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
      architectures expect for ia64, powerpc and tile at the moment, the
      “pageblock_order > MAX_ORDER” condition will be optimised out since both
      sides of the operator are constants.  In cases where pageblock size is
      variable, the performance degradation should not be significant anyway
      since init_cma_reserved_pageblock() is called only at boot time at most
      MAX_CMA_AREAS times which by default is eight.
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Reported-by: NMark Salter <msalter@redhat.com>
      Tested-by: NMark Salter <msalter@redhat.com>
      Tested-by: NChristopher Covington <cov@codeaurora.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc78327c
  3. 25 6月, 2014 1 次提交
    • G
      cpuset,mempolicy: fix sleeping function called from invalid context · 391acf97
      Gu Zheng 提交于
      When runing with the kernel(3.15-rc7+), the follow bug occurs:
      [ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
      [ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
      [ 9969.441175] INFO: lockdep is turned off.
      [ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G       A      3.15.0-rc7+ #85
      [ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
      [ 9969.706052]  ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
      [ 9969.795323]  ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
      [ 9969.884710]  ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
      [ 9969.974071] Call Trace:
      [ 9970.003403]  [<ffffffff8162f523>] dump_stack+0x4d/0x66
      [ 9970.065074]  [<ffffffff8109995a>] __might_sleep+0xfa/0x130
      [ 9970.130743]  [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
      [ 9970.200638]  [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
      [ 9970.272610]  [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
      [ 9970.344584]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.409282]  [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
      [ 9970.471897]  [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
      [ 9970.536585]  [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
      [ 9970.613763]  [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
      [ 9970.683660]  [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
      [ 9970.759795]  [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
      [ 9970.834885]  [<ffffffff8106a598>] do_fork+0xd8/0x380
      [ 9970.894375]  [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
      [ 9970.969470]  [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
      [ 9971.030011]  [<ffffffff81642009>] stub_clone+0x69/0x90
      [ 9971.091573]  [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
      
      The cause is that cpuset_mems_allowed() try to take
      mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
      __mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
      under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
      protection region to protect the access to cpuset only in
      current_cpuset_is_being_rebound(). So that we can avoid this bug.
      
      This patch is a temporary solution that just addresses the bug
      mentioned above, can not fix the long-standing issue about cpuset.mems
      rebinding on fork():
      
      "When the forker's task_struct is duplicated (which includes
       ->mems_allowed) and it races with an update to cpuset_being_rebound
       in update_tasks_nodemask() then the task's mems_allowed doesn't get
       updated. And the child task's mems_allowed can be wrong if the
       cpuset's nodemask changes before the child has been added to the
       cgroup's tasklist."
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable <stable@vger.kernel.org>
      391acf97
  4. 24 6月, 2014 9 次提交
    • H
      mm: fix crashes from mbind() merging vmas · d05f0cdc
      Hugh Dickins 提交于
      In v2.6.34 commit 9d8cebd4 ("mm: fix mbind vma merge problem")
      introduced vma merging to mbind(), but it should have also changed the
      convention of passing start vma from queue_pages_range() (formerly
      check_range()) to new_vma_page(): vma merging may have already freed
      that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
      worse crashes.
      
      Fixes: 9d8cebd4 ("mm: fix mbind vma merge problem")
      Reported-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: <stable@vger.kernel.org>	[2.6.34+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d05f0cdc
    • J
      slab: fix oops when reading /proc/slab_allocators · 03787301
      Joonsoo Kim 提交于
      Commit b1cb0982 ("change the management method of free objects of
      the slab") introduced a bug on slab leak detector
      ('/proc/slab_allocators').  This detector works like as following
      decription.
      
       1. traverse all objects on all the slabs.
       2. determine whether it is active or not.
       3. if active, print who allocate this object.
      
      but that commit changed the way how to manage free objects, so the logic
      determining whether it is active or not is also changed.  In before, we
      regard object in cpu caches as inactive one, but, with this commit, we
      mistakenly regard object in cpu caches as active one.
      
      This intoduces kernel oops if DEBUG_PAGEALLOC is enabled.  If
      DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
      corrupt free memory in the slab.  It unmaps page table mapping if object
      is free and map it if object is active.  When slab leak detector check
      object in cpu caches, it mistakenly think this object active so try to
      access object memory to retrieve caller of allocation.  At this point,
      page table mapping to this object doesn't exist, so oops occurs.
      
      Following is oops message reported from Dave.
      
      It blew up when something tried to read /proc/slab_allocators
      (Just cat it, and you should see the oops below)
      
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        [snip...]
        CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
        task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
        RIP: 0010:[<ffffffffaa1a8f4a>]  [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180
        RSP: 0018:ffff880076925de0  EFLAGS: 00010002
        RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
        RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
        RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
        R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
        FS:  00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
        DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
        Call Trace:
          leaks_show+0xce/0x240
          seq_read+0x28e/0x490
          proc_reg_read+0x3d/0x80
          vfs_read+0x9b/0x160
          SyS_read+0x58/0xb0
          tracesys+0xd4/0xd9
        Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
        RIP   handle_slab+0x8a/0x180
      
      To fix the problem, I introduce an object status buffer on each slab.
      With this, we can track object status precisely, so slab leak detector
      would not access active object and no kernel oops would occur.  Memory
      overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
      which is mainly used for debugging, so memory overhead isn't big
      problem.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03787301
    • H
      shmem: fix faulting into a hole while it's punched · f00cdc6d
      Hugh Dickins 提交于
      Trinity finds that mmap access to a hole while it's punched from shmem
      can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
      from completing, until the reader chooses to stop; with the puncher's
      hold on i_mutex locking out all other writers until it can complete.
      
      It appears that the tmpfs fault path is too light in comparison with its
      hole-punching path, lacking an i_data_sem to obstruct it; but we don't
      want to slow down the common case.
      
      Extend shmem_fallocate()'s existing range notification mechanism, so
      shmem_fault() can refrain from faulting pages into the hole while it's
      punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
      faulting when not).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f00cdc6d
    • H
      mm: let mm_find_pmd fix buggy race with THP fault · f72e7dcd
      Hugh Dickins 提交于
      Trinity has reported:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
          CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G        W
                                  3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
          lock_acquire (arch/x86/include/asm/current.h:14
                        kernel/locking/lockdep.c:3602)
          _raw_spin_lock (include/linux/spinlock_api_smp.h:143
                          kernel/locking/spinlock.c:151)
          remove_migration_pte (mm/migrate.c:137)
          rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
          remove_migration_ptes (mm/migrate.c:224)
          migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
          migrate_misplaced_page (mm/migrate.c:1733)
          __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
          handle_mm_fault (mm/memory.c:3948)
          __get_user_pages (mm/memory.c:1851)
          __mlock_vma_pages_range (mm/mlock.c:255)
          __mm_populate (mm/mlock.c:711)
          SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
      
      I believe this comes about because, whereas collapsing and splitting THP
      functions take anon_vma lock in write mode (which excludes concurrent
      rmap walks), faulting THP functions (write protection and misplaced
      NUMA) do not - and mostly they do not need to.
      
      But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
      an instant (indeed, for a long instant, given the inter-CPU TLB flush in
      there), leaves *pmd neither present not trans_huge.
      
      Which can confuse a concurrent rmap walk, as when removing migration
      ptes, seen in the dumped trace.  Although that rmap walk has a 4k page
      to insert, anon_vmas containing THPs are in no way segregated from
      4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
      that instant when a trans_huge pmd is temporarily absent.
      
      I don't think we need strengthen the locking at the THP end: it's easily
      handled with an ACCESS_ONCE() before testing both conditions.
      
      And since mm_find_pmd() had only one caller who wanted a THP rather than
      a pmd, let's slightly repurpose it to fail when it hits a THP or
      non-present pmd, and open code split_huge_page_address() again.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Lameter <cl@gentwo.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f72e7dcd
    • H
      mm: thp: fix DEBUG_PAGEALLOC oops in copy_page_rep() · 5338a937
      Hugh Dickins 提交于
      Trinity has for over a year been reporting a CONFIG_DEBUG_PAGEALLOC oops
      in copy_page_rep() called from copy_user_huge_page() called from
      do_huge_pmd_wp_page().
      
      I believe this is a DEBUG_PAGEALLOC false positive, due to the source
      page being split, and a tail page freed, while copy is in progress; and
      not a problem without DEBUG_PAGEALLOC, since the pmd_same() check will
      prevent a miscopy from being made visible.
      
      Fix by adding get_user_huge_page() and put_user_huge_page(): reducing to
      the usual get_page() and put_page() on head page in the usual config;
      but get and put references to all of the tail pages when
      DEBUG_PAGEALLOC.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5338a937
    • D
      mm, pcp: allow restoring percpu_pagelist_fraction default · 7cd2b0a3
      David Rientjes 提交于
      Oleg reports a division by zero error on zero-length write() to the
      percpu_pagelist_fraction sysctl:
      
          divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
          Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
          task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
          RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
          RSP: 0018:ffff8800d87a3e78  EFLAGS: 00010246
          RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
          RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
          R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
          R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
          FS:  00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
          Call Trace:
            proc_sys_call_handler+0xb3/0xc0
            proc_sys_write+0x14/0x20
            vfs_write+0xba/0x1e0
            SyS_write+0x46/0xb0
            tracesys+0xe1/0xe6
      
      However, if the percpu_pagelist_fraction sysctl is set by the user, it
      is also impossible to restore it to the kernel default since the user
      cannot write 0 to the sysctl.
      
      This patch allows the user to write 0 to restore the default behavior.
      It still requires a fraction equal to or larger than 8, however, as
      stated by the documentation for sanity.  If a value in the range [1, 7]
      is written, the sysctl will return EINVAL.
      
      This successfully solves the divide by zero issue at the same time.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Drokin <green@linuxhacker.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cd2b0a3
    • N
      hugetlb: fix copy_hugetlb_page_range() to handle migration/hwpoisoned entry · 4a705fef
      Naoya Horiguchi 提交于
      There's a race between fork() and hugepage migration, as a result we try
      to "dereference" a swap entry as a normal pte, causing kernel panic.
      The cause of the problem is that copy_hugetlb_page_range() can't handle
      "swap entry" family (migration entry and hwpoisoned entry) so let's fix
      it.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: <stable@vger.kernel.org>	[2.6.37+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a705fef
    • H
      tmpfs: ZERO_RANGE and COLLAPSE_RANGE not currently supported · 13ace4d0
      Hugh Dickins 提交于
      I was well aware of FALLOC_FL_ZERO_RANGE and FALLOC_FL_COLLAPSE_RANGE
      support being added to fallocate(); but didn't realize until now that I
      had been too stupid to future-proof shmem_fallocate() against new
      additions.  -EOPNOTSUPP instead of going on to ordinary fallocation.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.15]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13ace4d0
    • S
      mm: nommu: per-thread vma cache fix · e020d5bd
      Steven Miao 提交于
      mm could be removed from current task struct, using previous vma->vm_mm
      
      It will crash on blackfin after updated to Linux 3.15.  The commit "mm:
      per-thread vma caching" caused the crash.  mm could be removed from
      current task struct before
      
        mmput()->
          exit_mmap()->
            delete_vma_from_mm()
      
      the detailed fault information:
      
          NULL pointer access
          Kernel OOPS in progress
          Deferred Exception context
          CURRENT PROCESS:
          COMM=modprobe PID=278  CPU=0
          invalid mm
          return address: [0x000531de]; contents of:
          0x000531b0:  c727  acea  0c42  181d  0000  0000  0000  a0a8
          0x000531c0:  b090  acaa  0c42  1806  0000  0000  0000  a0e8
          0x000531d0:  b0d0  e801  0000  05b3  0010  e522  0046 [a090]
          0x000531e0:  6408  b090  0c00  17cc  3042  e3ff  f37b  2fc8
      
          CPU: 0 PID: 278 Comm: modprobe Not tainted 3.15.0-ADI-2014R1-pre-00345-gea9f446 #25
          task: 0572b720 ti: 0569e000 task.ti: 0569e000
          Compiled for cpu family 0x27fe (Rev 0), but running on:0x0000 (Rev 0)
          ADSP-BF609-0.0 500(MHz CCLK) 125(MHz SCLK) (mpu off)
          Linux version 3.15.0-ADI-2014R1-pre-00345-gea9f446 (steven@steven-OptiPlex-390) (gcc version 4.3.5 (ADI-trunk/svn-5962) ) #25 Tue Jun 10 17:47:46 CST 2014
      
          SEQUENCER STATUS:		Not tainted
           SEQSTAT: 00000027  IPEND: 8008  IMASK: ffff  SYSCFG: 2806
            EXCAUSE   : 0x27
            physical IVG3 asserted : <0xffa00744> { _trap + 0x0 }
            physical IVG15 asserted : <0xffa00d68> { _evt_system_call + 0x0 }
            logical irq   6 mapped  : <0xffa003bc> { _bfin_coretmr_interrupt + 0x0 }
            logical irq   7 mapped  : <0x00008828> { _bfin_fault_routine + 0x0 }
            logical irq  11 mapped  : <0x00007724> { _l2_ecc_err + 0x0 }
            logical irq  13 mapped  : <0x00008828> { _bfin_fault_routine + 0x0 }
            logical irq  39 mapped  : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 }
            logical irq  40 mapped  : <0x00150788> { _bfin_twi_interrupt_entry + 0x0 }
           RETE: <0x00000000> /* Maybe null pointer? */
           RETN: <0x0569fe50> /* kernel dynamic memory (maybe user-space) */
           RETX: <0x00000480> /* Maybe fixed code section */
           RETS: <0x00053384> { _exit_mmap + 0x28 }
           PC  : <0x000531de> { _delete_vma_from_mm + 0x92 }
          DCPLB_FAULT_ADDR: <0x00000008> /* Maybe null pointer? */
          ICPLB_FAULT_ADDR: <0x000531de> { _delete_vma_from_mm + 0x92 }
          PROCESSOR STATE:
           R0 : 00000004    R1 : 0569e000    R2 : 00bf3db4    R3 : 00000000
           R4 : 057f9800    R5 : 00000001    R6 : 0569ddd0    R7 : 0572b720
           P0 : 0572b854    P1 : 00000004    P2 : 00000000    P3 : 0569dda0
           P4 : 0572b720    P5 : 0566c368    FP : 0569fe5c    SP : 0569fd74
           LB0: 057f523f    LT0: 057f523e    LC0: 00000000
           LB1: 0005317c    LT1: 00053172    LC1: 00000002
           B0 : 00000000    L0 : 00000000    M0 : 0566f5bc    I0 : 00000000
           B1 : 00000000    L1 : 00000000    M1 : 00000000    I1 : ffffffff
           B2 : 00000001    L2 : 00000000    M2 : 00000000    I2 : 00000000
           B3 : 00000000    L3 : 00000000    M3 : 00000000    I3 : 057f8000
          A0.w: 00000000   A0.x: 00000000   A1.w: 00000000   A1.x: 00000000
          USP : 056ffcf8  ASTAT: 02003024
      
          Hardware Trace:
             0 Target : <0x00003fb8> { _trap_c + 0x0 }
               Source : <0xffa006d8> { _exception_to_level5 + 0xa0 } JUMP.L
             1 Target : <0xffa00638> { _exception_to_level5 + 0x0 }
               Source : <0xffa004f2> { _bfin_return_from_exception + 0x6 } RTX
             2 Target : <0xffa004ec> { _bfin_return_from_exception + 0x0 }
               Source : <0xffa00590> { _ex_trap_c + 0x70 } JUMP.S
             3 Target : <0xffa00520> { _ex_trap_c + 0x0 }
               Source : <0xffa0076e> { _trap + 0x2a } JUMP (P4)
             4 Target : <0xffa00744> { _trap + 0x0 }
                FAULT : <0x000531de> { _delete_vma_from_mm + 0x92 } P0 = W[P2 + 2]
               Source : <0x000531da> { _delete_vma_from_mm + 0x8e } P2 = [P4 + 0x18]
             5 Target : <0x000531da> { _delete_vma_from_mm + 0x8e }
               Source : <0x00053176> { _delete_vma_from_mm + 0x2a } IF CC JUMP pcrel
             6 Target : <0x0005314c> { _delete_vma_from_mm + 0x0 }
               Source : <0x00053380> { _exit_mmap + 0x24 } JUMP.L
             7 Target : <0x00053378> { _exit_mmap + 0x1c }
               Source : <0x00053394> { _exit_mmap + 0x38 } IF !CC JUMP pcrel (BP)
             8 Target : <0x00053390> { _exit_mmap + 0x34 }
               Source : <0xffa020e0> { __cond_resched + 0x20 } RTS
             9 Target : <0xffa020c0> { __cond_resched + 0x0 }
               Source : <0x0005338c> { _exit_mmap + 0x30 } JUMP.L
            10 Target : <0x0005338c> { _exit_mmap + 0x30 }
               Source : <0x0005333a> { _delete_vma + 0xb2 } RTS
            11 Target : <0x00053334> { _delete_vma + 0xac }
               Source : <0x0005507a> { _kmem_cache_free + 0xba } RTS
            12 Target : <0x00055068> { _kmem_cache_free + 0xa8 }
               Source : <0x0005505e> { _kmem_cache_free + 0x9e } IF !CC JUMP pcrel (BP)
            13 Target : <0x00055052> { _kmem_cache_free + 0x92 }
               Source : <0x0005501a> { _kmem_cache_free + 0x5a } IF CC JUMP pcrel
            14 Target : <0x00054ff4> { _kmem_cache_free + 0x34 }
               Source : <0x00054fce> { _kmem_cache_free + 0xe } IF CC JUMP pcrel (BP)
            15 Target : <0x00054fc0> { _kmem_cache_free + 0x0 }
               Source : <0x00053330> { _delete_vma + 0xa8 } JUMP.L
          Kernel Stack
          Stack info:
           SP: [0x0569ff24] <0x0569ff24> /* kernel dynamic memory (maybe user-space) */
           Memory from 0x0569ff20 to 056a0000
          0569ff20: 00000001 [04e8da5a] 00008000  00000000  00000000  056a0000  04e8da5a  04e8da5a
          0569ff40: 04eb9eea  ffa00dce  02003025  04ea09c5  057f523f  04ea09c4  057f523e  00000000
          0569ff60: 00000000  00000000  00000000  00000000  00000000  00000000  00000001  00000000
          0569ff80: 00000000  00000000  00000000  00000000  00000000  00000000  00000000  00000000
          0569ffa0: 0566f5bc  057f8000  057f8000  00000001  04ec0170  056ffcf8  056ffd04  057f9800
          0569ffc0: 04d1d498  057f9800  057f8fe4  057f8ef0  00000001  057f928c  00000001  00000001
          0569ffe0: 057f9800  00000000  00000008  00000007  00000001  00000001  00000001 <00002806>
          Return addresses in stack:
              address : <0x00002806> { _show_cpuinfo + 0x2d2 }
          Modules linked in:
          Kernel panic - not syncing: Kernel exception
          [ end Kernel panic - not syncing: Kernel exception
      Signed-off-by: NSteven Miao <realmz6@gmail.com>
      Acked-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.15.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e020d5bd
  5. 15 6月, 2014 1 次提交
    • A
      fix __swap_writepage() compile failure on old gcc versions · 05064084
      Al Viro 提交于
      Tetsuo Handa wrote:
       "Commit 62a8067a ("bio_vec-backed iov_iter") introduced an unnamed
        union inside a struct which gcc-4.4.7 cannot handle.  Name the unnamed
         union as u in order to fix build failure"
      
      Let's do this instead: there is only one place in the entire tree that
      steps into this breakage.  Anon structs and unions work in older gcc
      versions; as the matter of fact, we have those in the tree - see e.g.
      struct ieee80211_tx_info in include/net/mac80211.h
      
      What doesn't work is handling their initializers:
      
      struct {
      	int a;
      	union {
      		int b;
      		char c;
      	};
      } x[2] = {{.a = 1, .c = 'a'}, {.a = 0, .b = 1}};
      
      is the obvious syntax for initializer, perfectly fine for C11 and
      handled correctly by gcc-4.7 or later.
      
      Earlier versions, though, break on it - declaration is fine and so's
      access to fields (i.e.  x[0].c = 'a'; would produce the right code), but
      members of the anon structs and unions are not inserted into the right
      namespace.  Tellingly, those older versions will not barf on struct {int
      a; struct {int a;};}; - looks like they just have it hacked up somewhere
      around the handling of .  and -> instead of doing the right thing.
      
      The easiest way to deal with that crap is to turn initialization of
      those fields (in the only place where we have such initializer of
      iov_iter) into plain assignment.
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05064084
  6. 12 6月, 2014 1 次提交
  7. 09 6月, 2014 1 次提交
    • L
      Don't trigger congestion wait on dirty-but-not-writeout pages · b738d764
      Linus Torvalds 提交于
      shrink_inactive_list() used to wait 0.1s to avoid congestion when all
      the pages that were isolated from the inactive list were dirty but not
      under active writeback.  That makes no real sense, and apparently causes
      major interactivity issues under some loads since 3.11.
      
      The ostensible reason for it was to wait for kswapd to start writing
      pages, but that seems questionable as well, since the congestion wait
      code seems to trigger for kswapd itself as well.  Also, the logic behind
      delaying anything when we haven't actually started writeback is not
      clear - it only delays actually starting that writeback.
      
      We'll still trigger the congestion waiting if
      
       (a) the process is kswapd, and we hit pages flagged for immediate
           reclaim
      
       (b) the process is not kswapd, and the zone backing dev writeback is
           actually congested.
      
      This probably needs to be revisited, but as it is this fixes a reported
      regression.
      Reported-by: NFelipe Contreras <felipe.contreras@gmail.com>
      Pinpointed-by: NHillf Danton <dhillf@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b738d764
  8. 07 6月, 2014 14 次提交
  9. 06 6月, 2014 1 次提交
  10. 05 6月, 2014 6 次提交