1. 25 6月, 2015 31 次提交
    • X
      memory-failure: change type of action_result's param 3 to enum · cc3e2af4
      Xie XiuQi 提交于
      Change type of action_result's param 3 to enum for type consistency,
      and rename mf_outcome to mf_result for clearly.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc3e2af4
    • X
      memory-failure: export page_type and action result · cc637b17
      Xie XiuQi 提交于
      Export 'outcome' and 'action_page_type' to mm.h, so we could use
      this emnus outside.
      
      This patch is preparation for adding trace events for memory-failure
      recovery action.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc637b17
    • M
      mm, memcg: Try charging a page before setting page up to date · eb3c24f3
      Mel Gorman 提交于
      Historically memcg overhead was high even if memcg was unused.  This has
      improved a lot but it still showed up in a profile summary as being a
      problem.
      
      /usr/src/linux-4.0-vanilla/mm/memcontrol.c                           6.6441   395842
        mem_cgroup_try_charge                                                        2.950%   175781
        __mem_cgroup_count_vm_event                                                  1.431%    85239
        mem_cgroup_page_lruvec                                                       0.456%    27156
        mem_cgroup_commit_charge                                                     0.392%    23342
        uncharge_list                                                                0.323%    19256
        mem_cgroup_update_lru_size                                                   0.278%    16538
        memcg_check_events                                                           0.216%    12858
        mem_cgroup_charge_statistics.isra.22                                         0.188%    11172
        try_charge                                                                   0.150%     8928
        commit_charge                                                                0.141%     8388
        get_mem_cgroup_from_mm                                                       0.121%     7184
      
      That is showing that 6.64% of system CPU cycles were in memcontrol.c and
      dominated by mem_cgroup_try_charge.  The annotation shows that the bulk
      of the cost was checking PageSwapCache which is expected to be cache hot
      but is very expensive.  The problem appears to be that __SetPageUptodate
      is called just before the check which is a write barrier.  It is
      required to make sure struct page and page data is written before the
      PTE is updated and the data visible to userspace.  memcg charging does
      not require or need the barrier but gets unfairly hit with the cost so
      this patch attempts the charging before the barrier.  Aside from the
      accidental cost to memcg there is the added benefit that the barrier is
      avoided if the page cannot be charged.  When applied the relevant
      profile summary is as follows.
      
      /usr/src/linux-4.0-chargefirst-v2r1/mm/memcontrol.c                  3.7907   223277
        __mem_cgroup_count_vm_event                                                  1.143%    67312
        mem_cgroup_page_lruvec                                                       0.465%    27403
        mem_cgroup_commit_charge                                                     0.381%    22452
        uncharge_list                                                                0.332%    19543
        mem_cgroup_update_lru_size                                                   0.284%    16704
        get_mem_cgroup_from_mm                                                       0.271%    15952
        mem_cgroup_try_charge                                                        0.237%    13982
        memcg_check_events                                                           0.222%    13058
        mem_cgroup_charge_statistics.isra.22                                         0.185%    10920
        commit_charge                                                                0.140%     8235
        try_charge                                                                   0.131%     7716
      
      That brings the overhead down to 3.79% and leaves the memcg fault
      accounting to the root cgroup but it's an improvement.  The difference
      in headline performance of the page fault microbench is marginal as
      memcg is such a small component of it.
      
      pft faults
                                             4.0.0                  4.0.0
                                           vanilla            chargefirst
      Hmean    faults/cpu-1 1443258.1051 (  0.00%) 1509075.7561 (  4.56%)
      Hmean    faults/cpu-3 1340385.9270 (  0.00%) 1339160.7113 ( -0.09%)
      Hmean    faults/cpu-5  875599.0222 (  0.00%)  874174.1255 ( -0.16%)
      Hmean    faults/cpu-7  601146.6726 (  0.00%)  601370.9977 (  0.04%)
      Hmean    faults/cpu-8  510728.2754 (  0.00%)  510598.8214 ( -0.03%)
      Hmean    faults/sec-1 1432084.7845 (  0.00%) 1497935.5274 (  4.60%)
      Hmean    faults/sec-3 3943818.1437 (  0.00%) 3941920.1520 ( -0.05%)
      Hmean    faults/sec-5 3877573.5867 (  0.00%) 3869385.7553 ( -0.21%)
      Hmean    faults/sec-7 3991832.0418 (  0.00%) 3992181.4189 (  0.01%)
      Hmean    faults/sec-8 3987189.8167 (  0.00%) 3986452.2204 ( -0.02%)
      
      It's only visible at single threaded. The overhead is there for higher
      threads but other factors dominate.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb3c24f3
    • M
      hugetlb: do not account hugetlb pages as NR_FILE_PAGES · 4165b9b4
      Michal Hocko 提交于
      hugetlb pages uses add_to_page_cache to track shared mappings.  This is
      OK from the data structure point of view but it is less so from the
      NR_FILE_PAGES accounting:
      
      	- huge pages are accounted as 4k which is clearly wrong
      	- this counter is used as the amount of the reclaimable page
      	  cache which is incorrect as well because hugetlb pages are
      	  special and not reclaimable
      	- the counter is then exported to userspace via /proc/meminfo
      	  (in Cached:), /proc/vmstat and /proc/zoneinfo as
      	  nr_file_pages which is confusing at least:
      	  Cached:          8883504 kB
      	  HugePages_Free:     8348
      	  ...
      	  Cached:          8916048 kB
      	  HugePages_Free:      156
      	  ...
      	  thats 8192 huge pages allocated which is ~16G accounted as 32M
      
      There are usually not that many huge pages in the system for this to
      make any visible difference e.g.  by fooling __vm_enough_memory or
      zone_pagecache_reclaimable.
      
      Fix this by special casing huge pages in both __delete_from_page_cache
      and __add_to_page_cache_locked.  replace_page_cache_page is currently
      only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
      is more robust to check for special casing there as well.
      
      Hugetlb pages shouldn't get to any other paths where we do accounting:
      	- migration - we have a special handling via
      	  hugetlbfs_migrate_page
      	- shmem - doesn't handle hugetlb pages directly even for
      	  SHM_HUGETLB resp. MAP_HUGETLB
      	- swapcache - hugetlb is not swapable
      
      This has a user visible effect but I believe it is reasonable because the
      previously exported number is simply bogus.
      
      An alternative would be to account hugetlb pages with their real size and
      treat them similar to shmem.  But this has some drawbacks.
      
      First we would have to special case in kernel users of NR_FILE_PAGES and
      considering how hugetlb is special we would have to do it everywhere.  We
      do not want Cached exported by /proc/meminfo to include it because the
      value would be even more misleading.
      
      __vm_enough_memory and zone_pagecache_reclaimable would have to do the
      same thing because those pages are simply not reclaimable.  The correction
      is even not trivial because we would have to consider all active hugetlb
      page sizes properly.  Users of the counter outside of the kernel would
      have to do the same.
      
      So the question is why to account something that needs to be basically
      excluded for each reasonable usage.  This doesn't make much sense to me.
      
      It seems that this has been broken since hugetlb was introduced but I
      haven't checked the whole history.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4165b9b4
    • J
      mm: page_alloc: inline should_alloc_retry() · 9083905a
      Johannes Weiner 提交于
      The should_alloc_retry() function was meant to encapsulate retry
      conditions of the allocator slowpath, but there are still checks
      remaining in the main function, and much of how the retrying is
      performed also depends on the OOM killer progress.  The physical
      separation of those conditions make the code hard to follow.
      
      Inline the should_alloc_retry() checks.  Notes:
      
      - The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
        replace it with looping on OOM killer progress
      
      - The pm_suspended_storage() check is meant to skip the OOM killer
        when reclaim has no IO available, move to __alloc_pages_may_oom()
      
      - The order <= PAGE_ALLOC_COSTLY order is re-united with its original
        counterpart of checking whether reclaim actually made any progress
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9083905a
    • J
      mm: oom_kill: simplify OOM killer locking · dc56401f
      Johannes Weiner 提交于
      The zonelist locking and the oom_sem are two overlapping locks that are
      used to serialize global OOM killing against different things.
      
      The historical zonelist locking serializes OOM kills from allocations with
      overlapping zonelists against each other to prevent killing more tasks
      than necessary in the same memory domain.  Only when neither tasklists nor
      zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
      bound to separate nodes) are OOM kills allowed to execute in parallel.
      
      The younger oom_sem is a read-write lock to serialize OOM killing against
      the PM code trying to disable the OOM killer altogether.
      
      However, the OOM killer is a fairly cold error path, there is really no
      reason to optimize for highly performant and concurrent OOM kills.  And
      the oom_sem is just flat-out redundant.
      
      Replace both locking schemes with a single global mutex serializing OOM
      kills regardless of context.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc56401f
    • J
      mm: oom_kill: remove unnecessary locking in exit_oom_victim() · da51b14a
      Johannes Weiner 提交于
      Disabling the OOM killer needs to exclude allocators from entering, not
      existing victims from exiting.
      
      Right now the only waiter is suspend code, which achieves quiescence by
      disabling the OOM killer.  But later on we want to add waits that hold
      the lock instead to stop new victims from showing up.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da51b14a
    • J
      mm: oom_kill: generalize OOM progress waitqueue · c38f1025
      Johannes Weiner 提交于
      It turns out that the mechanism to wait for exiting OOM victims is less
      generic than it looks: it won't issue wakeups unless the OOM killer is
      disabled.
      
      The reason this check was added was the thought that, since only the OOM
      disabling code would wait on this queue, wakeup operations could be
      saved when that specific consumer is known to be absent.
      
      However, this is quite the handgrenade.  Later attempts to reuse the
      waitqueue for other purposes will lead to completely unexpected bugs and
      the failure mode will appear seemingly illogical.  Generally, providers
      shouldn't make unnecessary assumptions about consumers.
      
      This could have been replaced with waitqueue_active(), but it only saves
      a few instructions in one of the coldest paths in the kernel.  Simply
      remove it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c38f1025
    • J
      mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear · 46402778
      Johannes Weiner 提交于
      exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else
      can clear it concurrently.  Use clear_thread_flag() directly.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46402778
    • J
      mm: oom_kill: clean up victim marking and exiting interfaces · 16e95196
      Johannes Weiner 提交于
      Rename unmark_oom_victim() to exit_oom_victim().  Marking and unmarking
      are related in functionality, but the interface is not symmetrical at
      all: one is an internal OOM killer function used during the killing, the
      other is for an OOM victim to signal its own death on exit later on.
      This has locking implications, see follow-up changes.
      
      While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
      is easier on the eye.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16e95196
    • J
      mm: oom_kill: remove unnecessary locking in oom_enable() · 3f5ab8cf
      Johannes Weiner 提交于
      Setting oom_killer_disabled to false is atomic, there is no need for
      further synchronization with ongoing allocations trying to OOM-kill.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f5ab8cf
    • G
      mm/memory hotplug: init the zone's size when calculating node totalpages · febd5949
      Gu Zheng 提交于
      Init the zone's size when calculating node totalpages to avoid duplicated
      operations in free_area_init_core().
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      febd5949
    • N
      mm/hugetlb: introduce minimum hugepage order · 641844f5
      Naoya Horiguchi 提交于
      Currently the initial value of order in dissolve_free_huge_page is 64 or
      32, which leads to the following warning in static checker:
      
        mm/hugetlb.c:1203 dissolve_free_huge_pages()
        warn: potential right shift more than type allows '9,18,64'
      
      This is a potential risk of infinite loop, because 1 << order (== 0) is used
      in for-loop like this:
      
        for (pfn =3D start_pfn; pfn < end_pfn; pfn +=3D 1 << order)
            ...
      
      So this patch fixes it by using global minimum_order calculated at boot time.
      
          text    data     bss     dec     hex filename
         28313     469   84236  113018   1b97a mm/hugetlb.o
         28256     473   84236  112965   1b945 mm/hugetlb.o (patched)
      
      Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      641844f5
    • V
      rmap: fix theoretical race between do_wp_page and shrink_active_list · 414e2fb8
      Vladimir Davydov 提交于
      As noted by Paul the compiler is free to store a temporary result in a
      variable on stack, heap or global unless it is explicitly marked as
      volatile, see:
      
        http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html#sample-optimizations
      
      This can result in a race between do_wp_page() and shrink_active_list()
      as follows.
      
      In do_wp_page() we can call page_move_anon_rmap(), which sets
      page->mapping as follows:
      
        anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON;
        page->mapping = (struct address_space *) anon_vma;
      
      The page in question may be on an LRU list, because nowhere in
      do_wp_page() we remove it from the list, neither do we take any LRU
      related locks.  Although the page is locked, shrink_active_list() can
      still call page_referenced() on it concurrently, because the latter does
      not require an anonymous page to be locked:
      
        CPU0                          CPU1
        ----                          ----
        do_wp_page                    shrink_active_list
         lock_page                     page_referenced
                                        PageAnon->yes, so skip trylock_page
         page_move_anon_rmap
          page->mapping = anon_vma
                                        rmap_walk
                                         PageAnon->no
                                         rmap_walk_file
                                          BUG
          page->mapping += PAGE_MAPPING_ANON
      
      This patch fixes this race by explicitly forbidding the compiler to split
      page->mapping store in page_move_anon_rmap() with the aid of WRITE_ONCE.
      
      [akpm@linux-foundation.org: tweak comment, per Minchan]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      414e2fb8
    • N
      mm/memory-failure: me_huge_page() does nothing for thp · 2491ffee
      Naoya Horiguchi 提交于
      memory_failure() is supposed not to handle thp itself, but to split it.
      But if something were wrong and page_action() were called on thp,
      me_huge_page() (action routine for hugepages) should be better to take
      no action, rather than to take wrong action prepared for hugetlb (which
      triggers BUG_ON().)
      
      This change is for potential problems, but makes sense to me because thp
      is an actively developing feature and this code path can be open in the
      future.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2491ffee
    • N
      mm: soft-offline: don't free target page in successful page migration · add05cec
      Naoya Horiguchi 提交于
      Stress testing showed that soft offline events for a process iterating
      "mmap-pagefault-munmap" loop can trigger
      VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP) in __free_one_page():
      
        Soft offlining page 0x70fe1 at 0x70100008d000
        Soft offlining page 0x705fb at 0x70300008d000
        page:ffffea0001c3f840 count:0 mapcount:0 mapping:          (null) index:0x2
        flags: 0x1fffff80800000(hwpoison)
        page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
        ------------[ cut here ]------------
        kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
        invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
        Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
        CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
        RIP: free_pcppages_bulk+0x52a/0x6f0
        Call Trace:
          drain_pages_zone+0x3d/0x50
          drain_local_pages+0x1d/0x30
          on_each_cpu_mask+0x46/0x80
          drain_all_pages+0x14b/0x1e0
          soft_offline_page+0x432/0x6e0
          SyS_madvise+0x73c/0x780
          system_call_fastpath+0x12/0x17
        Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
        RIP  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
         RSP <ffff88007a117d28>
        ---[ end trace 53926436e76d1f35 ]---
      
      When soft offline successfully migrates page, the source page is supposed
      to be freed.  But there is a race condition where a source page looks
      isolated (i.e.  the refcount is 0 and the PageHWPoison is set) but
      somewhat linked to pcplist.  Then another soft offline event calls
      drain_all_pages() and tries to free such hwpoisoned page, which is
      forbidden.
      
      This odd page state seems to happen due to the race between put_page() in
      putback_lru_page() and __pagevec_lru_add_fn().  But I don't want to play
      with tweaking drain code as done in commit 9ab3b598 "mm: hwpoison:
      drop lru_add_drain_all() in __soft_offline_page()", or to change page
      freeing code for this soft offline's purpose.
      
      Instead, let's think about the difference between hard offline and soft
      offline.  There is an interesting difference in how to isolate the in-use
      page between these, that is, hard offline marks PageHWPoison of the target
      page at first, and doesn't free it by keeping its refcount 1.  OTOH, soft
      offline tries to free the target page then marks PageHWPoison.  This
      difference might be the source of complexity and result in bugs like the
      above.  So making soft offline isolate with keeping refcount can be a
      solution for this problem.
      
      We can pass to page migration code the "reason" which shows the caller, so
      let's use this more to avoid calling putback_lru_page() when called from
      soft offline, which effectively does the isolation for soft offline.  With
      this change, target pages of soft offline never be reused without changing
      migratetype, so this patch also removes the related code.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      add05cec
    • N
      mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling · ead07f6a
      Naoya Horiguchi 提交于
      memory_failure() can run in 2 different mode (specified by
      MF_COUNT_INCREASED) in page refcount perspective.  When
      MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
      takes a refcount of the target page.  And if cleared, memory_failure()
      takes it in it's own.
      
      In current code, however, refcounting is done differently in each caller.
      For example, madvise_hwpoison() uses get_user_pages_fast() and
      hwpoison_inject() uses get_page_unless_zero().  So this inconsistent
      refcounting causes refcount failure especially for thp tail pages.
      Typical user visible effects are like memory leak or
      VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
      
      To fix this refcounting issue, this patch introduces get_hwpoison_page()
      to handle thp tail pages in the same manner for each caller of hwpoison
      code.
      
      memory_failure() might fail to split thp and in such case it returns
      without completing page isolation.  This is not good because PageHWPoison
      on the thp is still set and there's no easy way to unpoison such thps.  So
      this patch try to roll back any action to the thp in "non anonymous thp"
      case and "thp split failed" case, expecting an MCE(SRAR) generated by
      later access afterward will properly free such thps.
      
      [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ead07f6a
    • N
      mm/memory-failure: split thp earlier in memory error handling · 415c64c1
      Naoya Horiguchi 提交于
      memory_failure() doesn't handle thp itself at this time and need to split
      it before doing isolation.  Currently thp is split in the middle of
      hwpoison_user_mappings(), but there're corner cases where memory_failure()
      wrongly tries to handle thp without splitting.
      
      1) "non anonymous" thp, which is not a normal operating mode of thp,
         but a memory error could hit a thp before anon_vma is initialized.  In
         such case, split_huge_page() fails and me_huge_page() (intended for
         hugetlb) is called for thp, which triggers BUG_ON in page_hstate().
      
      2) !PageLRU case, where hwpoison_user_mappings() returns with
         SWAP_SUCCESS and the result is the same as case 1.
      
      memory_failure() can't avoid splitting, so let's split it more earlier,
      which also reduces code which are prepared for both of normal page and
      thp.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      415c64c1
    • Z
      mm: rename RECLAIM_SWAP to RECLAIM_UNMAP · 95bbc0c7
      Zhihui Zhang 提交于
      The name SWAP implies that we are dealing with anonymous pages only.  In
      fact, the original patch that introduced the min_unmapped_ratio logic
      was to fix an issue related to file pages.  Rename it to RECLAIM_UNMAP
      to match what does.
      
      Historically, commit a6dc60f8 ("vmscan: rename sc.may_swap to
      may_unmap") renamed .may_swap to .may_unmap, leaving RECLAIM_SWAP
      behind.  commit 2e2e4259 ("vmscan,memcg: reintroduce sc->may_swap")
      reintroduced .may_swap for memory controller.
      Signed-off-by: NZhihui Zhang <zzhsuny@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95bbc0c7
    • N
      mm: vmscan: do not throttle based on pfmemalloc reserves if node has no reclaimable pages · f012a84a
      Nishanth Aravamudan 提交于
      Based upon 675becce ("mm: vmscan: do not throttle based on pfmemalloc
      reserves if node has no ZONE_NORMAL") from Mel.
      
      We have a system with the following topology:
      
      # numactl -H
      available: 3 nodes (0,2-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
      23 24 25 26 27 28 29 30 31
      node 0 size: 28273 MB
      node 0 free: 27323 MB
      node 2 cpus:
      node 2 size: 16384 MB
      node 2 free: 0 MB
      node 3 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 3 size: 30533 MB
      node 3 free: 13273 MB
      node distances:
      node   0   2   3
        0:  10  20  20
        2:  20  10  20
        3:  20  20  10
      
      Node 2 has no free memory, because:
      # cat /sys/devices/system/node/node2/hugepages/hugepages-16777216kB/nr_hugepages
      1
      
      This leads to the following zoneinfo:
      
      Node 2, zone      DMA
        pages free     0
              min      1840
              low      2300
              high     2760
              scanned  0
              spanned  262144
              present  262144
              managed  262144
      ...
        all_unreclaimable: 1
      
      If one then attempts to allocate some normal 16M hugepages via
      
      echo 37 > /proc/sys/vm/nr_hugepages
      
      The echo never returns and kswapd2 consumes CPU cycles.
      
      This is because throttle_direct_reclaim ends up calling
      wait_event(pfmemalloc_wait, pfmemalloc_watermark_ok...).
      pfmemalloc_watermark_ok() in turn checks all zones on the node if there
      are any reserves, and if so, then indicates the watermarks are ok, by
      seeing if there are sufficient free pages.
      
      675becce added a condition already for memoryless nodes.  In this case,
      though, the node has memory, it is just all consumed (and not
      reclaimable).  Effectively, though, the result is the same on this call to
      pfmemalloc_watermark_ok() and thus seems like a reasonable additional
      condition.
      
      With this change, the afore-mentioned 16M hugepage allocation attempt
      succeeds and correctly round-robins between Nodes 1 and 3.
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f012a84a
    • A
      mm/page_alloc.c: cleanup obsolete KM_USER* · f4d2897b
      Anisse Astier 提交于
      It's been five years now that KM_* kmap flags have been removed and that
      we can call clear_highpage from any context.  So we remove prep_zero_pages
      accordingly.
      Signed-off-by: NAnisse Astier <anisse@astier.eu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4d2897b
    • K
      mm: drop bogus VM_BUG_ON_PAGE assert in put_page() codepath · 73933b33
      Kirill A. Shutemov 提交于
      My commit 8d63d99a ("mm: avoid tail page refcounting on non-THP
      compound pages") which was merged during 4.1 merge window caused
      regression:
      
        page:ffffea0010a15040 count:0 mapcount:1 mapping:          (null) index:0x0
        flags: 0x8000000000008014(referenced|dirty|tail)
        page dumped because: VM_BUG_ON_PAGE(page_mapcount(page) != 0)
        ------------[ cut here ]------------
        kernel BUG at mm/swap.c:134!
      
      The problem can be reproduced by playing *two* audio files at the same
      time and then stopping one of players.  I used two mplayers to trigger
      this.
      
      The VM_BUG_ON_PAGE() which triggers the bug is bogus:
      
      Sound subsystem uses compound pages for its buffers, but unlike most
      __GFP_COMP sound maps compound pages to userspace with PTEs.
      
      In our case with two players map the buffer twice and therefore elevates
      page_mapcount() on tail pages by two.  When one of players exits it
      unmaps the VMA and drops page_mapcount() to one and try to release
      reference on the page with put_page().
      
      My commit changes which path it takes under put_compound_page().  It hits
      put_unrefcounted_compound_page() where VM_BUG_ON_PAGE() is.  It sees
      page_mapcount() == 1.  The function wrongly assumes that subpages of
      compound page cannot be be mapped by itself with PTEs..
      
      The solution is simply drop the VM_BUG_ON_PAGE().
      
      Note: there's no need to move the check under put_page_testzero().
      Allocator will check the mapcount by itself before putting on free list.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NBorislav Petkov <bp@alien8.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73933b33
    • R
      mm: only define hashdist variable when needed · a9919c79
      Rasmus Villemoes 提交于
      For !CONFIG_NUMA, hashdist will always be 0, since it's setter is
      otherwise compiled out.  So we can save 4 bytes of data and some .text
      (although mostly in __init functions) by only defining it for
      CONFIG_NUMA.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9919c79
    • L
      mm: new arch_remap() hook · 4abad2ca
      Laurent Dufour 提交于
      Some architectures would like to be triggered when a memory area is moved
      through the mremap system call.
      
      This patch introduces a new arch_remap() mm hook which is placed in the
      path of mremap, and is called before the old area is unmapped (and the
      arch_unmap() hook is called).
      Signed-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4abad2ca
    • Z
      mm/hugetlb: reduce arch dependent code about huge_pmd_unshare · e81f2d22
      Zhang Zhen 提交于
      Currently we have many duplicates in definitions of huge_pmd_unshare.  In
      all architectures this function just returns 0 when
      CONFIG_ARCH_WANT_HUGE_PMD_SHARE is N.
      
      This patch puts the default implementation in mm/hugetlb.c and lets these
      architectures use the common code.
      Signed-off-by: NZhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: James Yang <James.Yang@freescale.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e81f2d22
    • K
      mm: fix mprotect() behaviour on VM_LOCKED VMAs · 36f88188
      Kirill A. Shutemov 提交于
      On mlock(2) we trigger COW on private writable VMA to avoid faults in
      future.
      
      mm/gup.c:
       840 long populate_vma_page_range(struct vm_area_struct *vma,
       841                 unsigned long start, unsigned long end, int *nonblocking)
       842 {
       ...
       855          * We want to touch writable mappings with a write fault in order
       856          * to break COW, except for shared mappings because these don't COW
       857          * and we would not want to dirty them for nothing.
       858          */
       859         if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
       860                 gup_flags |= FOLL_WRITE;
      
      But we miss this case when we make VM_LOCKED VMA writeable via
      mprotect(2). The test case:
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/resource.h>
      	#include <sys/stat.h>
      	#include <sys/time.h>
      	#include <sys/types.h>
      
      	#define PAGE_SIZE 4096
      
      	int main(int argc, char **argv)
      	{
      		struct rusage usage;
      		long before;
      		char *p;
      		int fd;
      
      		/* Create a file and populate first page of page cache */
      		fd = open("/tmp", O_TMPFILE | O_RDWR, S_IRUSR | S_IWUSR);
      		write(fd, "1", 1);
      
      		/* Create a *read-only* *private* mapping of the file */
      		p = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
      
      		/*
      		 * Since the mapping is read-only, mlock() will populate the mapping
      		 * with PTEs pointing to page cache without triggering COW.
      		 */
      		mlock(p, PAGE_SIZE);
      
      		/*
      		 * Mapping became read-write, but it's still populated with PTEs
      		 * pointing to page cache.
      		 */
      		mprotect(p, PAGE_SIZE, PROT_READ | PROT_WRITE);
      
      		getrusage(RUSAGE_SELF, &usage);
      		before = usage.ru_minflt;
      
      		/* Trigger COW: fault in mlock()ed VMA. */
      		*p = 1;
      
      		getrusage(RUSAGE_SELF, &usage);
      		printf("faults: %ld\n", usage.ru_minflt - before);
      
      		return 0;
      	}
      
      	$ ./test
      	faults: 1
      
      Let's fix it by triggering populating of VMA in mprotect_fixup() on this
      condition. We don't care about population error as we don't in other
      similar cases i.e. mremap.
      
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36f88188
    • J
      thp: cleanup how khugepaged enters freezer · cd092411
      Jiri Kosina 提交于
      khugepaged_do_scan() checks in every iteration whether freezing(current)
      is true, and in such case breaks out of the loop, which causes
      try_to_freeze() to be called immediately afterwards in
      khugepaged_wait_work().
      
      If nothing else, this causes unnecessary freezing(current) test, and also
      makes the way khugepaged enters freezer a bit less obvious than necessary.
      
      Let's just try to freeze directly, instead of splitting it into two
      (directly adjacent) phases.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd092411
    • A
      mm, hwpoison: remove obsolete "Notebook" todo list · ebb09738
      Andi Kleen 提交于
      All the items mentioned here have been either addressed, or were not
      really needed.  So just remove the comment.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebb09738
    • A
      mm, hwpoison: add comment describing when to add new cases · e0de78df
      Andi Kleen 提交于
      Here's another comment fix for hwpoison.
      
      It describes the "guiding principle" on when to add new
      memory error recovery code.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0de78df
    • D
      slab: correct size_index table before replacing the bootstrap kmem_cache_node · 34cc6990
      Daniel Sanders 提交于
      This patch moves the initialization of the size_index table slightly
      earlier so that the first few kmem_cache_node's can be safely allocated
      when KMALLOC_MIN_SIZE is large.
      
      There are currently two ways to generate indices into kmalloc_caches (via
      kmalloc_index() and via the size_index table in slab_common.c) and on some
      arches (possibly only MIPS) they potentially disagree with each other
      until create_kmalloc_caches() has been called.  It seems that the
      intention is that the size_index table is a fast equivalent to
      kmalloc_index() and that create_kmalloc_caches() patches the table to
      return the correct value for the cases where kmalloc_index()'s
      if-statements apply.
      
      The failing sequence was:
      * kmalloc_caches contains NULL elements
      * kmem_cache_init initialises the element that 'struct
        kmem_cache_node' will be allocated to. For 32-bit Mips, this is a
        56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7).
      * init_list is called which calls kmalloc_node to allocate a 'struct
        kmem_cache_node'.
      * kmalloc_slab selects the kmem_caches element using
        size_index[size_index_elem(size)]. For MIPS, size is 56, and the
        expression returns 6.
      * This element of kmalloc_caches is NULL and allocation fails.
      * If it had not already failed, it would have called
        create_kmalloc_caches() at this point which would have changed
        size_index[size_index_elem(size)] to 7.
      
      I don't believe the bug to be LLVM specific but GCC doesn't normally
      encounter the problem.  I haven't been able to identify exactly what GCC
      is doing better (probably inlining) but it seems that GCC is managing to
      optimize to the point that it eliminates the problematic allocations.
      This theory is supported by the fact that GCC can be made to fail in the
      same way by changing inline, __inline, __inline__, and __always_inline in
      include/linux/compiler-gcc.h such that they don't actually inline things.
      Signed-off-by: NDaniel Sanders <daniel.sanders@imgtec.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34cc6990
    • G
      mm/slab_common: support the slub_debug boot option on specific object size · 4066c33d
      Gavin Guo 提交于
      The slub_debug=PU,kmalloc-xx cannot work because in the
      create_kmalloc_caches() the s->name is created after the
      create_kmalloc_cache() is called.  The name is NULL in the
      create_kmalloc_cache() so the kmem_cache_flags() would not set the
      slub_debug flags to the s->flags.  The fix here set up a kmalloc_names
      string array for the initialization purpose and delete the dynamic name
      creation of kmalloc_caches.
      
      [akpm@linux-foundation.org: s/kmalloc_names/kmalloc_info/, tweak comment text]
      Signed-off-by: NGavin Guo <gavin.guo@canonical.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4066c33d
  2. 18 6月, 2015 1 次提交
  3. 11 6月, 2015 4 次提交
    • S
      zsmalloc: fix a null pointer dereference in destroy_handle_cache() · 02f7b414
      Sergey Senozhatsky 提交于
      If zs_create_pool()->create_handle_cache()->kmem_cache_create() or
      pool->name allocation fails, zs_create_pool()->destroy_handle_cache()
      will dereference the NULL pool->handle_cachep.
      
      Modify destroy_handle_cache() to avoid this.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02f7b414
    • J
      mm: memcontrol: fix false-positive VM_BUG_ON() on -rt · f371763a
      Johannes Weiner 提交于
      On -rt, the VM_BUG_ON(!irqs_disabled()) triggers inside the memcg
      swapout path because the spin_lock_irq(&mapping->tree_lock) in the
      caller doesn't actually disable the hardware interrupts - which is fine,
      because on -rt the tophalves run in process context and so we are still
      safe from preemption while updating the statistics.
      
      Remove the VM_BUG_ON() but keep the comment of what we rely on.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NClark Williams <williams@redhat.com>
      Cc: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f371763a
    • V
      memcg: do not call reclaim if !__GFP_WAIT · 7d638093
      Vladimir Davydov 提交于
      When trimming memcg consumption excess (see memory.high), we call
      try_to_free_mem_cgroup_pages without checking if we are allowed to sleep
      in the current context, which can result in a deadlock.  Fix this.
      
      Fixes: 241994ed ("mm: memcontrol: default hierarchy interface for memory")
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d638093
    • G
      mm/memory_hotplug.c: set zone->wait_table to null after freeing it · 85bd8399
      Gu Zheng 提交于
      Izumi found the following oops when hot re-adding a node:
      
          BUG: unable to handle kernel paging request at ffffc90008963690
          IP: __wake_up_bit+0x20/0x70
          Oops: 0000 [#1] SMP
          CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
          Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
          task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
          RIP: 0010:[<ffffffff810dff80>]  [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
          RSP: 0018:ffff880017b97be8  EFLAGS: 00010246
          RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
          RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
          RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
          R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
          FS:  00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
          Call Trace:
            unlock_page+0x6d/0x70
            generic_write_end+0x53/0xb0
            xfs_vm_write_end+0x29/0x80 [xfs]
            generic_perform_write+0x10a/0x1e0
            xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
            xfs_file_write_iter+0x79/0x120 [xfs]
            __vfs_write+0xd4/0x110
            vfs_write+0xac/0x1c0
            SyS_write+0x58/0xd0
            system_call_fastpath+0x12/0x76
          Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
          RIP  [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
           RSP <ffff880017b97be8>
          CR2: ffffc90008963690
      
      Reproduce method (re-add a node)::
        Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)
      
      This seems an use-after-free problem, and the root cause is
      zone->wait_table was not set to *NULL* after free it in
      try_offline_node.
      
      When hot re-add a node, we will reuse the pgdat of it, so does the zone
      struct, and when add pages to the target zone, it will init the zone
      first (including the wait_table) if the zone is not initialized.  The
      judgement of zone initialized is based on zone->wait_table:
      
      	static inline bool zone_is_initialized(struct zone *zone)
      	{
      		return !!zone->wait_table;
      	}
      
      so if we do not set the zone->wait_table to *NULL* after free it, the
      memory hotplug routine will skip the init of new zone when hot re-add
      the node, and the wait_table still points to the freed memory, then we
      will access the invalid address when trying to wake up the waiting
      people after the i/o operation with the page is done, such as mentioned
      above.
      Signed-off-by: NGu Zheng <guz.fnst@cn.fujitsu.com>
      Reported-by: NTaku Izumi <izumi.taku@jp.fujitsu.com>
      Reviewed by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85bd8399
  4. 29 5月, 2015 1 次提交
  5. 19 5月, 2015 1 次提交
    • D
      sched/preempt, mm/fault: Trigger might_sleep() in might_fault() with disabled pagefaults · 9ec23531
      David Hildenbrand 提交于
      Commit 662bbcb2 ("mm, sched: Allow uaccess in atomic with
      pagefault_disable()") removed might_sleep() checks for all user access
      code (that uses might_fault()).
      
      The reason was to disable wrong "sleep in atomic" warnings in the
      following scenario:
      
          pagefault_disable()
          rc = copy_to_user(...)
          pagefault_enable()
      
      Which is valid, as pagefault_disable() increments the preempt counter
      and therefore disables the pagefault handler. copy_to_user() will not
      sleep and return an error code if a page is not available.
      
      However, as all might_sleep() checks are removed,
      CONFIG_DEBUG_ATOMIC_SLEEP would no longer detect the following scenario:
      
          spin_lock(&lock);
          rc = copy_to_user(...)
          spin_unlock(&lock)
      
      If the kernel is compiled with preemption turned on, preempt_disable()
      will make in_atomic() detect disabled preemption. The fault handler would
      correctly never sleep on user access.
      However, with preemption turned off, preempt_disable() is usually a NOP
      (with !CONFIG_PREEMPT_COUNT), therefore in_atomic() will not be able to
      detect disabled preemption nor disabled pagefaults. The fault handler
      could sleep.
      We really want to enable CONFIG_DEBUG_ATOMIC_SLEEP checks for user access
      functions again, otherwise we can end up with horrible deadlocks.
      
      Root of all evil is that pagefault_disable() acts almost as
      preempt_disable(), depending on preemption being turned on/off.
      
      As we now have pagefault_disabled(), we can use it to distinguish
      whether user acces functions might sleep.
      
      Convert might_fault() into a makro that calls __might_fault(), to
      allow proper file + line messages in case of a might_sleep() warning.
      Reviewed-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: benh@kernel.crashing.org
      Cc: bigeasy@linutronix.de
      Cc: borntraeger@de.ibm.com
      Cc: daniel.vetter@intel.com
      Cc: heiko.carstens@de.ibm.com
      Cc: herbert@gondor.apana.org.au
      Cc: hocko@suse.cz
      Cc: hughd@google.com
      Cc: mst@redhat.com
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: schwidefsky@de.ibm.com
      Cc: yang.shi@windriver.com
      Link: http://lkml.kernel.org/r/1431359540-32227-3-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9ec23531
  6. 15 5月, 2015 2 次提交
    • M
      mm, numa: really disable NUMA balancing by default on single node machines · b0dc2b9b
      Mel Gorman 提交于
      NUMA balancing is meant to be disabled by default on UMA machines but
      the check is using nr_node_ids (highest node) instead of
      num_online_nodes (online nodes).
      
      The consequences are that a UMA machine with a node ID of 1 or higher
      will enable NUMA balancing.  This will incur useless overhead due to
      minor faults with the impact depending on the workload.  These are the
      impact on the stats when running a kernel build on a single node machine
      whose node ID happened to be 1:
      
        			       vanilla     patched
        NUMA base PTE updates          5113158           0
        NUMA huge PMD updates              643           0
        NUMA page range updates        5442374           0
        NUMA hint faults               2109622           0
        NUMA hint local faults         2109622           0
        NUMA hint local percent            100         100
        NUMA pages migrated                  0           0
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b0dc2b9b
    • H
      CMA: page_isolation: check buddy before accessing it · 1ae7013d
      Hui Zhu 提交于
      I had an issue:
      
          Unable to handle kernel NULL pointer dereference at virtual address 0000082a
          pgd = cc970000
          [0000082a] *pgd=00000000
          Internal error: Oops: 5 [#1] PREEMPT SMP ARM
          PC is at get_pageblock_flags_group+0x5c/0xb0
          LR is at unset_migratetype_isolate+0x148/0x1b0
          pc : [<c00cc9a0>]    lr : [<c0109874>]    psr: 80000093
          sp : c7029d00  ip : 00000105  fp : c7029d1c
          r10: 00000001  r9 : 0000000a  r8 : 00000004
          r7 : 60000013  r6 : 000000a4  r5 : c0a357e4  r4 : 00000000
          r3 : 00000826  r2 : 00000002  r1 : 00000000  r0 : 0000003f
          Flags: Nzcv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
          Control: 10c5387d  Table: 2cb7006a  DAC: 00000015
          Backtrace:
              get_pageblock_flags_group+0x0/0xb0
              unset_migratetype_isolate+0x0/0x1b0
              undo_isolate_page_range+0x0/0xdc
              __alloc_contig_range+0x0/0x34c
              alloc_contig_range+0x0/0x18
      
      This issue is because when calling unset_migratetype_isolate() to unset
      a part of CMA memory, it try to access the buddy page to get its status:
      
      		if (order >= pageblock_order) {
      			page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
      			buddy_idx = __find_buddy_index(page_idx, order);
      			buddy = page + (buddy_idx - page_idx);
      
      			if (!is_migrate_isolate_page(buddy)) {
      
      But the begin addr of this part of CMA memory is very close to a part of
      memory that is reserved at boot time (not in buddy system).  So add a
      check before accessing it.
      
      [akpm@linux-foundation.org: use conventional code layout]
      Signed-off-by: NHui Zhu <zhuhui@xiaomi.com>
      Suggested-by: NLaura Abbott <labbott@redhat.com>
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1ae7013d