1. 15 1月, 2016 6 次提交
  2. 09 1月, 2016 1 次提交
    • M
      vmstat: allocate vmstat_wq before it is used · 751e5f5c
      Michal Hocko 提交于
      kernel test robot has reported the following crash:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000100
        IP: [<c1074df6>] __queue_work+0x26/0x390
        *pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53
        Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP
        CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe5 #1
        Workqueue: events vmstat_shepherd
        task: cb684600 ti: cb7ba000 task.ti: cb7ba000
        EIP: 0060:[<c1074df6>] EFLAGS: 00010046 CPU: 0
        EIP is at __queue_work+0x26/0x390
        EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000
        ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38
         DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
        CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0
        Stack:
        Call Trace:
          __queue_delayed_work+0xa1/0x160
          queue_delayed_work_on+0x36/0x60
          vmstat_shepherd+0xad/0xf0
          process_one_work+0x1aa/0x4c0
          worker_thread+0x41/0x440
          kthread+0xb0/0xd0
          ret_from_kernel_thread+0x21/0x40
      
      The reason is that start_shepherd_timer schedules the shepherd work item
      which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates
      that workqueue so if the further initialization takes more than HZ we
      might end up scheduling on a NULL vmstat_wq.  This is really unlikely
      but not impossible.
      
      Fixes: 373ccbe5 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress")
      Reported-by: Nkernel test robot <ying.huang@linux.intel.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: stable@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      751e5f5c
  3. 05 1月, 2016 1 次提交
    • T
      x86/mm/pat: Add untrack_pfn_moved for mremap · d9fe4fab
      Toshi Kani 提交于
      mremap() with MREMAP_FIXED on a VM_PFNMAP range causes the following
      WARN_ON_ONCE() message in untrack_pfn().
      
        WARNING: CPU: 1 PID: 3493 at arch/x86/mm/pat.c:985 untrack_pfn+0xbd/0xd0()
        Call Trace:
        [<ffffffff817729ea>] dump_stack+0x45/0x57
        [<ffffffff8109e4b6>] warn_slowpath_common+0x86/0xc0
        [<ffffffff8109e5ea>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8106a88d>] untrack_pfn+0xbd/0xd0
        [<ffffffff811d2d5e>] unmap_single_vma+0x80e/0x860
        [<ffffffff811d3725>] unmap_vmas+0x55/0xb0
        [<ffffffff811d916c>] unmap_region+0xac/0x120
        [<ffffffff811db86a>] do_munmap+0x28a/0x460
        [<ffffffff811dec33>] move_vma+0x1b3/0x2e0
        [<ffffffff811df113>] SyS_mremap+0x3b3/0x510
        [<ffffffff817793ee>] entry_SYSCALL_64_fastpath+0x12/0x71
      
      MREMAP_FIXED moves a pfnmap from old vma to new vma.  untrack_pfn() is
      called with the old vma after its pfnmap page table has been removed,
      which causes follow_phys() to fail.  The new vma has a new pfnmap to
      the same pfn & cache type with VM_PAT set.  Therefore, we only need to
      clear VM_PAT from the old vma in this case.
      
      Add untrack_pfn_moved(), which clears VM_PAT from a given old vma.
      move_vma() is changed to call this function with the old vma when
      VM_PFNMAP is set.  move_vma() then calls do_munmap(), and untrack_pfn()
      is a no-op since VM_PAT is cleared.
      Reported-by: NStas Sergeev <stsp@list.ru>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/1450832064-10093-2-git-send-email-toshi.kani@hpe.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d9fe4fab
  4. 04 1月, 2016 1 次提交
  5. 31 12月, 2015 1 次提交
  6. 30 12月, 2015 3 次提交
    • H
      mm/vmstat: fix overflow in mod_zone_page_state() · 6cdb18ad
      Heiko Carstens 提交于
      mod_zone_page_state() takes a "delta" integer argument.  delta contains
      the number of pages that should be added or subtracted from a struct
      zone's vm_stat field.
      
      If a zone is larger than 8TB this will cause overflows.  E.g.  for a
      zone with a size slightly larger than 8TB the line
      
          mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
      
      in mm/page_alloc.c:free_area_init_core() will result in a negative
      result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
      contain 0x8xxxxxxx pages which will be sign extended to a negative
      value.
      
      Fix this by changing the delta argument to long type.
      
      This could fix an early boot problem seen on s390, where we have a 9TB
      system with only one node.  ZONE_DMA contains 2GB and ZONE_NORMAL the
      rest.  The system is trying to allocate a GFP_DMA page but ZONE_DMA is
      completely empty, so it tries to reclaim pages in an endless loop.
      
      This was seen on a heavily patched 3.10 kernel.  One possible
      explaination seem to be the overflows caused by mod_zone_page_state().
      Unfortunately I did not have the chance to verify that this patch
      actually fixes the problem, since I don't have access to the system
      right now.  However the overflow problem does exist anyway.
      
      Given the description that a system with slightly less than 8TB does
      work, this seems to be a candidate for the observed problem.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cdb18ad
    • A
      mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone() · 5f0f2887
      Andrew Banman 提交于
      test_pages_in_a_zone() does not account for the possibility of missing
      sections in the given pfn range.  pfn_valid_within always returns 1 when
      CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
      sections to pass the test, leading to a kernel oops.
      
      Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
      for missing sections before proceeding into the zone-check code.
      
      This also prevents a crash from offlining memory devices with missing
      sections.  Despite this, it may be a good idea to keep the related patch
      '[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
      missing sections' because missing sections in a memory block may lead to
      other problems not covered by the scope of this fix.
      Signed-off-by: NAndrew Banman <abanman@sgi.com>
      Acked-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Russ Anderson <rja@sgi.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f0f2887
    • V
      mm: memcontrol: fix possible memcg leak due to interrupted reclaim · 6df38689
      Vladimir Davydov 提交于
      Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
      once enough pages have been reclaimed, in which case, in contrast to a
      full round-trip over a cgroup sub-tree, the current position stored in
      mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
      and so is left holding the reference to the last scanned cgroup.  If the
      target cgroup does not get scanned again (we might have just reclaimed
      the last page or all processes might exit and free their memory
      voluntary), we will leak it, because there is nobody to put the
      reference held by the iterator.
      
      The problem is easy to reproduce by running the following command
      sequence in a loop:
      
          mkdir /sys/fs/cgroup/memory/test
          echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
          echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
          memhog 150M
          echo $$ > /sys/fs/cgroup/memory/cgroup.procs
          rmdir test
      
      The cgroups generated by it will never get freed.
      
      This patch fixes this issue by making mem_cgroup_iter avoid taking
      reference to the current position.  In order not to hit use-after-free
      bug while running reclaim in parallel with cgroup deletion, we make use
      of ->css_released cgroup callback to clear references to the dying
      cgroup in all reclaim iterators that might refer to it.  This callback
      is called right before scheduling rcu work which will free css, so if we
      access iter->position from rcu read section, we might be sure it won't
      go away under us.
      
      [hannes@cmpxchg.org: clean up css ref handling]
      Fixes: 5ac8fb31 ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[3.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6df38689
  7. 28 12月, 2015 1 次提交
    • R
      cgroup: Fix uninitialized variable warning · eed67d75
      Ross Zwisler 提交于
      Commit 1f7dd3e5 ("cgroup: fix handling of multi-destination migration
      from subtree_control enabling") introduced the following compiler warning:
      
      mm/memcontrol.c: In function ‘mem_cgroup_can_attach’:
      mm/memcontrol.c:4790:9: warning: ‘memcg’ may be used uninitialized in this function [-Wmaybe-uninitialized]
         mc.to = memcg;
               ^
      
      Fix this by initializing 'memcg' to NULL.
      
      This was found using gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6).
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      eed67d75
  8. 19 12月, 2015 1 次提交
  9. 13 12月, 2015 9 次提交
    • C
      mm/oom_kill.c: avoid attempting to kill init sharing same memory · a2b829d9
      Chen Jie 提交于
      It's possible that an oom killed victim shares an ->mm with the init
      process and thus oom_kill_process() would end up trying to kill init as
      well.
      
      This has been shown in practice:
      
      	Out of memory: Kill process 9134 (init) score 3 or sacrifice child
      	Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB
      	Kill process 1 (init) sharing same memory
      	...
      	Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
      
      And this will result in a kernel panic.
      
      If a process is forked by init and selected for oom kill while still
      sharing init_mm, then it's likely this system is in a recoverable state.
      However, it's better not to try to kill init and allow the machine to
      panic due to unkillable processes.
      
      [rientjes@google.com: rewrote changelog]
      [akpm@linux-foundation.org: fix inverted test, per Ben]
      Signed-off-by: NChen Jie <chenjie6@huawei.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2b829d9
    • H
      tmpfs: fix shmem_evict_inode() warnings on i_blocks · 267a4c76
      Hugh Dickins 提交于
      Dmitry Vyukov provides a little program, autogenerated by syzkaller,
      which races a fault on a mapping of a sparse memfd object, against
      truncation of that object below the fault address: run repeatedly for a
      few minutes, it reliably generates shmem_evict_inode()'s
      WARN_ON(inode->i_blocks).
      
      (But there's nothing specific to memfd here, nor to the fstat which it
      happened to use to generate the fault: though that looked suspicious,
      since a shmem_recalc_inode() had been added there recently.  The same
      problem can be reproduced with open+unlink in place of memfd_create, and
      with fstatfs in place of fstat.)
      
      v3.7 commit 0f3c42f5 ("tmpfs: change final i_blocks BUG to WARNING")
      explains one cause of such a warning (a race with shmem_writepage to
      swap), and possible solutions; but we never took it further, and this
      syzkaller incident turns out to have a different cause.
      
      shmem_getpage_gfp()'s error recovery, when a freshly allocated page is
      then found to be beyond eof, looks plausible - decrementing the alloced
      count that was just before incremented - but in fact can go wrong, if a
      racing thread (the truncator, for example) gets its shmem_recalc_inode()
      in just after our delete_from_page_cache().  delete_from_page_cache()
      decrements nrpages, that shmem_recalc_inode() will balance the books by
      decrementing alloced itself, then our decrement of alloced take it one
      too low: leading to the WARNING when the object is finally evicted.
      
      Once the new page has been exposed in the page cache,
      shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get
      the accounting right in all cases (and not fall through from "trunc:" to
      "decused:").  Adjust that error recovery block; and the reinitialization
      of info and sbinfo can be removed too.
      
      While we're here, fix shmem_writepage() to avoid the original issue: it
      will be safe against a racing shmem_recalc_inode(), if it merely
      increments swapped before the shmem_delete_from_page_cache() which
      decrements nrpages (but it must then do its own shmem_recalc_inode()
      before that, while still in balance, instead of after).  (Aside: why do
      we shmem_recalc_inode() here in the swap path? Because its raison d'etre
      is to cope with clean sparse shmem pages being reclaimed behind our
      back: so here when swapping is a good place to look for that case.) But
      I've not now managed to reproduce this bug, even without the patch.
      
      I don't see why I didn't do that earlier: perhaps inhibited by the
      preference to eliminate shmem_recalc_inode() altogether.  Driven by this
      incident, I do now have a patch to do so at last; but still want to sit
      on it for a bit, there's a couple of questions yet to be resolved.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      267a4c76
    • M
      mm/hugetlb.c: fix resv map memory leak for placeholder entries · dbe409e4
      Mike Kravetz 提交于
      Dmitry Vyukov reported the following memory leak
      
      unreferenced object 0xffff88002eaafd88 (size 32):
        comm "a.out", pid 5063, jiffies 4295774645 (age 15.810s)
        hex dump (first 32 bytes):
          28 e9 4e 63 00 88 ff ff 28 e9 4e 63 00 88 ff ff  (.Nc....(.Nc....
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
           kmalloc include/linux/slab.h:458
           region_chg+0x2d4/0x6b0 mm/hugetlb.c:398
           __vma_reservation_common+0x2c3/0x390 mm/hugetlb.c:1791
           vma_needs_reservation mm/hugetlb.c:1813
           alloc_huge_page+0x19e/0xc70 mm/hugetlb.c:1845
           hugetlb_no_page mm/hugetlb.c:3543
           hugetlb_fault+0x7a1/0x1250 mm/hugetlb.c:3717
           follow_hugetlb_page+0x339/0xc70 mm/hugetlb.c:3880
           __get_user_pages+0x542/0xf30 mm/gup.c:497
           populate_vma_page_range+0xde/0x110 mm/gup.c:919
           __mm_populate+0x1c7/0x310 mm/gup.c:969
           do_mlock+0x291/0x360 mm/mlock.c:637
           SYSC_mlock2 mm/mlock.c:658
           SyS_mlock2+0x4b/0x70 mm/mlock.c:648
      
      Dmitry identified a potential memory leak in the routine region_chg,
      where a region descriptor is not free'ed on an error path.
      
      However, the root cause for the above memory leak resides in region_del.
      In this specific case, a "placeholder" entry is created in region_chg.
      The associated page allocation fails, and the placeholder entry is left
      in the reserve map.  This is "by design" as the entry should be deleted
      when the map is released.  The bug is in the region_del routine which is
      used to delete entries within a specific range (and when the map is
      released).  region_del did not handle the case where a placeholder entry
      exactly matched the start of the range range to be deleted.  In this
      case, the entry would not be deleted and leaked.  The fix is to take
      these special placeholder entries into account in region_del.
      
      The region_chg error path leak is also fixed.
      
      Fixes: feba16e2 ("mm/hugetlb: add region_del() to delete a specific range of entries")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbe409e4
    • N
      mm: hugetlb: call huge_pte_alloc() only if ptep is null · 0d777df5
      Naoya Horiguchi 提交于
      Currently at the beginning of hugetlb_fault(), we call huge_pte_offset()
      and check whether the obtained *ptep is a migration/hwpoison entry or
      not.  And if not, then we get to call huge_pte_alloc().  This is racy
      because the *ptep could turn into migration/hwpoison entry after the
      huge_pte_offset() check.  This race results in BUG_ON in
      huge_pte_alloc().
      
      We don't have to call huge_pte_alloc() when the huge_pte_offset()
      returns non-NULL, so let's fix this bug with moving the code into else
      block.
      
      Note that the *ptep could turn into a migration/hwpoison entry after
      this block, but that's not a problem because we have another
      !pte_present check later (we never go into hugetlb_no_page() in that
      case.)
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d777df5
    • H
      mm: fix kerneldoc on mem_cgroup_replace_page · 25be6a65
      Hugh Dickins 提交于
      Whoops, I missed removing the kerneldoc comment of the lrucare arg
      removed from mem_cgroup_replace_page; but it's a good comment, keep it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25be6a65
    • M
      mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress · 373ccbe5
      Michal Hocko 提交于
      Tetsuo Handa has reported that the system might basically livelock in
      OOM condition without triggering the OOM killer.
      
      The issue is caused by internal dependency of the direct reclaim on
      vmstat counter updates (via zone_reclaimable) which are performed from
      the workqueue context.  If all the current workers get assigned to an
      allocation request, though, they will be looping inside the allocator
      trying to reclaim memory but zone_reclaimable can see stalled numbers so
      it will consider a zone reclaimable even though it has been scanned way
      too much.  WQ concurrency logic will not consider this situation as a
      congested workqueue because it relies that worker would have to sleep in
      such a situation.  This also means that it doesn't try to spawn new
      workers or invoke the rescuer thread if the one is assigned to the
      queue.
      
      In order to fix this issue we need to do two things.  First we have to
      let wq concurrency code know that we are in trouble so we have to do a
      short sleep.  In order to prevent from issues handled by 0e093d99
      ("writeback: do not sleep on the congestion queue if there are no
      congested BDIs or if significant congestion is not being encountered in
      the current zone") we limit the sleep only to worker threads which are
      the ones of the interest anyway.
      
      The second thing to do is to create a dedicated workqueue for vmstat and
      mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
      have a spare worker thread for it.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Cristopher Lameter <clameter@sgi.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Arkadiusz Miskiewicz <arekm@maven.pl>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      373ccbe5
    • V
      mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo · 475a2f90
      Vlastimil Babka 提交于
      Commit 016c13da ("mm, page_alloc: use masks and shifts when
      converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
      MIGRATE_RECLAIMABLE in the enum definition.  However, migratetype_names
      wasn't updated to reflect that.
      
      As a result, the file /proc/pagetypeinfo shows the counts for Movable as
      Reclaimable and vice versa.
      
      Additionally, commit 0aaa29a5 ("mm, page_alloc: reserve pageblocks
      for high-order atomic allocations on demand") introduced
      MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
      show_migration_types(), so it doesn't appear in the listing of free
      areas during page alloc failures or oom kills.
      
      This patch fixes both problems.  The atomic reserves will show with a
      letter 'H' in the free areas listings.
      
      Fixes: 016c13da ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
      Fixes: 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      475a2f90
    • V
      memcg: fix memory.high target · 9516a18a
      Vladimir Davydov 提交于
      When the memory.high threshold is exceeded, try_charge() schedules a
      task_work to reclaim the excess.  The reclaim target is set to the
      number of pages requested by try_charge().
      
      This is wrong, because try_charge() usually charges more pages than
      requested (batch > nr_pages) in order to refill per cpu stocks.  As a
      result, a process in a cgroup can easily exceed memory.high
      significantly when doing a lot of charges w/o returning to userspace
      (e.g.  reading a file in big chunks).
      
      Fix this issue by assuring that when exceeding memory.high a process
      reclaims as many pages as were actually charged (i.e.  batch).
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9516a18a
    • N
      mm: hugetlb: fix hugepage memory leak caused by wrong reserve count · a88c7695
      Naoya Horiguchi 提交于
      When dequeue_huge_page_vma() in alloc_huge_page() fails, we fall back on
      alloc_buddy_huge_page() to directly create a hugepage from the buddy
      allocator.
      
      In that case, however, if alloc_buddy_huge_page() succeeds we don't
      decrement h->resv_huge_pages, which means that successful
      hugetlb_fault() returns without releasing the reserve count.  As a
      result, subsequent hugetlb_fault() might fail despite that there are
      still free hugepages.
      
      This patch simply adds decrementing code on that code path.
      
      I reproduced this problem when testing v4.3 kernel in the following situation:
       - the test machine/VM is a NUMA system,
       - hugepage overcommiting is enabled,
       - most of hugepages are allocated and there's only one free hugepage
         which is on node 0 (for example),
       - another program, which calls set_mempolicy(MPOL_BIND) to bind itself to
         node 1, tries to allocate a hugepage,
       - the allocation should fail but the reserve count is still hold.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org> [3.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a88c7695
  10. 10 12月, 2015 1 次提交
  11. 09 12月, 2015 3 次提交
    • A
      teach shmem_get_link() to work in RCU mode · 6a6c9904
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6a6c9904
    • A
      replace ->follow_link() with new method that could stay in RCU mode · 6b255391
      Al Viro 提交于
      new method: ->get_link(); replacement of ->follow_link().  The differences
      are:
      	* inode and dentry are passed separately
      	* might be called both in RCU and non-RCU mode;
      the former is indicated by passing it a NULL dentry.
      	* when called that way it isn't allowed to block
      and should return ERR_PTR(-ECHILD) if it needs to be called
      in non-RCU mode.
      
      It's a flagday change - the old method is gone, all in-tree instances
      converted.  Conversion isn't hard; said that, so far very few instances
      do not immediately bail out when called in RCU mode.  That'll change
      in the next commits.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6b255391
    • A
      don't put symlink bodies in pagecache into highmem · 21fc61c7
      Al Viro 提交于
      kmap() in page_follow_link_light() needed to go - allowing to hold
      an arbitrary number of kmaps for long is a great way to deadlocking
      the system.
      
      new helper (inode_nohighmem(inode)) needs to be used for pagecache
      symlinks inodes; done for all in-tree cases.  page_follow_link_light()
      instrumented to yell about anything missed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21fc61c7
  12. 07 12月, 2015 2 次提交
  13. 06 12月, 2015 1 次提交
  14. 03 12月, 2015 1 次提交
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  15. 23 11月, 2015 6 次提交
    • P
      treewide: Remove old email address · 90eec103
      Peter Zijlstra 提交于
      There were still a number of references to my old Red Hat email
      address in the kernel source. Remove these while keeping the
      Red Hat copyright notices intact.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90eec103
    • J
      slab/slub: adjust kmem_cache_alloc_bulk API · 865762a8
      Jesper Dangaard Brouer 提交于
      Adjust kmem_cache_alloc_bulk API before we have any real users.
      
      Adjust API to return type 'int' instead of previously type 'bool'.  This
      is done to allow future extension of the bulk alloc API.
      
      A future extension could be to allow SLUB to stop at a page boundary, when
      specified by a flag, and then return the number of objects.
      
      The advantage of this approach, would make it easier to make bulk alloc
      run without local IRQs disabled.  With an approach of cmpxchg "stealing"
      the entire c->freelist or page->freelist.  To avoid overshooting we would
      stop processing at a slab-page boundary.  Else we always end up returning
      some objects at the cost of another cmpxchg.
      
      To keep compatible with future users of this API linking against an older
      kernel when using the new flag, we need to return the number of allocated
      objects with this API change.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      865762a8
    • J
      slub: add missing kmem cgroup support to kmem_cache_free_bulk · 03374518
      Jesper Dangaard Brouer 提交于
      Initial implementation missed support for kmem cgroup support in
      kmem_cache_free_bulk() call, add this.
      
      If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough
      to not add any asm code.
      
      Incoming bulk free objects can belong to different kmem cgroups, and
      object free call can happen at a later point outside memcg context.  Thus,
      we need to keep the orig kmem_cache, to correctly verify if a memcg object
      match against its "root_cache" (s->memcg_params.root_cache).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03374518
    • J
      slub: fix kmem cgroup bug in kmem_cache_alloc_bulk · 03ec0ed5
      Jesper Dangaard Brouer 提交于
      The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to
      be called several times inside the bulk alloc for loop, due to the call to
      memcg_kmem_get_cache().
      
      This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache.
      
      As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able
      to handle an array of objects.
      
      A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have
      same type (size_t) as size argument.  This helps the compiler to easier
      realize that it can remove the loop, when all debug statements inside loop
      evaluates to nothing.  Note, this is only an issue because the kernel is
      compiled with GCC option: -fno-strict-overflow
      
      In slab_alloc_node() the compiler inlines and optimizes the invocation of
      slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access
      object directly.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Reported-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Suggested-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03ec0ed5
    • J
      slub: optimize bulk slowpath free by detached freelist · d0ecd894
      Jesper Dangaard Brouer 提交于
      This change focus on improving the speed of object freeing in the
      "slowpath" of kmem_cache_free_bulk.
      
      The calls slab_free (fastpath) and __slab_free (slowpath) have been
      extended with support for bulk free, which amortize the overhead of
      the (locked) cmpxchg_double.
      
      To use the new bulking feature, we build what I call a detached
      freelist.  The detached freelist takes advantage of three properties:
      
       1) the free function call owns the object that is about to be freed,
          thus writing into this memory is synchronization-free.
      
       2) many freelist's can co-exist side-by-side in the same slab-page
          each with a separate head pointer.
      
       3) it is the visibility of the head pointer that needs synchronization.
      
      Given these properties, the brilliant part is that the detached
      freelist can be constructed without any need for synchronization.  The
      freelist is constructed directly in the page objects, without any
      synchronization needed.  The detached freelist is allocated on the
      stack of the function call kmem_cache_free_bulk.  Thus, the freelist
      head pointer is not visible to other CPUs.
      
      All objects in a SLUB freelist must belong to the same slab-page.
      Thus, constructing the detached freelist is about matching objects
      that belong to the same slab-page.  The bulk free array is scanned is
      a progressive manor with a limited look-ahead facility.
      
      Kmem debug support is handled in call of slab_free().
      
      Notice kmem_cache_free_bulk no longer need to disable IRQs. This
      only slowed down single free bulk with approx 3 cycles.
      
      Performance data:
       Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
      
      SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
      
      To get stable and comparable numbers, the kernel have been booted with
      "slab_merge" (this also improve performance for larger bulk sizes).
      
      Performance data, compared against fallback bulking:
      
      bulk -  fallback bulk            - improvement with this patch
         1 -  62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
         2 -  55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
         3 -  53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
         4 -  52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
         8 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
        16 -  49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
        30 -  49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
        32 -  50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
        34 -  96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
        48 -  83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
        64 -  74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
       128 -  90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
       158 -  99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
       250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%
      
      Performance data, compared current in-kernel bulking:
      
      bulk - curr in-kernel  - improvement with this patch
         1 -  46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
         2 -  27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
         3 -  21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
         4 -  18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
         8 -  17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
        16 -  18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1)  5.6%
        30 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
        32 -  18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0)  0.0%
        34 -  78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
        48 -  60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
        64 -  49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
       128 -  69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
       158 -  79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
       250 -  86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%
      
      Performance with normal SLUB merging is significantly slower for
      larger bulking.  This is believed to (primarily) be an effect of not
      having to share the per-CPU data-structures, as tuning per-CPU size
      can achieve similar performance.
      
      bulk - slab_nomerge   -  normal SLUB merge
         1 -  49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
         2 -  30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
         3 -  23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
         4 -  20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
         8 -  18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
        16 -  17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
        30 -  18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
        32 -  18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
        34 -  23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
        48 -  21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
        64 -  20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
       128 -  27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
       158 -  30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
       250 -  37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19
      
      Joint work with Alexander Duyck.
      
      [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c
      
      [akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0ecd894
    • J
      slub: support for bulk free with SLUB freelists · 81084651
      Jesper Dangaard Brouer 提交于
      Make it possible to free a freelist with several objects by adjusting API
      of slab_free() and __slab_free() to have head, tail and an objects counter
      (cnt).
      
      Tail being NULL indicate single object free of head object.  This allow
      compiler inline constant propagation in slab_free() and
      slab_free_freelist_hook() to avoid adding any overhead in case of single
      object free.
      
      This allows a freelist with several objects (all within the same
      slab-page) to be free'ed using a single locked cmpxchg_double in
      __slab_free() and with an unlocked cmpxchg_double in slab_free().
      
      Object debugging on the free path is also extended to handle these
      freelists.  When CONFIG_SLUB_DEBUG is enabled it will also detect if
      objects don't belong to the same slab-page.
      
      These changes are needed for the next patch to bulk free the detached
      freelists it introduces and constructs.
      
      Micro benchmarking showed no performance reduction due to this change,
      when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81084651
  16. 21 11月, 2015 2 次提交