1. 07 5月, 2014 17 次提交
    • A
      start adding the tag to iov_iter · 71d8e532
      Al Viro 提交于
      For now, just use the same thing we pass to ->direct_IO() - it's all
      iovec-based at the moment.  Pass it explicitly to iov_iter_init() and
      account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
      uses.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      71d8e532
    • A
      new helper: generic_file_read_iter() · ed978a81
      Al Viro 提交于
      iov_iter-using variant of generic_file_aio_read().  Some callers
      converted.  Note that it's still not quite there for use as ->read_iter() -
      we depend on having zero iter->iov_offset in O_DIRECT case.  Fortunately,
      that's true for all converted callers (and for generic_file_aio_read() itself).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ed978a81
    • A
      new primitive: iov_iter_alignment() · 886a3911
      Al Viro 提交于
      returns the value aligned as badly as the worst remaining segment
      in iov_iter is.  Use instead of open-coded equivalents.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      886a3911
    • A
      give ->direct_IO() a copy of iov_iter · 26978b8b
      Al Viro 提交于
      the thing is, we want to advance what's given to ->direct_IO() as we
      are forming the request; however, the callers care about the amount
      of data actually transferred, not the amount we tried to transfer.
      It's more convenient to allow ->direct_IO() instances do use
      iov_iter_advance() on the copy of iov_iter, leaving the actual
      advancing of the original to caller.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      26978b8b
    • A
      get rid of pointless iov_length() in ->direct_IO() · a6cbcd4a
      Al Viro 提交于
      all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a6cbcd4a
    • A
      pass iov_iter to ->direct_IO() · d8d3d94b
      Al Viro 提交于
      unmodified, for now
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d8d3d94b
    • A
      kill generic_segment_checks() · cb66a7a1
      Al Viro 提交于
      all callers of ->aio_read() and ->aio_write() have iov/nr_segs already
      checked - generic_segment_checks() done after that is just an odd way
      to spell iov_length().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cb66a7a1
    • A
      generic_file_direct_write(): switch to iov_iter · f8579f86
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f8579f86
    • A
      kill iov_iter_copy_from_user() · e7c24607
      Al Viro 提交于
      all callers can use copy_page_from_iter() and it actually simplifies
      them.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e7c24607
    • C
      slub: use sysfs'es release mechanism for kmem_cache · 41a21285
      Christoph Lameter 提交于
      debugobjects warning during netfilter exit:
      
          ------------[ cut here ]------------
          WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
          ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
          Modules linked in:
          CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G        W 3.11.0-next-20130906-sasha #3984
          Workqueue: netns cleanup_net
          Call Trace:
            dump_stack+0x52/0x87
            warn_slowpath_common+0x8c/0xc0
            warn_slowpath_fmt+0x46/0x50
            debug_print_object+0x8d/0xb0
            __debug_check_no_obj_freed+0xa5/0x220
            debug_check_no_obj_freed+0x15/0x20
            kmem_cache_free+0x197/0x340
            kmem_cache_destroy+0x86/0xe0
            nf_conntrack_cleanup_net_list+0x131/0x170
            nf_conntrack_pernet_exit+0x5d/0x70
            ops_exit_list+0x5e/0x70
            cleanup_net+0xfb/0x1c0
            process_one_work+0x338/0x550
            worker_thread+0x215/0x350
            kthread+0xe7/0xf0
            ret_from_fork+0x7c/0xb0
      
      Also during dcookie cleanup:
      
          WARNING: CPU: 12 PID: 9725 at lib/debugobjects.c:260 debug_print_object+0x8c/0xb0()
          ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x20
          Modules linked in:
          CPU: 12 PID: 9725 Comm: trinity-c141 Not tainted 3.15.0-rc2-next-20140423-sasha-00018-gc4ff6c4 #408
          Call Trace:
            dump_stack (lib/dump_stack.c:52)
            warn_slowpath_common (kernel/panic.c:430)
            warn_slowpath_fmt (kernel/panic.c:445)
            debug_print_object (lib/debugobjects.c:262)
            __debug_check_no_obj_freed (lib/debugobjects.c:697)
            debug_check_no_obj_freed (lib/debugobjects.c:726)
            kmem_cache_free (mm/slub.c:2689 mm/slub.c:2717)
            kmem_cache_destroy (mm/slab_common.c:363)
            dcookie_unregister (fs/dcookies.c:302 fs/dcookies.c:343)
            event_buffer_release (arch/x86/oprofile/../../../drivers/oprofile/event_buffer.c:153)
            __fput (fs/file_table.c:217)
            ____fput (fs/file_table.c:253)
            task_work_run (kernel/task_work.c:125 (discriminator 1))
            do_notify_resume (include/linux/tracehook.h:196 arch/x86/kernel/signal.c:751)
            int_signal (arch/x86/kernel/entry_64.S:807)
      
      Sysfs has a release mechanism.  Use that to release the kmem_cache
      structure if CONFIG_SYSFS is enabled.
      
      Only slub is changed - slab currently only supports /proc/slabinfo and
      not /sys/kernel/slab/*.  We talked about adding that and someone was
      working on it.
      
      [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
      [akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NGreg KH <greg@kroah.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41a21285
    • J
      revert "mm: vmscan: do not swap anon pages just because free+file is low" · 62376251
      Johannes Weiner 提交于
      This reverts commit 0bf1457f ("mm: vmscan: do not swap anon pages
      just because free+file is low") because it introduced a regression in
      mostly-anonymous workloads, where reclaim would become ineffective and
      trap every allocating task in direct reclaim.
      
      The problem is that there is a runaway feedback loop in the scan balance
      between file and anon, where the balance tips heavily towards a tiny
      thrashing file LRU and anonymous pages are no longer being looked at.
      The commit in question removed the safe guard that would detect such
      situations and respond with forced anonymous reclaim.
      
      This commit was part of a series to fix premature swapping in loads with
      relatively little cache, and while it made a small difference, the cure
      is obviously worse than the disease.  Revert it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62376251
    • J
      mm: filemap: update find_get_pages_tag() to deal with shadow entries · 139b6a6f
      Johannes Weiner 提交于
      Dave Jones reports the following crash when find_get_pages_tag() runs
      into an exceptional entry:
      
        kernel BUG at mm/filemap.c:1347!
        RIP: find_get_pages_tag+0x1cb/0x220
        Call Trace:
          find_get_pages_tag+0x36/0x220
          pagevec_lookup_tag+0x21/0x30
          filemap_fdatawait_range+0xbe/0x1e0
          filemap_fdatawait+0x27/0x30
          sync_inodes_sb+0x204/0x2a0
          sync_inodes_one_sb+0x19/0x20
          iterate_supers+0xb2/0x110
          sys_sync+0x44/0xb0
          ia32_do_call+0x13/0x13
      
        1343                         /*
        1344                          * This function is never used on a shmem/tmpfs
        1345                          * mapping, so a swap entry won't be found here.
        1346                          */
        1347                         BUG();
      
      After commit 0cd6144a ("mm + fs: prepare for non-page entries in
      page cache radix trees") this comment and BUG() are out of date because
      exceptional entries can now appear in all mappings - as shadows of
      recently evicted pages.
      
      However, as Hugh Dickins notes,
      
        "it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
         any other PAGECACHE_TAG_*) to appear on an exceptional entry.
      
         I expect it comes down to an occasional race in RCU lookup of the
         radix_tree: lacking absolute synchronization, we might sometimes
         catch an exceptional entry, with the tag which really belongs with
         the unexceptional entry which was there an instant before."
      
      And indeed, not only is the tree walk lockless, the tags are also read
      in chunks, one radix tree node at a time.  There is plenty of time for
      page reclaim to swoop in and replace a page that was already looked up
      as tagged with a shadow entry.
      
      Remove the BUG() and update the comment.  While reviewing all other
      lookup sites for whether they properly deal with shadow entries of
      evicted pages, update all the comments and fix memcg file charge moving
      to not miss shmem/tmpfs swapcache pages.
      
      Fixes: 0cd6144a ("mm + fs: prepare for non-page entries in page cache radix trees")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NDave Jones <davej@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      139b6a6f
    • V
      mm/compaction: make isolate_freepages start at pageblock boundary · 49e068f0
      Vlastimil Babka 提交于
      The compaction freepage scanner implementation in isolate_freepages()
      starts by taking the current cc->free_pfn value as the first pfn.  In a
      for loop, it scans from this first pfn to the end of the pageblock, and
      then subtracts pageblock_nr_pages from the first pfn to obtain the first
      pfn for the next for loop iteration.
      
      This means that when cc->free_pfn starts at offset X rather than being
      aligned on pageblock boundary, the scanner will start at offset X in all
      scanned pageblock, ignoring potentially many free pages.  Currently this
      can happen when
      
       a) zone's end pfn is not pageblock aligned, or
      
       b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
          enabled and a hole spanning the beginning of a pageblock
      
      This patch fixes the problem by aligning the initial pfn in
      isolate_freepages() to pageblock boundary.  This also permits replacing
      the end-of-pageblock alignment within the for loop with a simple
      pageblock_nr_pages increment.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NHeesub Shin <heesub.shin@samsung.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Dongjun Shin <d.j.shin@samsung.com>
      Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49e068f0
    • R
      mm/page-writeback.c: fix divide by zero in pos_ratio_polynom · d5c9fde3
      Rik van Riel 提交于
      It is possible for "limit - setpoint + 1" to equal zero, after getting
      truncated to a 32 bit variable, and resulting in a divide by zero error.
      
      Using the fully 64 bit divide functions avoids this problem.  It also
      will cause pos_ratio_polynom() to return the correct value when
      (setpoint - limit) exceeds 2^32.
      
      Also uninline pos_ratio_polynom, at Andrew's request.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5c9fde3
    • N
      hugetlb: ensure hugepage access is denied if hugepages are not supported · 457c1b27
      Nishanth Aravamudan 提交于
      Currently, I am seeing the following when I `mount -t hugetlbfs /none
      /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`.  I think it's
      related to the fact that hugetlbfs is properly not correctly setting
      itself up in this state?:
      
        Unable to handle kernel paging request for data at address 0x00000031
        Faulting instruction address: 0xc000000000245710
        Oops: Kernel access of bad area, sig: 11 [#1]
        SMP NR_CPUS=2048 NUMA pSeries
        ....
      
      In KVM guests on Power, in a guest not backed by hugepages, we see the
      following:
      
        AnonHugePages:         0 kB
        HugePages_Total:       0
        HugePages_Free:        0
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:         64 kB
      
      HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
      are not supported at boot-time, but this is only checked in
      hugetlb_init().  Extract the check to a helper function, and use it in a
      few relevant places.
      
      This does make hugetlbfs not supported (not registered at all) in this
      environment.  I believe this is fine, as there are no valid hugepages
      and that won't change at runtime.
      
      [akpm@linux-foundation.org: use pr_info(), per Mel]
      [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      457c1b27
    • V
      slub: fix memcg_propagate_slab_attrs · 93030d83
      Vladimir Davydov 提交于
      After creating a cache for a memcg we should initialize its sysfs attrs
      with the values from its parent.  That's what memcg_propagate_slab_attrs
      is for.  Currently it's broken - we clearly muddled root-vs-memcg caches
      there.  Let's fix it up.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93030d83
    • A
      nick kvfree() from apparmor · 39f1f78d
      Al Viro 提交于
      too many places open-code it
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      39f1f78d
  2. 06 5月, 2014 2 次提交
    • D
      slab: Fix off by one in object max number tests. · 30321c7b
      David Miller 提交于
      If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and
      likewise if freelist_idx_t is a short, then it should be 65535 not
      65536.
      
      This was leading to all kinds of random crashes on sparc64 where
      PAGE_SIZE is 8192.  One problem shown was that if spinlock debugging was
      enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the
      same cpu already holding a lock it shouldn't hold, or the lock belonging
      to a completely unrelated process.
      
      Fixes: a41adfaa ("slab: introduce byte sized index for the freelist of a slab")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30321c7b
    • J
      slab: fix the type of the index on freelist index accessor · 7cc68973
      Joonsoo Kim 提交于
      Commit a41adfaa ("slab: introduce byte sized index for the freelist
      of a slab") changes the size of freelist index and also changes
      prototype of accessor function to freelist index.  And there was a
      mistake.
      
      The mistake is that although it changes the size of freelist index
      correctly, it changes the size of the index of freelist index
      incorrectly.  With patch, freelist index can be 1 byte or 2 bytes, that
      means that num of object on on a slab can be more than 255.  So we need
      more than 1 byte for the index to find the index of free object on
      freelist.  But, above patch makes this index type 1 byte, so slab which
      have more than 255 objects cannot work properly and in consequence of
      it, the system cannot boot.
      
      This issue was reported by Steven King on m68knommu which would use
      2 bytes freelist index:
      
        https://lkml.org/lkml/2014/4/16/433
      
      To fix is easy.  To change the type of the index of freelist index on
      accessor functions is enough to fix this bug.  Although 2 bytes is
      enough, I use 4 bytes since it have no bad effect and make things more
      easier.  This fix was suggested and tested by Steven in his original
      report.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-and-acked-by: NSteven King <sfking@fdwdc.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NJames Hogan <james.hogan@imgtec.com>
      Tested-by: NDavid Miller <davem@davemloft.net>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cc68973
  3. 29 4月, 2014 1 次提交
    • L
      mm: don't pointlessly use BUG_ON() for sanity check · 50f5aa8a
      Linus Torvalds 提交于
      BUG_ON() is a big hammer, and should be used _only_ if there is some
      major corruption that you cannot possibly recover from, making it
      imperative that the current process (and possibly the whole machine) be
      terminated with extreme prejudice.
      
      The trivial sanity check in the vmacache code is *not* such a fatal
      error.  Recovering from it is absolutely trivial, and using BUG_ON()
      just makes it harder to debug for no actual advantage.
      
      To make matters worse, the placement of the BUG_ON() (only if the range
      check matched) actually makes it harder to hit the sanity check to begin
      with, so _if_ there is a bug (and we just got a report from Srivatsa
      Bhat that this can indeed trigger), it is harder to debug not just
      because the machine is possibly dead, but because we don't have better
      coverage.
      
      BUG_ON() must *die*.  Maybe we should add a checkpatch warning for it,
      because it is simply just about the worst thing you can ever do if you
      hit some "this cannot happen" situation.
      Reported-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50f5aa8a
  4. 26 4月, 2014 1 次提交
    • L
      mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts · 1cf35d47
      Linus Torvalds 提交于
      The mmu-gather operation 'tlb_flush_mmu()' has done two things: the
      actual tlb flush operation, and the batched freeing of the pages that
      the TLB entries pointed at.
      
      This splits the operation into separate phases, so that the forced
      batched flushing done by zap_pte_range() can now do the actual TLB flush
      while still holding the page table lock, but delay the batched freeing
      of all the pages to after the lock has been dropped.
      
      This in turn allows us to avoid a race condition between
      set_page_dirty() (as called by zap_pte_range() when it finds a dirty
      shared memory pte) and page_mkclean(): because we now flush all the
      dirty page data from the TLB's while holding the pte lock,
      page_mkclean() will be held up walking the (recently cleaned) page
      tables until after the TLB entries have been flushed from all CPU's.
      Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Tested-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1cf35d47
  5. 23 4月, 2014 1 次提交
    • L
      mm: make fixup_user_fault() check the vma access rights too · 1b17844b
      Linus Torvalds 提交于
      fixup_user_fault() is used by the futex code when the direct user access
      fails, and the futex code wants it to either map in the page in a usable
      form or return an error.  It relied on handle_mm_fault() to map the
      page, and correctly checked the error return from that, but while that
      does map the page, it doesn't actually guarantee that the page will be
      mapped with sufficient permissions to be then accessed.
      
      So do the appropriate tests of the vma access rights by hand.
      
      [ Side note: arguably handle_mm_fault() could just do that itself, but
        we have traditionally done it in the caller, because some callers -
        notably get_user_pages() - have been able to access pages even when
        they are mapped with PROT_NONE.  Maybe we should re-visit that design
        decision, but in the meantime this is the minimal patch. ]
      
      Found by Dave Jones running his trinity tool.
      Reported-by: NDave Jones <davej@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b17844b
  6. 19 4月, 2014 4 次提交
  7. 14 4月, 2014 1 次提交
  8. 12 4月, 2014 1 次提交
  9. 11 4月, 2014 1 次提交
  10. 09 4月, 2014 1 次提交
    • J
      mm: vmscan: do not swap anon pages just because free+file is low · 0bf1457f
      Johannes Weiner 提交于
      Page reclaim force-scans / swaps anonymous pages when file cache drops
      below the high watermark of a zone in order to prevent what little cache
      remains from thrashing.
      
      However, on bigger machines the high watermark value can be quite large
      and when the workload is dominated by a static anonymous/shmem set, the
      file set might just be a small window of used-once cache.  In such
      situations, the VM starts swapping heavily when instead it should be
      recycling the no longer used cache.
      
      This is a longer-standing problem, but it's more likely to trigger after
      commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy")
      because file pages can no longer accumulate in a single zone and are
      dispersed into smaller fractions among the available zones.
      
      To resolve this, do not force scan anon when file pages are low but
      instead rely on the scan/rotation ratios to make the right prediction.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bf1457f
  11. 08 4月, 2014 10 次提交
    • M
      mm: create generic early_ioremap() support · 9e5c33d7
      Mark Salter 提交于
      This patch creates a generic implementation of early_ioremap() support
      based on the existing x86 implementation.  early_ioremp() is useful for
      early boot code which needs to temporarily map I/O or memory regions
      before normal mapping functions such as ioremap() are available.
      
      Some architectures have optional MMU.  In the no-MMU case, the remap
      functions simply return the passed in physical address and the unmap
      functions do nothing.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Borislav Petkov <borislav.petkov@amd.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e5c33d7
    • C
      slub: use raw_cpu_inc for incrementing statistics · 88da03a6
      Christoph Lameter 提交于
      Statistics are not critical to the operation of the allocation but
      should also not cause too much overhead.
      
      When __this_cpu_inc is altered to check if preemption is disabled this
      triggers.  Use raw_cpu_inc to avoid the checks.  Using this_cpu_ops may
      cause interrupt disable/enable sequences on various arches which may
      significantly impact allocator performance.
      
      [akpm@linux-foundation.org: add comment]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88da03a6
    • D
      slub: fix leak of 'name' in sysfs_slab_add · 54b6a731
      Dave Jones 提交于
      The failure paths of sysfs_slab_add don't release the allocation of
      'name' made by create_unique_id() a few lines above the context of the
      diff below.  Create a common exit path to make it more obvious what
      needs freeing.
      
      [vdavydov@parallels.com: free the name only if !unmergeable]
      Signed-off-by: NDave Jones <davej@fedoraproject.org>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54b6a731
    • V
      slub: rework sysfs layout for memcg caches · 9a41707b
      Vladimir Davydov 提交于
      Currently, we try to arrange sysfs entries for memcg caches in the same
      manner as for global caches.  Apart from turning /sys/kernel/slab into a
      mess when there are a lot of kmem-active memcgs created, it actually
      does not work properly - we won't create more than one link to a memcg
      cache in case its parent is merged with another cache.  For instance, if
      A is a root cache merged with another root cache B, we will have the
      following sysfs setup:
      
        X
        A -> X
        B -> X
      
      where X is some unique id (see create_unique_id()).  Now if memcgs M and
      N start to allocate from cache A (or B, which is the same), we will get:
      
        X
        X:M
        X:N
        A -> X
        B -> X
        A:M -> X:M
        A:N -> X:N
      
      Since B is an alias for A, we won't get entries B:M and B:N, which is
      confusing.
      
      It is more logical to have entries for memcg caches under the
      corresponding root cache's sysfs directory.  This would allow us to keep
      sysfs layout clean, and avoid such inconsistencies like one described
      above.
      
      This patch does the trick.  It creates a "cgroup" kset in each root
      cache kobject to keep its children caches there.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a41707b
    • V
      slub: adjust memcg caches when creating cache alias · 84d0ddd6
      Vladimir Davydov 提交于
      Otherwise, kzalloc() called from a memcg won't clear the whole object.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84d0ddd6
    • V
      memcg, slab: do not destroy children caches if parent has aliases · b8529907
      Vladimir Davydov 提交于
      Currently we destroy children caches at the very beginning of
      kmem_cache_destroy().  This is wrong, because the root cache will not
      necessarily be destroyed in the end - if it has aliases (refcount > 0),
      kmem_cache_destroy() will simply decrement its refcount and return.  In
      this case, at best we will get a bunch of warnings in dmesg, like this
      one:
      
        kmem_cache_destroy kmalloc-32:0: Slab cache still has objects
        CPU: 1 PID: 7139 Comm: modprobe Tainted: G    B   W    3.13.0+ #117
        Call Trace:
          dump_stack+0x49/0x5b
          kmem_cache_destroy+0xdf/0xf0
          kmem_cache_destroy_memcg_children+0x97/0xc0
          kmem_cache_destroy+0xf/0xf0
          xfs_mru_cache_uninit+0x21/0x30 [xfs]
          exit_xfs_fs+0x2e/0xc44 [xfs]
          SyS_delete_module+0x198/0x1f0
          system_call_fastpath+0x16/0x1b
      
      At worst - if kmem_cache_destroy() will race with an allocation from a
      memcg cache - the kernel will panic.
      
      This patch fixes this by moving children caches destruction after the
      check if the cache has aliases.  Plus, it forbids destroying a root
      cache if it still has children caches, because each children cache keeps
      a reference to its parent.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8529907
    • V
      memcg, slab: unregister cache from memcg before starting to destroy it · 051dd460
      Vladimir Davydov 提交于
      Currently, memcg_unregister_cache(), which deletes the cache being
      destroyed from the memcg_slab_caches list, is called after
      __kmem_cache_shutdown() (see kmem_cache_destroy()), which starts to
      destroy the cache.
      
      As a result, one can access a partially destroyed cache while traversing
      a memcg_slab_caches list, which can have deadly consequences (for
      instance, cache_show() called for each cache on a memcg_slab_caches list
      from mem_cgroup_slabinfo_read() will dereference pointers to already
      freed data).
      
      To fix this, let's move memcg_unregister_cache() before the cache
      destruction process beginning, issuing memcg_register_cache() on failure.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      051dd460
    • V
      memcg, slab: separate memcg vs root cache creation paths · 794b1248
      Vladimir Davydov 提交于
      Memcg-awareness turned kmem_cache_create() into a dirty interweaving of
      memcg-only and except-for-memcg calls.  To clean this up, let's move the
      code responsible for memcg cache creation to a separate function.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      794b1248
    • V
      memcg, slab: cleanup memcg cache creation · 5722d094
      Vladimir Davydov 提交于
      This patch cleans up the memcg cache creation path as follows:
      
      - Move memcg cache name creation to a separate function to be called
        from kmem_cache_create_memcg().  This allows us to get rid of the mutex
        protecting the temporary buffer used for the name formatting, because
        the whole cache creation path is protected by the slab_mutex.
      
      - Get rid of memcg_create_kmem_cache().  This function serves as a proxy
        to kmem_cache_create_memcg().  After separating the cache name creation
        path, it would be reduced to a function call, so let's inline it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5722d094
    • V
      memcg, slab: never try to merge memcg caches · a44cb944
      Vladimir Davydov 提交于
      When a kmem cache is created (kmem_cache_create_memcg()), we first try to
      find a compatible cache that already exists and can handle requests from
      the new cache, i.e.  has the same object size, alignment, ctor, etc.  If
      there is such a cache, we do not create any new caches, instead we simply
      increment the refcount of the cache found and return it.
      
      Currently we do this procedure not only when creating root caches, but
      also for memcg caches.  However, there is no point in that, because, as
      every memcg cache has exactly the same parameters as its parent and cache
      merging cannot be turned off in runtime (only on boot by passing
      "slub_nomerge"), the root caches of any two potentially mergeable memcg
      caches should be merged already, i.e.  it must be the same root cache, and
      therefore we couldn't even get to the memcg cache creation, because it
      already exists.
      
      The only exception is boot caches - they are explicitly forbidden to be
      merged by setting their refcount to -1.  There are currently only two of
      them - kmem_cache and kmem_cache_node, which are used in slab internals (I
      do not count kmalloc caches as their refcount is set to 1 immediately
      after creation).  Since they are prevented from merging preliminary I
      guess we should avoid to merge their children too.
      
      So let's remove the useless code responsible for merging memcg caches.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a44cb944