1. 06 11月, 2015 3 次提交
    • H
      mm: rename mem_cgroup_migrate to mem_cgroup_replace_page · 45637bab
      Hugh Dickins 提交于
      After v4.3's commit 0610c25d ("memcg: fix dirty page migration")
      mem_cgroup_migrate() doesn't have much to offer in page migration: convert
      migrate_misplaced_transhuge_page() to set_page_memcg() instead.
      
      Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
      remaining callers are replace_page_cache_page() and shmem_replace_page():
      both of whom passed lrucare true, so just eliminate that argument.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45637bab
    • J
      mm/filemap.c: make global sync not clear error status of individual inodes · aa750fd7
      Junichi Nomura 提交于
      filemap_fdatawait() is a function to wait for on-going writeback to
      complete but also consume and clear error status of the mapping set during
      writeback.
      
      The latter functionality is critical for applications to detect writeback
      error with system calls like fsync(2)/fdatasync(2).
      
      However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
      which don't check error status of individual mappings.
      
      As a result, fsync() may not be able to detect writeback error if events
      happen in the following order:
      
         Application                    System admin
         ----------------------------------------------------------
         write data on page cache
                                        Run sync command
                                        writeback completes with error
                                        filemap_fdatawait() clears error
         fsync returns success
         (but the data is not on disk)
      
      This patch adds filemap_fdatawait_keep_errors() for call sites where
      writeback error is not handled so that they don't clear error status.
      Signed-off-by: NJun'ichi Nomura <j-nomura@ce.jp.nec.com>
      Acked-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@gmail.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa750fd7
    • R
      mm: use only per-device readahead limit · 600e19af
      Roman Gushchin 提交于
      Maximal readahead size is limited now by two values:
       1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
       2) by configurable per-device value* (bdi->ra_pages)
      
      There are devices, which require custom readahead limit.
      For instance, for RAIDs it's calculated as number of devices
      multiplied by chunk size times 2.
      
      Readahead size can never be larger than bdi->ra_pages * 2 value
      (POSIX_FADV_SEQUNTIAL doubles readahead size).
      
      If so, why do we need two limits?
      I suggest to completely remove this max_sane_readahead() stuff and
      use per-device readahead limit everywhere.
      
      Also, using right readahead size for RAID disks can significantly
      increase i/o performance:
      
      before:
        dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s
      
      after:
        $ dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s
      
      (It's an 8-disks RAID5 storage).
      
      This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
      behavior introduced by 6d2be915 ("mm/readahead.c: fix readahead
      failure for memoryless NUMA nodes and limit readahead pages").
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: onstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      600e19af
  2. 23 10月, 2015 1 次提交
    • J
      mm: make sendfile(2) killable · 296291cd
      Jan Kara 提交于
      Currently a simple program below issues a sendfile(2) system call which
      takes about 62 days to complete in my test KVM instance.
      
              int fd;
              off_t off = 0;
      
              fd = open("file", O_RDWR | O_TRUNC | O_SYNC | O_CREAT, 0644);
              ftruncate(fd, 2);
              lseek(fd, 0, SEEK_END);
              sendfile(fd, fd, &off, 0xfffffff);
      
      Now you should not ask kernel to do a stupid stuff like copying 256MB in
      2-byte chunks and call fsync(2) after each chunk but if you do, sysadmin
      should have a way to stop you.
      
      We actually do have a check for fatal_signal_pending() in
      generic_perform_write() which triggers in this path however because we
      always succeed in writing something before the check is done, we return
      value > 0 from generic_perform_write() and thus the information about
      signal gets lost.
      
      Fix the problem by doing the signal check before writing anything.  That
      way generic_perform_write() returns -EINTR, the error gets propagated up
      and the sendfile loop terminates early.
      Signed-off-by: NJan Kara <jack@suse.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      296291cd
  3. 07 10月, 2015 1 次提交
    • L
      Revert "fs: do not prefault sys_write() user buffer pages" · 00a3d660
      Linus Torvalds 提交于
      This reverts commit 998ef75d.
      
      The commit itself does not appear to be buggy per se, but it is exposing
      a bug in ext4 (and Ted thinks ext3 too, but we solved that by getting
      rid of it).  It's too late in the release cycle to really worry about
      this, even if Dave Hansen has a patch that may actually fix the
      underlying ext4 problem.  We can (and should) revisit this for the next
      release.
      
      The problem is that moving the prefaulting later now exposes a special
      case with partially successful writes that isn't handled correctly.  And
      the prefaulting likely isn't normally even that much of a performance
      issue - it looks like at least one reason Dave saw this in his
      performance tests is that he also ran them on Skylake that now supports
      the new SMAP code, which makes the normally very cheap user space
      prefaulting noticeably more expensive.
      Bisected-and-acked-by: NTed Ts'o <tytso@mit.edu>
      Analyzed-and-acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00a3d660
  4. 09 9月, 2015 2 次提交
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
    • D
      fs: do not prefault sys_write() user buffer pages · 998ef75d
      Dave Hansen 提交于
      === Short summary ====
      
      iov_iter_fault_in_readable() works around a really rare case and we can
      avoid the deadlock it addresses in another way: disable page faults and
      work around copy failures by faulting after the copy in a slow path
      instead of before in a hot one.
      
      I have a little microbenchmark that does repeated, small writes to tmpfs.
      This patch speeds that micro up by 6.2%.
      
      === Long version ===
      
      When doing a sys_write() we have a source buffer in userspace and then a
      target file page.
      
      If both of those are the same physical page, there is a potential deadlock
      that we avoid.  It would happen something like this:
      
      1. We start the write to the file
      2. Allocate page cache page and set it !Uptodate
      3. Touch the userspace buffer to copy in the user data
      4. Page fault (since source of the write not yet mapped)
      5. Page fault code tries to lock the page and deadlocks
      
      (more details on this below)
      
      To avoid this, we prefault the page to guarantee that this fault does not
      occur.  But, this prefault comes at a cost.  It is one of the most
      expensive things that we do in a hot write() path (especially if we
      compare it to the read path).  It is working around a pretty rare case.
      
      To fix this, it's pretty simple.  We move the "prefault" code to run after
      we attempt the copy.  We explicitly disable page faults _during_ the copy,
      detect the copy failure, then execute the "prefault" ouside of where the
      page lock needs to be held.
      
      iov_iter_copy_from_user_atomic() actually already has an implicit
      pagefault_disable() inside of it (at least on x86), but we add an explicit
      one.  I don't think we can depend on every kmap_atomic() implementation to
      pagefault_disable() for eternity.
      
      ===================================================
      
      The stack trace when this happens looks like this:
      
        wait_on_page_bit_killable+0xc0/0xd0
        __lock_page_or_retry+0x84/0xa0
        filemap_fault+0x1ed/0x3d0
        __do_fault+0x41/0xc0
        handle_mm_fault+0x9bb/0x1210
        __do_page_fault+0x17f/0x3d0
        do_page_fault+0xc/0x10
        page_fault+0x22/0x30
        generic_perform_write+0xca/0x1a0
        __generic_file_write_iter+0x190/0x1f0
        ext4_file_write_iter+0xe9/0x460
        __vfs_write+0xaa/0xe0
        vfs_write+0xa6/0x1a0
        SyS_write+0x46/0xa0
        entry_SYSCALL_64_fastpath+0x12/0x6a
        0xffffffffffffffff
      
      (Note, this does *NOT* happen in practice today because
       the kmap_atomic() does a pagefault_disable().  The trace
       above was obtained by taking out the pagefault_disable().)
      
      You can trigger the deadlock with this little code snippet:
      
      	fd = open("foo", O_RDWR);
      	fdmap = mmap(NULL, len, PROT_WRITE|PROT_READ, MAP_SHARED, fd, 0);
      	write(fd, &fdmap[0], 1);
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Paul Cassella <cassella@cray.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      998ef75d
  5. 25 6月, 2015 2 次提交
    • M
      mm: do not ignore mapping_gfp_mask in page cache allocation paths · 6afdb859
      Michal Hocko 提交于
      page_cache_read, do_generic_file_read, __generic_file_splice_read and
      __ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
      add_to_page_cache_lru which might cause recursion into fs down in the
      direct reclaim path if the mapping really relies on GFP_NOFS semantic.
      
      This doesn't seem to be the case now because page_cache_read (page fault
      path) doesn't seem to suffer from the reclaim recursion issues and
      do_generic_file_read and __generic_file_splice_read also shouldn't be
      called under fs locks which would deadlock in the reclaim path.  Anyway it
      is better to obey mapping gfp mask and prevent from later breakage.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Anton Altaparmakov <anton@tuxera.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6afdb859
    • M
      hugetlb: do not account hugetlb pages as NR_FILE_PAGES · 4165b9b4
      Michal Hocko 提交于
      hugetlb pages uses add_to_page_cache to track shared mappings.  This is
      OK from the data structure point of view but it is less so from the
      NR_FILE_PAGES accounting:
      
      	- huge pages are accounted as 4k which is clearly wrong
      	- this counter is used as the amount of the reclaimable page
      	  cache which is incorrect as well because hugetlb pages are
      	  special and not reclaimable
      	- the counter is then exported to userspace via /proc/meminfo
      	  (in Cached:), /proc/vmstat and /proc/zoneinfo as
      	  nr_file_pages which is confusing at least:
      	  Cached:          8883504 kB
      	  HugePages_Free:     8348
      	  ...
      	  Cached:          8916048 kB
      	  HugePages_Free:      156
      	  ...
      	  thats 8192 huge pages allocated which is ~16G accounted as 32M
      
      There are usually not that many huge pages in the system for this to
      make any visible difference e.g.  by fooling __vm_enough_memory or
      zone_pagecache_reclaimable.
      
      Fix this by special casing huge pages in both __delete_from_page_cache
      and __add_to_page_cache_locked.  replace_page_cache_page is currently
      only used by fuse and that shouldn't touch hugetlb pages AFAICS but it
      is more robust to check for special casing there as well.
      
      Hugetlb pages shouldn't get to any other paths where we do accounting:
      	- migration - we have a special handling via
      	  hugetlbfs_migrate_page
      	- shmem - doesn't handle hugetlb pages directly even for
      	  SHM_HUGETLB resp. MAP_HUGETLB
      	- swapcache - hugetlb is not swapable
      
      This has a user visible effect but I believe it is reasonable because the
      previously exported number is simply bogus.
      
      An alternative would be to account hugetlb pages with their real size and
      treat them similar to shmem.  But this has some drawbacks.
      
      First we would have to special case in kernel users of NR_FILE_PAGES and
      considering how hugetlb is special we would have to do it everywhere.  We
      do not want Cached exported by /proc/meminfo to include it because the
      value would be even more misleading.
      
      __vm_enough_memory and zone_pagecache_reclaimable would have to do the
      same thing because those pages are simply not reclaimable.  The correction
      is even not trivial because we would have to consider all active hugetlb
      page sizes properly.  Users of the counter outside of the kernel would
      have to do the same.
      
      So the question is why to account something that needs to be basically
      excluded for each reasonable usage.  This doesn't make much sense to me.
      
      It seems that this has been broken since hugetlb was introduced but I
      haven't checked the whole history.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4165b9b4
  6. 24 6月, 2015 1 次提交
  7. 02 6月, 2015 3 次提交
    • T
      writeback: implement unlocked_inode_to_wb transaction and use it for stat updates · 682aa8e1
      Tejun Heo 提交于
      The mechanism for detecting whether an inode should switch its wb
      (bdi_writeback) association is now in place.  This patch build the
      framework for the actual switching.
      
      This patch adds a new inode flag I_WB_SWITCHING, which has two
      functions.  First, the easy one, it ensures that there's only one
      switching in progress for a give inode.  Second, it's used as a
      mechanism to synchronize wb stat updates.
      
      The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
      but track the current number of dirty pages and pages under writeback
      respectively.  As such, when an inode is moved from one wb to another,
      the inode's portion of those stats have to be transferred together;
      unfortunately, this is a bit tricky as those stat updates are percpu
      operations which are performed without holding any lock in some
      places.
      
      This patch solves the problem in a similar way as memcg.  Each such
      lockless stat updates are wrapped in transaction surrounded by
      unlocked_inode_to_wb_begin/end().  During normal operation, they map
      to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
      mapping->tree_lock is grabbed across the transaction.
      
      In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
      grace period to pass before actually starting to switch, which
      guarantees that all stat update paths are synchronizing against
      mapping->tree_lock.
      
      This patch still doesn't implement the actual switching.
      
      v3: Updated on top of the recent cancel_dirty_page() updates.
          unlocked_inode_to_wb_begin() now nests inside
          mem_cgroup_begin_page_stat() to match the locking order.
      
      v2: The i_wb access transaction will be used for !stat accesses too.
          Function names and comments updated accordingly.
      
          s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
          s/switch_wb/switch_wbs/
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      682aa8e1
    • T
      writeback: make writeback_control track the inode being written back · b16b1deb
      Tejun Heo 提交于
      Currently, for cgroup writeback, the IO submission paths directly
      associate the bio's with the blkcg from inode_to_wb_blkcg_css();
      however, it'd be necessary to keep more writeback context to implement
      foreign inode writeback detection.  wbc (writeback_control) is the
      natural fit for the extra context - it persists throughout the
      writeback of each inode and is passed all the way down to IO
      submission paths.
      
      This patch adds wbc_attach_and_unlock_inode(), wbc_detach_inode(), and
      wbc_attach_fdatawrite_inode() which are used to associate wbc with the
      inode being written back.  IO submission paths now use wbc_init_bio()
      instead of directly associating bio's with blkcg themselves.  This
      leaves inode_to_wb_blkcg_css() w/o any user.  The function is removed.
      
      wbc currently only tracks the associated wb (bdi_writeback).  Future
      patches will add more for foreign inode detection.  The association is
      established under i_lock which will be depended upon when migrating
      foreign inodes to other wb's.
      
      As currently, once established, inode to wb association never changes,
      going through wbc when initializing bio's doesn't cause any behavior
      changes.
      
      v2: submit_blk_blkcg() now checks whether the wbc is associated with a
          wb before dereferencing it.  This can happen when pageout() is
          writing pages directly without going through the usual writeback
          path.  As pageout() path is single-threaded, we don't want it to
          be blocked behind a slow cgroup and ultimately want it to delegate
          actual writing to the usual writeback path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b16b1deb
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
  8. 15 4月, 2015 1 次提交
    • K
      page_writeback: clean up mess around cancel_dirty_page() · b9ea2515
      Konstantin Khlebnikov 提交于
      This patch replaces cancel_dirty_page() with a helper function
      account_page_cleaned() which only updates counters.  It's called from
      truncate_complete_page() and from try_to_free_buffers() (hack for ext3).
      Page is locked in both cases, page-lock protects against concurrent
      dirtiers: see commit 2d6d7f98 ("mm: protect set_page_dirty() from
      ongoing truncation").
      
      Delete_from_page_cache() shouldn't be called for dirty pages, they must
      be handled by caller (either written or truncated).  This patch treats
      final dirty accounting fixup at the end of __delete_from_page_cache() as
      a debug check and adds WARN_ON_ONCE() around it.  If something removes
      dirty pages without proper handling that might be a bug and unwritten
      data might be lost.
      
      Hugetlbfs has no dirty pages accounting, ClearPageDirty() is enough
      here.
      
      cancel_dirty_page() in nfs_wb_page_cancel() is redundant.  This is
      helper for nfs_invalidate_page() and it's called only in case complete
      invalidation.
      
      The mess was started in v2.6.20 after commits 46d2277c ("Clean up
      and make try_to_free_buffers() not race with dirty pages") and
      3e67c098 ("truncate: clear page dirtiness before running
      try_to_free_buffers()") first was reverted right in v2.6.20 in commit
      ecdfc978 ("Resurrect 'try_to_free_buffers()' VM hackery"), second in
      v2.6.25 commit a2b34564 ("Fix dirty page accounting leak with ext3
      data=journal").
      
      Custom fixes were introduced between these points.  NFS in v2.6.23, commit
      1b3b4a1a ("NFS: Fix a write request leak in nfs_invalidate_page()").
      Kludge in __delete_from_page_cache() in v2.6.24, commit 3a692790 ("Do
      dirty page accounting when removing a page from the page cache").  Since
      v2.6.25 all of them are redundant.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9ea2515
  9. 12 4月, 2015 6 次提交
  10. 26 3月, 2015 1 次提交
  11. 17 2月, 2015 2 次提交
  12. 11 2月, 2015 1 次提交
  13. 21 1月, 2015 1 次提交
  14. 30 12月, 2014 1 次提交
    • M
      mm: get rid of radix tree gfp mask for pagecache_get_page · 45f87de5
      Michal Hocko 提交于
      Commit 2457aec6 ("mm: non-atomically mark page accessed during page
      cache allocation where possible") has added a separate parameter for
      specifying gfp mask for radix tree allocations.
      
      Not only this is less than optimal from the API point of view because it
      is error prone, it is also buggy currently because
      grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
      fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
      AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
      the radix tree allocation wouldn't obey the restriction and might
      recurse into filesystem and cause deadlocks.  This is the case for most
      filesystems unfortunately because only ext4 and gfs2 are using
      AOP_FLAG_NOFS.
      
      Let's simply remove radix_gfp_mask parameter because the allocation
      context is same for both page cache and for the radix tree.  Just make
      sure that the radix tree gets only the sane subset of the mask (e.g.  do
      not pass __GFP_WRITE).
      
      Long term it is more preferable to convert remaining users of
      AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
      interface even further.
      Reported-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45f87de5
  15. 17 12月, 2014 1 次提交
  16. 14 12月, 2014 1 次提交
  17. 10 10月, 2014 1 次提交
  18. 25 9月, 2014 2 次提交
  19. 09 9月, 2014 1 次提交
  20. 12 8月, 2014 1 次提交
  21. 09 8月, 2014 2 次提交
    • J
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner 提交于
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: NJet Chen <jet.chen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • J
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner 提交于
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  22. 07 8月, 2014 2 次提交
  23. 31 7月, 2014 1 次提交
  24. 16 7月, 2014 1 次提交
    • N
      sched: Remove proliferation of wait_on_bit() action functions · 74316201
      NeilBrown 提交于
      The current "wait_on_bit" interface requires an 'action'
      function to be provided which does the actual waiting.
      There are over 20 such functions, many of them identical.
      Most cases can be satisfied by one of just two functions, one
      which uses io_schedule() and one which just uses schedule().
      
      So:
       Rename wait_on_bit and        wait_on_bit_lock to
              wait_on_bit_action and wait_on_bit_lock_action
       to make it explicit that they need an action function.
      
       Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
       which are *not* given an action function but implicitly use
       a standard one.
       The decision to error-out if a signal is pending is now made
       based on the 'mode' argument rather than being encoded in the action
       function.
      
       All instances of the old wait_on_bit and wait_on_bit_lock which
       can use the new version have been changed accordingly and their
       action functions have been discarded.
       wait_on_bit{_lock} does not return any specific error code in the
       event of a signal so the caller must check for non-zero and
       interpolate their own error code as appropriate.
      
      The wait_on_bit() call in __fscache_wait_on_invalidate() was
      ambiguous as it specified TASK_UNINTERRUPTIBLE but used
      fscache_wait_bit_interruptible as an action function.
      David Howells confirms this should be uniformly
      "uninterruptible"
      
      The main remaining user of wait_on_bit{,_lock}_action is NFS
      which needs to use a freezer-aware schedule() call.
      
      A comment in fs/gfs2/glock.c notes that having multiple 'action'
      functions is useful as they display differently in the 'wchan'
      field of 'ps'. (and /proc/$PID/wchan).
      As the new bit_wait{,_io} functions are tagged "__sched", they
      will not show up at all, but something higher in the stack.  So
      the distinction will still be visible, only with different
      function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
      gfs2/glock.c case).
      
      Since first version of this patch (against 3.15) two new action
      functions appeared, on in NFS and one in CIFS.  CIFS also now
      uses an action function that makes the same freezer aware
      schedule call as NFS.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
      Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74316201
  25. 05 6月, 2014 1 次提交
    • M
      mm: avoid unnecessary atomic operations during end_page_writeback() · 888cf2db
      Mel Gorman 提交于
      If a page is marked for immediate reclaim then it is moved to the tail of
      the LRU list.  This occurs when the system is under enough memory pressure
      for pages under writeback to reach the end of the LRU but we test for this
      using atomic operations on every writeback.  This patch uses an optimistic
      non-atomic test first.  It'll miss some pages in rare cases but the
      consequences are not severe enough to warrant such a penalty.
      
      While the function does not dominate profiles during a simple dd test the
      cost of it is reduced.
      
      73048     0.7428  vmlinux-3.15.0-rc5-mmotm-20140513 end_page_writeback
      23740     0.2409  vmlinux-3.15.0-rc5-lessatomic     end_page_writeback
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      888cf2db