1. 07 11月, 2015 3 次提交
  2. 06 11月, 2015 3 次提交
    • E
      mm: introduce VM_LOCKONFAULT · de60f5f1
      Eric B Munson 提交于
      The cost of faulting in all memory to be locked can be very high when
      working with large mappings.  If only portions of the mapping will be used
      this can incur a high penalty for locking.
      
      For the example of a large file, this is the usage pattern for a large
      statical language model (probably applies to other statical or graphical
      models as well).  For the security example, any application transacting in
      data that cannot be swapped out (credit card data, medical records, etc).
      
      This patch introduces the ability to request that pages are not
      pre-faulted, but are placed on the unevictable LRU when they are finally
      faulted in.  The VM_LOCKONFAULT flag will be used together with VM_LOCKED
      and has no effect when set without VM_LOCKED.  Setting the VM_LOCKONFAULT
      flag for a VMA will cause pages faulted into that VMA to be added to the
      unevictable LRU when they are faulted or if they are already present, but
      will not cause any missing pages to be faulted in.
      
      Exposing this new lock state means that we cannot overload the meaning of
      the FOLL_POPULATE flag any longer.  Prior to this patch it was used to
      mean that the VMA for a fault was locked.  This means we need the new
      FOLL_MLOCK flag to communicate the locked state of a VMA.  FOLL_POPULATE
      will now only control if the VMA should be populated and in the case of
      VM_LOCKONFAULT, it will not be set.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de60f5f1
    • V
      mm: do not inc NR_PAGETABLE if ptlock_init failed · 706874e9
      Vladimir Davydov 提交于
      If ALLOC_SPLIT_PTLOCKS is defined, ptlock_init may fail, in which case we
      shouldn't increment NR_PAGETABLE.
      
      Since small allocations, such as ptlock, normally do not fail (currently
      they can fail if kmemcg is used though), this patch does not really fix
      anything and should be considered as a code cleanup.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      706874e9
    • R
      mm: use only per-device readahead limit · 600e19af
      Roman Gushchin 提交于
      Maximal readahead size is limited now by two values:
       1) by global 2Mb constant (MAX_READAHEAD in max_sane_readahead())
       2) by configurable per-device value* (bdi->ra_pages)
      
      There are devices, which require custom readahead limit.
      For instance, for RAIDs it's calculated as number of devices
      multiplied by chunk size times 2.
      
      Readahead size can never be larger than bdi->ra_pages * 2 value
      (POSIX_FADV_SEQUNTIAL doubles readahead size).
      
      If so, why do we need two limits?
      I suggest to completely remove this max_sane_readahead() stuff and
      use per-device readahead limit everywhere.
      
      Also, using right readahead size for RAID disks can significantly
      increase i/o performance:
      
      before:
        dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 12.9741 s, 808 MB/s
      
      after:
        $ dd if=/dev/md2 of=/dev/null bs=100M count=100
        100+0 records in
        100+0 records out
        10485760000 bytes (10 GB) copied, 8.91317 s, 1.2 GB/s
      
      (It's an 8-disks RAID5 storage).
      
      This patch doesn't change sys_readahead and madvise(MADV_WILLNEED)
      behavior introduced by 6d2be915 ("mm/readahead.c: fix readahead
      failure for memoryless NUMA nodes and limit readahead pages").
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: onstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      600e19af
  3. 02 10月, 2015 1 次提交
    • G
      memcg: fix dirty page migration · 0610c25d
      Greg Thelen 提交于
      The problem starts with a file backed dirty page which is charged to a
      memcg.  Then page migration is used to move oldpage to newpage.
      
      Migration:
       - copies the oldpage's data to newpage
       - clears oldpage.PG_dirty
       - sets newpage.PG_dirty
       - uncharges oldpage from memcg
       - charges newpage to memcg
      
      Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
      count.
      
      However, because newpage is not yet charged, setting newpage.PG_dirty
      does not increment the memcg's dirty page count.  After migration
      completes newpage.PG_dirty is eventually cleared, often in
      account_page_cleaned().  At this time newpage is charged to a memcg so
      the memcg's dirty page count is decremented which causes underflow
      because the count was not previously incremented by migration.  This
      underflow causes balance_dirty_pages() to see a very large unsigned
      number of dirty memcg pages which leads to aggressive throttling of
      buffered writes by processes in non root memcg.
      
      This issue:
       - can harm performance of non root memcg buffered writes.
       - can report too small (even negative) values in
         memory.stat[(total_)dirty] counters of all memcg, including the root.
      
      To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
      page_memcg() and set_page_memcg() helpers.
      
      Test:
          0) setup and enter limited memcg
          mkdir /sys/fs/cgroup/test
          echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
          echo $$ > /sys/fs/cgroup/test/cgroup.procs
      
          1) buffered writes baseline
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          2) buffered writes with compaction antagonist to induce migration
          yes 1 > /proc/sys/vm/compact_memory &
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          kill %
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
          3) buffered writes without antagonist, should match baseline
          rm -rf /data/tmp/foo
          dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
          sync
          grep ^dirty /sys/fs/cgroup/test/memory.stat
      
                             (speed, dirty residue)
                   unpatched                       patched
          1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
          2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
          3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
      
          Notice that unpatched baseline performance (1) fell after
          migration (3): 841 -> 114 MB/s.  In the patched kernel, post
          migration performance matches baseline.
      
      Fixes: c4843a75 ("memcg: add per cgroup dirty page accounting")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0610c25d
  4. 11 9月, 2015 1 次提交
  5. 09 9月, 2015 5 次提交
  6. 05 9月, 2015 3 次提交
  7. 22 8月, 2015 1 次提交
    • M
      mm: make page pfmemalloc check more robust · 2f064f34
      Michal Hocko 提交于
      Commit c48a11c7 ("netvm: propagate page->pfmemalloc to skb") added
      checks for page->pfmemalloc to __skb_fill_page_desc():
      
              if (page->pfmemalloc && !page->mapping)
                      skb->pfmemalloc = true;
      
      It assumes page->mapping == NULL implies that page->pfmemalloc can be
      trusted.  However, __delete_from_page_cache() can set set page->mapping
      to NULL and leave page->index value alone.  Due to being in union, a
      non-zero page->index will be interpreted as true page->pfmemalloc.
      
      So the assumption is invalid if the networking code can see such a page.
      And it seems it can.  We have encountered this with a NFS over loopback
      setup when such a page is attached to a new skbuf.  There is no copying
      going on in this case so the page confuses __skb_fill_page_desc which
      interprets the index as pfmemalloc flag and the network stack drops
      packets that have been allocated using the reserves unless they are to
      be queued on sockets handling the swapping which is the case here and
      that leads to hangs when the nfs client waits for a response from the
      server which has been dropped and thus never arrive.
      
      The struct page is already heavily packed so rather than finding another
      hole to put it in, let's do a trick instead.  We can reuse the index
      again but define it to an impossible value (-1UL).  This is the page
      index so it should never see the value that large.  Replace all direct
      users of page->pfmemalloc by page_is_pfmemalloc which will hide this
      nastiness from unspoiled eyes.
      
      The information will get lost if somebody wants to use page->index
      obviously but that was the case before and the original code expected
      that the information should be persisted somewhere else if that is
      really needed (e.g.  what SLAB and SLUB do).
      
      [akpm@linux-foundation.org: fix blooper in slub]
      Fixes: c48a11c7 ("netvm: propagate page->pfmemalloc to skb")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Debugged-by: NVlastimil Babka <vbabka@suse.com>
      Debugged-by: NJiri Bohac <jbohac@suse.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f064f34
  8. 17 8月, 2015 1 次提交
  9. 11 8月, 2015 1 次提交
    • D
      mm: enhance region_is_ram() to region_intersects() · 124fe20d
      Dan Williams 提交于
      region_is_ram() is used to prevent the establishment of aliased mappings
      to physical "System RAM" with incompatible cache settings.  However, it
      uses "-1" to indicate both "unknown" memory ranges (ranges not described
      by platform firmware) and "mixed" ranges (where the parameters describe
      a range that partially overlaps "System RAM").
      
      Fix this up by explicitly tracking the "unknown" vs "mixed" resource
      cases and returning REGION_INTERSECTS, REGION_MIXED, or REGION_DISJOINT.
      This re-write also adds support for detecting when the requested region
      completely eclipses all of a resource.  Note, the implementation treats
      overlaps between "unknown" and the requested memory type as
      REGION_INTERSECTS.
      
      Finally, other memory types can be passed in by name, for now the only
      usage "System RAM".
      Suggested-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Reviewed-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      124fe20d
  10. 01 7月, 2015 2 次提交
  11. 25 6月, 2015 4 次提交
    • X
      memory-failure: change type of action_result's param 3 to enum · cc3e2af4
      Xie XiuQi 提交于
      Change type of action_result's param 3 to enum for type consistency,
      and rename mf_outcome to mf_result for clearly.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc3e2af4
    • X
      memory-failure: export page_type and action result · cc637b17
      Xie XiuQi 提交于
      Export 'outcome' and 'action_page_type' to mm.h, so we could use
      this emnus outside.
      
      This patch is preparation for adding trace events for memory-failure
      recovery action.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc637b17
    • N
      mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling · ead07f6a
      Naoya Horiguchi 提交于
      memory_failure() can run in 2 different mode (specified by
      MF_COUNT_INCREASED) in page refcount perspective.  When
      MF_COUNT_INCREASED is set, memory_failure() assumes that the caller
      takes a refcount of the target page.  And if cleared, memory_failure()
      takes it in it's own.
      
      In current code, however, refcounting is done differently in each caller.
      For example, madvise_hwpoison() uses get_user_pages_fast() and
      hwpoison_inject() uses get_page_unless_zero().  So this inconsistent
      refcounting causes refcount failure especially for thp tail pages.
      Typical user visible effects are like memory leak or
      VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
      
      To fix this refcounting issue, this patch introduces get_hwpoison_page()
      to handle thp tail pages in the same manner for each caller of hwpoison
      code.
      
      memory_failure() might fail to split thp and in such case it returns
      without completing page isolation.  This is not good because PageHWPoison
      on the thp is still set and there's no easy way to unpoison such thps.  So
      this patch try to roll back any action to the thp in "non anonymous thp"
      case and "thp split failed" case, expecting an MCE(SRAR) generated by
      later access afterward will properly free such thps.
      
      [akpm@linux-foundation.org: fix CONFIG_HWPOISON_INJECT=m]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ead07f6a
    • K
      mm: avoid tail page refcounting on non-THP compound pages · c761471b
      Kirill A. Shutemov 提交于
      Reintroduce 8d63d99a ("mm: avoid tail page refcounting on non-THP
      compound pages") after removing bogus VM_BUG_ON_PAGE() in
      put_unrefcounted_compound_page().
      
      THP uses tail page refcounting to be able to split huge pages at any time.
       Tail page refcounting is not needed for other users of compound pages and
      it's harmful because of overhead.
      
      We try to exclude non-THP pages from tail page refcounting using
      __compound_tail_refcounted() check.  It excludes most common non-THP
      compound pages: SL*B and hugetlb, but it doesn't catch rest of __GFP_COMP
      users -- drivers.
      
      And it's not only about overhead.
      
      Drivers might want to use compound pages to get refcounting semantics
      suitable for mapping high-order pages to userspace.  But tail page
      refcounting breaks it.
      
      Tail page refcounting uses ->_mapcount in tail pages to store GUP pins on
      them.  It means GUP pins would affect page_mapcount() for tail pages.
      It's not a problem for THP, because it never maps tail pages.  But unlike
      THP, drivers map parts of compound pages with PTEs and it makes
      page_mapcount() be called for tail pages.
      
      In particular, GUP pins would shift PSS up and affect /proc/kpagecount for
      such pages.  But, I'm not aware about anything which can lead to crash or
      other serious misbehaviour.
      
      Since currently all THP pages are anonymous and all drivers pages are not,
      we can fix the __compound_tail_refcounted() check by requiring PageAnon()
      to enable tail page refcounting.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c761471b
  12. 02 6月, 2015 3 次提交
    • T
      writeback: implement unlocked_inode_to_wb transaction and use it for stat updates · 682aa8e1
      Tejun Heo 提交于
      The mechanism for detecting whether an inode should switch its wb
      (bdi_writeback) association is now in place.  This patch build the
      framework for the actual switching.
      
      This patch adds a new inode flag I_WB_SWITCHING, which has two
      functions.  First, the easy one, it ensures that there's only one
      switching in progress for a give inode.  Second, it's used as a
      mechanism to synchronize wb stat updates.
      
      The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
      but track the current number of dirty pages and pages under writeback
      respectively.  As such, when an inode is moved from one wb to another,
      the inode's portion of those stats have to be transferred together;
      unfortunately, this is a bit tricky as those stat updates are percpu
      operations which are performed without holding any lock in some
      places.
      
      This patch solves the problem in a similar way as memcg.  Each such
      lockless stat updates are wrapped in transaction surrounded by
      unlocked_inode_to_wb_begin/end().  During normal operation, they map
      to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
      mapping->tree_lock is grabbed across the transaction.
      
      In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
      grace period to pass before actually starting to switch, which
      guarantees that all stat update paths are synchronizing against
      mapping->tree_lock.
      
      This patch still doesn't implement the actual switching.
      
      v3: Updated on top of the recent cancel_dirty_page() updates.
          unlocked_inode_to_wb_begin() now nests inside
          mem_cgroup_begin_page_stat() to match the locking order.
      
      v2: The i_wb access transaction will be used for !stat accesses too.
          Function names and comments updated accordingly.
      
          s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
          s/switch_wb/switch_wbs/
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      682aa8e1
    • G
      memcg: add per cgroup dirty page accounting · c4843a75
      Greg Thelen 提交于
      When modifying PG_Dirty on cached file pages, update the new
      MEM_CGROUP_STAT_DIRTY counter.  This is done in the same places where
      global NR_FILE_DIRTY is managed.  The new memcg stat is visible in the
      per memcg memory.stat cgroupfs file.  The most recent past attempt at
      this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632
      
      The new accounting supports future efforts to add per cgroup dirty
      page throttling and writeback.  It also helps an administrator break
      down a container's memory usage and provides evidence to understand
      memcg oom kills (the new dirty count is included in memcg oom kill
      messages).
      
      The ability to move page accounting between memcg
      (memory.move_charge_at_immigrate) makes this accounting more
      complicated than the global counter.  The existing
      mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
      accounting with stat updates.
      Typical update operation:
      	memcg = mem_cgroup_begin_page_stat(page)
      	if (TestSetPageDirty()) {
      		[...]
      		mem_cgroup_update_page_stat(memcg)
      	}
      	mem_cgroup_end_page_stat(memcg)
      
      Summary of mem_cgroup_end_page_stat() overhead:
      - Without CONFIG_MEMCG it's a no-op
      - With CONFIG_MEMCG and no inter memcg task movement, it's just
        rcu_read_lock()
      - With CONFIG_MEMCG and inter memcg  task movement, it's
        rcu_read_lock() + spin_lock_irqsave()
      
      A memcg parameter is added to several routines because their callers
      now grab mem_cgroup_begin_page_stat() which returns the memcg later
      needed by for mem_cgroup_update_page_stat().
      
      Because mem_cgroup_begin_page_stat() may disable interrupts, some
      adjustments are needed:
      - move __mark_inode_dirty() from __set_page_dirty() to its caller.
        __mark_inode_dirty() locking does not want interrupts disabled.
      - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
        __delete_from_page_cache(), replace_page_cache_page(),
        invalidate_complete_page2(), and __remove_mapping().
      
         text    data     bss      dec    hex filename
      8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
      8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
                                  +192 text bytes
      8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
      8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
                                  +773 text bytes
      
      Performance tests run on v4.0-rc1-36-g4f671fe2.  Lower is better for
      all metrics, they're all wall clock or cycle counts.  The read and write
      fault benchmarks just measure fault time, they do not include I/O time.
      
      * CONFIG_MEMCG not set:
                                  baseline                              patched
        kbuild                 1m25.030000(+-0.088% 3 samples)       1m25.426667(+-0.120% 3 samples)
        dd write 100 MiB          0.859211561 +-15.10%                  0.874162885 +-15.03%
        dd write 200 MiB          1.670653105 +-17.87%                  1.669384764 +-11.99%
        dd write 1000 MiB         8.434691190 +-14.15%                  8.474733215 +-14.77%
        read fault cycles       254.0(+-0.000% 10 samples)            253.0(+-0.000% 10 samples)
        write fault cycles     2021.2(+-3.070% 10 samples)           1984.5(+-1.036% 10 samples)
      
      * CONFIG_MEMCG=y root_memcg:
                                  baseline                              patched
        kbuild                 1m25.716667(+-0.105% 3 samples)       1m25.686667(+-0.153% 3 samples)
        dd write 100 MiB          0.855650830 +-14.90%                  0.887557919 +-14.90%
        dd write 200 MiB          1.688322953 +-12.72%                  1.667682724 +-13.33%
        dd write 1000 MiB         8.418601605 +-14.30%                  8.673532299 +-15.00%
        read fault cycles       266.0(+-0.000% 10 samples)            266.0(+-0.000% 10 samples)
        write fault cycles     2051.7(+-1.349% 10 samples)           2049.6(+-1.686% 10 samples)
      
      * CONFIG_MEMCG=y non-root_memcg:
                                  baseline                              patched
        kbuild                 1m26.120000(+-0.273% 3 samples)       1m25.763333(+-0.127% 3 samples)
        dd write 100 MiB          0.861723964 +-15.25%                  0.818129350 +-14.82%
        dd write 200 MiB          1.669887569 +-13.30%                  1.698645885 +-13.27%
        dd write 1000 MiB         8.383191730 +-14.65%                  8.351742280 +-14.52%
        read fault cycles       265.7(+-0.172% 10 samples)            267.0(+-0.000% 10 samples)
        write fault cycles     2070.6(+-1.512% 10 samples)           2084.4(+-2.148% 10 samples)
      
      As expected anon page faults are not affected by this patch.
      
      tj: Updated to apply on top of the recent cancel_dirty_page() changes.
      Signed-off-by: NSha Zhengju <handai.szj@gmail.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c4843a75
    • T
      page_writeback: revive cancel_dirty_page() in a restricted form · 11f81bec
      Tejun Heo 提交于
      cancel_dirty_page() had some issues and b9ea2515 ("page_writeback:
      clean up mess around cancel_dirty_page()") replaced it with
      account_page_cleaned() which makes the caller responsible for clearing
      the dirty bit; unfortunately, the planned changes for cgroup writeback
      support requires synchronization between dirty bit manipulation and
      stat updates.  While we can open-code such synchronization in each
      account_page_cleaned() callsite, that's gonna be unnecessarily awkward
      and verbose.
      
      This patch revives cancel_dirty_page() but in a more restricted form.
      All it does is TestClearPageDirty() followed by account_page_cleaned()
      invocation if the page was dirty.  This helper covers all
      account_page_cleaned() usages except for __delete_from_page_cache()
      which is a special case anyway and left alone.  As this leaves no
      module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
      from it.
      
      This patch just revives cancel_dirty_page() as a trivial wrapper to
      replace equivalent usages and doesn't introduce any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      11f81bec
  13. 23 4月, 2015 1 次提交
  14. 16 4月, 2015 5 次提交
    • B
      mm: new pfn_mkwrite same as page_mkwrite for VM_PFNMAP · dd906184
      Boaz Harrosh 提交于
      This will allow FS that uses VM_PFNMAP | VM_MIXEDMAP (no page structs) to
      get notified when access is a write to a read-only PFN.
      
      This can happen if we mmap() a file then first mmap-read from it to
      page-in a read-only PFN, than we mmap-write to the same page.
      
      We need this functionality to fix a DAX bug, where in the scenario above
      we fail to set ctime/mtime though we modified the file.  An xfstest is
      attached to this patchset that shows the failure and the fix.  (A DAX
      patch will follow)
      
      This functionality is extra important for us, because upon dirtying of a
      pmem page we also want to RDMA the page to a remote cluster node.
      
      We define a new pfn_mkwrite and do not reuse page_mkwrite because
        1 - The name ;-)
        2 - But mainly because it would take a very long and tedious
            audit of all page_mkwrite functions of VM_MIXEDMAP/VM_PFNMAP
            users. To make sure they do not now CRASH. For example current
            DAX code (which this is for) would crash.
            If we would want to reuse page_mkwrite, We will need to first
            patch all users, so to not-crash-on-no-page. Then enable this
            patch. But even if I did that I would not sleep so well at night.
            Adding a new vector is the safest thing to do, and is not that
            expensive. an extra pointer at a static function vector per driver.
            Also the new vector is better for performance, because else we
            Will call all current Kernel vectors, so to:
              check-ha-no-page-do-nothing and return.
      
      No need to call it from do_shared_fault because do_wp_page is called to
      change pte permissions anyway.
      Signed-off-by: NYigal Korman <yigal@plexistor.com>
      Signed-off-by: NBoaz Harrosh <boaz@plexistor.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd906184
    • K
      mm: uninline and cleanup page-mapping related helpers · e39155ea
      Kirill A. Shutemov 提交于
      Most-used page->mapping helper -- page_mapping() -- has already uninlined.
       Let's uninline also page_rmapping() and page_anon_vma().  It saves us
      depending on configuration around 400 bytes in text:
      
         text	   data	    bss	    dec	    hex	filename
       660318	  99254	 410000	1169572	 11d8a4	mm/built-in.o-before
       659854	  99254	 410000	1169108	 11d6d4	mm/built-in.o
      
      I also tried to make code a bit more clean.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e39155ea
    • B
      include/linux/mm.h: simplify flag check · cdd7875e
      Borislav Petkov 提交于
      Flip the flag test so that it is the simplest.  No functional change, just
      a small readability improvement:
      
      No code changed:
      
        # arch/x86/kernel/sys_x86_64.o:
      
         text    data     bss     dec     hex filename
         1551      24       0    1575     627 sys_x86_64.o.before
         1551      24       0    1575     627 sys_x86_64.o.after
      
      md5:
         70708d1b1ad35cc891118a69dc1a63f9  sys_x86_64.o.before.asm
         70708d1b1ad35cc891118a69dc1a63f9  sys_x86_64.o.after.asm
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdd7875e
    • K
      mm: avoid tail page refcounting on non-THP compound pages · 8d63d99a
      Kirill A. Shutemov 提交于
      THP uses tail page refcounting to be able to split huge pages at any time.
       Tail page refcounting is not needed for other users of compound pages and
      it's harmful because of overhead.
      
      We try to exclude non-THP pages from tail page refcounting using
      __compound_tail_refcounted() check.  It excludes most common non-THP
      compound pages: SL*B and hugetlb, but it doesn't catch rest of __GFP_COMP
      users -- drivers.
      
      And it's not only about overhead.
      
      Drivers might want to use compound pages to get refcounting semantics
      suitable for mapping high-order pages to userspace.  But tail page
      refcounting breaks it.
      
      Tail page refcounting uses ->_mapcount in tail pages to store GUP pins on
      them.  It means GUP pins would affect page_mapcount() for tail pages.
      It's not a problem for THP, because it never maps tail pages.  But unlike
      THP, drivers map parts of compound pages with PTEs and it makes
      page_mapcount() be called for tail pages.
      
      In particular, GUP pins would shift PSS up and affect /proc/kpagecount for
      such pages.  But, I'm not aware about anything which can lead to crash or
      other serious misbehaviour.
      
      Since currently all THP pages are anonymous and all drivers pages are not,
      we can fix the __compound_tail_refcounted() check by requiring PageAnon()
      to enable tail page refcounting.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d63d99a
    • K
      mm: consolidate all page-flags helpers in <linux/page-flags.h> · e8c6158f
      Kirill A. Shutemov 提交于
      Currently we take a naive approach to page flags on compound pages - we
      set the flag on the page without consideration if the flag makes sense
      for tail page or for compound page in general.  This patchset try to
      sort this out by defining per-flag policy on what need to be done if
      page-flag helper operate on compound page.
      
      The last patch in the patchset also sanitizes usege of page->mapping for
      tail pages.  We don't define the meaning of page->mapping for tail
      pages.  Currently it's always NULL, which can be inconsistent with head
      page and potentially lead to problems.
      
      For now I caught one case of illegal usage of page flags or ->mapping:
      sound subsystem allocates pages with __GFP_COMP and maps them with PTEs.
      It leads to setting dirty bit on tail pages and access to tail_page's
      ->mapping.  I don't see any bad behaviour caused by this, but worth
      fixing anyway.
      
      This patchset makes more sense if you take my THP refcounting into
      account: we will see more compound pages mapped with PTEs and we need to
      define behaviour of flags on compound pages to avoid bugs.
      
      This patch (of 16):
      
      We have page-flags helper function declarations/definitions spread over
      several header files.  Let's consolidate them in <linux/page-flags.h>.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8c6158f
  15. 15 4月, 2015 4 次提交
  16. 17 2月, 2015 1 次提交
    • M
      mm: allow page fault handlers to perform the COW · 2e4cdab0
      Matthew Wilcox 提交于
      Currently COW of an XIP file is done by first bringing in a read-only
      mapping, then retrying the fault and copying the page.  It is much more
      efficient to tell the fault handler that a COW is being attempted (by
      passing in the pre-allocated page in the vm_fault structure), and allow
      the handler to perform the COW operation itself.
      
      The handler cannot insert the page itself if there is already a read-only
      mapping at that address, so allow the handler to return VM_FAULT_LOCKED
      and set the fault_page to be NULL.  This indicates to the MM code that the
      i_mmap_lock is held instead of the page lock.
      Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e4cdab0
  17. 13 2月, 2015 1 次提交