1. 23 2月, 2017 1 次提交
    • R
      dax: add tracepoint infrastructure, PMD tracing · 282a8e03
      Ross Zwisler 提交于
      Tracepoints are the standard way to capture debugging and tracing
      information in many parts of the kernel, including the XFS and ext4
      filesystems.  Create a tracepoint header for FS DAX and add the first DAX
      tracepoints to the PMD fault handler.  This allows the tracing for DAX to
      be done in the same way as the filesystem tracing so that developers can
      look at them together and get a coherent idea of what the system is doing.
      
      I added both an entry and exit tracepoint because future patches will add
      tracepoints to child functions of dax_iomap_pmd_fault() like
      dax_pmd_load_hole() and dax_pmd_insert_mapping().  We want those messages
      to be wrapped by the parent function tracepoints so the code flow is more
      easily understood.  Having entry and exit tracepoints for faults also
      allows us to easily see what filesystems functions were called during the
      fault.  These filesystem functions get executed via iomap_begin() and
      iomap_end() calls, for example, and will have their own tracepoints.
      
      For PMD faults we primarily want to understand the type of mapping, the
      fault flags, the faulting address and whether it fell back to 4k faults.
      If it fell back to 4k faults the tracepoints should let us understand why.
      
      I named the new tracepoint header file "fs_dax.h" to allow for device DAX
      to have its own separate tracing header in the same directory at some
      point.
      
      Here is an example output for these events from a successful PMD fault:
      
        big-1441  [005] ....    32.582758: xfs_filemap_pmd_fault: dev 259:0 ino 0x1003
      
        big-1441  [005] ....    32.582776: dax_pmd_fault: dev 259:0 ino 0x1003
        shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400
      
        big-1441  [005] ....    32.583292: dax_pmd_fault_done: dev 259:0 ino 0x1003
        shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10505000 vm_start 0x10200000 vm_end 0x10700000 pgoff 0x200 max_pgoff 0x1400 NOPAGE
      
      Link: http://lkml.kernel.org/r/1484085142-2297-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      282a8e03
  2. 09 2月, 2017 1 次提交
  3. 04 2月, 2017 1 次提交
    • M
      fs: break out of iomap_file_buffered_write on fatal signals · d1908f52
      Michal Hocko 提交于
      Tetsuo has noticed that an OOM stress test which performs large write
      requests can cause the full memory reserves depletion.  He has tracked
      this down to the following path
      
      	__alloc_pages_nodemask+0x436/0x4d0
      	alloc_pages_current+0x97/0x1b0
      	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
      	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
      	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
      	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
      	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
      	? iomap_write_end+0x80/0x80             fs/iomap.c:150
      	iomap_apply+0xb3/0x130                  fs/iomap.c:79
      	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
      	? iomap_write_end+0x80/0x80
      	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
      	? remove_wait_queue+0x59/0x60
      	xfs_file_write_iter+0x90/0x130 [xfs]
      	__vfs_write+0xe5/0x140
      	vfs_write+0xc7/0x1f0
      	? syscall_trace_enter+0x1d0/0x380
      	SyS_write+0x58/0xc0
      	do_syscall_64+0x6c/0x200
      	entry_SYSCALL64_slow_path+0x25/0x25
      
      the oom victim has access to all memory reserves to make a forward
      progress to exit easier.  But iomap_file_buffered_write and other
      callers of iomap_apply loop to complete the full request.  We need to
      check for fatal signals and back off with a short write instead.
      
      As the iomap_apply delegates all the work down to the actor we have to
      hook into those.  All callers that work with the page cache are calling
      iomap_write_begin so we will check for signals there.  dax_iomap_actor
      has to handle the situation explicitly because it copies data to the
      userspace directly.  Other callers like iomap_page_mkwrite work on a
      single page or iomap_fiemap_actor do not allocate memory based on the
      given len.
      
      Fixes: 68a9f5e7 ("xfs: implement iomap based buffered write path")
      Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1908f52
  4. 25 1月, 2017 1 次提交
  5. 11 1月, 2017 1 次提交
    • R
      dax: wrprotect pmd_t in dax_mapping_entry_mkclean · f729c8c9
      Ross Zwisler 提交于
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss in the following sequence:
      
      1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
         pmd_t dirty and writeable
      2) fsync, flushing out PMD data and cleaning the radix tree entry. We
         currently fail to mark the pmd_t as clean and write protected.
      3) more mmap writes to the PMD.  These don't cause any page faults since
         the pmd_t is dirty and writeable.  The radix tree entry remains clean.
      4) fsync, which fails to flush the dirty PMD data because the radix tree
         entry was clean.
      5) crash - dirty data that should have been fsync'd as part of 4) could
         still have been in the processor cache, and is lost.
      
      Fix this by marking the pmd_t clean and write protected in
      dax_mapping_entry_mkclean(), which is called as part of the fsync
      operation 2).  This will cause the writes in step 3) above to generate
      page faults where we'll re-dirty the PMD radix tree entry, resulting in
      flushes in the fsync that happens in step 4).
      
      Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush")
      Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f729c8c9
  6. 27 12月, 2016 4 次提交
  7. 15 12月, 2016 5 次提交
  8. 13 12月, 2016 4 次提交
  9. 21 11月, 2016 1 次提交
  10. 10 11月, 2016 1 次提交
  11. 08 11月, 2016 12 次提交
  12. 08 10月, 2016 1 次提交
    • A
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu 提交于
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
  13. 19 9月, 2016 4 次提交
  14. 27 7月, 2016 1 次提交
    • R
      dax: remote unused fault wrappers · 6b524995
      Ross Zwisler 提交于
      Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
      removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
      dax_pmd_fault() respectively, and update all callers.
      
      The dax_fault() and dax_pmd_fault() wrappers were initially intended to
      capture some filesystem independent functionality around page faults
      (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
      and ctime).
      
      However, the following commits:
      
         5726b27b ("ext2: Add locking for DAX faults")
         ea3d7209 ("ext4: fix races between page faults and hole punching")
      
      added locking to the ext2 and ext4 filesystems after these common
      operations but before __dax_fault() and __dax_pmd_fault() were called.
      This means that these wrappers are no longer used, and are unlikely to
      be used in the future.
      
      XFS has had locking analogous to what was recently added to ext2 and
      ext4 since DAX support was initially introduced by:
      
         6b698ede ("xfs: add DAX file operations support")
      
      Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b524995
  15. 13 7月, 2016 2 次提交