1. 11 1月, 2017 1 次提交
    • R
      dax: wrprotect pmd_t in dax_mapping_entry_mkclean · f729c8c9
      Ross Zwisler 提交于
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss in the following sequence:
      
      1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
         pmd_t dirty and writeable
      2) fsync, flushing out PMD data and cleaning the radix tree entry. We
         currently fail to mark the pmd_t as clean and write protected.
      3) more mmap writes to the PMD.  These don't cause any page faults since
         the pmd_t is dirty and writeable.  The radix tree entry remains clean.
      4) fsync, which fails to flush the dirty PMD data because the radix tree
         entry was clean.
      5) crash - dirty data that should have been fsync'd as part of 4) could
         still have been in the processor cache, and is lost.
      
      Fix this by marking the pmd_t clean and write protected in
      dax_mapping_entry_mkclean(), which is called as part of the fsync
      operation 2).  This will cause the writes in step 3) above to generate
      page faults where we'll re-dirty the PMD radix tree entry, resulting in
      flushes in the fsync that happens in step 4).
      
      Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush")
      Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f729c8c9
  2. 27 12月, 2016 4 次提交
  3. 15 12月, 2016 5 次提交
  4. 13 12月, 2016 4 次提交
  5. 21 11月, 2016 1 次提交
  6. 10 11月, 2016 1 次提交
  7. 08 11月, 2016 12 次提交
  8. 08 10月, 2016 1 次提交
    • A
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu 提交于
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
  9. 19 9月, 2016 4 次提交
  10. 27 7月, 2016 1 次提交
    • R
      dax: remote unused fault wrappers · 6b524995
      Ross Zwisler 提交于
      Remove the unused wrappers dax_fault() and dax_pmd_fault().  After this
      removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
      dax_pmd_fault() respectively, and update all callers.
      
      The dax_fault() and dax_pmd_fault() wrappers were initially intended to
      capture some filesystem independent functionality around page faults
      (calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
      and ctime).
      
      However, the following commits:
      
         5726b27b ("ext2: Add locking for DAX faults")
         ea3d7209 ("ext4: fix races between page faults and hole punching")
      
      added locking to the ext2 and ext4 filesystems after these common
      operations but before __dax_fault() and __dax_pmd_fault() were called.
      This means that these wrappers are no longer used, and are unlikely to
      be used in the future.
      
      XFS has had locking analogous to what was recently added to ext2 and
      ext4 since DAX support was initially introduced by:
      
         6b698ede ("xfs: add DAX file operations support")
      
      Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b524995
  11. 13 7月, 2016 2 次提交
  12. 28 6月, 2016 1 次提交
    • E
      dax: fix offset overflow in dax_io · 02395435
      Eric Sandeen 提交于
      This isn't functionally apparent for some reason, but
      when we test io at extreme offsets at the end of the loff_t
      rang, such as in fstests xfs/071, the calculation of
      "max" in dax_io() can be wrong due to pos + size overflowing.
      
      For example,
      
      # xfs_io -c "pwrite 9223372036854771712 512" /mnt/test/file
      
      enters dax_io with:
      
      start 0x7ffffffffffff000
      end   0x7ffffffffffff200
      
      and the rounded up "size" variable is 0x1000.  This yields:
      
      pos + size 0x8000000000000000 (overflows loff_t)
             end 0x7ffffffffffff200
      
      Due to the overflow, the min() function picks the wrong
      value for the "max" variable, and when we send (max - pos)
      into i.e. copy_from_iter_pmem() it is also the wrong value.
      
      This somehow(tm) gets magically absorbed without incident,
      probably because iter->count is correct.  But it seems best
      to fix it up properly by comparing the two values as
      unsigned.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      02395435
  13. 21 5月, 2016 1 次提交
  14. 20 5月, 2016 2 次提交