1. 08 12月, 2015 1 次提交
    • J
      ext4: fix races between page faults and hole punching · ea3d7209
      Jan Kara 提交于
      Currently, page faults and hole punching are completely unsynchronized.
      This can result in page fault faulting in a page into a range that we
      are punching after truncate_pagecache_range() has been called and thus
      we can end up with a page mapped to disk blocks that will be shortly
      freed. Filesystem corruption will shortly follow. Note that the same
      race is avoided for truncate by checking page fault offset against
      i_size but there isn't similar mechanism available for punching holes.
      
      Fix the problem by creating new rw semaphore i_mmap_sem in inode and
      grab it for writing over truncate, hole punching, and other functions
      removing blocks from extent tree and for read over page faults. We
      cannot easily use i_data_sem for this since that ranks below transaction
      start and we need something ranking above it so that it can be held over
      the whole truncate / hole punching operation. Also remove various
      workarounds we had in the code to reduce race window when page fault
      could have created pages with stale mapping information.
      Signed-off-by: NJan Kara <jack@suse.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ea3d7209
  2. 27 11月, 2015 3 次提交
  3. 25 11月, 2015 1 次提交
    • D
      ext4: Fix handling of extended tv_sec · a4dad1ae
      David Turner 提交于
      In ext4, the bottom two bits of {a,c,m}time_extra are used to extend
      the {a,c,m}time fields, deferring the year 2038 problem to the year
      2446.
      
      When decoding these extended fields, for times whose bottom 32 bits
      would represent a negative number, sign extension causes the 64-bit
      extended timestamp to be negative as well, which is not what's
      intended.  This patch corrects that issue, so that the only negative
      {a,c,m}times are those between 1901 and 1970 (as per 32-bit signed
      timestamps).
      
      Some older kernels might have written pre-1970 dates with 1,1 in the
      extra bits.  This patch treats those incorrectly-encoded dates as
      pre-1970, instead of post-2311, until kernel 4.20 is released.
      Hopefully by then e2fsck will have fixed up the bad data.
      
      Also add a comment explaining the encoding of ext4's extra {a,c,m}time
      bits.
      Signed-off-by: NDavid Turner <novalis@novalis.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reported-by: NMark Harris <mh8928@yahoo.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=23732
      Cc: stable@vger.kernel.org
      a4dad1ae
  4. 17 11月, 2015 1 次提交
    • D
      ext2, ext4: warn when mounting with dax enabled · ef83b6e8
      Dan Williams 提交于
      Similar to XFS warn when mounting DAX while it is still considered under
      development.  Also, aspects of the DAX implementation, for example
      synchronization against multiple faults and faults causing block
      allocation, depend on the correct implementation in the filesystem.  The
      maturity of a given DAX implementation is filesystem specific.
      
      Cc: <stable@vger.kernel.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: linux-ext4@vger.kernel.org
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Acked-by: NJan Kara <jack@suse.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ef83b6e8
  5. 14 11月, 2015 1 次提交
  6. 11 11月, 2015 1 次提交
    • R
      vfs: remove unused wrapper block_page_mkwrite() · 5c500029
      Ross Zwisler 提交于
      The function currently called "__block_page_mkwrite()" used to be called
      "block_page_mkwrite()" until a wrapper for this function was added by:
      
      commit 24da4fab ("vfs: Create __block_page_mkwrite() helper passing
      	error values back")
      
      This wrapper, the current "block_page_mkwrite()", is currently unused.
      __block_page_mkwrite() is used directly by ext4, nilfs2 and xfs.
      
      Remove the unused wrapper, rename __block_page_mkwrite() back to
      block_page_mkwrite() and update the comment above block_page_mkwrite().
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5c500029
  7. 10 11月, 2015 1 次提交
    • A
      remove abs64() · 79211c8e
      Andrew Morton 提交于
      Switch everything to the new and more capable implementation of abs().
      Mainly to give the new abs() a bit of a workout.
      
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79211c8e
  8. 07 11月, 2015 2 次提交
    • M
      mm, fs: introduce mapping_gfp_constraint() · c62d2555
      Michal Hocko 提交于
      There are many places which use mapping_gfp_mask to restrict a more
      generic gfp mask which would be used for allocations which are not
      directly related to the page cache but they are performed in the same
      context.
      
      Let's introduce a helper function which makes the restriction explicit and
      easier to track.  This patch doesn't introduce any functional changes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c62d2555
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  9. 30 10月, 2015 1 次提交
  10. 21 10月, 2015 1 次提交
    • D
      KEYS: Merge the type-specific data with the payload data · 146aa8b1
      David Howells 提交于
      Merge the type-specific data with the payload data into one four-word chunk
      as it seems pointless to keep them separate.
      
      Use user_key_payload() for accessing the payloads of overloaded
      user-defined keys.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-cifs@vger.kernel.org
      cc: ecryptfs@vger.kernel.org
      cc: linux-ext4@vger.kernel.org
      cc: linux-f2fs-devel@lists.sourceforge.net
      cc: linux-nfs@vger.kernel.org
      cc: ceph-devel@vger.kernel.org
      cc: linux-ima-devel@lists.sourceforge.net
      146aa8b1
  11. 19 10月, 2015 4 次提交
  12. 18 10月, 2015 8 次提交
  13. 17 10月, 2015 1 次提交
    • M
      mm, fs: obey gfp_mapping for add_to_page_cache() · 063d99b4
      Michal Hocko 提交于
      Commit 6afdb859 ("mm: do not ignore mapping_gfp_mask in page cache
      allocation paths") has caught some users of hardcoded GFP_KERNEL used in
      the page cache allocation paths.  This, however, wasn't complete and
      there were others which went unnoticed.
      
      Dave Chinner has reported the following deadlock for xfs on loop device:
      : With the recent merge of the loop device changes, I'm now seeing
      : XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
      :
      : The deadlocked is as follows:
      :
      : kloopd1: loop_queue_read_work
      :       xfs_file_iter_read
      :       lock XFS inode XFS_IOLOCK_SHARED (on image file)
      :       page cache read (GFP_KERNEL)
      :       radix tree alloc
      :       memory reclaim
      :       reclaim XFS inodes
      :       log force to unpin inodes
      :       <wait for log IO completion>
      :
      : xfs-cil/loop1: <does log force IO work>
      :       xlog_cil_push
      :       xlog_write
      :       <loop issuing log writes>
      :               xlog_state_get_iclog_space()
      :               <blocks due to all log buffers under write io>
      :               <waits for IO completion>
      :
      : kloopd1: loop_queue_write_work
      :       xfs_file_write_iter
      :       lock XFS inode XFS_IOLOCK_EXCL (on image file)
      :       <wait for inode to be unlocked>
      :
      : i.e. the kloopd, with it's split read and write work queues, has
      : introduced a dependency through memory reclaim. i.e. that writes
      : need to be able to progress for reads make progress.
      :
      : The problem, fundamentally, is that mpage_readpages() does a
      : GFP_KERNEL allocation, rather than paying attention to the inode's
      : mapping gfp mask, which is set to GFP_NOFS.
      :
      : The didn't used to happen, because the loop device used to issue
      : reads through the splice path and that does:
      :
      :       error = add_to_page_cache_lru(page, mapping, index,
      :                       GFP_KERNEL & mapping_gfp_mask(mapping));
      
      This has changed by commit aa4d8616 ("block: loop: switch to VFS
      ITER_BVEC").
      
      This patch changes mpage_readpage{s} to follow gfp mask set for the
      mapping.  There are, however, other places which are doing basically the
      same.
      
      lustre:ll_dir_filler is doing GFP_KERNEL from the function which
      apparently uses GFP_NOFS for other allocations so let's make this
      consistent.
      
      cifs:readpages_get_pages is called from cifs_readpages and
      __cifs_readpages_from_fscache called from the same path obeys mapping
      gfp.
      
      ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
      regardless it uses mapping_gfp_mask for the page allocation.
      
      ext4_mpage_readpages is the called from the page cache allocation path
      same as read_pages and read_cache_pages
      
      As I've noticed in my previous post I cannot say I would be happy about
      sprinkling mapping_gfp_mask all over the place and it sounds like we
      should drop gfp_mask argument altogether and use it internally in
      __add_to_page_cache_locked that would require all the filesystems to use
      mapping gfp consistently which I am not sure is the case here.  From a
      quick glance it seems that some file system use it all the time while
      others are selective.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Andreas Dilger <andreas.dilger@intel.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      063d99b4
  14. 15 10月, 2015 1 次提交
    • T
      ext4: use private version of page_zero_new_buffers() for data=journal mode · b90197b6
      Theodore Ts'o 提交于
      If there is a error while copying data from userspace into the page
      cache during a write(2) system call, in data=journal mode, in
      ext4_journalled_write_end() were using page_zero_new_buffers() from
      fs/buffer.c.  Unfortunately, this sets the buffer dirty flag, which is
      no good if journalling is enabled.  This is a long-standing bug that
      goes back for years and years in ext3, but a combination of (a)
      data=journal not being very common, (b) in many case it only results
      in a warning message. and (c) only very rarely causes the kernel hang,
      means that we only really noticed this as a problem when commit
      998ef75d caused this failure to happen frequently enough to cause
      generic/208 to fail when run in data=journal mode.
      
      The fix is to have our own version of this function that doesn't call
      mark_dirty_buffer(), since we will end up calling
      ext4_handle_dirty_metadata() on the buffer head(s) in questions very
      shortly afterwards in ext4_journalled_write_end().
      
      Thanks to Dave Hansen and Linus Torvalds for helping to identify the
      root cause of the problem.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.com>
      b90197b6
  15. 03 10月, 2015 5 次提交
    • T
      ext4 crypto: fix bugs in ext4_encrypted_zeroout() · 36086d43
      Theodore Ts'o 提交于
      Fix multiple bugs in ext4_encrypted_zeroout(), including one that
      could cause us to write an encrypted zero page to the wrong location
      on disk, potentially causing data and file system corruption.
      Fortunately, this tends to only show up in stress tests, but even with
      these fixes, we are seeing some test failures with generic/127 --- but
      these are now caused by data failures instead of metadata corruption.
      
      Since ext4_encrypted_zeroout() is only used for some optimizations to
      keep the extent tree from being too fragmented, and
      ext4_encrypted_zeroout() itself isn't all that optimized from a time
      or IOPS perspective, disable the extent tree optimization for
      encrypted inodes for now.  This prevents the data corruption issues
      reported by generic/127 until we can figure out what's going wrong.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      36086d43
    • T
      ext4 crypto: replace some BUG_ON()'s with error checks · 687c3c36
      Theodore Ts'o 提交于
      Buggy (or hostile) userspace should not be able to cause the kernel to
      crash.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      687c3c36
    • T
      ext4 crypto: ext4_page_crypto() doesn't need a encryption context · 3684de8c
      Theodore Ts'o 提交于
      Since ext4_page_crypto() doesn't need an encryption context (at least
      not any more), this allows us to simplify a number function signature
      and also allows us to avoid needing to allocate a context in
      ext4_block_write_begin().  It also means we no longer need a separate
      ext4_decrypt_one() function.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      3684de8c
    • T
      ext4: optimize ext4_writepage() for attempted 4k delalloc writes · cccd147a
      Theodore Ts'o 提交于
      In cases where the file system block size is the same as the page
      size, and ext4_writepage() is asked to write out a page which is
      either has the unwritten bit set in the extent tree, or which does not
      yet have a block assigned due to delayed allocation, we can bail out
      early and, unlocking the page earlier and avoiding a round trip
      through ext4_bio_write_page() with the attendant calls to
      set_page_writeback() and redirty_page_for_writeback().
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      cccd147a
    • T
      ext4 crypto: fix memory leak in ext4_bio_write_page() · 937d7b84
      Theodore Ts'o 提交于
      There are times when ext4_bio_write_page() is called even though we
      don't actually need to do any I/O.  This happens when ext4_writepage()
      gets called by the jbd2 commit path when an inode needs to force its
      pages written out in order to provide data=ordered guarantees --- and
      a page is backed by an unwritten (e.g., uninitialized) block on disk,
      or if delayed allocation means the page's backing store hasn't been
      allocated yet.  In that case, we need to skip the call to
      ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
      bounce page and an ext4 crypto context getting leaked.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      937d7b84
  16. 29 9月, 2015 1 次提交
  17. 24 9月, 2015 4 次提交
  18. 09 9月, 2015 3 次提交