1. 25 11月, 2015 2 次提交
    • J
      jbd2: Fix unreclaimed pages after truncate in data=journal mode · bc23f0c8
      Jan Kara 提交于
      Ted and Namjae have reported that truncated pages don't get timely
      reclaimed after being truncated in data=journal mode. The following test
      triggers the issue easily:
      
      for (i = 0; i < 1000; i++) {
      	pwrite(fd, buf, 1024*1024, 0);
      	fsync(fd);
      	fsync(fd);
      	ftruncate(fd, 0);
      }
      
      The reason is that journal_unmap_buffer() finds that truncated buffers
      are not journalled (jh->b_transaction == NULL), they are part of
      checkpoint list of a transaction (jh->b_cp_transaction != NULL) and have
      been already written out (!buffer_dirty(bh)). We clean such buffers but
      we leave them in the checkpoint list. Since checkpoint transaction holds
      a reference to the journal head, these buffers cannot be released until
      the checkpoint transaction is cleaned up. And at that point we don't
      call release_buffer_page() anymore so pages detached from mapping are
      lingering in the system waiting for reclaim to find them and free them.
      
      Fix the problem by removing buffers from transaction checkpoint lists
      when journal_unmap_buffer() finds out they don't have to be there
      anymore.
      Reported-and-tested-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Fixes: de1b7941Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bc23f0c8
    • D
      ext4: Fix handling of extended tv_sec · a4dad1ae
      David Turner 提交于
      In ext4, the bottom two bits of {a,c,m}time_extra are used to extend
      the {a,c,m}time fields, deferring the year 2038 problem to the year
      2446.
      
      When decoding these extended fields, for times whose bottom 32 bits
      would represent a negative number, sign extension causes the 64-bit
      extended timestamp to be negative as well, which is not what's
      intended.  This patch corrects that issue, so that the only negative
      {a,c,m}times are those between 1901 and 1970 (as per 32-bit signed
      timestamps).
      
      Some older kernels might have written pre-1970 dates with 1,1 in the
      extra bits.  This patch treats those incorrectly-encoded dates as
      pre-1970, instead of post-2311, until kernel 4.20 is released.
      Hopefully by then e2fsck will have fixed up the bad data.
      
      Also add a comment explaining the encoding of ext4's extra {a,c,m}time
      bits.
      Signed-off-by: NDavid Turner <novalis@novalis.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reported-by: NMark Harris <mh8928@yahoo.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=23732
      Cc: stable@vger.kernel.org
      a4dad1ae
  2. 30 10月, 2015 1 次提交
  3. 19 10月, 2015 4 次提交
  4. 18 10月, 2015 10 次提交
  5. 15 10月, 2015 3 次提交
    • D
      ext4: promote ext4 over ext2 in the default probe order · 9172796b
      Darrick J. Wong 提交于
      Prevent clean ext3 filesystems from mounting by default with the ext2
      driver (with no journal!) by putting ext4 ahead of ext2 in the default
      probe order.  This will have the effect of mounting ext2 filesystems
      with ext4.ko by default, which is a safer failure than hoping the user
      notices that their journalled ext3 is now running without a journal!
      
      Users who require ext2.ko for ext2 can either disable ext4.ko or
      explicitly request ext2 via "mount -t ext2" or "rootfstype=ext2".
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      9172796b
    • D
      jbd2: gate checksum calculations on crc driver presence, not sb flags · 8595798c
      Darrick J. Wong 提交于
      Change the journal's checksum functions to gate on whether or not the
      crc32c driver is loaded, and gate the loading on the superblock bits.
      This prevents a journal crash if someone loads a journal in no-csum
      mode and then randomizes the superblock, thus flipping on the feature
      bits.
      Tested-By: NNikolay Borisov <kernel@kyup.com>
      Reported-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      8595798c
    • T
      ext4: use private version of page_zero_new_buffers() for data=journal mode · b90197b6
      Theodore Ts'o 提交于
      If there is a error while copying data from userspace into the page
      cache during a write(2) system call, in data=journal mode, in
      ext4_journalled_write_end() were using page_zero_new_buffers() from
      fs/buffer.c.  Unfortunately, this sets the buffer dirty flag, which is
      no good if journalling is enabled.  This is a long-standing bug that
      goes back for years and years in ext3, but a combination of (a)
      data=journal not being very common, (b) in many case it only results
      in a warning message. and (c) only very rarely causes the kernel hang,
      means that we only really noticed this as a problem when commit
      998ef75d caused this failure to happen frequently enough to cause
      generic/208 to fail when run in data=journal mode.
      
      The fix is to have our own version of this function that doesn't call
      mark_dirty_buffer(), since we will end up calling
      ext4_handle_dirty_metadata() on the buffer head(s) in questions very
      shortly afterwards in ext4_journalled_write_end().
      
      Thanks to Dave Hansen and Linus Torvalds for helping to identify the
      root cause of the problem.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.com>
      b90197b6
  6. 03 10月, 2015 5 次提交
    • T
      ext4 crypto: fix bugs in ext4_encrypted_zeroout() · 36086d43
      Theodore Ts'o 提交于
      Fix multiple bugs in ext4_encrypted_zeroout(), including one that
      could cause us to write an encrypted zero page to the wrong location
      on disk, potentially causing data and file system corruption.
      Fortunately, this tends to only show up in stress tests, but even with
      these fixes, we are seeing some test failures with generic/127 --- but
      these are now caused by data failures instead of metadata corruption.
      
      Since ext4_encrypted_zeroout() is only used for some optimizations to
      keep the extent tree from being too fragmented, and
      ext4_encrypted_zeroout() itself isn't all that optimized from a time
      or IOPS perspective, disable the extent tree optimization for
      encrypted inodes for now.  This prevents the data corruption issues
      reported by generic/127 until we can figure out what's going wrong.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      36086d43
    • T
      ext4 crypto: replace some BUG_ON()'s with error checks · 687c3c36
      Theodore Ts'o 提交于
      Buggy (or hostile) userspace should not be able to cause the kernel to
      crash.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      687c3c36
    • T
      ext4 crypto: ext4_page_crypto() doesn't need a encryption context · 3684de8c
      Theodore Ts'o 提交于
      Since ext4_page_crypto() doesn't need an encryption context (at least
      not any more), this allows us to simplify a number function signature
      and also allows us to avoid needing to allocate a context in
      ext4_block_write_begin().  It also means we no longer need a separate
      ext4_decrypt_one() function.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      3684de8c
    • T
      ext4: optimize ext4_writepage() for attempted 4k delalloc writes · cccd147a
      Theodore Ts'o 提交于
      In cases where the file system block size is the same as the page
      size, and ext4_writepage() is asked to write out a page which is
      either has the unwritten bit set in the extent tree, or which does not
      yet have a block assigned due to delayed allocation, we can bail out
      early and, unlocking the page earlier and avoiding a round trip
      through ext4_bio_write_page() with the attendant calls to
      set_page_writeback() and redirty_page_for_writeback().
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      cccd147a
    • T
      ext4 crypto: fix memory leak in ext4_bio_write_page() · 937d7b84
      Theodore Ts'o 提交于
      There are times when ext4_bio_write_page() is called even though we
      don't actually need to do any I/O.  This happens when ext4_writepage()
      gets called by the jbd2 commit path when an inode needs to force its
      pages written out in order to provide data=ordered guarantees --- and
      a page is backed by an unwritten (e.g., uninitialized) block on disk,
      or if delayed allocation means the page's backing store hasn't been
      allocated yet.  In that case, we need to skip the call to
      ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
      bounce page and an ext4 crypto context getting leaked.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      937d7b84
  7. 24 9月, 2015 3 次提交
  8. 20 9月, 2015 1 次提交
    • C
      fs-writeback: unplug before cond_resched in writeback_sb_inodes · 590dca3a
      Chris Mason 提交于
      Commit 505a666e ("writeback: plug writeback in wb_writeback() and
      writeback_inodes_wb()") has us holding a plug during writeback_sb_inodes,
      which increases the merge rate when relatively contiguous small files
      are written by the filesystem.  It helps both on flash and spindles.
      
      For an fs_mark workload creating 4K files in parallel across 8 drives,
      this commit improves performance ~9% more by unplugging before calling
      cond_resched().  cond_resched() doesn't trigger an implicit unplug, so
      explicitly getting the IO down to the device before scheduling reduces
      latencies for anyone waiting on clean pages.
      
      It also cuts down on how often we use kblockd to unplug, which means
      less work bouncing from one workqueue to another.
      
      Many more details about how we got here:
      
        https://lkml.org/lkml/2015/9/11/570Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      590dca3a
  9. 18 9月, 2015 1 次提交
  10. 16 9月, 2015 2 次提交
  11. 13 9月, 2015 1 次提交
    • L
      writeback: plug writeback in wb_writeback() and writeback_inodes_wb() · 505a666e
      Linus Torvalds 提交于
      We had to revert the pluggin in writeback_sb_inodes() because the
      wb->list_lock is held, but we could easily plug at a higher level before
      taking that lock, and unplug after releasing it.  This does that.
      
      Chris will run performance numbers, just to verify that this approach is
      comparable to the alternative (we could just drop and re-take the lock
      around the blk_finish_plug() rather than these two commits.
      
      I'd have preferred waiting for actual performance numbers before picking
      one approach over the other, but I don't want to release rc1 with the
      known "sleeping function called from invalid context" issue, so I'll
      pick this cleanup version for now.  But if the numbers show that we
      really want to plug just at the writeback_sb_inodes() level, and we
      should just play ugly games with the spinlock, we'll switch to that.
      
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505a666e
  12. 12 9月, 2015 4 次提交
    • S
      [CIFS] mount option sec=none not displayed properly in /proc/mounts · eda2116f
      Steve French 提交于
      When the user specifies "sec=none" in a cifs mount, we set
      sec_type as unspecified (and set a flag and the username will be
      null) rather than setting sectype as "none" so
      cifs_show_security was not properly displaying it in
      cifs /proc/mounts entries.
      Signed-off-by: NSteve French <steve.french@primarydata.com>
      Reviewed-by: NJeff Layton <jlayton@poochiereds.net>
      eda2116f
    • A
      revert "ocfs2/dlm: use list_for_each_entry instead of list_for_each" · e527b22c
      Andrew Morton 提交于
      Revert commit f83c7b5e ("ocfs2/dlm: use list_for_each_entry instead
      of list_for_each").
      
      list_for_each_entry() will dereference its `pos' argument, which can be
      NULL in dlm_process_recovery_data().
      Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
      Reported-by: NFengguang Wu <fengguang.wu@gmail.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e527b22c
    • J
      fs/seq_file: convert int seq_vprint/seq_printf/etc... returns to void · 6798a8ca
      Joe Perches 提交于
      The seq_<foo> function return values were frequently misused.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      
      All uses of these return values have been removed, so convert the
      return types to void.
      
      Miscellanea:
      
      o Move seq_put_decimal_<type> and seq_escape prototypes closer the
        other seq_vprintf prototypes
      o Reorder seq_putc and seq_puts to return early on overflow
      o Add argument names to seq_vprintf and seq_printf
      o Update the seq_escape kernel-doc
      o Convert a couple of leading spaces to tabs in seq_escape
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joerg Roedel <jroedel@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6798a8ca
    • L
      Revert "writeback: plug writeback at a high level" · 0ba13fd1
      Linus Torvalds 提交于
      This reverts commit d353d758.
      
      Doing the block layer plug/unplug inside writeback_sb_inodes() is
      broken, because that function is actually called with a spinlock held:
      wb->list_lock, as pointed out by Chris Mason.
      
      Chris suggested just dropping and re-taking the spinlock around the
      blk_finish_plug() call (the plgging itself can happen under the
      spinlock), and that would technically work, but is just disgusting.
      
      We do something fairly similar - but not quite as disgusting because we
      at least have a better reason for it - in writeback_single_inode(), so
      it's not like the caller can depend on the lock being held over the
      call, but in this case there just isn't any good reason for that
      "release and re-take the lock" pattern.
      
      [ In general, we should really strive to avoid the "release and retake"
        pattern for locks, because in the general case it can easily cause
        subtle bugs when the caller caches any state around the call that
        might be invalidated by dropping the lock even just temporarily. ]
      
      But in this case, the plugging should be easy to just move up to the
      callers before the spinlock is taken, which should even improve the
      effectiveness of the plug.  So there is really no good reason to play
      games with locking here.
      
      I'll send off a test-patch so that Dave Chinner can verify that that
      plug movement works.  In the meantime this just reverts the problematic
      commit and adds a comment to the function so that we hopefully don't
      make this mistake again.
      Reported-by: NChris Mason <clm@fb.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ba13fd1
  13. 11 9月, 2015 3 次提交