1. 04 7月, 2015 1 次提交
    • L
      ext4: fix reservation release on invalidatepage for delalloc fs · 9705acd6
      Lukas Czerner 提交于
      On delalloc enabled file system on invalidatepage operation
      in ext4_da_page_release_reservation() we want to clear the delayed
      buffer and remove the extent covering the delayed buffer from the extent
      status tree.
      
      However currently there is a bug where on the systems with page size >
      block size we will always remove extents from the start of the page
      regardless where the actual delayed buffers are positioned in the page.
      This leads to the errors like this:
      
      EXT4-fs warning (device loop0): ext4_da_release_space:1225:
      ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data
      blocks
      
      This however can cause data loss on writeback time if the file system is
      in ENOSPC condition because we're releasing reservation for someones
      else delayed buffer.
      
      Fix this by only removing extents that corresponds to the part of the
      page we want to invalidate.
      
      This problem is reproducible by the following fio receipt (however I was
      only able to reproduce it with fio-2.1 or older.
      
      [global]
      bs=8k
      iodepth=1024
      iodepth_batch=60
      randrepeat=1
      size=1m
      directory=/mnt/test
      numjobs=20
      [job1]
      ioengine=sync
      bs=1k
      direct=1
      rw=randread
      filename=file1:file2
      [job2]
      ioengine=libaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job3]
      bs=1k
      ioengine=posixaio
      rw=randwrite
      direct=1
      filename=file1:file2
      [job5]
      bs=1k
      ioengine=sync
      rw=randread
      filename=file1:file2
      [job7]
      ioengine=libaio
      rw=randwrite
      filename=file1:file2
      [job8]
      ioengine=posixaio
      rw=randwrite
      filename=file1:file2
      [job10]
      ioengine=mmap
      rw=randwrite
      bs=1k
      filename=file1:file2
      [job11]
      ioengine=mmap
      rw=randwrite
      direct=1
      filename=file1:file2
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      9705acd6
  2. 02 7月, 2015 1 次提交
    • T
      ext4: fix fencepost error in lazytime optimization · 0f0ff9a9
      Theodore Ts'o 提交于
      Commit 8f4d8558: "ext4: fix lazytime optimization" was not a
      complete fix.  In the case where the inode number is a multiple of 16,
      and we could still end up updating an inode with dirty timestamps
      written to the wrong inode on disk.  Oops.
      
      This can be easily reproduced by using generic/005 with a file system
      with metadata_csum and lazytime enabled.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      0f0ff9a9
  3. 22 6月, 2015 2 次提交
  4. 21 6月, 2015 1 次提交
    • T
      ext4: prevent ext4_quota_write() from failing due to ENOSPC · c5e298ae
      Theodore Ts'o 提交于
      In order to prevent quota block tracking to be inaccurate when
      ext4_quota_write() fails with ENOSPC, we make two changes.  The quota
      file can now use the reserved block (since the quota file is arguably
      file system metadata), and ext4_quota_write() now uses
      ext4_should_retry_alloc() to retry the block allocation after a commit
      has completed and released some blocks for allocation.
      
      This fixes failures of xfstests generic/270:
      
      Quota error (device vdc): write_blk: dquota write failed
      Quota error (device vdc): qtree_write_dquot: Error -28 occurred while creating quota
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c5e298ae
  5. 13 6月, 2015 1 次提交
    • T
      ext4: fix race between truncate and __ext4_journalled_writepage() · bdf96838
      Theodore Ts'o 提交于
      The commit cf108bca: "ext4: Invert the locking order of page_lock
      and transaction start" caused __ext4_journalled_writepage() to drop
      the page lock before the page was written back, as part of changing
      the locking order to jbd2_journal_start -> page_lock.  However, this
      introduced a potential race if there was a truncate racing with the
      data=journalled writeback mode.
      
      Fix this by grabbing the page lock after starting the journal handle,
      and then checking to see if page had gotten truncated out from under
      us.
      
      This fixes a number of different warnings or BUG_ON's when running
      xfstests generic/086 in data=journalled mode, including:
      
      jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7
      c0, 164), jh->b_transaction (  (null), 0), jh->b_next_transaction (  (null), 0), jlist 0
      
      	      	      	  - and -
      
      kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200!
          ...
      Call Trace:
       [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
       [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117
       [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117
       [<c027d883>] ? lock_buffer+0x36/0x36
       [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22
       [<c0229139>] do_invalidatepage+0x22/0x26
       [<c0229198>] truncate_inode_page+0x5b/0x85
       [<c022934b>] truncate_inode_pages_range+0x156/0x38c
       [<c0229592>] truncate_inode_pages+0x11/0x15
       [<c022962d>] truncate_pagecache+0x55/0x71
       [<c02b913b>] ext4_setattr+0x4a9/0x560
       [<c01ca542>] ? current_kernel_time+0x10/0x44
       [<c026c4d8>] notify_change+0x1c7/0x2be
       [<c0256a00>] do_truncate+0x65/0x85
       [<c0226f31>] ? file_ra_state_init+0x12/0x29
      
      	      	      	  - and -
      
      WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396
      irty_metadata+0x14a/0x1ae()
          ...
      Call Trace:
       [<c01b879f>] ? console_unlock+0x3a1/0x3ce
       [<c082cbb4>] dump_stack+0x48/0x60
       [<c0178b65>] warn_slowpath_common+0x89/0xa0
       [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae
       [<c0178bef>] warn_slowpath_null+0x14/0x18
       [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae
       [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d
       [<c02b2f44>] write_end_fn+0x40/0x53
       [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a
       [<c02b59e7>] ext4_writepage+0x354/0x3b8
       [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4
       [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c
       [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
       [<c02b5a5b>] __writepage+0x10/0x2e
       [<c0225956>] write_cache_pages+0x22d/0x32c
       [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8
       [<c02b6ee8>] ext4_writepages+0x102/0x607
       [<c019adfe>] ? sched_clock_local+0x10/0x10e
       [<c01a8a7c>] ? __lock_is_held+0x2e/0x44
       [<c01a8ad5>] ? lock_is_held+0x43/0x51
       [<c0226dff>] do_writepages+0x1c/0x29
       [<c0276bed>] __writeback_single_inode+0xc3/0x545
       [<c0277c07>] writeback_sb_inodes+0x21f/0x36d
          ...
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      bdf96838
  6. 15 5月, 2015 1 次提交
  7. 03 5月, 2015 1 次提交
    • L
      ext4: fix data corruption caused by unwritten and delayed extents · d2dc317d
      Lukas Czerner 提交于
      Currently it is possible to lose whole file system block worth of data
      when we hit the specific interaction with unwritten and delayed extents
      in status extent tree.
      
      The problem is that when we insert delayed extent into extent status
      tree the only way to get rid of it is when we write out delayed buffer.
      However there is a limitation in the extent status tree implementation
      so that when inserting unwritten extent should there be even a single
      delayed block the whole unwritten extent would be marked as delayed.
      
      At this point, there is no way to get rid of the delayed extents,
      because there are no delayed buffers to write out. So when a we write
      into said unwritten extent we will convert it to written, but it still
      remains delayed.
      
      When we try to write into that block later ext4_da_map_blocks() will set
      the buffer new and delayed and map it to invalid block which causes
      the rest of the block to be zeroed loosing already written data.
      
      For now we can fix this by simply not allowing to set delayed status on
      written extent in the extent status tree. Also add WARN_ON() to make
      sure that we notice if this happens in the future.
      
      This problem can be easily reproduced by running the following xfs_io.
      
      xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
                -c "falloc 0 131072" \
                -c "pwrite -S 0xbb 65536 2048" \
                -c "fsync" /mnt/test/fff
      
      echo 3 > /proc/sys/vm/drop_caches
      xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
      
      This can be theoretically also reproduced by at random by running fsx,
      but it's not very reliable, though on machines with bigger page size
      (like ppc) this can be seen more often (especially xfstest generic/127)
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d2dc317d
  8. 25 4月, 2015 1 次提交
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
  9. 16 4月, 2015 3 次提交
  10. 12 4月, 2015 6 次提交
  11. 08 4月, 2015 1 次提交
  12. 03 4月, 2015 1 次提交
  13. 26 3月, 2015 1 次提交
  14. 17 2月, 2015 1 次提交
  15. 13 2月, 2015 1 次提交
  16. 05 2月, 2015 2 次提交
    • T
      ext4: add optimization for the lazytime mount option · a26f4992
      Theodore Ts'o 提交于
      Add an optimization for the MS_LAZYTIME mount option so that we will
      opportunistically write out any inodes with the I_DIRTY_TIME flag set
      in a particular inode table block when we need to update some inode in
      that inode table block anyway.
      
      Also add some temporary code so that we can set the lazytime mount
      option without needing a modified /sbin/mount program which can set
      MS_LAZYTIME.  We can eventually make this go away once util-linux has
      added support.
      
      Google-Bug-Id: 18297052
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a26f4992
    • T
      vfs: add support for a lazytime mount option · 0ae45f63
      Theodore Ts'o 提交于
      Add a new mount option which enables a new "lazytime" mode.  This mode
      causes atime, mtime, and ctime updates to only be made to the
      in-memory version of the inode.  The on-disk times will only get
      updated when (a) if the inode needs to be updated for some non-time
      related change, (b) if userspace calls fsync(), syncfs() or sync(), or
      (c) just before an undeleted inode is evicted from memory.
      
      This is OK according to POSIX because there are no guarantees after a
      crash unless userspace explicitly requests via a fsync(2) call.
      
      For workloads which feature a large number of random write to a
      preallocated file, the lazytime mount option significantly reduces
      writes to the inode table.  The repeated 4k writes to a single block
      will result in undesirable stress on flash devices and SMR disk
      drives.  Even on conventional HDD's, the repeated writes to the inode
      table block will trigger Adjacent Track Interference (ATI) remediation
      latencies, which very negatively impact long tail latencies --- which
      is a very big deal for web serving tiers (for example).
      
      Google-Bug-Id: 18297052
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0ae45f63
  17. 26 11月, 2014 5 次提交
    • W
      ext4: update comments regarding ext4_delete_inode() · 58d86a50
      Wang Shilong 提交于
      ext4_delete_inode() has been renamed for a long time, update
      comments for this.
      Signed-off-by: NWang Shilong <wshilong@ddn.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      58d86a50
    • J
      ext4: move handling of list of shrinkable inodes into extent status code · b0dea4c1
      Jan Kara 提交于
      Currently callers adding extents to extent status tree were responsible
      for adding the inode to the list of inodes with freeable extents. This
      is error prone and puts list handling in unnecessarily many places.
      
      Just add inode to the list automatically when the first non-delay extent
      is added to the tree and remove inode from the list when the last
      non-delay extent is removed.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b0dea4c1
    • Z
      ext4: change LRU to round-robin in extent status tree shrinker · edaa53ca
      Zheng Liu 提交于
      In this commit we discard the lru algorithm for inodes with extent
      status tree because it takes significant effort to maintain a lru list
      in extent status tree shrinker and the shrinker can take a long time to
      scan this lru list in order to reclaim some objects.
      
      We replace the lru ordering with a simple round-robin.  After that we
      never need to keep a lru list.  That means that the list needn't be
      sorted if the shrinker can not reclaim any objects in the first round.
      
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      edaa53ca
    • Z
      ext4: cache extent hole in extent status tree for ext4_da_map_blocks() · 2f8e0a7c
      Zheng Liu 提交于
      Currently extent status tree doesn't cache extent hole when a write
      looks up in extent tree to make sure whether a block has been allocated
      or not.  In this case, we don't put extent hole in extent cache because
      later this extent might be removed and a new delayed extent might be
      added back.  But it will cause a defect when we do a lot of writes.  If
      we don't put extent hole in extent cache, the following writes also need
      to access extent tree to look at whether or not a block has been
      allocated.  It brings a cache miss.  This commit fixes this defect.
      Also if the inode doesn't have any extent, this extent hole will be
      cached as well.
      
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      2f8e0a7c
    • J
      ext4: fix block reservation for bigalloc filesystems · cbd7584e
      Jan Kara 提交于
      For bigalloc filesystems we have to check whether newly requested inode
      block isn't already part of a cluster for which we already have delayed
      allocation reservation. This check happens in ext4_ext_map_blocks() and
      that function sets EXT4_MAP_FROM_CLUSTER if that's the case. However if
      ext4_da_map_blocks() finds in extent cache information about the block,
      we don't call into ext4_ext_map_blocks() and thus we always end up
      getting new reservation even if the space for cluster is already
      reserved. This results in overreservation and premature ENOSPC reports.
      
      Fix the problem by checking for existing cluster reservation already in
      ext4_da_map_blocks(). That simplifies the logic and actually allows us
      to get rid of the EXT4_MAP_FROM_CLUSTER flag completely.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      cbd7584e
  18. 30 10月, 2014 1 次提交
    • J
      ext4: bail early when clearing inode journal flag fails · 4f879ca6
      Jan Kara 提交于
      When clearing inode journal flag, we call jbd2_journal_flush() to force
      all the journalled data to their final locations. Currently we ignore
      when this fails and continue clearing inode journal flag. This isn't a
      big problem because when jbd2_journal_flush() fails, journal is likely
      aborted anyway. But it can still lead to somewhat confusing results so
      rather bail out early.
      
      Coverity-id: 989044
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4f879ca6
  19. 13 10月, 2014 1 次提交
  20. 12 10月, 2014 1 次提交
    • E
      ext4: fix reservation overflow in ext4_da_write_begin · 0ff8947f
      Eric Sandeen 提交于
      Delalloc write journal reservations only reserve 1 credit,
      to update the inode if necessary.  However, it may happen
      once in a filesystem's lifetime that a file will cross
      the 2G threshold, and require the LARGE_FILE feature to
      be set in the superblock as well, if it was not set already.
      
      This overruns the transaction reservation, and can be
      demonstrated simply on any ext4 filesystem without the LARGE_FILE
      feature already set:
      
      dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \
      	conv=notrunc of=testfile
      sync
      dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \
      	conv=notrunc of=testfile
      
      leads to:
      
      EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super
      EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28
      EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem
      EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28
      EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28
      
      Adjust the number of credits based on whether the flag is
      already set, and whether the current write may extend past the
      LARGE_FILE limit.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Cc: stable@vger.kernel.org
      0ff8947f
  21. 06 10月, 2014 2 次提交
    • T
      ext4: add ext4_iget_normal() which is to be used for dir tree lookups · f4bb2981
      Theodore Ts'o 提交于
      If there is a corrupted file system which has directory entries that
      point at reserved, metadata inodes, prohibit them from being used by
      treating them the same way we treat Boot Loader inodes --- that is,
      mark them to be bad inodes.  This prohibits them from being opened,
      deleted, or modified via chmod, chown, utimes, etc.
      
      In particular, this prevents a corrupted file system which has a
      directory entry which points at the journal inode from being deleted
      and its blocks released, after which point Much Hilarity Ensues.
      Reported-by: NSami Liedes <sami.liedes@iki.fi>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      f4bb2981
    • T
      ext4: don't orphan or truncate the boot loader inode · e2bfb088
      Theodore Ts'o 提交于
      The boot loader inode (inode #5) should never be visible in the
      directory hierarchy, but it's possible if the file system is corrupted
      that there will be a directory entry that points at inode #5.  In
      order to avoid accidentally trashing it, when such a directory inode
      is opened, the inode will be marked as a bad inode, so that it's not
      possible to modify (or read) the inode from userspace.
      
      Unfortunately, when we unlink this (invalid/illegal) directory entry,
      we will put the bad inode on the ophan list, and then when try to
      unlink the directory, we don't actually remove the bad inode from the
      orphan list before freeing in-memory inode structure.  This means the
      in-memory orphan list is corrupted, leading to a kernel oops.
      
      In addition, avoid truncating a bad inode in ext4_destroy_inode(),
      since truncating the boot loader inode is not a smart thing to do.
      Reported-by: NSami Liedes <sami.liedes@iki.fi>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      e2bfb088
  22. 02 10月, 2014 2 次提交
  23. 05 9月, 2014 1 次提交
  24. 02 9月, 2014 1 次提交
  25. 31 8月, 2014 1 次提交