1. 16 4月, 2015 1 次提交
  2. 12 4月, 2015 2 次提交
  3. 08 4月, 2015 1 次提交
  4. 03 4月, 2015 1 次提交
  5. 17 2月, 2015 1 次提交
  6. 13 2月, 2015 1 次提交
  7. 05 2月, 2015 2 次提交
    • T
      ext4: add optimization for the lazytime mount option · a26f4992
      Theodore Ts'o 提交于
      Add an optimization for the MS_LAZYTIME mount option so that we will
      opportunistically write out any inodes with the I_DIRTY_TIME flag set
      in a particular inode table block when we need to update some inode in
      that inode table block anyway.
      
      Also add some temporary code so that we can set the lazytime mount
      option without needing a modified /sbin/mount program which can set
      MS_LAZYTIME.  We can eventually make this go away once util-linux has
      added support.
      
      Google-Bug-Id: 18297052
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a26f4992
    • T
      vfs: add support for a lazytime mount option · 0ae45f63
      Theodore Ts'o 提交于
      Add a new mount option which enables a new "lazytime" mode.  This mode
      causes atime, mtime, and ctime updates to only be made to the
      in-memory version of the inode.  The on-disk times will only get
      updated when (a) if the inode needs to be updated for some non-time
      related change, (b) if userspace calls fsync(), syncfs() or sync(), or
      (c) just before an undeleted inode is evicted from memory.
      
      This is OK according to POSIX because there are no guarantees after a
      crash unless userspace explicitly requests via a fsync(2) call.
      
      For workloads which feature a large number of random write to a
      preallocated file, the lazytime mount option significantly reduces
      writes to the inode table.  The repeated 4k writes to a single block
      will result in undesirable stress on flash devices and SMR disk
      drives.  Even on conventional HDD's, the repeated writes to the inode
      table block will trigger Adjacent Track Interference (ATI) remediation
      latencies, which very negatively impact long tail latencies --- which
      is a very big deal for web serving tiers (for example).
      
      Google-Bug-Id: 18297052
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0ae45f63
  8. 26 11月, 2014 5 次提交
    • W
      ext4: update comments regarding ext4_delete_inode() · 58d86a50
      Wang Shilong 提交于
      ext4_delete_inode() has been renamed for a long time, update
      comments for this.
      Signed-off-by: NWang Shilong <wshilong@ddn.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      58d86a50
    • J
      ext4: move handling of list of shrinkable inodes into extent status code · b0dea4c1
      Jan Kara 提交于
      Currently callers adding extents to extent status tree were responsible
      for adding the inode to the list of inodes with freeable extents. This
      is error prone and puts list handling in unnecessarily many places.
      
      Just add inode to the list automatically when the first non-delay extent
      is added to the tree and remove inode from the list when the last
      non-delay extent is removed.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      b0dea4c1
    • Z
      ext4: change LRU to round-robin in extent status tree shrinker · edaa53ca
      Zheng Liu 提交于
      In this commit we discard the lru algorithm for inodes with extent
      status tree because it takes significant effort to maintain a lru list
      in extent status tree shrinker and the shrinker can take a long time to
      scan this lru list in order to reclaim some objects.
      
      We replace the lru ordering with a simple round-robin.  After that we
      never need to keep a lru list.  That means that the list needn't be
      sorted if the shrinker can not reclaim any objects in the first round.
      
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      edaa53ca
    • Z
      ext4: cache extent hole in extent status tree for ext4_da_map_blocks() · 2f8e0a7c
      Zheng Liu 提交于
      Currently extent status tree doesn't cache extent hole when a write
      looks up in extent tree to make sure whether a block has been allocated
      or not.  In this case, we don't put extent hole in extent cache because
      later this extent might be removed and a new delayed extent might be
      added back.  But it will cause a defect when we do a lot of writes.  If
      we don't put extent hole in extent cache, the following writes also need
      to access extent tree to look at whether or not a block has been
      allocated.  It brings a cache miss.  This commit fixes this defect.
      Also if the inode doesn't have any extent, this extent hole will be
      cached as well.
      
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      2f8e0a7c
    • J
      ext4: fix block reservation for bigalloc filesystems · cbd7584e
      Jan Kara 提交于
      For bigalloc filesystems we have to check whether newly requested inode
      block isn't already part of a cluster for which we already have delayed
      allocation reservation. This check happens in ext4_ext_map_blocks() and
      that function sets EXT4_MAP_FROM_CLUSTER if that's the case. However if
      ext4_da_map_blocks() finds in extent cache information about the block,
      we don't call into ext4_ext_map_blocks() and thus we always end up
      getting new reservation even if the space for cluster is already
      reserved. This results in overreservation and premature ENOSPC reports.
      
      Fix the problem by checking for existing cluster reservation already in
      ext4_da_map_blocks(). That simplifies the logic and actually allows us
      to get rid of the EXT4_MAP_FROM_CLUSTER flag completely.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      cbd7584e
  9. 30 10月, 2014 1 次提交
    • J
      ext4: bail early when clearing inode journal flag fails · 4f879ca6
      Jan Kara 提交于
      When clearing inode journal flag, we call jbd2_journal_flush() to force
      all the journalled data to their final locations. Currently we ignore
      when this fails and continue clearing inode journal flag. This isn't a
      big problem because when jbd2_journal_flush() fails, journal is likely
      aborted anyway. But it can still lead to somewhat confusing results so
      rather bail out early.
      
      Coverity-id: 989044
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4f879ca6
  10. 13 10月, 2014 1 次提交
  11. 12 10月, 2014 1 次提交
    • E
      ext4: fix reservation overflow in ext4_da_write_begin · 0ff8947f
      Eric Sandeen 提交于
      Delalloc write journal reservations only reserve 1 credit,
      to update the inode if necessary.  However, it may happen
      once in a filesystem's lifetime that a file will cross
      the 2G threshold, and require the LARGE_FILE feature to
      be set in the superblock as well, if it was not set already.
      
      This overruns the transaction reservation, and can be
      demonstrated simply on any ext4 filesystem without the LARGE_FILE
      feature already set:
      
      dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \
      	conv=notrunc of=testfile
      sync
      dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \
      	conv=notrunc of=testfile
      
      leads to:
      
      EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super
      EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28
      EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem
      EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28
      EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28
      
      Adjust the number of credits based on whether the flag is
      already set, and whether the current write may extend past the
      LARGE_FILE limit.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Cc: stable@vger.kernel.org
      0ff8947f
  12. 06 10月, 2014 2 次提交
    • T
      ext4: add ext4_iget_normal() which is to be used for dir tree lookups · f4bb2981
      Theodore Ts'o 提交于
      If there is a corrupted file system which has directory entries that
      point at reserved, metadata inodes, prohibit them from being used by
      treating them the same way we treat Boot Loader inodes --- that is,
      mark them to be bad inodes.  This prohibits them from being opened,
      deleted, or modified via chmod, chown, utimes, etc.
      
      In particular, this prevents a corrupted file system which has a
      directory entry which points at the journal inode from being deleted
      and its blocks released, after which point Much Hilarity Ensues.
      Reported-by: NSami Liedes <sami.liedes@iki.fi>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      f4bb2981
    • T
      ext4: don't orphan or truncate the boot loader inode · e2bfb088
      Theodore Ts'o 提交于
      The boot loader inode (inode #5) should never be visible in the
      directory hierarchy, but it's possible if the file system is corrupted
      that there will be a directory entry that points at inode #5.  In
      order to avoid accidentally trashing it, when such a directory inode
      is opened, the inode will be marked as a bad inode, so that it's not
      possible to modify (or read) the inode from userspace.
      
      Unfortunately, when we unlink this (invalid/illegal) directory entry,
      we will put the bad inode on the ophan list, and then when try to
      unlink the directory, we don't actually remove the bad inode from the
      orphan list before freeing in-memory inode structure.  This means the
      in-memory orphan list is corrupted, leading to a kernel oops.
      
      In addition, avoid truncating a bad inode in ext4_destroy_inode(),
      since truncating the boot loader inode is not a smart thing to do.
      Reported-by: NSami Liedes <sami.liedes@iki.fi>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      e2bfb088
  13. 02 10月, 2014 2 次提交
  14. 05 9月, 2014 1 次提交
  15. 02 9月, 2014 1 次提交
  16. 31 8月, 2014 1 次提交
  17. 30 8月, 2014 2 次提交
  18. 29 8月, 2014 1 次提交
  19. 24 8月, 2014 1 次提交
  20. 15 7月, 2014 2 次提交
    • L
      ext4: fix punch hole on files with indirect mapping · 4f579ae7
      Lukas Czerner 提交于
      Currently punch hole code on files with direct/indirect mapping has some
      problems which may lead to a data loss. For example (from Jan Kara):
      
      fallocate -n -p 10240000 4096
      
      will punch the range 10240000 - 12632064 instead of the range 1024000 -
      10244096.
      
      Also the code is a bit weird and it's not using infrastructure provided
      by indirect.c, but rather creating it's own way.
      
      This patch fixes the issues as well as making the operation to run 4
      times faster from my testing (punching out 60GB file). It uses similar
      approach used in ext4_ind_truncate() which takes advantage of
      ext4_free_branches() function.
      
      Also rename the ext4_free_hole_blocks() to something more sensible, like
      the equivalent we have for extent mapped files. Call it
      ext4_ind_remove_space().
      
      This has been tested mostly with fsx and some xfstests which are testing
      punch hole but does not require unwritten extents which are not
      supported with direct/indirect mapping. Not problems showed up even with
      1024k block size.
      
      CC: stable@vger.kernel.org
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4f579ae7
    • T
      ext4: remove metadata reservation checks · 71d4f7d0
      Theodore Ts'o 提交于
      Commit 27dd4385 ("ext4: introduce reserved space") reserves 2% of
      the file system space to make sure metadata allocations will always
      succeed.  Given that, tracking the reservation of metadata blocks is
      no longer necessary.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      71d4f7d0
  21. 02 6月, 2014 1 次提交
    • Z
      ext4: handle symlink properly with inline_data · bd9db175
      Zheng Liu 提交于
      This commit tries to fix a bug that we can't read symlink properly with
      inline data feature when the length of symlink is greater than 60 bytes
      but less than extra space.
      
      The key issue is in ext4_inode_is_fast_symlink() that it doesn't check
      whether or not an inode has inline data.  When the user creates a new
      symlink, an inode will be allocated with MAY_INLINE_DATA flag.  Then
      symlink will be stored in ->i_block and extended attribute space.  In
      the mean time, this inode is with inline data flag.  After remounting
      it, ext4_inode_is_fast_symlink() function thinks that this inode is a
      fast symlink so that the data in ->i_block is copied to the user, and
      the data in extra space is trimmed.  In fact this inode should be as a
      normal symlink.
      
      The following script can hit this bug.
      
        #!/bin/bash
      
        cd ${MNT}
        filename=ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
        rm -rf test
        mkdir test
        cd test
        echo "hello" >$filename
        ln -s $filename symlinkfile
        cd
        sudo umount /mnt/sda1
        sudo mount -t ext4 /dev/sda1 /mnt/sda1
        readlink /mnt/sda1/test/symlinkfile
      
      After applying this patch, it will break the assumption in e2fsck
      because the original implementation doesn't want to support symlink
      with inline data.
      Reported-by: N"Darrick J. Wong" <darrick.wong@oracle.com>
      Reported-by: NIan Nartowicz <claws@nartowicz.co.uk>
      Cc: Ian Nartowicz <claws@nartowicz.co.uk>
      Cc: Tao Ma <tm@tao.ma>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      bd9db175
  22. 13 5月, 2014 2 次提交
  23. 12 5月, 2014 2 次提交
    • S
      ext4: make local functions static · c197855e
      Stephen Hemminger 提交于
      I have been running make namespacecheck to look for unneeded globals, and
      found these in ext4.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c197855e
    • N
      ext4: fix data integrity sync in ordered mode · 1c8349a1
      Namjae Jeon 提交于
      When we perform a data integrity sync we tag all the dirty pages with
      PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.  Later we check
      for this tag in write_cache_pages_da and creates a struct
      mpage_da_data containing contiguously indexed pages tagged with this
      tag and sync these pages with a call to mpage_da_map_and_submit.  This
      process is done in while loop until all the PAGECACHE_TAG_TOWRITE
      pages are synced. We also do journal start and stop in each iteration.
      journal_stop could initiate journal commit which would call
      ext4_writepage which in turn will call ext4_bio_write_page even for
      delayed OR unwritten buffers. When ext4_bio_write_page is called for
      such buffers, even though it does not sync them but it clears the
      PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
      are also not synced by the currently running data integrity sync. We
      will end up with dirty pages although sync is completed.
      
      This could cause a potential data loss when the sync call is followed
      by a truncate_pagecache call, which is exactly the case in
      collapse_range.  (It will cause generic/127 failure in xfstests)
      
      To avoid this issue, we can use set_page_writeback_keepwrite instead of
      set_page_writeback, which doesn't clear TOWRITE tag.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      1c8349a1
  24. 07 5月, 2014 4 次提交
  25. 22 4月, 2014 1 次提交