1. 08 11月, 2013 1 次提交
  2. 04 11月, 2013 1 次提交
  3. 29 8月, 2013 3 次提交
  4. 17 8月, 2013 5 次提交
    • J
      ext4: fix warning in ext4_da_update_reserve_space() · 7d734532
      Jan Kara 提交于
      reaim workfile.dbase test easily triggers warning in
      ext4_da_update_reserve_space():
      
      EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365:
      ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1
      blocks with reserved 9 data blocks)
      
      The problem is that (one of) tests creates file and then randomly writes
      to it with O_SYNC. That results in writing back pages of the file in
      random order so we create extents for written blocks say 0, 2, 4, 6, 8
      - this last allocation also allocates new block for extents. Then we
      writeout block 1 so we have extents 0-2, 4, 6, 8 and we release
      indirect extent block because extents fit in the inode again. Then we
      writeout block 10 and we need to allocate indirect extent block again
      which triggers the warning because we don't have the reservation
      anymore.
      
      Fix the problem by giving back freed metadata blocks resulting from
      extent merging into inode's reservation pool.
      Signed-off-by: NJan Kara <jack@suse.cz>
      7d734532
    • T
      ext4: add support for extent pre-caching · 7869a4a6
      Theodore Ts'o 提交于
      Add a new fiemap flag which forces the all of the extents in an inode
      to be cached in the extent_status tree.  This is critically important
      when using AIO to a preallocated file, since if we need to read in
      blocks from the extent tree, the io_submit(2) system call becomes
      synchronous, and the AIO is no longer "A", which is bad.
      
      In addition, for most files which have an external leaf tree block,
      the cost of caching the information in the extent status tree will be
      less than caching the entire 4k block in the buffer cache.  So it is
      generally a win to keep the extent information cached.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7869a4a6
    • T
      ext4: cache all of an extent tree's leaf block upon reading · 107a7bd3
      Theodore Ts'o 提交于
      When we read in an extent tree leaf block from disk, arrange to have
      all of its entries cached.  In nearly all cases the in-memory
      representation will be more compact than the on-disk representation in
      the buffer cache, and it allows us to get the information without
      having to traverse the extent tree for successive extents.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      107a7bd3
    • T
      ext4: print the block number of invalid extent tree blocks · c349179b
      Theodore Ts'o 提交于
      When we find an invalid extent tree block, report the block number of
      the bad block for debugging purposes.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      c349179b
    • T
      ext4: refactor code to read the extent tree block · 7d7ea89e
      Theodore Ts'o 提交于
      Refactor out the code needed to read the extent tree block into a
      single read_extent_tree_block() function.  In addition to simplifying
      the code, it also makes sure that we call the ext4_ext_load_extent
      tracepoint whenever we need to read an extent tree block from disk.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      7d7ea89e
  5. 30 7月, 2013 1 次提交
  6. 16 7月, 2013 2 次提交
    • T
      ext4: call ext4_es_lru_add() after handling cache miss · 63b99968
      Theodore Ts'o 提交于
      If there are no items in the extent status tree, ext4_es_lru_add() is
      a no-op.  So it is not sufficient to call ext4_es_lru_add() before we
      try to lookup an entry in the extent status tree.  We also need to
      call it at the end of ext4_ext_map_blocks(), after items have been
      added to the extent status tree.
      
      This could lead to inodes with that have extent status trees but which
      are not in the LRU list, which means they won't get considered for
      eviction by the es_shrinker.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <wenqing.lz@taobao.com>
      Cc: stable@vger.kernel.org
      63b99968
    • T
      ext4: yield during large unlinks · 76828c88
      Theodore Ts'o 提交于
      During large unlink operations on files with extents, we can use a lot
      of CPU time.  This adds a cond_resched() call when starting to examine
      the next level of a multi-level extent tree.  Multi-level extent trees
      are rare in the first place, and this should rarely be executed.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      76828c88
  7. 15 7月, 2013 2 次提交
    • T
      ext4: simplify calculation of blocks to free on error · c8e15130
      Theodore Ts'o 提交于
      In ext4_ext_map_blocks(), if we have successfully allocated the data
      blocks, but then run into trouble inserting the extent into the extent
      tree, most likely due to an ENOSPC condition, determine the arguments
      to ext4_free_blocks() in a simpler way which is easier to prove to be
      correct.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c8e15130
    • T
      ext4: fix error handling in ext4_ext_truncate() · 8acd5e9b
      Theodore Ts'o 提交于
      Previously ext4_ext_truncate() was ignoring potential error returns
      from ext4_es_remove_extent() and ext4_ext_remove_space().  This can
      lead to the on-diks extent tree and the extent status tree cache
      getting out of sync, which is particuarlly bad, and can lead to file
      system corruption and potential data loss.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      8acd5e9b
  8. 01 7月, 2013 3 次提交
  9. 13 6月, 2013 1 次提交
  10. 12 6月, 2013 1 次提交
    • T
      ext4: don't use EXT4_FREE_BLOCKS_FORGET unnecessarily · 981250ca
      Theodore Ts'o 提交于
      Commit 18888cf0: "ext4: speed up truncate/unlink by not using
      bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
      the most important codepath for file systems using extents, but a
      similar optimization also can be done for file systems using indirect
      blocks, and for the two special cases in the ext4 extents code.
      
      Cc: Andrey Sidorov <qrxd43@motorola.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      981250ca
  11. 05 6月, 2013 2 次提交
  12. 01 6月, 2013 1 次提交
  13. 28 5月, 2013 3 次提交
    • L
      ext4: make punch hole code path work with bigalloc · d23142c6
      Lukas Czerner 提交于
      Currently punch hole is disabled in file systems with bigalloc
      feature enabled. However the recent changes in punch hole patch should
      make it easier to support punching holes on bigalloc enabled file
      systems.
      
      This commit changes partial_cluster handling in ext4_remove_blocks(),
      ext4_ext_rm_leaf() and ext4_ext_remove_space(). Currently
      partial_cluster is unsigned long long type and it makes sure that we
      will free the partial cluster if all extents has been released from that
      cluster. However it has been specifically designed only for truncate.
      
      With punch hole we can be freeing just some extents in the cluster
      leaving the rest untouched. So we have to make sure that we will notice
      cluster which still has some extents. To do this I've changed
      partial_cluster to be signed long long type. The only scenario where
      this could be a problem is when cluster_size == block size, however in
      that case there would not be any partial clusters so we're safe. For
      bigger clusters the signed type is enough. Now we use the negative value
      in partial_cluster to mark such cluster used, hence we know that we must
      not free it even if all other extents has been freed from such cluster.
      
      This scenario can be described in simple diagram:
      
      |FFF...FF..FF.UUU|
       ^----------^
        punch hole
      
      . - free space
      | - cluster boundary
      F - freed extent
      U - used extent
      
      Also update respective tracepoints to use signed long long type for
      partial_cluster.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d23142c6
    • L
      ext4: update ext4_ext_remove_space trace point · 61801325
      Lukas Czerner 提交于
      Add "end" variable.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      61801325
    • L
      ext4: remove unused code from ext4_remove_blocks() · 78fb9cdf
      Lukas Czerner 提交于
      The "head removal" branch in the condition is never used in any code
      path in ext4 since the function only caller ext4_ext_rm_leaf() will make
      sure that the extent is properly split before removing blocks. Note that
      there is a bug in this branch anyway.
      
      This commit removes the unused code completely and makes use of
      ext4_error() instead of printk if dubious range is provided.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      78fb9cdf
  14. 03 5月, 2013 1 次提交
    • Y
      ext4: fix fio regression · e30b5dca
      Yan, Zheng 提交于
      We (Linux Kernel Performance project) found a regression introduced
      by commit:
      
        f7fec032 ext4: track all extent status in extent status tree
      
      The commit causes about 20% performance decrease in fio random write
      test. Profiler shows that rb_next() uses a lot of CPU time. The call
      stack is:
      
        rb_next
        ext4_es_find_delayed_extent
        ext4_map_blocks
        _ext4_get_block
        ext4_get_block_write
        __blockdev_direct_IO
        ext4_direct_IO
        generic_file_direct_write
        __generic_file_aio_write
        ext4_file_write
        aio_rw_vect_retry
        aio_run_iocb
        do_io_submit
        sys_io_submit
        system_call_fastpath
        io_submit
        td_io_getevents
        io_u_queued_complete
        thread_main
        main
        __libc_start_main
      
      The cause is that ext4_es_find_delayed_extent() doesn't have an
      upper bound, it keeps searching until a delayed extent is found.
      When there are a lots of non-delayed entries in the extent state
      tree, ext4_es_find_delayed_extent() may uses a lot of CPU time.
      Reported-by: NLKP project <lkp@linux.intel.com>
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      e30b5dca
  15. 20 4月, 2013 1 次提交
  16. 11 4月, 2013 1 次提交
  17. 10 4月, 2013 2 次提交
    • D
      ext4: fix big-endian bug in extent migration code · 0b65349e
      Dmitry Monakhov 提交于
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      0b65349e
    • L
      ext4: introduce reserved space · 27dd4385
      Lukas Czerner 提交于
      Currently in ENOSPC condition when writing into unwritten space, or
      punching a hole, we might need to split the extent and grow extent tree.
      However since we can not allocate any new metadata blocks we'll have to
      zero out unwritten part of extent or punched out part of extent, or in
      the worst case return ENOSPC even though use actually does not allocate
      any space.
      
      Also in delalloc path we do reserve metadata and data blocks for the
      time we're going to write out, however metadata block reservation is
      very tricky especially since we expect that logical connectivity implies
      physical connectivity, however that might not be the case and hence we
      might end up allocating more metadata blocks than previously reserved.
      So in future, metadata reservation checks should be removed since we can
      not assure that we do not under reserve.
      
      And this is where reserved space comes into the picture. When mounting
      the file system we slice off a little bit of the file system space (2%
      or 4096 clusters, whichever is smaller) which can be then used for the
      cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.
      
      The number of reserved clusters can be set via sysfs, however it can
      never be bigger than number of free clusters in the file system.
      
      Note that this patch fixes the failure of xfstest 274 as expected.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      27dd4385
  18. 09 4月, 2013 1 次提交
  19. 04 4月, 2013 6 次提交
    • L
      ext4: try to prepend extent to the existing one · be8981be
      Lukas Czerner 提交于
      Currently when inserting extent in ext4_ext_insert_extent() we would
      only try to to see if we can append new extent to the found extent. If
      we can not, then we proceed with adding new extent into the extent tree,
      but then possibly merging it back again.
      
      We can avoid this situation by trying to append and prepend new extent
      to the existing ones. However since the new extent can be on either
      sides of the existing extent, we have to pick the right extent to try to
      append/prepend to.
      
      This patch adds the conditions to pick the right extent to
      append/prepend to and adds the actual prepending condition as well. This
      will also eliminate the need to use "reserved" block for possibly
      growing extent tree.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      be8981be
    • L
      ext4: Transfer initialized block to right neighbor if possible · bc2d9db4
      Lukas Czerner 提交于
      Currently when converting extent to initialized we attempt to transfer
      initialized block to the left neighbour if possible when certain
      criteria are met. However we do not attempt to do the same for the
      right neighbor.
      
      This commit adds the possibility to transfer initialized block to the
      right neighbour if:
      
      1. We're not converting the whole extent
      2. Both extents are stored in the same extent tree node
      3. Right neighbor is initialized
      4. Right neighbor is logically abutting the current one
      5. Right neighbor is physically abutting the current one
      6. Right neighbor would not overflow the length limit
      
      This is basically the same logic as with transferring to the left. This
      will gain us some performance benefits since it is faster than inserting
      extent and then merging it.
      
      It would also prevent some situation in delalloc patch when we might run
      out of metadata reservation. This is due to the fact that we would
      attempt to split the extent first (possibly allocating new metadata
      block) even though we did not counted for that because it can (and will)
      be merged again. This commit fix that scenario, because we no longer
      need to split the extent in such case.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      bc2d9db4
    • T
      ext4: support simple conversion of extent-mapped inodes to use i_blocks · 996bb9fd
      Theodore Ts'o 提交于
      In order to make it simpler to test the code which support
      i_blocks/indirect-mapped inodes, support the conversion of inodes
      which are less than 12 blocks and which are contained in no more than
      a single extent.
      
      The primary intended use of this code is to converting freshly created
      zero-length files and empty directories.
      
      Note that the version of chattr in e2fsprogs 1.42.7 and earlier has a
      check that prevents the clearing of the extent flag.  A simple patch
      which allows "chattr -e <file>" to work will be checked into the
      e2fsprogs git repository.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      996bb9fd
    • T
      ext4: refactor truncate code · 819c4920
      Theodore Ts'o 提交于
      Move common code in ext4_ind_truncate() and ext4_ext_truncate() into
      ext4_truncate().  This saves over 60 lines of code.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      819c4920
    • T
      ext4: refactor punch hole code · 26a4c0c6
      Theodore Ts'o 提交于
      Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole()
      into ext4_punch_hole().  This saves over 150 lines of code.
      
      This also fixes a potential bug when the punch_hole() code is racing
      against indirect-to-extents or extents-to-indirect migation.  We are
      currently using i_mutex to protect against changes to the inode flag;
      specifically, the append-only, immutable, and extents inode flags.  So
      we need to take i_mutex before deciding whether to use the
      extents-specific or indirect-specific punch_hole code.
      
      Also, there was a missing call to ext4_inode_block_unlocked_dio() in
      the indirect punch codepath.  This was added in commit 02d262df
      to block DIO readers racing against the punch operation in the
      codepath for extent-mapped inodes, but it was missing for
      indirect-block mapped inodes.  One of the advantages of refactoring
      the code is that it makes such oversights much less likely.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      26a4c0c6
    • Z
      ext4: fix big-endian bugs which could cause fs corruptions · 8cde7ad1
      Zheng Liu 提交于
      When an extent was zeroed out, we forgot to do convert from cpu to le16.
      It could make us hit a BUG_ON when we try to write dirty pages out.  So
      fix it.
      
      [ Also fix a bug found by Dmitry Monakhov where we were missing
        le32_to_cpu() calls in the new indirect punch hole code.
      
        There are a number of other big endian warnings found by static code
        analyzers, but we'll wait for the next merge window to fix them all
        up.  These fixes are designed to be Obviously Correct by code
        inspection, and easy to demonstrate that it won't make any
        difference (and hence, won't introduce any bugs) on little endian
        architectures such as x86.  --tytso ]
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Reported-by: NChristian Kujau <lists@nerdbynature.de>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      8cde7ad1
  20. 13 3月, 2013 1 次提交
    • L
      ext4: use s_extent_max_zeroout_kb value as number of kb · 4f42f80a
      Lukas Czerner 提交于
      Currently when converting extent to initialized, we have to decide
      whether to zeroout part/all of the uninitialized extent in order to
      avoid extent tree growing rapidly.
      
      The decision is made by comparing the size of the extent with the
      configurable value s_extent_max_zeroout_kb which is in kibibytes units.
      
      However when converting it to number of blocks we currently use it as it
      was in bytes. This is obviously bug and it will result in ext4 _never_
      zeroout extents, but rather always split and convert parts to
      initialized while leaving the rest uninitialized in default setting.
      
      Fix this by using s_extent_max_zeroout_kb as kibibytes.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      4f42f80a
  21. 11 3月, 2013 1 次提交
    • L
      ext4: update reserved space after the 'correction' · 232ec872
      Lukas Czerner 提交于
      Currently in ext4_ext_map_blocks() in delayed allocation writeback
      we would update the reservation and after that check whether we claimed
      cluster outside of the range of the allocation and if so, we'll give the
      block back to the reservation pool.
      
      However this also means that if the number of reserved data block
      dropped to zero before the correction, we would release all the metadata
      reservation as well, however we might still need it because the we're
      not done with the delayed allocation and there might be more blocks to
      come. This will result in error messages such as:
      
      EXT4-fs warning (device sdb): ext4_da_update_reserve_space:361: ino 12,
      allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks
      with reserved 1 data blocks)
      
      This will only happen on bigalloc file system and it can be easily
      reproduced using fiemap-tester from xfstests like this:
      
      ./src/fiemap-tester -m DHDHDHDHD -S -p0 /mnt/test/file
      
      Or using xfstests such as 225.
      
      Fix this by doing the correction first and updating the reservation
      after that so that we do not accidentally decrease
      i_reserved_data_blocks to zero.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      232ec872