1. 04 4月, 2013 2 次提交
    • T
      ext4: refactor punch hole code · 26a4c0c6
      Theodore Ts'o 提交于
      Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole()
      into ext4_punch_hole().  This saves over 150 lines of code.
      
      This also fixes a potential bug when the punch_hole() code is racing
      against indirect-to-extents or extents-to-indirect migation.  We are
      currently using i_mutex to protect against changes to the inode flag;
      specifically, the append-only, immutable, and extents inode flags.  So
      we need to take i_mutex before deciding whether to use the
      extents-specific or indirect-specific punch_hole code.
      
      Also, there was a missing call to ext4_inode_block_unlocked_dio() in
      the indirect punch codepath.  This was added in commit 02d262df
      to block DIO readers racing against the punch operation in the
      codepath for extent-mapped inodes, but it was missing for
      indirect-block mapped inodes.  One of the advantages of refactoring
      the code is that it makes such oversights much less likely.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      26a4c0c6
    • Z
      ext4: fix big-endian bugs which could cause fs corruptions · 8cde7ad1
      Zheng Liu 提交于
      When an extent was zeroed out, we forgot to do convert from cpu to le16.
      It could make us hit a BUG_ON when we try to write dirty pages out.  So
      fix it.
      
      [ Also fix a bug found by Dmitry Monakhov where we were missing
        le32_to_cpu() calls in the new indirect punch hole code.
      
        There are a number of other big endian warnings found by static code
        analyzers, but we'll wait for the next merge window to fix them all
        up.  These fixes are designed to be Obviously Correct by code
        inspection, and easy to demonstrate that it won't make any
        difference (and hence, won't introduce any bugs) on little endian
        architectures such as x86.  --tytso ]
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Reported-by: NChristian Kujau <lists@nerdbynature.de>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      8cde7ad1
  2. 13 3月, 2013 1 次提交
    • L
      ext4: use s_extent_max_zeroout_kb value as number of kb · 4f42f80a
      Lukas Czerner 提交于
      Currently when converting extent to initialized, we have to decide
      whether to zeroout part/all of the uninitialized extent in order to
      avoid extent tree growing rapidly.
      
      The decision is made by comparing the size of the extent with the
      configurable value s_extent_max_zeroout_kb which is in kibibytes units.
      
      However when converting it to number of blocks we currently use it as it
      was in bytes. This is obviously bug and it will result in ext4 _never_
      zeroout extents, but rather always split and convert parts to
      initialized while leaving the rest uninitialized in default setting.
      
      Fix this by using s_extent_max_zeroout_kb as kibibytes.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      4f42f80a
  3. 11 3月, 2013 4 次提交
    • L
      ext4: update reserved space after the 'correction' · 232ec872
      Lukas Czerner 提交于
      Currently in ext4_ext_map_blocks() in delayed allocation writeback
      we would update the reservation and after that check whether we claimed
      cluster outside of the range of the allocation and if so, we'll give the
      block back to the reservation pool.
      
      However this also means that if the number of reserved data block
      dropped to zero before the correction, we would release all the metadata
      reservation as well, however we might still need it because the we're
      not done with the delayed allocation and there might be more blocks to
      come. This will result in error messages such as:
      
      EXT4-fs warning (device sdb): ext4_da_update_reserve_space:361: ino 12,
      allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks
      with reserved 1 data blocks)
      
      This will only happen on bigalloc file system and it can be easily
      reproduced using fiemap-tester from xfstests like this:
      
      ./src/fiemap-tester -m DHDHDHDHD -S -p0 /mnt/test/file
      
      Or using xfstests such as 225.
      
      Fix this by doing the correction first and updating the reservation
      after that so that we do not accidentally decrease
      i_reserved_data_blocks to zero.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      232ec872
    • Z
      ext4: fix the wrong number of the allocated blocks in ext4_split_extent() · 3a225670
      Zheng Liu 提交于
      This commit fixes a wrong return value of the number of the allocated
      blocks in ext4_split_extent.  When the length of blocks we want to
      allocate is greater than the length of the current extent, we return a
      wrong number.  Let's see what happens in the following case when we
      call ext4_split_extent().
      
        map: [48, 72]
        ex:  [32, 64, u]
      
      'ex' will be split into two parts:
        ex1: [32, 47, u]
        ex2: [48, 64, w]
      
      'map->m_len' is returned from this function, and the value is 24.  But
      the real length is 16.  So it should be fixed.
      
      Meanwhile in this commit we use right length of the allocated blocks
      when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents
      is called.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      Cc: stable@vger.kernel.org
      3a225670
    • Z
      ext4: update extent status tree after an extent is zeroed out · adb23551
      Zheng Liu 提交于
      When we try to split an extent, this extent could be zeroed out and mark
      as initialized.  But we don't know this in ext4_map_blocks because it
      only returns a length of allocated extent.  Meanwhile we will mark this
      extent as uninitialized because we only check m_flags.
      
      This commit update extent status tree when we try to split an unwritten
      extent.  We don't need to worry about the status of this extent because
      we always mark it as initialized.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      adb23551
    • Z
      ext4: fix wrong m_len value after unwritten extent conversion · cdee7843
      Zheng Liu 提交于
      The ext4_ext_handle_uninitialized_extents() function was assuming the
      return value of ext4_ext_map_blocks() is equal to map->m_len.  This
      incorrect assumption was harmless until we started use status tree as
      a extent cache because we need to update status tree according to
      'm_len' value.
      
      Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent
      conversion.  It shouldn't cause a bug because we update status tree
      according to checking EXT4_MAP_UNWRITTEN flag.  But it should be fixed.
      
      After applied this commit, the following error message from self-testing
      infrastructure disappears.
      
          ...
          kernel: ES len assertation failed for inode: 230 retval 1 !=
          map->m_len 3 in ext4_map_blocks (allocation)
          ...
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      cdee7843
  4. 04 3月, 2013 4 次提交
  5. 23 2月, 2013 1 次提交
  6. 18 2月, 2013 6 次提交
    • Z
      ext4: remove single extent cache · 69eb33dc
      Zheng Liu 提交于
      Single extent cache could be removed because we have extent status tree
      as a extent cache, and it would be better.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      69eb33dc
    • Z
      ext4: lookup block mapping in extent status tree · d100eef2
      Zheng Liu 提交于
      After tracking all extent status, we already have a extent cache in
      memory.  Every time we want to lookup a block mapping, we can first
      try to lookup it in extent status tree to avoid a potential disk I/O.
      
      A new function called ext4_es_lookup_extent is defined to finish this
      work.  When we try to lookup a block mapping, we always call
      ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
      first try to lookup a block mapping in extent status tree.
      
      A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
      in order not to put a hole into extent status tree because this hole
      will be converted to delayed extent in the tree immediately.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      d100eef2
    • Z
      ext4: track all extent status in extent status tree · f7fec032
      Zheng Liu 提交于
      By recording the phycisal block and status, extent status tree is able
      to track the status of every extents.  When we call _map_blocks
      functions to lookup an extent or create a new written/unwritten/delayed
      extent, this extent will be inserted into extent status tree.
      
      We don't load all extents from disk in alloc_inode() because it costs
      too much memory, and if a file is opened and closed frequently it will
      takes too much time to load all extent information.  So currently when
      we create/lookup an extent, this extent will be inserted into extent
      status tree.  Hence, the extent status tree may not comprehensively
      contain all of the extents found in the file.
      
      Here a condition we need to take care is that an extent might contains
      unwritten and delayed status simultaneously because an extent is delayed
      allocated and could be allocated by fallocate.  At this time we need to
      keep delayed status because later we need to update delayed reservation
      space using it.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      f7fec032
    • Z
      ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag · a25a4e1a
      Zheng Liu 提交于
      This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
      because in later commit ext4_map_blocks needs to use this flag to
      determine the extent status.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      a25a4e1a
    • Z
      ext4: rename and improbe ext4_es_find_extent() · be401363
      Zheng Liu 提交于
      This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
      and improve this function.  First, we split input and output parameter.
      Second, this function never return the first block of the next delayed
      extent after 'es'.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      be401363
    • Z
      ext4: refine extent status tree · 06b0c886
      Zheng Liu 提交于
      This commit refines the extent status tree code.
      
      1) A prefix 'es_' is added to to the extent status tree structure
      members.
      
      2) Refactored es_remove_extent() so that __es_remove_extent() can be
      used by es_insert_extent() to remove the old extent entry(-ies) before
      inserting a new one.
      
      3) Rename extent_status_end() to ext4_es_end()
      
      4) ext4_es_can_be_merged() is define to check whether two extents can
      be merged or not.
      
      5) Update and clarified comments.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      06b0c886
  7. 09 2月, 2013 1 次提交
    • T
      ext4: pass context information to jbd2__journal_start() · 9924a92a
      Theodore Ts'o 提交于
      So we can better understand what bits of ext4 are responsible for
      long-running jbd2 handles, use jbd2__journal_start() so we can pass
      context information for logging purposes.
      
      The recommended way for finding the longer-running handles is:
      
         T=/sys/kernel/debug/tracing
         EVENT=$T/events/jbd2/jbd2_handle_stats
         echo "interval > 5" > $EVENT/filter
         echo 1 > $EVENT/enable
      
         ./run-my-fs-benchmark
      
         cat $T/trace > /tmp/problem-handles
      
      This will list handles that were active for longer than 20ms.  Having
      longer-running handles is bad, because a commit started at the wrong
      time could stall for those 20+ milliseconds, which could delay an
      fsync() or an O_SYNC operation.  Here is an example line from the
      trace file describing a handle which lived on for 311 jiffies, or over
      1.2 seconds:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9924a92a
  8. 29 1月, 2013 1 次提交
  9. 28 1月, 2013 1 次提交
    • Z
      ext4: add punching hole support for non-extent-mapped files · 8bad6fc8
      Zheng Liu 提交于
      This patch add supports for indirect file support punching hole.  It
      is almost the same as ext4_ext_punch_hole.  First, we invalidate all
      pages between this hole, and then we try to deallocate all blocks of
      this hole.
      
      A recursive function is used to handle deallocation of blocks.  In
      this function, it iterates over the entries in inode's i_blocks or
      indirect blocks, and try to free the block for each one of them.
      
      After applying this patch, xfstest #255 will not pass w/o extent because
      indirect-based file doesn't support unwritten extents.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8bad6fc8
  10. 13 1月, 2013 2 次提交
  11. 17 12月, 2012 1 次提交
  12. 11 12月, 2012 4 次提交
  13. 29 11月, 2012 3 次提交
    • T
      ext4: rationalize ext4_extents.h inclusion · 4a092d73
      Theodore Ts'o 提交于
      Previously, ext4_extents.h was being included at the end of ext4.h,
      which was bad for a number of reasons: (a) it was not being included
      in the expected place, and (b) it caused the header to be included
      multiple times.  There were #ifdef's to prevent this from causing any
      problems, but it still was unnecessary.
      
      By moving the function declarations that were in ext4_extents.h to
      ext4.h, which is standard practice for where the function declarations
      for the rest of ext4.h can be found, we can remove ext4_extents.h from
      being included in ext4.h at all, and then we can only include
      ext4_extents.h where it is needed in ext4's source files.
      
      It should be possible to move a few more things into ext4.h, and
      further reduce the number of source files that need to #include
      ext4_extents.h, but that's a cleanup for another day.
      Reported-by: NSachin Kamat <sachin.kamat@linaro.org>
      Reported-by: NWei Yongjun <weiyj.lk@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4a092d73
    • L
      ext4: simple cleanup in fiemap codepath · 06348679
      Lukas Czerner 提交于
      This commit is simple cleanup of fiemap codepath which has not been
      included in previous commit to make the changes clearer. In this commit
      we rename cbex variable to newex in ext4_fill_fiemap_extents() because
      callback is no longer present
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      06348679
    • L
      ext4: prevent race while walking extent tree for fiemap · 91dd8c11
      Lukas Czerner 提交于
      Currently ext4_ext_walk_space() only takes i_data_sem for read when
      searching for the extent at given block with ext4_ext_find_extent().
      Then it drops the lock and the extent tree can be changed at will.
      However later on we're searching for the 'next' extent, but the extent
      tree might already have changed, so the information might not be
      accurate.
      
      In fact we can hit BUG_ON(end <= start) if the extent got inserted into
      the tree after the one we found and before the block we were searching
      for. This has been reproduced by running xfstests 225 in loop on s390x
      architecture, but theoretically we could hit this on any other
      architecture as well, but probably not as often.
      
      Moreover the extent currently in delayed allocation might be allocated
      after we search the extent tree and before we search extent status tree
      delayed buffers resulting in those delayed buffers being completely
      missed, even though completely written and allocated.
      
      We fix all those problems in several steps:
      
       1. remove unnecessary callback indirection
       2. rename functions
              ext4_ext_walk_space -> ext4_fill_fiemap_extents
              ext4_ext_fiemap_cb -> ext4_find_delayed_extent
       3. move fiemap_fill_next_extent() into ext4_fill_fiemap_extents()
       4. hold the i_data_sem for:
              ext4_ext_find_extent()
              ext4_ext_next_allocated_block()
              ext4_find_delayed_extent()
       5. call fiemap_fill_next_extent after releasing the i_data_sem
       6. move path reinitialization into the critical section.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      91dd8c11
  14. 09 11月, 2012 6 次提交
  15. 10 10月, 2012 1 次提交
  16. 05 10月, 2012 2 次提交
    • D
      ext4: serialize fallocate with ext4_convert_unwritten_extents · 60d4616f
      Dmitry Monakhov 提交于
      Fallocate should wait for pended ext4_convert_unwritten_extents()
      otherwise following race may happen:
      
      ftruncate( ,12288);
      fallocate( ,0, 4096)
      io_sibmit( ,0, 4096); /* Write to fallocated area, split extent if needed */
      fallocate( ,0, 8192); /* Grow extent and broke assumption about extent */
      
      Later kwork completion will do:
       ->ext4_convert_unwritten_extents (0, 4096)
         ->ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_IO_CONVERT_EXT);
          ->ext4_ext_map_blocks() /* Will find new extent:  ex = [0,2] !!!!!! */
            ->ext4_ext_handle_uninitialized_extents()
              ->ext4_convert_unwritten_extents_endio()
              /* convert [0,2] extent to initialized, but only[0,1] was written */
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      60d4616f
    • D
      ext4: fix ext4_flush_completed_IO wait semantics · c278531d
      Dmitry Monakhov 提交于
      BUG #1) All places where we call ext4_flush_completed_IO are broken
          because buffered io and DIO/AIO goes through three stages
          1) submitted io,
          2) completed io (in i_completed_io_list) conversion pended
          3) finished  io (conversion done)
          And by calling ext4_flush_completed_IO we will flush only
          requests which were in (2) stage, which is wrong because:
           1) punch_hole and truncate _must_ wait for all outstanding unwritten io
            regardless to it's state.
           2) fsync and nolock_dio_read should also wait because there is
              a time window between end_page_writeback() and ext4_add_complete_io()
              As result integrity fsync is broken in case of buffered write
              to fallocated region:
              fsync                                      blkdev_completion
      	 ->filemap_write_and_wait_range
                                                         ->ext4_end_bio
                                                           ->end_page_writeback
                <-- filemap_write_and_wait_range return
      	 ->ext4_flush_completed_IO
         	 sees empty i_completed_io_list but pended
         	 conversion still exist
                                                           ->ext4_add_complete_io
      
      BUG #2) Race window becomes wider due to the 'ext4: completed_io
      locking cleanup V4' patch series
      
      This patch make following changes:
      1) ext4_flush_completed_io() now first try to flush completed io and when
         wait for any outstanding unwritten io via ext4_unwritten_wait()
      2) Rename function to more appropriate name.
      3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
         prevent endless wait
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      c278531d