1. 11 3月, 2013 2 次提交
    • Z
      ext4: update extent status tree after an extent is zeroed out · adb23551
      Zheng Liu 提交于
      When we try to split an extent, this extent could be zeroed out and mark
      as initialized.  But we don't know this in ext4_map_blocks because it
      only returns a length of allocated extent.  Meanwhile we will mark this
      extent as uninitialized because we only check m_flags.
      
      This commit update extent status tree when we try to split an unwritten
      extent.  We don't need to worry about the status of this extent because
      we always mark it as initialized.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      adb23551
    • Z
      ext4: fix wrong m_len value after unwritten extent conversion · cdee7843
      Zheng Liu 提交于
      The ext4_ext_handle_uninitialized_extents() function was assuming the
      return value of ext4_ext_map_blocks() is equal to map->m_len.  This
      incorrect assumption was harmless until we started use status tree as
      a extent cache because we need to update status tree according to
      'm_len' value.
      
      Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent
      conversion.  It shouldn't cause a bug because we update status tree
      according to checking EXT4_MAP_UNWRITTEN flag.  But it should be fixed.
      
      After applied this commit, the following error message from self-testing
      infrastructure disappears.
      
          ...
          kernel: ES len assertation failed for inode: 230 retval 1 !=
          map->m_len 3 in ext4_map_blocks (allocation)
          ...
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      cdee7843
  2. 04 3月, 2013 4 次提交
  3. 18 2月, 2013 6 次提交
    • Z
      ext4: remove single extent cache · 69eb33dc
      Zheng Liu 提交于
      Single extent cache could be removed because we have extent status tree
      as a extent cache, and it would be better.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      69eb33dc
    • Z
      ext4: lookup block mapping in extent status tree · d100eef2
      Zheng Liu 提交于
      After tracking all extent status, we already have a extent cache in
      memory.  Every time we want to lookup a block mapping, we can first
      try to lookup it in extent status tree to avoid a potential disk I/O.
      
      A new function called ext4_es_lookup_extent is defined to finish this
      work.  When we try to lookup a block mapping, we always call
      ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
      first try to lookup a block mapping in extent status tree.
      
      A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
      in order not to put a hole into extent status tree because this hole
      will be converted to delayed extent in the tree immediately.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      d100eef2
    • Z
      ext4: track all extent status in extent status tree · f7fec032
      Zheng Liu 提交于
      By recording the phycisal block and status, extent status tree is able
      to track the status of every extents.  When we call _map_blocks
      functions to lookup an extent or create a new written/unwritten/delayed
      extent, this extent will be inserted into extent status tree.
      
      We don't load all extents from disk in alloc_inode() because it costs
      too much memory, and if a file is opened and closed frequently it will
      takes too much time to load all extent information.  So currently when
      we create/lookup an extent, this extent will be inserted into extent
      status tree.  Hence, the extent status tree may not comprehensively
      contain all of the extents found in the file.
      
      Here a condition we need to take care is that an extent might contains
      unwritten and delayed status simultaneously because an extent is delayed
      allocated and could be allocated by fallocate.  At this time we need to
      keep delayed status because later we need to update delayed reservation
      space using it.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      f7fec032
    • Z
      ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag · a25a4e1a
      Zheng Liu 提交于
      This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
      because in later commit ext4_map_blocks needs to use this flag to
      determine the extent status.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      a25a4e1a
    • Z
      ext4: rename and improbe ext4_es_find_extent() · be401363
      Zheng Liu 提交于
      This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
      and improve this function.  First, we split input and output parameter.
      Second, this function never return the first block of the next delayed
      extent after 'es'.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      be401363
    • Z
      ext4: refine extent status tree · 06b0c886
      Zheng Liu 提交于
      This commit refines the extent status tree code.
      
      1) A prefix 'es_' is added to to the extent status tree structure
      members.
      
      2) Refactored es_remove_extent() so that __es_remove_extent() can be
      used by es_insert_extent() to remove the old extent entry(-ies) before
      inserting a new one.
      
      3) Rename extent_status_end() to ext4_es_end()
      
      4) ext4_es_can_be_merged() is define to check whether two extents can
      be merged or not.
      
      5) Update and clarified comments.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      06b0c886
  4. 09 2月, 2013 1 次提交
    • T
      ext4: pass context information to jbd2__journal_start() · 9924a92a
      Theodore Ts'o 提交于
      So we can better understand what bits of ext4 are responsible for
      long-running jbd2 handles, use jbd2__journal_start() so we can pass
      context information for logging purposes.
      
      The recommended way for finding the longer-running handles is:
      
         T=/sys/kernel/debug/tracing
         EVENT=$T/events/jbd2/jbd2_handle_stats
         echo "interval > 5" > $EVENT/filter
         echo 1 > $EVENT/enable
      
         ./run-my-fs-benchmark
      
         cat $T/trace > /tmp/problem-handles
      
      This will list handles that were active for longer than 20ms.  Having
      longer-running handles is bad, because a commit started at the wrong
      time could stall for those 20+ milliseconds, which could delay an
      fsync() or an O_SYNC operation.  Here is an example line from the
      trace file describing a handle which lived on for 311 jiffies, or over
      1.2 seconds:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9924a92a
  5. 29 1月, 2013 1 次提交
  6. 28 1月, 2013 1 次提交
    • Z
      ext4: add punching hole support for non-extent-mapped files · 8bad6fc8
      Zheng Liu 提交于
      This patch add supports for indirect file support punching hole.  It
      is almost the same as ext4_ext_punch_hole.  First, we invalidate all
      pages between this hole, and then we try to deallocate all blocks of
      this hole.
      
      A recursive function is used to handle deallocation of blocks.  In
      this function, it iterates over the entries in inode's i_blocks or
      indirect blocks, and try to free the block for each one of them.
      
      After applying this patch, xfstest #255 will not pass w/o extent because
      indirect-based file doesn't support unwritten extents.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8bad6fc8
  7. 13 1月, 2013 2 次提交
  8. 17 12月, 2012 1 次提交
  9. 11 12月, 2012 4 次提交
  10. 29 11月, 2012 3 次提交
    • T
      ext4: rationalize ext4_extents.h inclusion · 4a092d73
      Theodore Ts'o 提交于
      Previously, ext4_extents.h was being included at the end of ext4.h,
      which was bad for a number of reasons: (a) it was not being included
      in the expected place, and (b) it caused the header to be included
      multiple times.  There were #ifdef's to prevent this from causing any
      problems, but it still was unnecessary.
      
      By moving the function declarations that were in ext4_extents.h to
      ext4.h, which is standard practice for where the function declarations
      for the rest of ext4.h can be found, we can remove ext4_extents.h from
      being included in ext4.h at all, and then we can only include
      ext4_extents.h where it is needed in ext4's source files.
      
      It should be possible to move a few more things into ext4.h, and
      further reduce the number of source files that need to #include
      ext4_extents.h, but that's a cleanup for another day.
      Reported-by: NSachin Kamat <sachin.kamat@linaro.org>
      Reported-by: NWei Yongjun <weiyj.lk@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4a092d73
    • L
      ext4: simple cleanup in fiemap codepath · 06348679
      Lukas Czerner 提交于
      This commit is simple cleanup of fiemap codepath which has not been
      included in previous commit to make the changes clearer. In this commit
      we rename cbex variable to newex in ext4_fill_fiemap_extents() because
      callback is no longer present
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      06348679
    • L
      ext4: prevent race while walking extent tree for fiemap · 91dd8c11
      Lukas Czerner 提交于
      Currently ext4_ext_walk_space() only takes i_data_sem for read when
      searching for the extent at given block with ext4_ext_find_extent().
      Then it drops the lock and the extent tree can be changed at will.
      However later on we're searching for the 'next' extent, but the extent
      tree might already have changed, so the information might not be
      accurate.
      
      In fact we can hit BUG_ON(end <= start) if the extent got inserted into
      the tree after the one we found and before the block we were searching
      for. This has been reproduced by running xfstests 225 in loop on s390x
      architecture, but theoretically we could hit this on any other
      architecture as well, but probably not as often.
      
      Moreover the extent currently in delayed allocation might be allocated
      after we search the extent tree and before we search extent status tree
      delayed buffers resulting in those delayed buffers being completely
      missed, even though completely written and allocated.
      
      We fix all those problems in several steps:
      
       1. remove unnecessary callback indirection
       2. rename functions
              ext4_ext_walk_space -> ext4_fill_fiemap_extents
              ext4_ext_fiemap_cb -> ext4_find_delayed_extent
       3. move fiemap_fill_next_extent() into ext4_fill_fiemap_extents()
       4. hold the i_data_sem for:
              ext4_ext_find_extent()
              ext4_ext_next_allocated_block()
              ext4_find_delayed_extent()
       5. call fiemap_fill_next_extent after releasing the i_data_sem
       6. move path reinitialization into the critical section.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      91dd8c11
  11. 09 11月, 2012 6 次提交
  12. 10 10月, 2012 1 次提交
  13. 05 10月, 2012 2 次提交
    • D
      ext4: serialize fallocate with ext4_convert_unwritten_extents · 60d4616f
      Dmitry Monakhov 提交于
      Fallocate should wait for pended ext4_convert_unwritten_extents()
      otherwise following race may happen:
      
      ftruncate( ,12288);
      fallocate( ,0, 4096)
      io_sibmit( ,0, 4096); /* Write to fallocated area, split extent if needed */
      fallocate( ,0, 8192); /* Grow extent and broke assumption about extent */
      
      Later kwork completion will do:
       ->ext4_convert_unwritten_extents (0, 4096)
         ->ext4_map_blocks(handle, inode, &map, EXT4_GET_BLOCKS_IO_CONVERT_EXT);
          ->ext4_ext_map_blocks() /* Will find new extent:  ex = [0,2] !!!!!! */
            ->ext4_ext_handle_uninitialized_extents()
              ->ext4_convert_unwritten_extents_endio()
              /* convert [0,2] extent to initialized, but only[0,1] was written */
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      60d4616f
    • D
      ext4: fix ext4_flush_completed_IO wait semantics · c278531d
      Dmitry Monakhov 提交于
      BUG #1) All places where we call ext4_flush_completed_IO are broken
          because buffered io and DIO/AIO goes through three stages
          1) submitted io,
          2) completed io (in i_completed_io_list) conversion pended
          3) finished  io (conversion done)
          And by calling ext4_flush_completed_IO we will flush only
          requests which were in (2) stage, which is wrong because:
           1) punch_hole and truncate _must_ wait for all outstanding unwritten io
            regardless to it's state.
           2) fsync and nolock_dio_read should also wait because there is
              a time window between end_page_writeback() and ext4_add_complete_io()
              As result integrity fsync is broken in case of buffered write
              to fallocated region:
              fsync                                      blkdev_completion
      	 ->filemap_write_and_wait_range
                                                         ->ext4_end_bio
                                                           ->end_page_writeback
                <-- filemap_write_and_wait_range return
      	 ->ext4_flush_completed_IO
         	 sees empty i_completed_io_list but pended
         	 conversion still exist
                                                           ->ext4_add_complete_io
      
      BUG #2) Race window becomes wider due to the 'ext4: completed_io
      locking cleanup V4' patch series
      
      This patch make following changes:
      1) ext4_flush_completed_io() now first try to flush completed io and when
         wait for any outstanding unwritten io via ext4_unwritten_wait()
      2) Rename function to more appropriate name.
      3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
         prevent endless wait
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      c278531d
  14. 01 10月, 2012 2 次提交
    • D
      ext4: fix ext_remove_space for punch_hole case · 6f2080e6
      Dmitry Monakhov 提交于
      Inode is allowed to have empty leaf only if it this is blockless inode.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6f2080e6
    • D
      ext4: punch_hole should wait for DIO writers · 02d262df
      Dmitry Monakhov 提交于
      punch_hole is the place where we have to wait for all existing writers
      (writeback, aio, dio), but currently we simply flush pended end_io request
      which is not sufficient. Other issue is that punch_hole performed w/o i_mutex
      held which obviously result in dangerous data corruption due to
      write-after-free.
      
      This patch performs following changes:
      - Guard punch_hole with i_mutex
      - Recheck inode flags under i_mutex
      - Block all new dio readers in order to prevent information leak caused by
        read-after-free pattern.
      - punch_hole now wait for all writers in flight
        NOTE: XXX write-after-free race is still possible because new dirty pages
        may appear due to mmap(), and currently there is no easy way to stop
        writeback while punch_hole is in progress.
      
      [ Fixed error return from ext4_ext_punch_hole() to make sure that we
        release i_mutex before returning EPERM or ETXTBUSY -- Ted ]
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      02d262df
  15. 29 9月, 2012 3 次提交
    • D
      ext4: completed_io locking cleanup · 28a535f9
      Dmitry Monakhov 提交于
      Current unwritten extent conversion state-machine is very fuzzy.
      - For unknown reason it performs conversion under i_mutex. What for?
        My diagnosis:
        We already protect extent tree with i_data_sem, truncate and punch_hole
        should wait for DIO, so the only data we have to protect is end_io->flags
        modification, but only flush_completed_IO and end_io_work modified this
        flags and we can serialize them via i_completed_io_lock.
      
        Currently all these games with mutex_trylock result in the following deadlock
         truncate:                          kworker:
          ext4_setattr                       ext4_end_io_work
          mutex_lock(i_mutex)
          inode_dio_wait(inode)  ->BLOCK
                                   DEADLOCK<- mutex_trylock()
                                              inode_dio_done()
        #TEST_CASE1_BEGIN
        MNT=/mnt_scrach
        unlink $MNT/file
        fallocate -l $((1024*1024*1024)) $MNT/file
        aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
        sleep 2
        truncate -s 0 $MNT/file
        #TEST_CASE1_END
      
      Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
      
      This patch makes state machine simple and clean:
      
      (1) xxx_end_io schedule final extent conversion simply by calling
          ext4_add_complete_io(), which append it to ei->i_completed_io_list
          NOTE1: because of (2A) work should be queued only if
          ->i_completed_io_list was empty, otherwise the work is scheduled already.
      
      (2) ext4_flush_completed_IO is responsible for handling all pending
          end_io from ei->i_completed_io_list
          Flushing sequence consists of following stages:
          A) LOCKED: Atomically drain completed_io_list to local_list
          B) Perform extents conversion
          C) LOCKED: move converted io's to to_free list for final deletion
             	     This logic depends on context which we was called from.
          D) Final end_io context destruction
          NOTE1: i_mutex is no longer required because end_io->flags modification
          is protected by ei->ext4_complete_io_lock
      
      Full list of changes:
      - Move all completion end_io related routines to page-io.c in order to improve
        logic locality
      - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
      - remove EXT4_IO_END_FSYNC
      - Improve SMP scalability by removing useless i_mutex which does not
        protect io->flags anymore.
      - Reduce lock contention on i_completed_io_lock by optimizing list walk.
      - Rename ext4_end_io_nolock to end4_end_io and make it static
      - Check flush completion status to ext4_ext_punch_hole(). Because it is
        not good idea to punch blocks from corrupted inode.
      
      Changes since V3 (in request to Jan's comments):
        Fall back to active flush_completed_IO() approach in order to prevent
        performance issues with nolocked DIO reads.
      Changes since V2:
        Fix use-after-free caused by race truncate vs end_io_work
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      28a535f9
    • D
      ext4: fix unwritten counter leakage · 82e54229
      Dmitry Monakhov 提交于
      ext4_set_io_unwritten_flag() will increment i_unwritten counter, so
      once we mark end_io with EXT4_END_IO_UNWRITTEN we have to revert it back
      on error path.
      
       - add missed error checks to prevent counter leakage
       - ext4_end_io_nolock() will clear EXT4_END_IO_UNWRITTEN flag to signal
         that conversion finished.
       - add BUG_ON to ext4_free_end_io() to prevent similar leakage in future.
      
      Visible effect of this bug is that unaligned aio_stress may deadlock
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      82e54229
    • D
      ext4: ext4_inode_info diet · f45ee3a1
      Dmitry Monakhov 提交于
      Generic inode has unused i_private pointer which may be used as cur_aio_dio
      storage.
      
      TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
            to have concurent AIO_DIO requests.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f45ee3a1
  16. 27 9月, 2012 1 次提交