1. 11 12月, 2012 8 次提交
  2. 29 11月, 2012 1 次提交
    • T
      ext4: rationalize ext4_extents.h inclusion · 4a092d73
      Theodore Ts'o 提交于
      Previously, ext4_extents.h was being included at the end of ext4.h,
      which was bad for a number of reasons: (a) it was not being included
      in the expected place, and (b) it caused the header to be included
      multiple times.  There were #ifdef's to prevent this from causing any
      problems, but it still was unnecessary.
      
      By moving the function declarations that were in ext4_extents.h to
      ext4.h, which is standard practice for where the function declarations
      for the rest of ext4.h can be found, we can remove ext4_extents.h from
      being included in ext4.h at all, and then we can only include
      ext4_extents.h where it is needed in ext4's source files.
      
      It should be possible to move a few more things into ext4.h, and
      further reduce the number of source files that need to #include
      ext4_extents.h, but that's a cleanup for another day.
      Reported-by: NSachin Kamat <sachin.kamat@linaro.org>
      Reported-by: NWei Yongjun <weiyj.lk@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4a092d73
  3. 09 11月, 2012 2 次提交
  4. 22 10月, 2012 1 次提交
  5. 10 10月, 2012 1 次提交
    • T
      ext4: fix metadata checksum calculation for the superblock · 06db49e6
      Theodore Ts'o 提交于
      The function ext4_handle_dirty_super() was calculating the superblock
      on the wrong block data.  As a result, when the superblock is modified
      while it is mounted (most commonly, when inodes are added or removed
      from the orphan list), the superblock checksum would be wrong.  We
      didn't notice because the superblock *was* being correctly calculated
      in ext4_commit_super(), and this would get called when the file system
      was unmounted.  So the problem only became obvious if the system
      crashed while the file system was mounted.
      
      Fix this by removing the poorly designed function signature for
      ext4_superblock_csum_set(); if it only took a single argument, the
      pointer to a struct superblock, the ambiguity which caused this
      mistake would have been impossible.
      Reported-by: NGeorge Spelvin <linux@horizon.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      06db49e6
  6. 05 10月, 2012 1 次提交
    • D
      ext4: fix ext4_flush_completed_IO wait semantics · c278531d
      Dmitry Monakhov 提交于
      BUG #1) All places where we call ext4_flush_completed_IO are broken
          because buffered io and DIO/AIO goes through three stages
          1) submitted io,
          2) completed io (in i_completed_io_list) conversion pended
          3) finished  io (conversion done)
          And by calling ext4_flush_completed_IO we will flush only
          requests which were in (2) stage, which is wrong because:
           1) punch_hole and truncate _must_ wait for all outstanding unwritten io
            regardless to it's state.
           2) fsync and nolock_dio_read should also wait because there is
              a time window between end_page_writeback() and ext4_add_complete_io()
              As result integrity fsync is broken in case of buffered write
              to fallocated region:
              fsync                                      blkdev_completion
      	 ->filemap_write_and_wait_range
                                                         ->ext4_end_bio
                                                           ->end_page_writeback
                <-- filemap_write_and_wait_range return
      	 ->ext4_flush_completed_IO
         	 sees empty i_completed_io_list but pended
         	 conversion still exist
                                                           ->ext4_add_complete_io
      
      BUG #2) Race window becomes wider due to the 'ext4: completed_io
      locking cleanup V4' patch series
      
      This patch make following changes:
      1) ext4_flush_completed_io() now first try to flush completed io and when
         wait for any outstanding unwritten io via ext4_unwritten_wait()
      2) Rename function to more appropriate name.
      3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
         prevent endless wait
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      c278531d
  7. 29 9月, 2012 4 次提交
    • D
      ext4: serialize dio nonlocked reads with defrag workers · 17335dcc
      Dmitry Monakhov 提交于
      Inode's block defrag and ext4_change_inode_journal_flag() may
      affect nonlocked DIO reads result, so proper synchronization
      required.
      
      - Add missed inode_dio_wait() calls where appropriate
      - Check inode state under extra i_dio_count reference.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      17335dcc
    • D
      ext4: completed_io locking cleanup · 28a535f9
      Dmitry Monakhov 提交于
      Current unwritten extent conversion state-machine is very fuzzy.
      - For unknown reason it performs conversion under i_mutex. What for?
        My diagnosis:
        We already protect extent tree with i_data_sem, truncate and punch_hole
        should wait for DIO, so the only data we have to protect is end_io->flags
        modification, but only flush_completed_IO and end_io_work modified this
        flags and we can serialize them via i_completed_io_lock.
      
        Currently all these games with mutex_trylock result in the following deadlock
         truncate:                          kworker:
          ext4_setattr                       ext4_end_io_work
          mutex_lock(i_mutex)
          inode_dio_wait(inode)  ->BLOCK
                                   DEADLOCK<- mutex_trylock()
                                              inode_dio_done()
        #TEST_CASE1_BEGIN
        MNT=/mnt_scrach
        unlink $MNT/file
        fallocate -l $((1024*1024*1024)) $MNT/file
        aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
        sleep 2
        truncate -s 0 $MNT/file
        #TEST_CASE1_END
      
      Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
      
      This patch makes state machine simple and clean:
      
      (1) xxx_end_io schedule final extent conversion simply by calling
          ext4_add_complete_io(), which append it to ei->i_completed_io_list
          NOTE1: because of (2A) work should be queued only if
          ->i_completed_io_list was empty, otherwise the work is scheduled already.
      
      (2) ext4_flush_completed_IO is responsible for handling all pending
          end_io from ei->i_completed_io_list
          Flushing sequence consists of following stages:
          A) LOCKED: Atomically drain completed_io_list to local_list
          B) Perform extents conversion
          C) LOCKED: move converted io's to to_free list for final deletion
             	     This logic depends on context which we was called from.
          D) Final end_io context destruction
          NOTE1: i_mutex is no longer required because end_io->flags modification
          is protected by ei->ext4_complete_io_lock
      
      Full list of changes:
      - Move all completion end_io related routines to page-io.c in order to improve
        logic locality
      - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
      - remove EXT4_IO_END_FSYNC
      - Improve SMP scalability by removing useless i_mutex which does not
        protect io->flags anymore.
      - Reduce lock contention on i_completed_io_lock by optimizing list walk.
      - Rename ext4_end_io_nolock to end4_end_io and make it static
      - Check flush completion status to ext4_ext_punch_hole(). Because it is
        not good idea to punch blocks from corrupted inode.
      
      Changes since V3 (in request to Jan's comments):
        Fall back to active flush_completed_IO() approach in order to prevent
        performance issues with nolocked DIO reads.
      Changes since V2:
        Fix use-after-free caused by race truncate vs end_io_work
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      28a535f9
    • D
      ext4: give i_aiodio_unwritten a more appropriate name · e27f41e1
      Dmitry Monakhov 提交于
      AIO/DIO prefix is wrong because it account unwritten extents which
      also may be scheduled from buffered write endio
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e27f41e1
    • D
      ext4: ext4_inode_info diet · f45ee3a1
      Dmitry Monakhov 提交于
      Generic inode has unused i_private pointer which may be used as cur_aio_dio
      storage.
      
      TODO: If cur_aio_dio will be passed as an argument to get_block_t this allow
            to have concurent AIO_DIO requests.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f45ee3a1
  8. 05 9月, 2012 2 次提交
    • T
      ext4: grow the s_group_info array as needed · 28623c2f
      Theodore Ts'o 提交于
      Previously we allocated the s_group_info array with enough space for
      any future possible growth of the file system via online resize.  This
      is unfortunate because it wastes memory, and it doesn't work for the
      meta_bg scheme, since there is no limit based on the number of
      reserved gdt blocks.  So add the code to grow the s_group_info array
      as needed.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      28623c2f
    • T
      ext4: grow the s_flex_groups array as needed when resizing · 117fff10
      Theodore Ts'o 提交于
      Previously, we allocated the s_flex_groups array to the maximum size
      that the file system could be resized.  There was two problems with
      this approach.  First, it wasted memory in the common case where the
      file system was not resized.  Secondly, once we start allowing online
      resizing using the meta_bg scheme, there is no maximum size that the
      file system can be resized.  So instead, we need to grow the
      s_flex_groups at inline resize time.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      117fff10
  9. 17 8月, 2012 2 次提交
    • Z
      ext4: make the zero-out chunk size tunable · 67a5da56
      Zheng Liu 提交于
      Currently in ext4 the length of zero-out chunk is set to 7 file system
      blocks.  But if an inode has uninitailized extents from using
      fallocate to preallocate space, and the workload issues many random
      writes, this can cause a fragmented extent tree that will
      unnecessarily grow the extent tree.
      
      So create a new sysfs tunable, extent_max_zeroout_kb, which controls
      the maximum size where blocks will be zeroed out instead of creating a
      new uninitialized extent.  The default of this has been sent to 32kb.
      
      CC: Zach Brown <zab@zabbo.net>
      CC: Andreas Dilger <adilger@dilger.ca>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      67a5da56
    • T
      ext4: add max_dir_size_kb mount option · df981d03
      Theodore Ts'o 提交于
      Very large directories can cause significant performance problems, or
      perhaps even invoke the OOM killer, if the process is running in a
      highly constrained memory environment (whether it is VM's with a small
      amount of memory or in a small memory cgroup).
      
      So it is useful, in cloud server/data center environments, to be able
      to set a filesystem-wide cap on the maximum size of a directory, to
      ensure that directories never get larger than a sane size.  We do this
      via a new mount option, max_dir_size_kb.  If there is an attempt to
      grow the directory larger than max_dir_size_kb, the system call will
      return ENOSPC instead.
      
      Google-Bug-Id: 6863013
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      
      
      
      df981d03
  10. 23 7月, 2012 3 次提交
  11. 10 7月, 2012 2 次提交
    • Z
      ext4: add a new nolock flag in ext4_map_blocks · 729f52c6
      Zheng Liu 提交于
      EXT4_GET_BLOCKS_NO_LOCK flag is added to indicate that we don't need
      to acquire i_data_sem lock in ext4_map_blocks.  Meanwhile, it changes
      ext4_get_block() to not start a new journal because when we do a
      overwrite dio, there is no any metadata that needs to be modified.
      
      We define a new function called ext4_get_block_write_nolock, which is
      used in dio overwrite nolock.  In this function, it doesn't try to
      acquire i_data_sem lock and doesn't start a new journal as it does a
      lookup.
      
      CC: Tao Ma <tm@tao.ma>
      CC: Eric Sandeen <sandeen@redhat.com>
      CC: Robin Dong <hao.bigrat@gmail.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      729f52c6
    • T
      ext4: fix overhead calculation used by ext4_statfs() · 952fc18e
      Theodore Ts'o 提交于
      Commit f975d6bc introduced bug which caused ext4_statfs() to
      miscalculate the number of file system overhead blocks.  This causes
      the f_blocks field in the statfs structure to be larger than it should
      be.  This would in turn cause the "df" output to show the number of
      data blocks in the file system and the number of data blocks used to
      be larger than they should be.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      952fc18e
  12. 01 7月, 2012 1 次提交
  13. 31 5月, 2012 1 次提交
  14. 29 5月, 2012 1 次提交
  15. 27 5月, 2012 1 次提交
  16. 16 5月, 2012 1 次提交
  17. 30 4月, 2012 8 次提交