1. 05 9月, 2014 1 次提交
  2. 02 9月, 2014 5 次提交
    • Z
      ext4: track extent status tree shrinker delay statictics · eb68d0e2
      Zheng Liu 提交于
      This commit adds some statictics in extent status tree shrinker.  The
      purpose to add these is that we want to collect more details when we
      encounter a stall caused by extent status tree shrinker.  Here we count
      the following statictics:
        stats:
          the number of all objects on all extent status trees
          the number of reclaimable objects on lru list
          cache hits/misses
          the last sorted interval
          the number of inodes on lru list
        average:
          scan time for shrinking some objects
          the number of shrunk objects
        maximum:
          the inode that has max nr. of objects on lru list
          the maximum scan time for shrinking some objects
      
      The output looks like below:
        $ cat /proc/fs/ext4/sda1/es_shrinker_info
        stats:
          28228 objects
          6341 reclaimable objects
          5281/631 cache hits/misses
          586 ms last sorted interval
          250 inodes on lru list
        average:
          153 us scan time
          128 shrunk objects
        maximum:
          255 inode (255 objects, 198 reclaimable)
          125723 us max scan time
      
      If the lru list has never been sorted, the following line will not be
      printed:
          586ms last sorted interval
      If there is an empty lru list, the following lines also will not be
      printed:
          250 inodes on lru list
        ...
        maximum:
          255 inode (255 objects, 198 reclaimable)
          0 us max scan time
      
      Meanwhile in this commit a new trace point is defined to print some
      details in __ext4_es_shrink().
      
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Jan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      eb68d0e2
    • T
      ext4: rename ext4_ext_find_extent() to ext4_find_extent() · ed8a1a76
      Theodore Ts'o 提交于
      Make the function name less redundant.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ed8a1a76
    • T
      ext4: drop EXT4_EX_NOFREE_ON_ERR from rest of extents handling code · dfe50809
      Theodore Ts'o 提交于
      Drop EXT4_EX_NOFREE_ON_ERR from ext4_ext_create_new_leaf(),
      ext4_split_extent(), ext4_convert_unwritten_extents_endio().
      
      This requires fixing all of their callers to potentially
      ext4_ext_find_extent() to free the struct ext4_ext_path object in case
      of an error, and there are interlocking dependencies all the way up to
      ext4_ext_map_blocks(), ext4_swap_extents(), and
      ext4_ext_remove_space().
      
      Once this is done, we can drop the EXT4_EX_NOFREE_ON_ERR flag since it
      is no longer necessary.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      dfe50809
    • T
      ext4: teach ext4_ext_find_extent() to free path on error · 705912ca
      Theodore Ts'o 提交于
      Right now, there are a places where it is all to easy to leak memory
      on an error path, via a usage like this:
      
      	struct ext4_ext_path *path = NULL
      
      	while (...) {
      		...
      		path = ext4_ext_find_extent(inode, block, path, 0);
      		if (IS_ERR(path)) {
      			/* oops, if path was non-NULL before the call to
      			   ext4_ext_find_extent, we've leaked it!  :-(  */
      			...
      			return PTR_ERR(path);
      		}
      		...
      	}
      
      Unfortunately, there some code paths where we are doing the following
      instead:
      
      	path = ext4_ext_find_extent(inode, block, orig_path, 0);
      
      and where it's important that we _not_ free orig_path in the case
      where ext4_ext_find_extent() returns an error.
      
      So change the function signature of ext4_ext_find_extent() so that it
      takes a struct ext4_ext_path ** for its third argument, and by
      default, on an error, it will free the struct ext4_ext_path, and then
      zero out the struct ext4_ext_path * pointer.  In order to avoid
      causing problems, we add a flag EXT4_EX_NOFREE_ON_ERR which causes
      ext4_ext_find_extent() to use the original behavior of forcing the
      caller to deal with freeing the original path pointer on the error
      case.
      
      The goal is to get rid of EXT4_EX_NOFREE_ON_ERR entirely, but this
      allows for a gentle transition and makes the patches easier to verify.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      
      		
      705912ca
    • T
      ext4: fix accidental flag aliasing in ext4_map_blocks flags · bd30d702
      Theodore Ts'o 提交于
      Commit b8a86845 introduced an accidental flag aliasing between
      EXT4_EX_NOCACHE and EXT4_GET_BLOCKS_CONVERT_UNWRITTEN.
      
      Fortunately, this didn't introduce any untorward side effects --- we
      got lucky.  Nevertheless, fix this and leave a warning to hopefully
      avoid this from happening in the future.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      bd30d702
  3. 31 8月, 2014 2 次提交
  4. 30 8月, 2014 2 次提交
  5. 24 8月, 2014 2 次提交
  6. 29 7月, 2014 1 次提交
  7. 15 7月, 2014 3 次提交
    • Z
      ext4: make ext4_has_inline_data() as a inline function · 83447ccb
      Zheng Liu 提交于
      Now ext4_has_inline_data() is used in wide spread codepaths.  So we need
      to make it as a inline function to avoid burning some CPU cycles.
      
      Change in text size:
      
               text     data      bss     dec     hex filename
      before: 326110    19258    5528  350896   55ab0 fs/ext4/ext4.o
      after:  326227    19258    5528  351013   55b25 fs/ext4/ext4.o
      
      I use the following script to measure the CPU usage.
      
        #!/bin/bash
      
        shm_base='/dev/shm'
        img=${shm_base}/ext4-img
        mnt=/mnt/loop
      
        e2fsprgs_base=$HOME/e2fsprogs
        mkfs=${e2fsprgs_base}/misc/mke2fs
        fsck=${e2fsprgs_base}/e2fsck/e2fsck
      
        sudo umount $mnt
        dd if=/dev/zero of=$img bs=4k count=3145728
        ${mkfs} -t ext4 -O inline_data -F $img
        sudo mount -t ext4 -o loop $img $mnt
      
        # start testing...
        testdir="${mnt}/testdir"
        mkdir $testdir
        cd $testdir
      
        echo "start testing..."
        for ((cnt=0;cnt<100;cnt++)); do
      
        for ((i=0;i<5;i++)); do
        	for ((j=0;j<5;j++)); do
        		for ((k=0;k<5;k++)); do
        			for ((l=0;l<5;l++)); do
        				mkdir -p $i/$j/$k/$l
        				echo "$i-$j-$k-$l" > $i/$j/$k/$l/testfile
        			done
        		done
        	done
        done
      
        ls -R $testdir > /dev/null
        rm -rf $testdir/*
      
        done
      
      The result of `perf top -G -U` is as below.
      
      vanilla:
       13.92%  [ext4]  [k] ext4_do_update_inode
        9.36%  [ext4]  [k] __ext4_get_inode_loc
        4.07%  [ext4]  [k] ftrace_define_fields_ext4_writepages
        3.83%  [ext4]  [k] __ext4_handle_dirty_metadata
        3.42%  [ext4]  [k] ext4_get_inode_flags
        2.71%  [ext4]  [k] ext4_mark_iloc_dirty
        2.46%  [ext4]  [k] ftrace_define_fields_ext4_direct_IO_enter
        2.26%  [ext4]  [k] ext4_get_inode_loc
        2.22%  [ext4]  [k] ext4_has_inline_data
        [...]
      
      After applied the patch, we don't see ext4_has_inline_data() because it
      has been inlined and perf couldn't sample it.  Although it doesn't mean
      that the CPU cycles can be saved but at least the overhead of function
      calls can be eliminated.  So IMHO we'd better inline this function.
      
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      83447ccb
    • L
      ext4: fix punch hole on files with indirect mapping · 4f579ae7
      Lukas Czerner 提交于
      Currently punch hole code on files with direct/indirect mapping has some
      problems which may lead to a data loss. For example (from Jan Kara):
      
      fallocate -n -p 10240000 4096
      
      will punch the range 10240000 - 12632064 instead of the range 1024000 -
      10244096.
      
      Also the code is a bit weird and it's not using infrastructure provided
      by indirect.c, but rather creating it's own way.
      
      This patch fixes the issues as well as making the operation to run 4
      times faster from my testing (punching out 60GB file). It uses similar
      approach used in ext4_ind_truncate() which takes advantage of
      ext4_free_branches() function.
      
      Also rename the ext4_free_hole_blocks() to something more sensible, like
      the equivalent we have for extent mapped files. Call it
      ext4_ind_remove_space().
      
      This has been tested mostly with fsx and some xfstests which are testing
      punch hole but does not require unwritten extents which are not
      supported with direct/indirect mapping. Not problems showed up even with
      1024k block size.
      
      CC: stable@vger.kernel.org
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4f579ae7
    • T
      ext4: remove metadata reservation checks · 71d4f7d0
      Theodore Ts'o 提交于
      Commit 27dd4385 ("ext4: introduce reserved space") reserves 2% of
      the file system space to make sure metadata allocations will always
      succeed.  Given that, tracking the reservation of metadata blocks is
      no longer necessary.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      71d4f7d0
  8. 12 5月, 2014 3 次提交
    • S
      ext4: make local functions static · c197855e
      Stephen Hemminger 提交于
      I have been running make namespacecheck to look for unneeded globals, and
      found these in ext4.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c197855e
    • D
      ext4: fix block bitmap initialization under sparse_super2 · 1beeef1b
      Darrick J. Wong 提交于
      The ext4_bg_has_super() function doesn't know about the new rules for
      where backup superblocks go on a sparse_super2 filesystem.  Therefore,
      block bitmap initialization doesn't know that it shouldn't reserve
      space for backups in groups that are never going to contain backups.
      The result of this is e2fsck complaining about the block bitmap being
      incorrect (fortunately not in a way that results in cross-linked
      files), so fix the whole thing.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1beeef1b
    • N
      ext4: fix data integrity sync in ordered mode · 1c8349a1
      Namjae Jeon 提交于
      When we perform a data integrity sync we tag all the dirty pages with
      PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.  Later we check
      for this tag in write_cache_pages_da and creates a struct
      mpage_da_data containing contiguously indexed pages tagged with this
      tag and sync these pages with a call to mpage_da_map_and_submit.  This
      process is done in while loop until all the PAGECACHE_TAG_TOWRITE
      pages are synced. We also do journal start and stop in each iteration.
      journal_stop could initiate journal commit which would call
      ext4_writepage which in turn will call ext4_bio_write_page even for
      delayed OR unwritten buffers. When ext4_bio_write_page is called for
      such buffers, even though it does not sync them but it clears the
      PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
      are also not synced by the currently running data integrity sync. We
      will end up with dirty pages although sync is completed.
      
      This could cause a potential data loss when the sync call is followed
      by a truncate_pagecache call, which is exactly the case in
      collapse_range.  (It will cause generic/127 failure in xfstests)
      
      To avoid this issue, we can use set_page_writeback_keepwrite instead of
      set_page_writeback, which doesn't clear TOWRITE tag.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      1c8349a1
  9. 07 5月, 2014 1 次提交
  10. 22 4月, 2014 1 次提交
  11. 21 4月, 2014 2 次提交
    • L
      ext4: rename uninitialized extents to unwritten · 556615dc
      Lukas Czerner 提交于
      Currently in ext4 there is quite a mess when it comes to naming
      unwritten extents. Sometimes we call it uninitialized and sometimes we
      refer to it as unwritten.
      
      The right name for the extent which has been allocated but does not
      contain any written data is _unwritten_. Other file systems are
      using this name consistently, even the buffer head state refers to it as
      unwritten. We need to fix this confusion in ext4.
      
      This commit changes every reference to an uninitialized extent (meaning
      allocated but unwritten) to unwritten extent. This includes comments,
      function names and variable names. It even covers abbreviation of the
      word uninitialized (such as uninit) and some misspellings.
      
      This commit does not change any of the code paths at all. This has been
      confirmed by comparing md5sums of the assembly code of each object file
      after all the function names were stripped from it.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      556615dc
    • L
      ext4: get rid of EXT4_MAP_UNINIT flag · 090f32ee
      Lukas Czerner 提交于
      Currently EXT4_MAP_UNINIT is used in dioread_nolock case to mark the
      cases where we're using dioread_nolock and we're writing into either
      unallocated, or unwritten extent, because we need to make sure that
      any DIO write into that inode will wait for the extent conversion.
      
      However EXT4_MAP_UNINIT is not only entirely misleading name but also
      unnecessary because we can check for EXT4_MAP_UNWRITTEN in the
      dioread_nolock case instead.
      
      This commit removes EXT4_MAP_UNINIT flag.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      090f32ee
  12. 11 4月, 2014 1 次提交
    • T
      ext4: move ext4_update_i_disksize() into mpage_map_and_submit_extent() · 622cad13
      Theodore Ts'o 提交于
      The function ext4_update_i_disksize() is used in only one place, in
      the function mpage_map_and_submit_extent().  Move its code to simplify
      the code paths, and also move the call to ext4_mark_inode_dirty() into
      the i_data_sem's critical region, to be consistent with all of the
      other places where we update i_disksize.  That way, we also keep the
      raw_inode's i_disksize protected, to avoid the following race:
      
            CPU #1                                 CPU #2
      
         down_write(&i_data_sem)
         Modify i_disk_size
         up_write(&i_data_sem)
                                              down_write(&i_data_sem)
                                              Modify i_disk_size
                                              Copy i_disk_size to on-disk inode
                                              up_write(&i_data_sem)
         Copy i_disk_size to on-disk inode
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      622cad13
  13. 25 3月, 2014 2 次提交
  14. 19 3月, 2014 2 次提交
    • T
      ext4: each filesystem creates and uses its own mb_cache · 9c191f70
      T Makphaibulchoke 提交于
      This patch adds new interfaces to create and destory cache,
      ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove
      the cache creation and destory calls from ex4_init_xattr() and
      ext4_exitxattr() in fs/ext4/xattr.c.
      
      fs/ext4/super.c has been changed so that when a filesystem is mounted
      a cache is allocated and attched to its ext4_sb_info structure.
      
      fs/mbcache.c has been changed so that only one slab allocator is
      allocated and used by all mbcache structures.
      Signed-off-by: NT. Makphaibulchoke <tmac@hp.com>
      9c191f70
    • L
      ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate · b8a86845
      Lukas Czerner 提交于
      Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
      functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
      
      It can be used to convert a range of file to zeros preferably without
      issuing data IO. Blocks should be preallocated for the regions that span
      holes in the file, and the entire range is preferable converted to
      unwritten extents
      
      This can be also used to preallocate blocks past EOF in the same way as
      with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
      size to remain the same.
      
      Also add appropriate tracepoints.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b8a86845
  15. 24 2月, 2014 1 次提交
  16. 22 2月, 2014 1 次提交
  17. 17 2月, 2014 1 次提交
  18. 20 12月, 2013 1 次提交
  19. 12 11月, 2013 1 次提交
  20. 09 11月, 2013 1 次提交
  21. 18 10月, 2013 1 次提交
    • T
      ext4: add ratelimiting to ext4 messages · efbed4dc
      Theodore Ts'o 提交于
      In the case of a storage device that suddenly disappears, or in the
      case of significant file system corruption, this can result in a huge
      flood of messages being sent to the console.  This can overflow the
      file system containing /var/log/messages, or if a serial console is
      configured, this can slow down the system so much that a hardware
      watchdog can end up triggering forcing a system reboot.
      
      Google-Bug-Id: 7258357
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      efbed4dc
  22. 04 9月, 2013 1 次提交
    • C
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig 提交于
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665
  23. 29 8月, 2013 4 次提交
    • D
      ext4: mark block group as corrupt on inode bitmap error · 87a39389
      Darrick J. Wong 提交于
      If we detect either a discrepancy between the inode bitmap and the
      inode counts or the inode bitmap fails to pass validation checks, mark
      the block group corrupt and refuse to allocate or deallocate inodes
      from the group.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      87a39389
    • D
      ext4: mark block group as corrupt on block bitmap error · 163a203d
      Darrick J. Wong 提交于
      When we notice a block-bitmap corruption (because of device failure or
      something else), we should mark this group as corrupt and prevent
      further block allocations/deallocations from it. Currently, we end up
      generating one error message for every block in the bitmap. This
      potentially could make the system unstable as noticed in some
      bugs. With this patch, the error will be printed only the first time
      and mark the entire block group as corrupted. This prevents future
      access allocations/deallocations from it.
      
      Also tested by corrupting the block
      bitmap and forcefully introducing the mb_free_blocks error:
      (1) create a largefile (2Gb)
      $ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
      (2) umount filesystem. use dumpe2fs to see which block-bitmaps
      are in use by largefile and note their block numbers
      (3) use dd to zero-out the used block bitmaps
      $ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
      (4) mount the FS and delete the largefile.
      (5) recreate the largefile. verify that the new largefile does not
      get any blocks from the groups marked as bad.
      Without the patch, we will see mb_free_blocks error for each bit in
      each zero'ed out bitmap at (4). With the patch, we only see the error
      once per blockgroup:
      [  309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      
      Google-Bug-Id: 7258357
      
      [darrick.wong@oracle.com]
      Further modifications (by Darrick) to make more obvious that this corruption
      bit applies to blocks only.  Set the corruption flag if the block group bitmap
      verification fails.
      
      Original-author: Aditya Kali <adityakali@google.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      163a203d
    • D
      ext4: fix type declaration of ext4_validate_block_bitmap · dbde0abe
      Darrick J. Wong 提交于
      The block_group parameter to ext4_validate_block_bitmap is both used
      as a ext4_group_t inside the function and the same type is passed in
      by all callers.  We might as well use the typedef consistently instead
      of open-coding the 'unsigned int'.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      dbde0abe
    • Z
      ext4: isolate ext4_extents.h file · d7b2a00c
      Zheng Liu 提交于
      After applied the commit (4a092d73), we have reduced the number of
      source files that need to #include ext4_extents.h.  But we can do
      better.
      
      This commit defines ext4_zeroout_es() in extents.c and move
      EXT_MAX_BLOCKS into ext4.h in order not to include ext4_extents.h in
      indirect.c and ioctl.c.  Meanwhile we just need to include this file in
      extent_status.c when ES_AGGRESSIVE_TEST is defined.  Otherwise, this
      commit removes a duplicated declaration in trace/events/ext4.h.
      
      After applied this patch, we just need to include ext4_extents.h file
      in {super,migrate,move_extents,extents}.c, and it is easy for us to
      define a new extent disk layout.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d7b2a00c