1. 15 7月, 2014 3 次提交
    • Z
      ext4: make ext4_has_inline_data() as a inline function · 83447ccb
      Zheng Liu 提交于
      Now ext4_has_inline_data() is used in wide spread codepaths.  So we need
      to make it as a inline function to avoid burning some CPU cycles.
      
      Change in text size:
      
               text     data      bss     dec     hex filename
      before: 326110    19258    5528  350896   55ab0 fs/ext4/ext4.o
      after:  326227    19258    5528  351013   55b25 fs/ext4/ext4.o
      
      I use the following script to measure the CPU usage.
      
        #!/bin/bash
      
        shm_base='/dev/shm'
        img=${shm_base}/ext4-img
        mnt=/mnt/loop
      
        e2fsprgs_base=$HOME/e2fsprogs
        mkfs=${e2fsprgs_base}/misc/mke2fs
        fsck=${e2fsprgs_base}/e2fsck/e2fsck
      
        sudo umount $mnt
        dd if=/dev/zero of=$img bs=4k count=3145728
        ${mkfs} -t ext4 -O inline_data -F $img
        sudo mount -t ext4 -o loop $img $mnt
      
        # start testing...
        testdir="${mnt}/testdir"
        mkdir $testdir
        cd $testdir
      
        echo "start testing..."
        for ((cnt=0;cnt<100;cnt++)); do
      
        for ((i=0;i<5;i++)); do
        	for ((j=0;j<5;j++)); do
        		for ((k=0;k<5;k++)); do
        			for ((l=0;l<5;l++)); do
        				mkdir -p $i/$j/$k/$l
        				echo "$i-$j-$k-$l" > $i/$j/$k/$l/testfile
        			done
        		done
        	done
        done
      
        ls -R $testdir > /dev/null
        rm -rf $testdir/*
      
        done
      
      The result of `perf top -G -U` is as below.
      
      vanilla:
       13.92%  [ext4]  [k] ext4_do_update_inode
        9.36%  [ext4]  [k] __ext4_get_inode_loc
        4.07%  [ext4]  [k] ftrace_define_fields_ext4_writepages
        3.83%  [ext4]  [k] __ext4_handle_dirty_metadata
        3.42%  [ext4]  [k] ext4_get_inode_flags
        2.71%  [ext4]  [k] ext4_mark_iloc_dirty
        2.46%  [ext4]  [k] ftrace_define_fields_ext4_direct_IO_enter
        2.26%  [ext4]  [k] ext4_get_inode_loc
        2.22%  [ext4]  [k] ext4_has_inline_data
        [...]
      
      After applied the patch, we don't see ext4_has_inline_data() because it
      has been inlined and perf couldn't sample it.  Although it doesn't mean
      that the CPU cycles can be saved but at least the overhead of function
      calls can be eliminated.  So IMHO we'd better inline this function.
      
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      83447ccb
    • L
      ext4: fix punch hole on files with indirect mapping · 4f579ae7
      Lukas Czerner 提交于
      Currently punch hole code on files with direct/indirect mapping has some
      problems which may lead to a data loss. For example (from Jan Kara):
      
      fallocate -n -p 10240000 4096
      
      will punch the range 10240000 - 12632064 instead of the range 1024000 -
      10244096.
      
      Also the code is a bit weird and it's not using infrastructure provided
      by indirect.c, but rather creating it's own way.
      
      This patch fixes the issues as well as making the operation to run 4
      times faster from my testing (punching out 60GB file). It uses similar
      approach used in ext4_ind_truncate() which takes advantage of
      ext4_free_branches() function.
      
      Also rename the ext4_free_hole_blocks() to something more sensible, like
      the equivalent we have for extent mapped files. Call it
      ext4_ind_remove_space().
      
      This has been tested mostly with fsx and some xfstests which are testing
      punch hole but does not require unwritten extents which are not
      supported with direct/indirect mapping. Not problems showed up even with
      1024k block size.
      
      CC: stable@vger.kernel.org
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4f579ae7
    • T
      ext4: remove metadata reservation checks · 71d4f7d0
      Theodore Ts'o 提交于
      Commit 27dd4385 ("ext4: introduce reserved space") reserves 2% of
      the file system space to make sure metadata allocations will always
      succeed.  Given that, tracking the reservation of metadata blocks is
      no longer necessary.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      71d4f7d0
  2. 12 5月, 2014 3 次提交
    • S
      ext4: make local functions static · c197855e
      Stephen Hemminger 提交于
      I have been running make namespacecheck to look for unneeded globals, and
      found these in ext4.
      Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c197855e
    • D
      ext4: fix block bitmap initialization under sparse_super2 · 1beeef1b
      Darrick J. Wong 提交于
      The ext4_bg_has_super() function doesn't know about the new rules for
      where backup superblocks go on a sparse_super2 filesystem.  Therefore,
      block bitmap initialization doesn't know that it shouldn't reserve
      space for backups in groups that are never going to contain backups.
      The result of this is e2fsck complaining about the block bitmap being
      incorrect (fortunately not in a way that results in cross-linked
      files), so fix the whole thing.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1beeef1b
    • N
      ext4: fix data integrity sync in ordered mode · 1c8349a1
      Namjae Jeon 提交于
      When we perform a data integrity sync we tag all the dirty pages with
      PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages.  Later we check
      for this tag in write_cache_pages_da and creates a struct
      mpage_da_data containing contiguously indexed pages tagged with this
      tag and sync these pages with a call to mpage_da_map_and_submit.  This
      process is done in while loop until all the PAGECACHE_TAG_TOWRITE
      pages are synced. We also do journal start and stop in each iteration.
      journal_stop could initiate journal commit which would call
      ext4_writepage which in turn will call ext4_bio_write_page even for
      delayed OR unwritten buffers. When ext4_bio_write_page is called for
      such buffers, even though it does not sync them but it clears the
      PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
      are also not synced by the currently running data integrity sync. We
      will end up with dirty pages although sync is completed.
      
      This could cause a potential data loss when the sync call is followed
      by a truncate_pagecache call, which is exactly the case in
      collapse_range.  (It will cause generic/127 failure in xfstests)
      
      To avoid this issue, we can use set_page_writeback_keepwrite instead of
      set_page_writeback, which doesn't clear TOWRITE tag.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      1c8349a1
  3. 07 5月, 2014 1 次提交
  4. 22 4月, 2014 1 次提交
  5. 21 4月, 2014 2 次提交
    • L
      ext4: rename uninitialized extents to unwritten · 556615dc
      Lukas Czerner 提交于
      Currently in ext4 there is quite a mess when it comes to naming
      unwritten extents. Sometimes we call it uninitialized and sometimes we
      refer to it as unwritten.
      
      The right name for the extent which has been allocated but does not
      contain any written data is _unwritten_. Other file systems are
      using this name consistently, even the buffer head state refers to it as
      unwritten. We need to fix this confusion in ext4.
      
      This commit changes every reference to an uninitialized extent (meaning
      allocated but unwritten) to unwritten extent. This includes comments,
      function names and variable names. It even covers abbreviation of the
      word uninitialized (such as uninit) and some misspellings.
      
      This commit does not change any of the code paths at all. This has been
      confirmed by comparing md5sums of the assembly code of each object file
      after all the function names were stripped from it.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      556615dc
    • L
      ext4: get rid of EXT4_MAP_UNINIT flag · 090f32ee
      Lukas Czerner 提交于
      Currently EXT4_MAP_UNINIT is used in dioread_nolock case to mark the
      cases where we're using dioread_nolock and we're writing into either
      unallocated, or unwritten extent, because we need to make sure that
      any DIO write into that inode will wait for the extent conversion.
      
      However EXT4_MAP_UNINIT is not only entirely misleading name but also
      unnecessary because we can check for EXT4_MAP_UNWRITTEN in the
      dioread_nolock case instead.
      
      This commit removes EXT4_MAP_UNINIT flag.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      090f32ee
  6. 11 4月, 2014 1 次提交
    • T
      ext4: move ext4_update_i_disksize() into mpage_map_and_submit_extent() · 622cad13
      Theodore Ts'o 提交于
      The function ext4_update_i_disksize() is used in only one place, in
      the function mpage_map_and_submit_extent().  Move its code to simplify
      the code paths, and also move the call to ext4_mark_inode_dirty() into
      the i_data_sem's critical region, to be consistent with all of the
      other places where we update i_disksize.  That way, we also keep the
      raw_inode's i_disksize protected, to avoid the following race:
      
            CPU #1                                 CPU #2
      
         down_write(&i_data_sem)
         Modify i_disk_size
         up_write(&i_data_sem)
                                              down_write(&i_data_sem)
                                              Modify i_disk_size
                                              Copy i_disk_size to on-disk inode
                                              up_write(&i_data_sem)
         Copy i_disk_size to on-disk inode
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      622cad13
  7. 25 3月, 2014 2 次提交
  8. 19 3月, 2014 2 次提交
    • T
      ext4: each filesystem creates and uses its own mb_cache · 9c191f70
      T Makphaibulchoke 提交于
      This patch adds new interfaces to create and destory cache,
      ext4_xattr_create_cache() and ext4_xattr_destroy_cache(), and remove
      the cache creation and destory calls from ex4_init_xattr() and
      ext4_exitxattr() in fs/ext4/xattr.c.
      
      fs/ext4/super.c has been changed so that when a filesystem is mounted
      a cache is allocated and attched to its ext4_sb_info structure.
      
      fs/mbcache.c has been changed so that only one slab allocator is
      allocated and used by all mbcache structures.
      Signed-off-by: NT. Makphaibulchoke <tmac@hp.com>
      9c191f70
    • L
      ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate · b8a86845
      Lukas Czerner 提交于
      Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
      functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
      
      It can be used to convert a range of file to zeros preferably without
      issuing data IO. Blocks should be preallocated for the regions that span
      holes in the file, and the entire range is preferable converted to
      unwritten extents
      
      This can be also used to preallocate blocks past EOF in the same way as
      with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
      size to remain the same.
      
      Also add appropriate tracepoints.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b8a86845
  9. 24 2月, 2014 1 次提交
  10. 22 2月, 2014 1 次提交
  11. 17 2月, 2014 1 次提交
  12. 20 12月, 2013 1 次提交
  13. 12 11月, 2013 1 次提交
  14. 09 11月, 2013 1 次提交
  15. 18 10月, 2013 1 次提交
    • T
      ext4: add ratelimiting to ext4 messages · efbed4dc
      Theodore Ts'o 提交于
      In the case of a storage device that suddenly disappears, or in the
      case of significant file system corruption, this can result in a huge
      flood of messages being sent to the console.  This can overflow the
      file system containing /var/log/messages, or if a serial console is
      configured, this can slow down the system so much that a hardware
      watchdog can end up triggering forcing a system reboot.
      
      Google-Bug-Id: 7258357
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      efbed4dc
  16. 04 9月, 2013 1 次提交
    • C
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig 提交于
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665
  17. 29 8月, 2013 4 次提交
    • D
      ext4: mark block group as corrupt on inode bitmap error · 87a39389
      Darrick J. Wong 提交于
      If we detect either a discrepancy between the inode bitmap and the
      inode counts or the inode bitmap fails to pass validation checks, mark
      the block group corrupt and refuse to allocate or deallocate inodes
      from the group.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      87a39389
    • D
      ext4: mark block group as corrupt on block bitmap error · 163a203d
      Darrick J. Wong 提交于
      When we notice a block-bitmap corruption (because of device failure or
      something else), we should mark this group as corrupt and prevent
      further block allocations/deallocations from it. Currently, we end up
      generating one error message for every block in the bitmap. This
      potentially could make the system unstable as noticed in some
      bugs. With this patch, the error will be printed only the first time
      and mark the entire block group as corrupted. This prevents future
      access allocations/deallocations from it.
      
      Also tested by corrupting the block
      bitmap and forcefully introducing the mb_free_blocks error:
      (1) create a largefile (2Gb)
      $ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
      (2) umount filesystem. use dumpe2fs to see which block-bitmaps
      are in use by largefile and note their block numbers
      (3) use dd to zero-out the used block bitmaps
      $ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
      (4) mount the FS and delete the largefile.
      (5) recreate the largefile. verify that the new largefile does not
      get any blocks from the groups marked as bad.
      Without the patch, we will see mb_free_blocks error for each bit in
      each zero'ed out bitmap at (4). With the patch, we only see the error
      once per blockgroup:
      [  309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      [  309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
      [  309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
      
      Google-Bug-Id: 7258357
      
      [darrick.wong@oracle.com]
      Further modifications (by Darrick) to make more obvious that this corruption
      bit applies to blocks only.  Set the corruption flag if the block group bitmap
      verification fails.
      
      Original-author: Aditya Kali <adityakali@google.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      163a203d
    • D
      ext4: fix type declaration of ext4_validate_block_bitmap · dbde0abe
      Darrick J. Wong 提交于
      The block_group parameter to ext4_validate_block_bitmap is both used
      as a ext4_group_t inside the function and the same type is passed in
      by all callers.  We might as well use the typedef consistently instead
      of open-coding the 'unsigned int'.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      dbde0abe
    • Z
      ext4: isolate ext4_extents.h file · d7b2a00c
      Zheng Liu 提交于
      After applied the commit (4a092d73), we have reduced the number of
      source files that need to #include ext4_extents.h.  But we can do
      better.
      
      This commit defines ext4_zeroout_es() in extents.c and move
      EXT_MAX_BLOCKS into ext4.h in order not to include ext4_extents.h in
      indirect.c and ioctl.c.  Meanwhile we just need to include this file in
      extent_status.c when ES_AGGRESSIVE_TEST is defined.  Otherwise, this
      commit removes a duplicated declaration in trace/events/ext4.h.
      
      After applied this patch, we just need to include ext4_extents.h file
      in {super,migrate,move_extents,extents}.c, and it is easy for us to
      define a new extent disk layout.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d7b2a00c
  18. 17 8月, 2013 5 次提交
    • J
      ext4: fix lost truncate due to race with writeback · 90e775b7
      Jan Kara 提交于
      The following race can lead to a loss of i_disksize update from truncate
      thus resulting in a wrong inode size if the inode size isn't updated
      again before inode is reclaimed:
      
      ext4_setattr()				mpage_map_and_submit_extent()
        EXT4_I(inode)->i_disksize = attr->ia_size;
        ...					  ...
      					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
      					  /* False because i_size isn't
      					   * updated yet */
      					  if (disksize > i_size_read(inode))
      					  /* True, because i_disksize is
      					   * already truncated */
      					  if (disksize > EXT4_I(inode)->i_disksize)
      					    /* Overwrite i_disksize
      					     * update from truncate */
      					    ext4_update_i_disksize()
        i_size_write(inode, attr->ia_size);
      
      For other places updating i_disksize such race cannot happen because
      i_mutex prevents these races. Writeback is the only place where we do
      not hold i_mutex and we cannot grab it there because of lock ordering.
      
      We fix the race by doing both i_disksize and i_size update in truncate
      atomically under i_data_sem and in mpage_map_and_submit_extent() we move
      the check against i_size under i_data_sem as well.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      90e775b7
    • J
      ext4: fix warning in ext4_da_update_reserve_space() · 7d734532
      Jan Kara 提交于
      reaim workfile.dbase test easily triggers warning in
      ext4_da_update_reserve_space():
      
      EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365:
      ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1
      blocks with reserved 9 data blocks)
      
      The problem is that (one of) tests creates file and then randomly writes
      to it with O_SYNC. That results in writing back pages of the file in
      random order so we create extents for written blocks say 0, 2, 4, 6, 8
      - this last allocation also allocates new block for extents. Then we
      writeout block 1 so we have extents 0-2, 4, 6, 8 and we release
      indirect extent block because extents fit in the inode again. Then we
      writeout block 10 and we need to allocate indirect extent block again
      which triggers the warning because we don't have the reservation
      anymore.
      
      Fix the problem by giving back freed metadata blocks resulting from
      extent merging into inode's reservation pool.
      Signed-off-by: NJan Kara <jack@suse.cz>
      7d734532
    • T
      ext4: add support for extent pre-caching · 7869a4a6
      Theodore Ts'o 提交于
      Add a new fiemap flag which forces the all of the extents in an inode
      to be cached in the extent_status tree.  This is critically important
      when using AIO to a preallocated file, since if we need to read in
      blocks from the extent tree, the io_submit(2) system call becomes
      synchronous, and the AIO is no longer "A", which is bad.
      
      In addition, for most files which have an external leaf tree block,
      the cost of caching the information in the extent status tree will be
      less than caching the entire 4k block in the buffer cache.  So it is
      generally a win to keep the extent information cached.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7869a4a6
    • T
      ext4: cache all of an extent tree's leaf block upon reading · 107a7bd3
      Theodore Ts'o 提交于
      When we read in an extent tree leaf block from disk, arrange to have
      all of its entries cached.  In nearly all cases the in-memory
      representation will be more compact than the on-disk representation in
      the buffer cache, and it allows us to get the information without
      having to traverse the extent tree for successive extents.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      107a7bd3
    • J
      jbd2: Fix oops in jbd2_journal_file_inode() · a361293f
      Jan Kara 提交于
      Commit 0713ed0c added
      jbd2_journal_file_inode() call into ext4_block_zero_page_range().
      However that function gets called from truncate path and thus inode
      needn't have jinode attached - that happens in ext4_file_open() but
      the file needn't be ever open since mount. Calling
      jbd2_journal_file_inode() without jinode attached results in the oops.
      
      We fix the problem by attaching jinode to inode also in ext4_truncate()
      and ext4_punch_hole() when we are going to zero out partial blocks.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a361293f
  19. 01 7月, 2013 3 次提交
    • A
      ext4: pass inode pointer instead of file pointer to punch hole · aeb2817a
      Ashish Sangwan 提交于
      No need to pass file pointer when we can directly pass inode pointer.
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      aeb2817a
    • J
      ext4: reduce object size when !CONFIG_PRINTK · e7c96e8e
      Joe Perches 提交于
      Reduce the object size ~10% could be useful for embedded systems.
      
      Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
      arguments, passing " " to functions when !CONFIG_PRINTK and still
      verifying format and arguments with no_printk.
      
      $ size fs/ext4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
       264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old
      
          $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
          # CONFIG_PRINTK is not set
          CONFIG_EXT4_FS=y
          CONFIG_EXT4_USE_FOR_EXT23=y
          CONFIG_EXT4_FS_POSIX_ACL=y
          # CONFIG_EXT4_FS_SECURITY is not set
          # CONFIG_EXT4_DEBUG is not set
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e7c96e8e
    • Z
      ext4: improve extent cache shrink mechanism to avoid to burn CPU time · d3922a77
      Zheng Liu 提交于
      Now we maintain an proper in-order LRU list in ext4 to reclaim entries
      from extent status tree when we are under heavy memory pressure.  For
      keeping this order, a spin lock is used to protect this list.  But this
      lock burns a lot of CPU time.  We can use the following steps to trigger
      it.
      
        % cd /dev/shm
        % dd if=/dev/zero of=ext4-img bs=1M count=2k
        % mkfs.ext4 ext4-img
        % mount -t ext4 -o loop ext4-img /mnt
        % cd /mnt
        % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
        % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
        % perf record -a -g
        % perf report
      
      This commit tries to fix this problem.  Now a new member called
      i_touch_when is added into ext4_inode_info to record the last access
      time for an inode.  Meanwhile we never need to keep a proper in-order
      LRU list.  So this can avoid to burns some CPU time.  When we try to
      reclaim some entries from extent status tree, we use list_sort() to get
      a proper in-order list.  Then we traverse this list to discard some
      entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
      time of sorting this list.  When we traverse the list, we skip the inode
      that is newer than this time, and move this inode to the tail of LRU
      list.  When the head of the list is newer than s_es_last_sorted, we will
      sort the LRU list again.
      
      In this commit, we break the loop if s_extent_cache_cnt == 0 because
      that means that all extents in extent status tree have been reclaimed.
      
      Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
      changed to save a local variable in these functions.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d3922a77
  20. 29 6月, 2013 1 次提交
  21. 06 6月, 2013 1 次提交
  22. 05 6月, 2013 3 次提交
    • J
      ext4: remove ext4_ioend_wait() · 5dc23bdd
      Jan Kara 提交于
      Now that we clear PageWriteback after extent conversion, there's no
      need to wait for io_end processing in ext4_evict_inode().  Running
      AIO/DIO keeps file reference until aio_complete() is called so
      ext4_evict_inode() cannot be called.  For io_end structures resulting
      from buffered IO waiting is happening because we wait for
      PageWriteback in truncate_inode_pages().
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5dc23bdd
    • J
      ext4: don't wait for extent conversion in ext4_punch_hole() · c724585b
      Jan Kara 提交于
      We don't have to wait for extent conversion in ext4_punch_hole() as
      buffered IO for the punched range has been flushed and waited upon
      (thus all extent conversions for that range have completed).  Also we
      wait for all DIO to finish using inode_dio_wait() so there cannot be
      any extent conversions pending due to direct IO.
      
      Also remove ext4_flush_unwritten_io() since it's unused now.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c724585b
    • J
      ext4: defer clearing of PageWriteback after extent conversion · b0857d30
      Jan Kara 提交于
      Currently PageWriteback bit gets cleared from put_io_page() called
      from ext4_end_bio().  This is somewhat inconvenient as extent tree is
      not fully updated at that time (unwritten extents are not marked as
      written) so we cannot read the data back yet.  This design was
      dictated by lock ordering as we cannot start a transaction while
      PageWriteback bit is set (we could easily deadlock with
      ext4_da_writepages()).  But now that we use transaction reservation
      for extent conversion, locking issues are solved and we can move
      PageWriteback bit clearing after extent conversion is done.  As a
      result we can remove wait for unwritten extent conversion from
      ext4_sync_file() because it already implicitely happens through
      wait_on_page_writeback().
      
      We implement deferring of PageWriteback clearing by queueing completed
      bios to appropriate io_end and processing all the pages when io_end is
      going to be freed instead of at the moment ext4_io_end() is called.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b0857d30