1. 17 8月, 2013 1 次提交
    • J
      jbd2: Fix oops in jbd2_journal_file_inode() · a361293f
      Jan Kara 提交于
      Commit 0713ed0c added
      jbd2_journal_file_inode() call into ext4_block_zero_page_range().
      However that function gets called from truncate path and thus inode
      needn't have jinode attached - that happens in ext4_file_open() but
      the file needn't be ever open since mount. Calling
      jbd2_journal_file_inode() without jinode attached results in the oops.
      
      We fix the problem by attaching jinode to inode also in ext4_truncate()
      and ext4_punch_hole() when we are going to zero out partial blocks.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a361293f
  2. 01 7月, 2013 3 次提交
    • A
      ext4: pass inode pointer instead of file pointer to punch hole · aeb2817a
      Ashish Sangwan 提交于
      No need to pass file pointer when we can directly pass inode pointer.
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      aeb2817a
    • J
      ext4: reduce object size when !CONFIG_PRINTK · e7c96e8e
      Joe Perches 提交于
      Reduce the object size ~10% could be useful for embedded systems.
      
      Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
      arguments, passing " " to functions when !CONFIG_PRINTK and still
      verifying format and arguments with no_printk.
      
      $ size fs/ext4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
       264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old
      
          $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
          # CONFIG_PRINTK is not set
          CONFIG_EXT4_FS=y
          CONFIG_EXT4_USE_FOR_EXT23=y
          CONFIG_EXT4_FS_POSIX_ACL=y
          # CONFIG_EXT4_FS_SECURITY is not set
          # CONFIG_EXT4_DEBUG is not set
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e7c96e8e
    • Z
      ext4: improve extent cache shrink mechanism to avoid to burn CPU time · d3922a77
      Zheng Liu 提交于
      Now we maintain an proper in-order LRU list in ext4 to reclaim entries
      from extent status tree when we are under heavy memory pressure.  For
      keeping this order, a spin lock is used to protect this list.  But this
      lock burns a lot of CPU time.  We can use the following steps to trigger
      it.
      
        % cd /dev/shm
        % dd if=/dev/zero of=ext4-img bs=1M count=2k
        % mkfs.ext4 ext4-img
        % mount -t ext4 -o loop ext4-img /mnt
        % cd /mnt
        % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
        % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
        % perf record -a -g
        % perf report
      
      This commit tries to fix this problem.  Now a new member called
      i_touch_when is added into ext4_inode_info to record the last access
      time for an inode.  Meanwhile we never need to keep a proper in-order
      LRU list.  So this can avoid to burns some CPU time.  When we try to
      reclaim some entries from extent status tree, we use list_sort() to get
      a proper in-order list.  Then we traverse this list to discard some
      entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
      time of sorting this list.  When we traverse the list, we skip the inode
      that is newer than this time, and move this inode to the tail of LRU
      list.  When the head of the list is newer than s_es_last_sorted, we will
      sort the LRU list again.
      
      In this commit, we break the loop if s_extent_cache_cnt == 0 because
      that means that all extents in extent status tree have been reclaimed.
      
      Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
      changed to save a local variable in these functions.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d3922a77
  3. 29 6月, 2013 1 次提交
  4. 06 6月, 2013 1 次提交
  5. 05 6月, 2013 10 次提交
    • J
      ext4: remove ext4_ioend_wait() · 5dc23bdd
      Jan Kara 提交于
      Now that we clear PageWriteback after extent conversion, there's no
      need to wait for io_end processing in ext4_evict_inode().  Running
      AIO/DIO keeps file reference until aio_complete() is called so
      ext4_evict_inode() cannot be called.  For io_end structures resulting
      from buffered IO waiting is happening because we wait for
      PageWriteback in truncate_inode_pages().
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5dc23bdd
    • J
      ext4: don't wait for extent conversion in ext4_punch_hole() · c724585b
      Jan Kara 提交于
      We don't have to wait for extent conversion in ext4_punch_hole() as
      buffered IO for the punched range has been flushed and waited upon
      (thus all extent conversions for that range have completed).  Also we
      wait for all DIO to finish using inode_dio_wait() so there cannot be
      any extent conversions pending due to direct IO.
      
      Also remove ext4_flush_unwritten_io() since it's unused now.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c724585b
    • J
      ext4: defer clearing of PageWriteback after extent conversion · b0857d30
      Jan Kara 提交于
      Currently PageWriteback bit gets cleared from put_io_page() called
      from ext4_end_bio().  This is somewhat inconvenient as extent tree is
      not fully updated at that time (unwritten extents are not marked as
      written) so we cannot read the data back yet.  This design was
      dictated by lock ordering as we cannot start a transaction while
      PageWriteback bit is set (we could easily deadlock with
      ext4_da_writepages()).  But now that we use transaction reservation
      for extent conversion, locking issues are solved and we can move
      PageWriteback bit clearing after extent conversion is done.  As a
      result we can remove wait for unwritten extent conversion from
      ext4_sync_file() because it already implicitely happens through
      wait_on_page_writeback().
      
      We implement deferring of PageWriteback clearing by queueing completed
      bios to appropriate io_end and processing all the pages when io_end is
      going to be freed instead of at the moment ext4_io_end() is called.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b0857d30
    • J
      ext4: split extent conversion lists to reserved & unreserved parts · 2e8fa54e
      Jan Kara 提交于
      Now that we have extent conversions with reserved transaction, we have
      to prevent extent conversions without reserved transaction (from DIO
      code) to block these (as that would effectively void any transaction
      reservation we did).  So split lists, work items, and work queues to
      reserved and unreserved parts.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2e8fa54e
    • J
      ext4: use transaction reservation for extent conversion in ext4_end_io · 6b523df4
      Jan Kara 提交于
      Later we would like to clear PageWriteback bit only after extent
      conversion from unwritten to written extents is performed.  However it
      is not possible to start a transaction after PageWriteback is set
      because that violates lock ordering (and is easy to deadlock).  So we
      have to reserve a transaction before locking pages and sending them
      for IO and later we use the transaction for extent conversion from
      ext4_end_io().
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6b523df4
    • J
      ext4: remove buffer_uninit handling · 3613d228
      Jan Kara 提交于
      There isn't any need for setting BH_Uninit on buffers anymore.  It was
      only used to signal we need to mark io_end as needing extent
      conversion in add_bh_to_extent() but now we can mark the io_end
      directly when mapping extent.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3613d228
    • J
      ext4: restructure writeback path · 4e7ea81d
      Jan Kara 提交于
      There are two issues with current writeback path in ext4.  For one we
      don't necessarily map complete pages when blocksize < pagesize and
      thus needn't do any writeback in one iteration.  We always map some
      blocks though so we will eventually finish mapping the page.  Just if
      writeback races with other operations on the file, forward progress is
      not really guaranteed. The second problem is that current code
      structure makes it hard to associate all the bios to some range of
      pages with one io_end structure so that unwritten extents can be
      converted after all the bios are finished.  This will be especially
      difficult later when io_end will be associated with reserved
      transaction handle.
      
      We restructure the writeback path to a relatively simple loop which
      first prepares extent of pages, then maps one or more extents so that
      no page is partially mapped, and once page is fully mapped it is
      submitted for IO. We keep all the mapping and IO submission
      information in mpage_da_data structure to somewhat reduce stack usage.
      Resulting code is somewhat shorter than the old one and hopefully also
      easier to read.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4e7ea81d
    • J
      ext4: better estimate credits needed for ext4_da_writepages() · fffb2739
      Jan Kara 提交于
      We limit the number of blocks written in a single loop of
      ext4_da_writepages() to 64 when inode uses indirect blocks.  That is
      unnecessary as credit estimates for mapping logically continguous run
      of blocks is rather low even for inode with indirect blocks.  So just
      lift this limitation and properly calculate the number of necessary
      credits.
      
      This better credit estimate will also later allow us to always write
      at least a single page in one iteration.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      fffb2739
    • J
      ext4: improve writepage credit estimate for files with indirect blocks · fa55a0ed
      Jan Kara 提交于
      ext4_ind_trans_blocks() wrongly used 'chunk' argument to decide whether
      blocks mapped are logically contiguous. That is wrong since the argument
      informs whether the blocks are physically contiguous. As the blocks
      mapped are always logically contiguous and that's all
      ext4_ind_trans_blocks() cares about, just remove the 'chunk' argument.
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      fa55a0ed
    • J
      ext4: deprecate max_writeback_mb_bump sysfs attribute · f2d50a65
      Jan Kara 提交于
      This attribute is now unused so deprecate it.  We still show the old
      default value to keep some compatibility but we don't allow writing to
      that attribute anymore.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f2d50a65
  6. 04 6月, 2013 1 次提交
    • J
      ext4: use io_end for multiple bios · 97a851ed
      Jan Kara 提交于
      Change writeback path to create just one io_end structure for the
      extent to which we submit IO and share it among bios writing that
      extent. This prevents needless splitting and joining of unwritten
      extents when they cannot be submitted as a single bio.
      
      Bugs in ENOMEM handling found by Linux File System Verification project
      (linuxtesting.org) and fixed by Alexey Khoroshilov
      <khoroshilov@ispras.ru>.
      
      CC: Alexey Khoroshilov <khoroshilov@ispras.ru>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      97a851ed
  7. 28 5月, 2013 3 次提交
    • L
      ext4: remove unused discard_partial_page_buffers · c121ffd0
      Lukas Czerner 提交于
      The discard_partial_page_buffers is no longer used anywhere so we can
      simply remove it including the *_no_lock variant and
      EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED define.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c121ffd0
    • L
      ext4: use ext4_zero_partial_blocks in punch_hole · a87dd18c
      Lukas Czerner 提交于
      We're doing to get rid of ext4_discard_partial_page_buffers() since it is
      duplicating some code and also partially duplicating work of
      truncate_pagecache_range(), moreover the old implementation was much
      clearer.
      
      Now when the truncate_inode_pages_range() can handle truncating non page
      aligned regions we can use this to invalidate and zero out block aligned
      region of the punched out range and then use ext4_block_truncate_page()
      to zero the unaligned blocks on the start and end of the range. This
      will greatly simplify the punch hole code. Moreover after this commit we
      can get rid of the ext4_discard_partial_page_buffers() completely.
      
      We also introduce function ext4_prepare_punch_hole() to do come common
      operations before we attempt to do the actual punch hole on
      indirect or extent file which saves us some code duplication.
      
      This has been tested on ppc64 with 1k block size with fsx and xfstests
      without any problems.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      a87dd18c
    • L
      Revert "ext4: remove no longer used functions in inode.c" · d863dc36
      Lukas Czerner 提交于
      This reverts commit ccb4d7af.
      
      This commit reintroduces functions ext4_block_truncate_page() and
      ext4_block_zero_page_range() which has been previously removed in favour
      of ext4_discard_partial_page_buffers().
      
      In future commits we want to reintroduce those function and remove
      ext4_discard_partial_page_buffers() since it is duplicating some code
      and also partially duplicating work of truncate_pagecache_range(),
      moreover the old implementation was much clearer.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d863dc36
  8. 12 5月, 2013 1 次提交
  9. 20 4月, 2013 1 次提交
    • T
      ext4: fix readdir error in the case of inline_data+dir_index · 8af0f082
      Tao Ma 提交于
      Zach reported a problem that if inline data is enabled, we don't
      tell the difference between the offset of '.' and '..'. And a
      getdents will fail if the user only want to get '.' and what's worse,
      if there is a conversion happens when the user calls getdents
      many times, he/she may get the same entry twice.
      
      In theory, a dir block would also fail if it is converted to a
      hashed-index based dir since f_pos will become a hash value, not the
      real one, but it doesn't happen.  And a deep investigation shows that
      we uses a hash based solution even for a normal dir if the dir_index
      feature is enabled.
      
      So this patch just adds a new htree_inlinedir_to_tree for inline dir,
      and if we find that the hash index is supported, we will do like what
      we do for a dir block.
      Reported-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8af0f082
  10. 12 4月, 2013 2 次提交
    • J
      ext4: use io_end for multiple bios · 4eec708d
      Jan Kara 提交于
      Change writeback path to create just one io_end structure for the
      extent to which we submit IO and share it among bios writing that
      extent. This prevents needless splitting and joining of unwritten
      extents when they cannot be submitted as a single bio.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      4eec708d
    • J
      ext4: make ext4_bio_write_page() use BH_Async_Write flags · 0058f965
      Jan Kara 提交于
      So far ext4_bio_write_page() attached all the pages to ext4_io_end
      structure.  This makes that structure pretty heavy (1 KB for pointers
      + 16 bytes per page attached to the bio).  Also later we would like to
      share ext4_io_end structure among several bios in case IO to a single
      extent needs to be split among several bios and pointing to pages from
      ext4_io_end makes this complex.
      
      We remove page pointers from ext4_io_end and use pointers from bio
      itself instead.  This isn't as easy when blocksize < pagesize because
      then we can have several bios in flight for a single page and we have
      to be careful when to call end_page_writeback().  However this is a
      known problem already solved by block_write_full_page() /
      end_buffer_async_write() so we mimic its behavior here.  We mark
      buffers going to disk with BH_Async_Write flag and in
      ext4_bio_end_io() we check whether there are any buffers with
      BH_Async_Write flag left.  If there are not, we can call
      end_page_writeback().
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      0058f965
  11. 11 4月, 2013 1 次提交
  12. 10 4月, 2013 1 次提交
    • L
      ext4: introduce reserved space · 27dd4385
      Lukas Czerner 提交于
      Currently in ENOSPC condition when writing into unwritten space, or
      punching a hole, we might need to split the extent and grow extent tree.
      However since we can not allocate any new metadata blocks we'll have to
      zero out unwritten part of extent or punched out part of extent, or in
      the worst case return ENOSPC even though use actually does not allocate
      any space.
      
      Also in delalloc path we do reserve metadata and data blocks for the
      time we're going to write out, however metadata block reservation is
      very tricky especially since we expect that logical connectivity implies
      physical connectivity, however that might not be the case and hence we
      might end up allocating more metadata blocks than previously reserved.
      So in future, metadata reservation checks should be removed since we can
      not assure that we do not under reserve.
      
      And this is where reserved space comes into the picture. When mounting
      the file system we slice off a little bit of the file system space (2%
      or 4096 clusters, whichever is smaller) which can be then used for the
      cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.
      
      The number of reserved clusters can be set via sysfs, however it can
      never be bigger than number of free clusters in the file system.
      
      Note that this patch fixes the failure of xfstest 274 as expected.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      27dd4385
  13. 09 4月, 2013 1 次提交
    • D
      ext4: implementation of a new ioctl called EXT4_IOC_SWAP_BOOT · 393d1d1d
      Dr. Tilmann Bubeck 提交于
      Add a new ioctl, EXT4_IOC_SWAP_BOOT which swaps i_blocks and
      associated attributes (like i_blocks, i_size, i_flags, ...) from the
      specified inode with inode EXT4_BOOT_LOADER_INO (#5). This is
      typically used to store a boot loader in a secure part of the
      filesystem, where it can't be changed by a normal user by accident.
      The data blocks of the previous boot loader will be associated with
      the given inode.
      
      This usercode program is a simple example of the usage:
      
      int main(int argc, char *argv[])
      {
        int fd;
        int err;
      
        if ( argc != 2 ) {
          printf("usage: ext4-swap-boot-inode FILE-TO-SWAP\n");
          exit(1);
        }
      
        fd = open(argv[1], O_WRONLY);
        if ( fd < 0 ) {
          perror("open");
          exit(1);
        }
      
        err = ioctl(fd, EXT4_IOC_SWAP_BOOT);
        if ( err < 0 ) {
          perror("ioctl");
          exit(1);
        }
      
        close(fd);
        exit(0);
      }
      
      [ Modified by Theodore Ts'o to fix a number of bugs in the original code.]
      Signed-off-by: NDr. Tilmann Bubeck <t.bubeck@reinform.de>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      393d1d1d
  14. 04 4月, 2013 6 次提交
    • L
      ext4: introduce ext4_get_group_number() · bd86298e
      Lukas Czerner 提交于
      Currently on many places in ext4 we're using
      ext4_get_group_no_and_offset() even though we're only interested in
      knowing the block group of the particular block, not the offset within
      the block group so we can use more efficient way to compute block
      group.
      
      This patch introduces ext4_get_group_number() which computes block
      group for a given block much more efficiently. Use this function
      instead of ext4_get_group_no_and_offset() everywhere where we're only
      interested in knowing the block group.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bd86298e
    • L
      ext4: make ext4_block_in_group() much more efficient · 68911009
      Lukas Czerner 提交于
      Currently in when getting the block group number for a particular
      block in ext4_block_in_group() we're using
      ext4_get_group_no_and_offset() which uses do_div() to get the block
      group and the remainer which is offset within the group.
      
      We don't need all of that in ext4_block_in_group() as we only need to
      figure out the group number.
      
      This commit changes ext4_block_in_group() to calculate group number
      directly. This shows as a big improvement with regards to cpu
      utilization. Measuring fallocate -l 15T on fresh file system with perf
      showed that 23% of cpu time was spend in the
      ext4_get_group_no_and_offset(). With this change it completely
      disappears from the list only bumping the occurrence of
      ext4_init_block_bitmap() which is the biggest user of
      ext4_block_in_group() by 4%. As the result of this change on my system
      the fallocate call was approx. 10% faster.
      
      However since there is '-g' option in mkfs which allow us setting
      different groups size (mostly for developers) I've introduced new per
      file system flag whether we have a standard block group size or
      not. The flag is used to determine whether we can use the bit shift
      optimization or not.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      68911009
    • T
      ext4: support simple conversion of extent-mapped inodes to use i_blocks · 996bb9fd
      Theodore Ts'o 提交于
      In order to make it simpler to test the code which support
      i_blocks/indirect-mapped inodes, support the conversion of inodes
      which are less than 12 blocks and which are contained in no more than
      a single extent.
      
      The primary intended use of this code is to converting freshly created
      zero-length files and empty directories.
      
      Note that the version of chattr in e2fsprogs 1.42.7 and earlier has a
      check that prevents the clearing of the extent flag.  A simple patch
      which allows "chattr -e <file>" to work will be checked into the
      e2fsprogs git repository.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      996bb9fd
    • T
      ext4: refactor truncate code · 819c4920
      Theodore Ts'o 提交于
      Move common code in ext4_ind_truncate() and ext4_ext_truncate() into
      ext4_truncate().  This saves over 60 lines of code.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      819c4920
    • T
      ext4: refactor punch hole code · 26a4c0c6
      Theodore Ts'o 提交于
      Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole()
      into ext4_punch_hole().  This saves over 150 lines of code.
      
      This also fixes a potential bug when the punch_hole() code is racing
      against indirect-to-extents or extents-to-indirect migation.  We are
      currently using i_mutex to protect against changes to the inode flag;
      specifically, the append-only, immutable, and extents inode flags.  So
      we need to take i_mutex before deciding whether to use the
      extents-specific or indirect-specific punch_hole code.
      
      Also, there was a missing call to ext4_inode_block_unlocked_dio() in
      the indirect punch codepath.  This was added in commit 02d262df
      to block DIO readers racing against the punch operation in the
      codepath for extent-mapped inodes, but it was missing for
      indirect-block mapped inodes.  One of the advantages of refactoring
      the code is that it makes such oversights much less likely.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      26a4c0c6
    • T
      ext4: collapse handling of data=ordered and data=writeback codepaths · 74d553aa
      Theodore Ts'o 提交于
      The only difference between how we handle data=ordered and
      data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
      code duplication by factoring out redundant the code paths.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      74d553aa
  15. 20 3月, 2013 1 次提交
    • T
      ext4: fix ext4_evict_inode() racing against workqueue processing code · 1ada47d9
      Theodore Ts'o 提交于
      Commit 84c17543 (ext4: move work from io_end to inode) triggered a
      regression when running xfstest #270 when the file system is mounted
      with dioread_nolock.
      
      The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
      this guarantees that last io_end structure has been freed, but it does
      not guarantee that the workqueue structure, which was moved into the
      inode by commit 84c17543, is actually finished.  Once
      ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
      will allow ext4_ioend_wait() to return on CPU #2, at which point the
      evict_inode() codepath can race against the workqueue code on CPU #1
      accessing EXT4_I(inode)->i_unwritten_work to find the next item of
      work to do.
      
      Fix this by calling cancel_work_sync() in ext4_ioend_wait(), which
      will be renamed ext4_ioend_shutdown(), since it is only used by
      ext4_evict_inode().  Also, move the call to ext4_ioend_shutdown()
      until after truncate_inode_pages() and filemap_write_and_wait() are
      called, to make sure all dirty pages have been written back and
      flushed from the page cache first.
      
      BUG: unable to handle kernel NULL pointer dereference at   (null)
      IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e
      *pdpt = 0000000030bc3001 *pde = 0000000000000000 
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in:
      Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c17543-dirty #91 Bochs Bochs
      EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0
      EIP is at cwq_activate_delayed_work+0x3b/0x7e
      EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
      ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
       DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
      DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      DR6: ffff0ff0 DR7: 00000400
      Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
      Stack:
       f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
       f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
       f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
      Call Trace:
       [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb
       [<c01def1d>] process_one_work+0x5d8/0x637
       [<c040a10b>] ? ext4_end_bio+0x300/0x300
       [<c01e3105>] worker_thread+0x249/0x3ef
       [<c01ea317>] kthread+0xd8/0xeb
       [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb
       [<c023a370>] ? trace_hardirqs_on+0x27/0x37
       [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28
       [<c01ea23f>] ? __init_kthread_worker+0x71/0x71
      Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1
       6c c1 00 31 c9 83
      EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
      CR2: 0000000000000000
      ---[ end trace a1923229da53d8a4 ]---
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      1ada47d9
  16. 12 3月, 2013 1 次提交
  17. 02 3月, 2013 1 次提交
  18. 01 3月, 2013 1 次提交
    • T
      ext4: optimize ext4_es_shrink() · 24630774
      Theodore Ts'o 提交于
      When the system is under memory pressure, ext4_es_srhink() will get
      called very often.  So optimize returning the number of items in the
      file system's extent status cache by keeping a per-filesystem count,
      instead of calculating it each time by scanning all of the inodes in
      the extent status cache.
      
      Also rename the slab used for the extent status cache to be
      "ext4_extent_status" so it's obviousl the slab in question is created
      by ext4.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <gnehzuil.liu@gmail.com>
      24630774
  19. 18 2月, 2013 3 次提交
    • Z
      ext4: reclaim extents from extent status tree · 74cd15cd
      Zheng Liu 提交于
      Although extent status is loaded on-demand, we also need to reclaim
      extent from the tree when we are under a heavy memory pressure because
      in some cases fragmented extent tree causes status tree costs too much
      memory.
      
      Here we maintain a lru list in super_block.  When the extent status of
      an inode is accessed and changed, this inode will be move to the tail
      of the list.  The inode will be dropped from this list when it is
      cleared.  In the inode, a counter is added to count the number of
      cached objects in extent status tree.  Here only written/unwritten/hole
      extent is counted because delayed extent doesn't be reclaimed due to
      fiemap, bigalloc and seek_data/hole need it.  The counter will be
      increased as a new extent is allocated, and it will be decreased as a
      extent is freed.
      
      In this commit we use normal shrinker framework to reclaim memory from
      the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
      to count the number of reclaimable extents.  ext4_es_shrink() tries to
      reclaim written/unwritten/hole extents from extent status tree.  The
      inode that has been shrunk is moved to the tail of lru list.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      74cd15cd
    • Z
      ext4: remove single extent cache · 69eb33dc
      Zheng Liu 提交于
      Single extent cache could be removed because we have extent status tree
      as a extent cache, and it would be better.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      69eb33dc
    • Z
      ext4: lookup block mapping in extent status tree · d100eef2
      Zheng Liu 提交于
      After tracking all extent status, we already have a extent cache in
      memory.  Every time we want to lookup a block mapping, we can first
      try to lookup it in extent status tree to avoid a potential disk I/O.
      
      A new function called ext4_es_lookup_extent is defined to finish this
      work.  When we try to lookup a block mapping, we always call
      ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
      first try to lookup a block mapping in extent status tree.
      
      A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
      in order not to put a hole into extent status tree because this hole
      will be converted to delayed extent in the tree immediately.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      d100eef2