1. 04 9月, 2013 1 次提交
    • C
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig 提交于
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665
  2. 17 8月, 2013 1 次提交
    • J
      jbd2: Fix oops in jbd2_journal_file_inode() · a361293f
      Jan Kara 提交于
      Commit 0713ed0c added
      jbd2_journal_file_inode() call into ext4_block_zero_page_range().
      However that function gets called from truncate path and thus inode
      needn't have jinode attached - that happens in ext4_file_open() but
      the file needn't be ever open since mount. Calling
      jbd2_journal_file_inode() without jinode attached results in the oops.
      
      We fix the problem by attaching jinode to inode also in ext4_truncate()
      and ext4_punch_hole() when we are going to zero out partial blocks.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a361293f
  3. 30 7月, 2013 1 次提交
  4. 16 7月, 2013 1 次提交
    • T
      ext4: call ext4_es_lru_add() after handling cache miss · 63b99968
      Theodore Ts'o 提交于
      If there are no items in the extent status tree, ext4_es_lru_add() is
      a no-op.  So it is not sufficient to call ext4_es_lru_add() before we
      try to lookup an entry in the extent status tree.  We also need to
      call it at the end of ext4_ext_map_blocks(), after items have been
      added to the extent status tree.
      
      This could lead to inodes with that have extent status trees but which
      are not in the LRU list, which means they won't get considered for
      eviction by the es_shrinker.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <wenqing.lz@taobao.com>
      Cc: stable@vger.kernel.org
      63b99968
  5. 13 7月, 2013 1 次提交
  6. 06 7月, 2013 1 次提交
  7. 01 7月, 2013 6 次提交
    • T
      ext4: fix up error handling for mpage_map_and_submit_extent() · cb530541
      Theodore Ts'o 提交于
      The function mpage_released_unused_page() must only be called once;
      otherwise the kernel will BUG() when the second call to
      mpage_released_unused_page() tries to unlock the pages which had been
      unlocked by the first call.
      
      Also restructure the error handling so that we only give up on writing
      the dirty pages in the case of ENOSPC where retrying the allocation
      won't help.  Otherwise, a transient failure, such as a kmalloc()
      failure in calling ext4_map_blocks() might cause us to give up on
      those pages, leading to a scary message in /var/log/messages plus data
      loss.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      cb530541
    • L
      ext4: only zero partial blocks in ext4_zero_partial_blocks() · e1be3a92
      Lukas Czerner 提交于
      Currently if we pass range into ext4_zero_partial_blocks() which covers
      entire block we would attempt to zero it even though we should only zero
      unaligned part of the block.
      
      Fix this by checking whether the range covers the whole block skip
      zeroing if so.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e1be3a92
    • T
      ext4: check error return from ext4_write_inline_data_end() · 42c832de
      Theodore Ts'o 提交于
      The function ext4_write_inline_data_end() can return an error.  So we
      need to assign it to a signed integer variable to check for an error
      return (since copied is an unsigned int).
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <wenqing.lz@taobao.com>
      Cc: stable@vger.kernel.org
      42c832de
    • J
      ext4: delete unnecessary C statements · 353eefd3
      jon ernst 提交于
      Comparing unsigned variable with 0 always returns false.
      err = 0 is duplicated and unnecessary.
      
      [ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ]
      Signed-off-by: N"Jon Ernst" <jonernst07@gmx.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      353eefd3
    • A
      ext4: pass inode pointer instead of file pointer to punch hole · aeb2817a
      Ashish Sangwan 提交于
      No need to pass file pointer when we can directly pass inode pointer.
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      aeb2817a
    • Z
      ext4: improve extent cache shrink mechanism to avoid to burn CPU time · d3922a77
      Zheng Liu 提交于
      Now we maintain an proper in-order LRU list in ext4 to reclaim entries
      from extent status tree when we are under heavy memory pressure.  For
      keeping this order, a spin lock is used to protect this list.  But this
      lock burns a lot of CPU time.  We can use the following steps to trigger
      it.
      
        % cd /dev/shm
        % dd if=/dev/zero of=ext4-img bs=1M count=2k
        % mkfs.ext4 ext4-img
        % mount -t ext4 -o loop ext4-img /mnt
        % cd /mnt
        % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
        % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
        % perf record -a -g
        % perf report
      
      This commit tries to fix this problem.  Now a new member called
      i_touch_when is added into ext4_inode_info to record the last access
      time for an inode.  Meanwhile we never need to keep a proper in-order
      LRU list.  So this can avoid to burns some CPU time.  When we try to
      reclaim some entries from extent status tree, we use list_sort() to get
      a proper in-order list.  Then we traverse this list to discard some
      entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
      time of sorting this list.  When we traverse the list, we skip the inode
      that is newer than this time, and move this inode to the tail of LRU
      list.  When the head of the list is newer than s_es_last_sorted, we will
      sort the LRU list again.
      
      In this commit, we break the loop if s_extent_cache_cnt == 0 because
      that means that all extents in extent status tree have been reclaimed.
      
      Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
      changed to save a local variable in these functions.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d3922a77
  8. 07 6月, 2013 1 次提交
    • T
      ext4: use ext4_da_writepages() for all modes · 20970ba6
      Theodore Ts'o 提交于
      Rename ext4_da_writepages() to ext4_writepages() and use it for all
      modes.  We still need to iterate over all the pages in the case of
      data=journalling, but in the case of nodelalloc/data=ordered (which is
      what file systems mounted using ext3 backwards compatibility will use)
      this will allow us to use a much more efficient I/O submission path.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      20970ba6
  9. 05 6月, 2013 10 次提交
  10. 04 6月, 2013 1 次提交
    • J
      ext4: use io_end for multiple bios · 97a851ed
      Jan Kara 提交于
      Change writeback path to create just one io_end structure for the
      extent to which we submit IO and share it among bios writing that
      extent. This prevents needless splitting and joining of unwritten
      extents when they cannot be submitted as a single bio.
      
      Bugs in ENOMEM handling found by Linux File System Verification project
      (linuxtesting.org) and fixed by Alexey Khoroshilov
      <khoroshilov@ispras.ru>.
      
      CC: Alexey Khoroshilov <khoroshilov@ispras.ru>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      97a851ed
  11. 01 6月, 2013 1 次提交
  12. 28 5月, 2013 5 次提交
    • L
      ext4: remove unused discard_partial_page_buffers · c121ffd0
      Lukas Czerner 提交于
      The discard_partial_page_buffers is no longer used anywhere so we can
      simply remove it including the *_no_lock variant and
      EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED define.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      c121ffd0
    • L
      ext4: use ext4_zero_partial_blocks in punch_hole · a87dd18c
      Lukas Czerner 提交于
      We're doing to get rid of ext4_discard_partial_page_buffers() since it is
      duplicating some code and also partially duplicating work of
      truncate_pagecache_range(), moreover the old implementation was much
      clearer.
      
      Now when the truncate_inode_pages_range() can handle truncating non page
      aligned regions we can use this to invalidate and zero out block aligned
      region of the punched out range and then use ext4_block_truncate_page()
      to zero the unaligned blocks on the start and end of the range. This
      will greatly simplify the punch hole code. Moreover after this commit we
      can get rid of the ext4_discard_partial_page_buffers() completely.
      
      We also introduce function ext4_prepare_punch_hole() to do come common
      operations before we attempt to do the actual punch hole on
      indirect or extent file which saves us some code duplication.
      
      This has been tested on ppc64 with 1k block size with fsx and xfstests
      without any problems.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      a87dd18c
    • L
      Revert "ext4: fix fsx truncate failure" · eb3544c6
      Lukas Czerner 提交于
      This reverts commit 189e868f.
      
      This commit reintroduces the use of ext4_block_truncate_page() in ext4
      truncate operation instead of ext4_discard_partial_page_buffers().
      
      The statement in the commit description that the truncate operation only
      zero block unaligned portion of the last page is not exactly right,
      since truncate_pagecache_range() also zeroes and invalidate the unaligned
      portion of the page. Then there is no need to zero and unmap it once more
      and ext4_block_truncate_page() was doing the right job, although we
      still need to update the buffer head containing the last block, which is
      exactly what ext4_block_truncate_page() is doing.
      
      Moreover the problem described in the commit is fixed more properly with
      commit
      
      15291164
      	jbd2: clear BH_Delay & BH_Unwritten in journal_unmap_buffer
      
      This was tested on ppc64 machine with block size of 1024 bytes without
      any problems.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      eb3544c6
    • L
      ext4: Call ext4_jbd2_file_inode() after zeroing block · 0713ed0c
      Lukas Czerner 提交于
      In data=ordered mode we should call ext4_jbd2_file_inode() so that crash
      after the truncate transaction has committed does not expose stall data
      in the tail of the block.
      
      Thanks Jan Kara for pointing that out.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      0713ed0c
    • L
      Revert "ext4: remove no longer used functions in inode.c" · d863dc36
      Lukas Czerner 提交于
      This reverts commit ccb4d7af.
      
      This commit reintroduces functions ext4_block_truncate_page() and
      ext4_block_zero_page_range() which has been previously removed in favour
      of ext4_discard_partial_page_buffers().
      
      In future commits we want to reintroduce those function and remove
      ext4_discard_partial_page_buffers() since it is duplicating some code
      and also partially duplicating work of truncate_pagecache_range(),
      moreover the old implementation was much clearer.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      d863dc36
  13. 22 5月, 2013 3 次提交
    • L
      ext4: use ->invalidatepage() length argument · ca99fdd2
      Lukas Czerner 提交于
      ->invalidatepage() aop now accepts range to invalidate so we can make
      use of it in all ext4 invalidatepage routines.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      ca99fdd2
    • L
      jbd2: change jbd2_journal_invalidatepage to accept length · 259709b0
      Lukas Czerner 提交于
      invalidatepage now accepts range to invalidate and there are two file
      system using jbd2 also implementing punch hole feature which can benefit
      from this. We need to implement the same thing for jbd2 layer in order to
      allow those file system take benefit of this functionality.
      
      This commit adds length argument to the jbd2_journal_invalidatepage()
      and updates all instances in ext4 and ocfs2.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      259709b0
    • L
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner 提交于
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      d47992f8
  14. 12 5月, 2013 1 次提交
  15. 08 5月, 2013 1 次提交
  16. 23 4月, 2013 1 次提交
  17. 22 4月, 2013 1 次提交
  18. 12 4月, 2013 1 次提交
  19. 10 4月, 2013 2 次提交
    • D
      ext4: fix big-endian bug in metadata checksum calculations · 171a7f21
      Dmitry Monakhov 提交于
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      171a7f21
    • L
      ext4: introduce reserved space · 27dd4385
      Lukas Czerner 提交于
      Currently in ENOSPC condition when writing into unwritten space, or
      punching a hole, we might need to split the extent and grow extent tree.
      However since we can not allocate any new metadata blocks we'll have to
      zero out unwritten part of extent or punched out part of extent, or in
      the worst case return ENOSPC even though use actually does not allocate
      any space.
      
      Also in delalloc path we do reserve metadata and data blocks for the
      time we're going to write out, however metadata block reservation is
      very tricky especially since we expect that logical connectivity implies
      physical connectivity, however that might not be the case and hence we
      might end up allocating more metadata blocks than previously reserved.
      So in future, metadata reservation checks should be removed since we can
      not assure that we do not under reserve.
      
      And this is where reserved space comes into the picture. When mounting
      the file system we slice off a little bit of the file system space (2%
      or 4096 clusters, whichever is smaller) which can be then used for the
      cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.
      
      The number of reserved clusters can be set via sysfs, however it can
      never be bigger than number of free clusters in the file system.
      
      Note that this patch fixes the failure of xfstest 274 as expected.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      27dd4385