1. 12 4月, 2014 1 次提交
  2. 11 4月, 2014 1 次提交
    • T
      ext4: move ext4_update_i_disksize() into mpage_map_and_submit_extent() · 622cad13
      Theodore Ts'o 提交于
      The function ext4_update_i_disksize() is used in only one place, in
      the function mpage_map_and_submit_extent().  Move its code to simplify
      the code paths, and also move the call to ext4_mark_inode_dirty() into
      the i_data_sem's critical region, to be consistent with all of the
      other places where we update i_disksize.  That way, we also keep the
      raw_inode's i_disksize protected, to avoid the following race:
      
            CPU #1                                 CPU #2
      
         down_write(&i_data_sem)
         Modify i_disk_size
         up_write(&i_data_sem)
                                              down_write(&i_data_sem)
                                              Modify i_disk_size
                                              Copy i_disk_size to on-disk inode
                                              up_write(&i_data_sem)
         Copy i_disk_size to on-disk inode
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      622cad13
  3. 08 4月, 2014 1 次提交
    • T
      ext4: update PF_MEMALLOC handling in ext4_write_inode() · 87f7e416
      Theodore Ts'o 提交于
      The special handling of PF_MEMALLOC callers in ext4_write_inode()
      shouldn't be necessary as there shouldn't be any. Warn about it. Also
      update comment before the function as it seems somewhat outdated.
      
      (Changes modeled on an ext3 patch posted by Jan Kara to the linux-ext4
      mailing list on Februaryt 28, 2014, which apparently never went into
      the ext3 tree.)
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      87f7e416
  4. 07 4月, 2014 1 次提交
    • K
      ext4: FIBMAP ioctl causes BUG_ON due to handle EXT_MAX_BLOCKS · 4adb6ab3
      Kazuya Mio 提交于
      When we try to get 2^32-1 block of the file which has the extent
      (ee_block=2^32-2, ee_len=1) with FIBMAP ioctl, it causes BUG_ON
      in ext4_ext_put_gap_in_cache().
      
      To avoid the problem, ext4_map_blocks() needs to check the file logical block
      number. ext4_ext_put_gap_in_cache() called via ext4_map_blocks() cannot
      handle 2^32-1 because the maximum file logical block number is 2^32-2.
      
      Note that ext4_ind_map_blocks() returns -EIO when the block number is invalid.
      So ext4_map_blocks() should also return the same errno.
      Signed-off-by: NKazuya Mio <k-mio@sx.jp.nec.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      4adb6ab3
  5. 04 4月, 2014 1 次提交
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
  6. 31 3月, 2014 1 次提交
  7. 25 3月, 2014 4 次提交
  8. 20 3月, 2014 1 次提交
    • T
      ext4: kill i_version support for Hurd-castrated file systems · c4f65706
      Theodore Ts'o 提交于
      The Hurd file system uses uses the inode field which is now used for
      i_version for its translator block.  This means that ext2 file systems
      that are formatted for GNU Hurd can't be used to support NFSv4.  Given
      that Hurd file systems don't support extents, and a huge number of
      modern file system features, this is no great loss.
      
      If we don't do this, the attempt to update the i_version field will
      stomp over the translator block field, which will cause file system
      corruption for Hurd file systems.  This can be replicated via:
      
      mke2fs -t ext2 -o hurd /dev/vdc
      mount -t ext4 /dev/vdc /vdc
      touch /vdc/bug0000
      umount /dev/vdc
      e2fsck -f /dev/vdc
      
      Addresses-Debian-Bug: #738758
      Reported-By: NGabriele Giacone <1o5g4r8o@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      c4f65706
  9. 19 3月, 2014 1 次提交
    • L
      ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate · b8a86845
      Lukas Czerner 提交于
      Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
      functionality as xfs ioctl XFS_IOC_ZERO_RANGE.
      
      It can be used to convert a range of file to zeros preferably without
      issuing data IO. Blocks should be preallocated for the regions that span
      holes in the file, and the entire range is preferable converted to
      unwritten extents
      
      This can be also used to preallocate blocks past EOF in the same way as
      with fallocate. Flag FALLOC_FL_KEEP_SIZE which should cause the inode
      size to remain the same.
      
      Also add appropriate tracepoints.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b8a86845
  10. 04 3月, 2014 1 次提交
    • J
      ext4: Speedup WB_SYNC_ALL pass called from sync(2) · 10542c22
      Jan Kara 提交于
      When doing filesystem wide sync, there's no need to force transaction
      commit (or synchronously write inode buffer) separately for each inode
      because ext4_sync_fs() takes care of forcing commit at the end (VFS
      takes care of flushing buffer cache, respectively). Most of the time
      this slowness doesn't manifest because previous WB_SYNC_NONE writeback
      doesn't leave much to write but when there are processes aggressively
      creating new files and several filesystems to sync, the sync slowness
      can be noticeable. In the following test script sync(1) takes around 6
      minutes when there are two ext4 filesystems mounted on a standard SATA
      drive. After this patch sync takes a couple of seconds so we have about
      two orders of magnitude improvement.
      
            function run_writers
            {
              for (( i = 0; i < 10; i++ )); do
                mkdir $1/dir$i
                for (( j = 0; j < 40000; j++ )); do
                  dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
                done &
              done
            }
      
            for dir in "$@"; do
              run_writers $dir
            done
      
            sleep 40
            time sync
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      10542c22
  11. 21 2月, 2014 2 次提交
    • M
      ext4: avoid exposure of stale data in ext4_punch_hole() · e251f9bc
      Maxim Patlasov 提交于
      While handling punch-hole fallocate, it's useless to truncate page cache
      before removing the range from extent tree (or block map in indirect case)
      because page cache can be re-populated (by read-ahead or read(2) or mmap-ed
      read) immediately after truncating page cache, but before updating extent
      tree (or block map). In that case the user will see stale data even after
      fallocate is completed.
      
      Until the problem of data corruption resulting from pages backed by
      already freed blocks is fully resolved, the simple thing we can do now
      is to add another truncation of pagecache after punch hole is done.
      Signed-off-by: NMaxim Patlasov <mpatlasov@parallels.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      e251f9bc
    • T
      ext4: avoid possible overflow in ext4_map_blocks() · e861b5e9
      Theodore Ts'o 提交于
      The ext4_map_blocks() function returns the number of blocks which
      satisfying the caller's request.  This number of blocks requested by
      the caller is specified by an unsigned integer, but the return value
      of ext4_map_blocks() is a signed integer (to accomodate error codes
      per the kernel's standard error signalling convention).
      
      Historically, overflows could never happen since mballoc() will refuse
      to allocate more than 2048 blocks at a time (which is something we
      should fix), and if the blocks were already allocated, the fact that
      there would be some number of intervening metadata blocks pretty much
      guaranteed that there could never be a contiguous region of data
      blocks that was greater than 2**31 blocks.
      
      However, this is now possible if there is a file system which is a bit
      bigger than 8TB, and is created using the new mke2fs hugeblock
      feature, which can create a perfectly contiguous file.  In that case,
      if a userspace program attempted to call fallocate() on this already
      fully allocated file, it's possible that ext4_map_blocks() could
      return a number large enough that it would overflow a signed integer,
      resulting in a ext4 thinking that the ext4_map_blocks() call had
      failed with some strange error code.
      
      Since ext4_map_blocks() is always free to return a smaller number of
      blocks than what was requested by the caller, fix this by capping the
      number of blocks that ext4_map_blocks() will ever try to map to 2**31
      - 1.  In practice this should never get hit, except by someone
      deliberately trying to provke the above-described bug.
      
      Thanks to the PaX team for asking whethre this could possibly happen
      in some off-line discussions about using some static code checking
      technology they are developing to find bugs in kernel code.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e861b5e9
  12. 26 1月, 2014 1 次提交
  13. 08 1月, 2014 1 次提交
    • T
      ext4: don't pass freed handle to ext4_walk_page_buffers · 8c9367fd
      Theodore Ts'o 提交于
      This is harmless, since ext4_walk_page_buffers only passes the handle
      onto the callback function, and in this call site the function in
      question, bput_one(), doesn't actually use the handle.  But there's no
      point passing in an invalid handle, and it creates a Coverity warning,
      so let's just clean it up.
      
      Addresses-Coverity-Id: #1091168
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8c9367fd
  14. 07 1月, 2014 2 次提交
  15. 18 12月, 2013 1 次提交
    • J
      ext4: fix deadlock when writing in ENOSPC conditions · 34cf865d
      Jan Kara 提交于
      Akira-san has been reporting rare deadlocks of his machine when running
      xfstests test 269 on ext4 filesystem. The problem turned out to be in
      ext4_da_reserve_metadata() and ext4_da_reserve_space() which called
      ext4_should_retry_alloc() while holding i_data_sem. Since
      ext4_should_retry_alloc() can force a transaction commit, this is a
      lock ordering violation and leads to deadlocks.
      
      Fix the problem by just removing the retry loops. These functions should
      just report ENOSPC to the caller (e.g. ext4_da_write_begin()) and that
      function must take care of retrying after dropping all necessary locks.
      Reported-and-tested-by: NAkira Fujita <a-fujita@rs.jp.nec.com>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      34cf865d
  16. 06 12月, 2013 1 次提交
  17. 12 11月, 2013 1 次提交
  18. 30 10月, 2013 1 次提交
  19. 18 10月, 2013 1 次提交
    • M
      ext4: fix performance regression in ext4_writepages · aeac589a
      Ming Lei 提交于
      Commit 4e7ea81d(ext4: restructure writeback path) introduces another
      performance regression on random write:
      
      - one more page may be added to ext4 extent in
        mpage_prepare_extent_to_map, and will be submitted for I/O so
        nr_to_write will become -1 before 'done' is set
      
      - the worse thing is that dirty pages may still be retrieved from page
        cache after nr_to_write becomes negative, so lots of small chunks
        can be submitted to block device when page writeback is catching up
        with write path, and performance is hurted.
      
      On one arm A15 board with sata 3.0 SSD(CPU: 1.5GHz dura core, RAM:
      2GB, SATA controller: 3.0Gbps), this patch can improve below test's
      result from 157MB/sec to 174MB/sec(>10%):
      
      	dd if=/dev/zero of=./z.img bs=8K count=512K
      
      The above test is actually prototype of block write in bonnie++
      utility.
      
      This patch makes sure no more pages than nr_to_write can be added to
      extent for mapping, so that nr_to_write won't become negative.
      
      Cc: linux-ext4@vger.kernel.org
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMing Lei <ming.lei@canonical.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      aeac589a
  20. 16 10月, 2013 1 次提交
  21. 16 9月, 2013 1 次提交
    • J
      ext4: fix performance regression in writeback of random writes · 9c12a831
      Jan Kara 提交于
      The Linux Kernel Performance project guys have reported that commit
      4e7ea81d introduces a performance regression for the following fio
      workload:
      
      [global]
      direct=0
      ioengine=mmap
      size=1500M
      bs=4k
      pre_read=1
      numjobs=1
      overwrite=1
      loops=5
      runtime=300
      group_reporting
      invalidate=0
      directory=/mnt/
      file_service_type=random:36
      file_service_type=random:36
      
      [job0]
      startdelay=0
      rw=randrw
      filename=data0/f1:data0/f2
      
      [job1]
      startdelay=0
      rw=randrw
      filename=data0/f2:data0/f1
      ...
      
      [job7]
      startdelay=0
      rw=randrw
      filename=data0/f2:data0/f1
      
      The culprit of the problem is that after the commit ext4_writepages()
      are more aggressive in writing back pages. Thus we have less consecutive
      dirty pages resulting in more seeking.
      
      This increased aggressivity is caused by a bug in the condition
      terminating ext4_writepages(). We start writing from the beginning of
      the file even if we should have terminated ext4_writepages() because
      wbc->nr_to_write <= 0.
      
      After fixing the condition the throughput of the fio workload is about 20%
      better than before writeback reorganization.
      Reported-by: N"Yan, Zheng" <zheng.z.yan@intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9c12a831
  22. 13 9月, 2013 1 次提交
  23. 04 9月, 2013 1 次提交
    • C
      direct-io: Implement generic deferred AIO completions · 7b7a8665
      Christoph Hellwig 提交于
      Add support to the core direct-io code to defer AIO completions to user
      context using a workqueue.  This replaces opencoded and less efficient
      code in XFS and ext4 (we save a memory allocation for each direct IO)
      and will be needed to properly support O_(D)SYNC for AIO.
      
      The communication between the filesystem and the direct I/O code requires
      a new buffer head flag, which is a bit ugly but not avoidable until the
      direct I/O code stops abusing the buffer_head structure for communicating
      with the filesystems.
      
      Currently this creates a per-superblock unbound workqueue for these
      completions, which is taken from an earlier patch by Jan Kara.  I'm
      not really convinced about this use and would prefer a "normal" global
      workqueue with a high concurrency limit, but this needs further discussion.
      
      JK: Fixed ext4 part, dynamic allocation of the workqueue.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7b7a8665
  24. 29 8月, 2013 2 次提交
  25. 17 8月, 2013 6 次提交
    • J
      ext4: fix lost truncate due to race with writeback · 90e775b7
      Jan Kara 提交于
      The following race can lead to a loss of i_disksize update from truncate
      thus resulting in a wrong inode size if the inode size isn't updated
      again before inode is reclaimed:
      
      ext4_setattr()				mpage_map_and_submit_extent()
        EXT4_I(inode)->i_disksize = attr->ia_size;
        ...					  ...
      					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
      					  /* False because i_size isn't
      					   * updated yet */
      					  if (disksize > i_size_read(inode))
      					  /* True, because i_disksize is
      					   * already truncated */
      					  if (disksize > EXT4_I(inode)->i_disksize)
      					    /* Overwrite i_disksize
      					     * update from truncate */
      					    ext4_update_i_disksize()
        i_size_write(inode, attr->ia_size);
      
      For other places updating i_disksize such race cannot happen because
      i_mutex prevents these races. Writeback is the only place where we do
      not hold i_mutex and we cannot grab it there because of lock ordering.
      
      We fix the race by doing both i_disksize and i_size update in truncate
      atomically under i_data_sem and in mpage_map_and_submit_extent() we move
      the check against i_size under i_data_sem as well.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      90e775b7
    • J
      ext4: simplify truncation code in ext4_setattr() · 5208386c
      Jan Kara 提交于
      Merge conditions in ext4_setattr() handling inode size changes, also
      move ext4_begin_ordered_truncate() call somewhat earlier because it
      simplifies error recovery in case of failure. Also add error handling in
      case i_disksize update fails.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      5208386c
    • J
      ext4: fix ext4_writepages() in presence of truncate · 5f1132b2
      Jan Kara 提交于
      Inode size can arbitrarily change while writeback is in progress. When
      ext4_writepages() has prepared a long extent for mapping and truncate
      then reduces i_size, mpage_map_and_submit_buffers() will always map just
      one buffer in a page instead of all of them due to lblk < blocks check.
      So we end up not using all blocks we've allocated (thus leaking them)
      and also delalloc accounting goes wrong manifesting as a warning like:
      
      ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
      with only 0 reserved data blocks
      
      Note that the problem can happen only when blocksize < pagesize because
      otherwise we have only a single buffer in the page.
      
      Fix the problem by removing the size check from the mapping loop. We
      have an extent allocated so we have to use it all before checking for
      i_size. We also rename add_page_bufs_to_extent() to
      mpage_process_page_bufs() and make that function submit the page for IO
      if all buffers (upto EOF) in it are mapped.
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NZheng Liu <gnehzuil.liu@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      5f1132b2
    • J
      ext4: move test whether extent to map can be extended to one place · 09930042
      Jan Kara 提交于
      Currently the logic whether the current buffer can be added to an extent
      of buffers to map is split between mpage_add_bh_to_extent() and
      add_page_bufs_to_extent(). Move the whole logic to
      mpage_add_bh_to_extent() which makes things a bit more straightforward
      and make following i_size fixes easier.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      09930042
    • T
      ext4: use unsigned int for es_status values · 3be78c73
      Theodore Ts'o 提交于
      Don't use an unsigned long long for the es_status flags; this requires
      that we pass 64-bit values around which is painful on 32-bit systems.
      Instead pass the extent status flags around using the low 4 bits of an
      unsigned int, and shift them into place when we are reading or writing
      es_pblk.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      3be78c73
    • J
      jbd2: Fix oops in jbd2_journal_file_inode() · a361293f
      Jan Kara 提交于
      Commit 0713ed0c added
      jbd2_journal_file_inode() call into ext4_block_zero_page_range().
      However that function gets called from truncate path and thus inode
      needn't have jinode attached - that happens in ext4_file_open() but
      the file needn't be ever open since mount. Calling
      jbd2_journal_file_inode() without jinode attached results in the oops.
      
      We fix the problem by attaching jinode to inode also in ext4_truncate()
      and ext4_punch_hole() when we are going to zero out partial blocks.
      Reported-by: Nmajianpeng <majianpeng@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a361293f
  26. 30 7月, 2013 1 次提交
  27. 16 7月, 2013 1 次提交
    • T
      ext4: call ext4_es_lru_add() after handling cache miss · 63b99968
      Theodore Ts'o 提交于
      If there are no items in the extent status tree, ext4_es_lru_add() is
      a no-op.  So it is not sufficient to call ext4_es_lru_add() before we
      try to lookup an entry in the extent status tree.  We also need to
      call it at the end of ext4_ext_map_blocks(), after items have been
      added to the extent status tree.
      
      This could lead to inodes with that have extent status trees but which
      are not in the LRU list, which means they won't get considered for
      eviction by the es_shrinker.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <wenqing.lz@taobao.com>
      Cc: stable@vger.kernel.org
      63b99968
  28. 13 7月, 2013 1 次提交
  29. 06 7月, 2013 1 次提交