1. 01 11月, 2011 3 次提交
  2. 26 10月, 2011 1 次提交
  3. 25 10月, 2011 1 次提交
    • D
      ext4: update EOFBLOCKS flag on fallocate properly · a4e5d88b
      Dmitry Monakhov 提交于
      EOFBLOCK_FL should be updated if called w/o FALLOCATE_FL_KEEP_SIZE
      Currently it happens only if new extent was allocated.
      
      TESTCASE:
      fallocate test_file -n -l4096
      fallocate test_file -l4096
      Last fallocate cmd has updated size, but keept EOFBLOCK_FL set. And
      fsck will complain about that.
      
      Also remove ping pong in ext4_fallocate() in case of new extents,
      where ext4_ext_map_blocks() clear EOFBLOCKS bit, and later
      ext4_falloc_update_inode() restore it again.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a4e5d88b
  4. 21 10月, 2011 2 次提交
  5. 18 10月, 2011 1 次提交
  6. 09 10月, 2011 1 次提交
  7. 10 9月, 2011 8 次提交
    • A
      ext4: attempt to fix race in bigalloc code path · 5356f261
      Aditya Kali 提交于
      Currently, there exists a race between delayed allocated writes and
      the writeback when bigalloc feature is in use. The race was because we
      wanted to determine what blocks in a cluster are under delayed
      allocation and we were using buffer_delayed(bh) check for it. But, the
      writeback codepath clears this bit without any synchronization which
      resulted in a race and an ext4 warning similar to:
      
      EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0
      		reserved data blocks
      
      The race existed in two places.
      (1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from
          writeback code path.
      (2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where
          buffer_delayed(bh) is set.
      
      To fix (1), this patch introduces a new buffer_head state bit -
      BH_Da_Mapped.  This bit is set under the protection of
      EXT4_I(inode)->i_data_sem when we have actually mapped the delayed
      allocated blocks during the writeout time. We can now reliably check
      for this bit inside ext4_find_delalloc_range() to determine whether
      the reservation for the blocks have already been claimed or not.
      
      To fix (2), it was necessary to set buffer_delay(bh) under the
      protection of i_data_sem.  So, I extracted the very beginning of
      ext4_map_blocks into a new function - ext4_da_map_blocks() - and
      performed the required setting of bh_delay bit and the quota
      reservation under the protection of i_data_sem.  These two fixes makes
      the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent,
      thus removing the race.
      
      Tested: I was able to reproduce the problem by running 'dd' and
      'fsync' in parallel. Also, xfstests sometimes used to reproduce this
      race. After the fix both my test and xfstests were successful and no
      race (warning message) was observed.
      
      Google-Bug-Id: 4997027
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5356f261
    • A
      ext4: add some tracepoints in ext4/extents.c · d8990240
      Aditya Kali 提交于
      This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
      ext4/inode.c.
      
      Tested: Built and ran the kernel and verified that these tracepoints work.
      Also ran xfstests.
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
          
      d8990240
    • T
      ext4: rename ext4_has_free_blocks() to ext4_has_free_clusters() · df55c99d
      Theodore Ts'o 提交于
      Rename the function so it is more clear what is going on.  Also rename
      the various variables so it's clearer what's happening.
      
      Also fix a missing blocks to cluster conversion when reading the
      number of reserved blocks for root.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      df55c99d
    • T
      ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters() · e7d5f315
      Theodore Ts'o 提交于
      This function really claims a number of free clusters, not blocks, so
      rename it so it's clearer what's going on.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e7d5f315
    • T
      ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters() · 5dee5437
      Theodore Ts'o 提交于
      This function really counts the free clusters reported in the block
      group descriptors, so rename it to reduce confusion.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5dee5437
    • A
      ext4: Fix bigalloc quota accounting and i_blocks value · 7b415bf6
      Aditya Kali 提交于
      With bigalloc changes, the i_blocks value was not correctly set (it was still
      set to number of blocks being used, but in case of bigalloc, we want i_blocks
      to represent the number of clusters being used). Since the quota subsystem sets
      the i_blocks value, this patch fixes the quota accounting and makes sure that
      the i_blocks value is set correctly.
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7b415bf6
    • T
      ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter · 57042651
      Theodore Ts'o 提交于
      Convert the percpu counters s_dirtyblocks_counter and
      s_freeblocks_counter in struct ext4_super_info to be
      s_dirtyclusters_counter and s_freeclusters_counter.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      57042651
    • T
      ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.) · bab08ab9
      Theodore Ts'o 提交于
      At least initially if the bigalloc feature is enabled, we will not
      support non-extent mapped inodes, online resizing, online defrag, or
      the FITRIM ioctl.  This simplifies the initial implementation.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bab08ab9
  8. 07 9月, 2011 1 次提交
    • A
      ext4: fix partial page writes · 02fac129
      Allison Henderson 提交于
      While running extended fsx tests to verify the preceeding patches,
      a similar bug was also found in the write operation
      
      When ever a write operation begins or ends in a hole,
      or extends EOF, the partial page contained in the hole
      or beyond EOF needs to be zeroed out.
      
      To correct this the new ext4_discard_partial_page_buffers_no_lock
      routine is used to zero out the partial page, but only for buffer
      heads that are already unmapped.
      Signed-off-by: NAllison Henderson <achender@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      02fac129
  9. 06 9月, 2011 1 次提交
    • T
      ext4: only call ext4_jbd2_file_inode when an inode has been extended · decbd919
      Theodore Ts'o 提交于
      In delayed allocation mode, it's important to only call
      ext4_jbd2_file_inode when the file has been extended.  This is
      necessary to avoid a race which first got introduced in commit
      678aaf48, but which was made much more common with the introduction
      of the "punch hole" functionality.  (Especially when dioread_nolock
      was enabled; when I could reliably reproduce this problem with
      xfstests #74.)
      
      The race is this: If while trying to writeback a delayed allocation
      inode, there is a need to map delalloc blocks, and we run out of space
      in the journal, *and* at the same time the inode is already on the
      committing transaction's t_inode_list (because for example while doing
      the punch hole operation, ext4_jbd2_file_inode() is called), then the
      commit operation will wait for the inode to finish all of its pending
      writebacks by calling filemap_fdatawait(), but since that inode has
      one or more pages with the PageWriteback flag set, the commit
      operation will wait forever, and the so the writeback of the inode can
      never take place, and the kjournald thread and the writeback thread
      end up waiting for each other --- forever.
      
      It's important at this point to recall why an inode is placed on the
      t_inode_list; it is to provide the data=ordered guarantees that we
      don't end up exposing stale data.  In the case where we are truncating
      or punching a hole in the inode, there is no possibility that stale
      data could be exposed in the first place, so we don't need to put the
      inode on the t_inode_list!
      
      The right long-term fix is to get rid of data=ordered mode altogether,
      and only update the extent tree or indirect blocks after the data has
      been written.  Until then, this change will also avoid some
      unnecessary waiting in the commit operation.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Allison Henderson <achender@linux.vnet.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      decbd919
  10. 03 9月, 2011 1 次提交
    • A
      ext4: Add new ext4_discard_partial_page_buffers routines · 4e96b2db
      Allison Henderson 提交于
      This patch adds two new routines: ext4_discard_partial_page_buffers
      and ext4_discard_partial_page_buffers_no_lock.
      
      The ext4_discard_partial_page_buffers routine is a wrapper
      function to ext4_discard_partial_page_buffers_no_lock.
      The wrapper function locks the page and passes it to
      ext4_discard_partial_page_buffers_no_lock.
      Calling functions that already have the page locked can call
      ext4_discard_partial_page_buffers_no_lock directly.
      
      The ext4_discard_partial_page_buffers_no_lock function
      zeros a specified range in a page, and unmaps the
      corresponding buffer heads.  Only block aligned regions of the
      page will have their buffer heads unmapped.  Unblock aligned regions
      will be mapped if needed so that they can be updated with the
      partial zero out.  This function is meant to
      be used to update a page and its buffer heads to be zeroed
      and unmapped when the corresponding blocks have been released
      or will be released.
      
      This routine is used in the following scenarios:
      * A hole is punched and the non page aligned regions
        of the head and tail of the hole need to be discarded
      
      * The file is truncated and the partial page beyond EOF needs
        to be discarded
      
      * The end of a hole is in the same page as EOF.  After the
        page is flushed, the partial page beyond EOF needs to be
        discarded.
      
      * A write operation begins or ends inside a hole and the partial
        page appearing before or after the write needs to be discarded
      
      * A write operation extends EOF and the partial page beyond EOF
        needs to be discarded
      
      This function takes a flag EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED
      which is used when a write operation begins or ends in a hole.
      When the EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED flag is used, only
      buffer heads that are already unmapped will have the corresponding
      regions of the page zeroed.
      Signed-off-by: NAllison Henderson <achender@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4e96b2db
  11. 31 8月, 2011 2 次提交
    • T
      ext4: fake direct I/O mode for data=journal · 84ebd795
      Theodore Ts'o 提交于
      Currently attempts to open a file with O_DIRECT in data=journal mode
      causes the open to fail with -EINVAL.  This makes it very hard to test
      data=journal mode.  So we will let the open succeed, but then always
      fall back to O_DSYNC buffered writes.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      84ebd795
    • J
      ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining · 8c0bec21
      Jiaying Zhang 提交于
      The i_mutex lock and flush_completed_IO() added by commit 2581fdc8
      in ext4_evict_inode() causes lockdep complaining about potential
      deadlock in several places.  In most/all of these LOCKDEP complaints
      it looks like it's a false positive, since many of the potential
      circular locking cases can't take place by the time the
      ext4_evict_inode() is called; but since at the very least it may mask
      real problems, we need to address this.
      
      This change removes the flush_completed_IO() and i_mutex lock in
      ext4_evict_inode().  Instead, we take a different approach to resolve
      the software lockup that commit 2581fdc8 intends to fix.  Rather
      than having ext4-dio-unwritten thread wait for grabing the i_mutex
      lock of an inode, we use mutex_trylock() instead, and simply requeue
      the work item if we fail to grab the inode's i_mutex lock.
      
      This should speed up work queue processing in general and also
      prevents the following deadlock scenario: During page fault,
      shrink_icache_memory is called that in turn evicts another inode B.
      Inode B has some pending io_end work so it calls ext4_ioend_wait()
      that waits for inode B's i_ioend_count to become zero.  However, inode
      B's ioend work was queued behind some of inode A's ioend work on the
      same cpu's ext4-dio-unwritten workqueue.  As the ext4-dio-unwritten
      thread on that cpu is processing inode A's ioend work, it tries to
      grab inode A's i_mutex lock.  Since the i_mutex lock of inode A is
      still hold before the page fault happened, we enter a deadlock.
      Signed-off-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8c0bec21
  12. 14 8月, 2011 3 次提交
    • T
      ext4: fix nomblk_io_submit option so it correctly converts uninit blocks · 9dd75f1f
      Theodore Ts'o 提交于
      Bug discovered by Jan Kara:
      
      Finally, commit 1449032b returned back
      the old IO submission code but apparently it forgot to return the old
      handling of uninitialized buffers so we unconditionnaly call
      block_write_full_page() without specifying end_io function. So AFAICS
      we never convert unwritten extents to written in some cases. For
      example when I mount the fs as: mount -t ext4 -o
      nomblk_io_submit,dioread_nolock /dev/ubdb /mnt and do
              int fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0600);
              char buf[1024];
              memset(buf, 'a', sizeof(buf));
              fallocate(fd, 0, 0, 16384);
              write(fd, buf, sizeof(buf));
      
      I get a file full of zeros (after remounting the filesystem so that
      pagecache is dropped) instead of seeing the first KB contain 'a's.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      9dd75f1f
    • T
      ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN. · 32c80b32
      Tao Ma 提交于
      EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
      should be done simultaneously since ext4_end_io_nolock always clear
      the flag and decrease the counter in the same time.
      
      We don't increase i_aiodio_unwritten when setting
      EXT4_IO_END_UNWRITTEN so it will go nagative and causes some process
      to wait forever.
      
      Part of the patch came from Eric in his e-mail, but it doesn't fix the
      problem met by Michael actually.
      
      http://marc.info/?l=linux-ext4&m=131316851417460&w=2
      
      Reported-and-Tested-by: Michael Tokarev<mjt@tls.msk.ru>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      32c80b32
    • J
      ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode · 2581fdc8
      Jiaying Zhang 提交于
      Flush inode's i_completed_io_list before calling ext4_io_wait to
      prevent the following deadlock scenario: A page fault happens while
      some process is writing inode A. During page fault,
      shrink_icache_memory is called that in turn evicts another inode
      B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
      that waits for inode B's i_ioend_count to become zero. However, inode
      B's ioend work was queued behind some of inode A's ioend work on the
      same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
      thread on that cpu is processing inode A's ioend work, it tries to
      grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
      still hold before the page fault happened, we enter a deadlock.
      
      Also moves ext4_flush_completed_IO and ext4_ioend_wait from
      ext4_destroy_inode() to ext4_evict_inode(). During inode deleteion,
      ext4_evict_inode() is called before ext4_destroy_inode() and in
      ext4_evict_inode(), we may call ext4_truncate() without holding
      i_mutex lock. As a result, there is a race between flush_completed_IO
      that is called from ext4_ext_truncate() and ext4_end_io_work, which
      may cause corruption on an io_end structure. This change moves
      ext4_flush_completed_IO and ext4_ioend_wait from ext4_destroy_inode()
      to ext4_evict_inode() to resolve the race between ext4_truncate() and
      ext4_end_io_work during inode deletion.
      Signed-off-by: NJiaying Zhang <jiayingz@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      2581fdc8
  13. 13 8月, 2011 1 次提交
    • C
      ext4: Fix ext4_should_writeback_data() for no-journal mode · 441c8508
      Curt Wohlgemuth 提交于
      ext4_should_writeback_data() had an incorrect sequence of
      tests to determine if it should return 0 or 1: in
      particular, even in no-journal mode, 0 was being returned
      for a non-regular-file inode.
      
      This meant that, in non-journal mode, we would use
      ext4_journalled_aops for directories, symlinks, and other
      non-regular files.  However, calling journalled aop
      callbacks when there is no valid handle, can cause problems.
      
      This would cause a kernel crash with Jan Kara's commit
      2d859db3 ("ext4: fix data corruption in inodes with
      journalled data"), because we now dereference 'handle' in
      ext4_journalled_write_end().
      
      I also added BUG_ONs to check for a valid handle in the
      obviously journal-only aops callbacks.
      
      I tested this running xfstests with a scratch device in
      these modes:
      
         - no-journal
         - data=ordered
         - data=writeback
         - data=journal
      
      All work fine; the data=journal run has many failures and a
      crash in xfstests 074, but this is no different from a
      vanilla kernel.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@kernel.org
      441c8508
  14. 26 7月, 2011 1 次提交
    • J
      ext4: fix data corruption in inodes with journalled data · 2d859db3
      Jan Kara 提交于
      When journalling data for an inode (either because it is a symlink or
      because the filesystem is mounted in data=journal mode), ext4_evict_inode()
      can discard unwritten data by calling truncate_inode_pages(). This is
      because we don't mark the buffer / page dirty when journalling data but only
      add the buffer to the running transaction and thus mm does not know there
      are still unwritten data.
      
      Fix the problem by carefully tracking transaction containing inode's data,
      committing this transaction, and writing uncheckpointed buffers when inode
      should be reaped.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2d859db3
  15. 21 7月, 2011 4 次提交
    • C
      fs: move inode_dio_done to the end_io handler · 72c5052d
      Christoph Hellwig 提交于
      For filesystems that delay their end_io processing we should keep our
      i_dio_count until the the processing is done.  Enable this by moving
      the inode_dio_done call to the end_io handler if one exist.  Note that
      the actual move to the workqueue for ext4 and XFS is not done in
      this patch yet, but left to the filesystem maintainers.  At least
      for XFS it's not needed yet either as XFS has an internal equivalent
      to i_dio_count.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      72c5052d
    • C
      fs: simplify the blockdev_direct_IO prototype · aacfc19c
      Christoph Hellwig 提交于
      Simple filesystems always pass inode->i_sb_bdev as the block device
      argument, and never need a end_io handler.  Let's simply things for
      them and for my grepping activity by dropping these arguments.  The
      only thing not falling into that scheme is ext4, which passes and
      end_io handler without needing special flags (yet), but given how
      messy the direct I/O code there is use of __blockdev_direct_IO
      in one instead of two out of three cases isn't going to make a large
      difference anyway.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aacfc19c
    • C
      fs: move inode_dio_wait calls into ->setattr · 562c72aa
      Christoph Hellwig 提交于
      Let filesystems handle waiting for direct I/O requests themselves instead
      of doing it beforehand.  This means filesystem-specific locks to prevent
      new dio referenes from appearing can be held.  This is important to allow
      generalizing i_dio_count to non-DIO_LOCKING filesystems.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      562c72aa
    • J
      ext4: Rewrite ext4_page_mkwrite() to use generic helpers · 9ea7df53
      Jan Kara 提交于
      Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
      removes the need of using i_alloc_sem to avoid races with truncate which
      seems to be the wrong locking order according to lock ordering documented in
      mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
      be problematic because we can decide to flush delay-allocated blocks which
      will acquire s_umount semaphore - again creating unpleasant lock dependency
      if not directly a deadlock.
      
      Also add a check for frozen filesystem so that we don't busyloop in page fault
      when the filesystem is frozen.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9ea7df53
  16. 28 6月, 2011 5 次提交
  17. 08 6月, 2011 1 次提交
    • W
      writeback: introduce .tagged_writepages for the WB_SYNC_NONE sync stage · 6e6938b6
      Wu Fengguang 提交于
      sync(2) is performed in two stages: the WB_SYNC_NONE sync and the
      WB_SYNC_ALL sync. Identify the first stage with .tagged_writepages and
      do livelock prevention for it, too.
      
      Jan's commit f446daae ("mm: implement writeback livelock avoidance
      using page tagging") is a partial fix in that it only fixed the
      WB_SYNC_ALL phase livelock.
      
      Although ext4 is tested to no longer livelock with commit f446daae,
      it may due to some "redirty_tail() after pages_skipped" effect which
      is by no means a guarantee for _all_ the file systems.
      
      Note that writeback_inodes_sb() is called by not only sync(), they are
      treated the same because the other callers also need livelock prevention.
      
      Impact:  It changes the order in which pages/inodes are synced to disk.
      Now in the WB_SYNC_NONE stage, it won't proceed to write the next inode
      until finished with the current inode.
      Acked-by: NJan Kara <jack@suse.cz>
      CC: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      6e6938b6
  18. 06 6月, 2011 1 次提交
  19. 27 5月, 2011 1 次提交
    • C
      fs: pass exact type of data dirties to ->dirty_inode · aa385729
      Christoph Hellwig 提交于
      Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
      anything else, so that the filesystem can track internally if it
      needs to push out a transaction for fdatasync or not.
      
      This is just the prototype change with no user for it yet.  I plan
      to push large XFS changes for the next merge window, and getting
      this trivial infrastructure in this window would help a lot to avoid
      tree interdependencies.
      
      Also remove incorrect comments that ->dirty_inode can't block.  That
      has been changed a long time ago, and many implementations rely on it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aa385729
  20. 26 5月, 2011 1 次提交