1. 06 7月, 2013 1 次提交
  2. 01 7月, 2013 16 次提交
    • A
      ext4: optimize starting extent in ext4_ext_rm_leaf() · 6ae06ff5
      Ashish Sangwan 提交于
      Both hole punch and truncate use ext4_ext_rm_leaf() for removing
      blocks.  Currently we choose the last extent as the starting
      point for removing blocks:
      
      	ex = EXT_LAST_EXTENT(eh);
      
      This is OK for truncate but for hole punch we can optimize the extent
      selection as the path is already initialized.  We could use this
      information to select proper starting extent.  The code change in this
      patch will not affect truncate as for truncate path[depth].p_ext will
      always be NULL.
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6ae06ff5
    • T
      jbd2: invalidate handle if jbd2_journal_restart() fails · 41a5b913
      Theodore Ts'o 提交于
      If jbd2_journal_restart() fails the handle will have been disconnected
      from the current transaction.  In this situation, the handle must not
      be used for for any jbd2 function other than jbd2_journal_stop().
      Enforce this with by treating a handle which has a NULL transaction
      pointer as an aborted handle, and issue a kernel warning if
      jbd2_journal_extent(), jbd2_journal_get_write_access(),
      jbd2_journal_dirty_metadata(), etc. is called with an invalid handle.
      
      This commit also fixes a bug where jbd2_journal_stop() would trip over
      a kernel jbd2 assertion check when trying to free an invalid handle.
      
      Also move the responsibility of setting current->journal_info to
      start_this_handle(), simplifying the three users of this function.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: NYounger Liu <younger.liu@huawei.com>
      Cc: Jan Kara <jack@suse.cz>
      41a5b913
    • T
      ext4: translate flag bits to strings in tracepoints · 21ddd568
      Theodore Ts'o 提交于
      Translate the bitfields used in various flags argument to strings to
      make the tracepoint output more human-readable.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      21ddd568
    • T
      ext4: fix up error handling for mpage_map_and_submit_extent() · cb530541
      Theodore Ts'o 提交于
      The function mpage_released_unused_page() must only be called once;
      otherwise the kernel will BUG() when the second call to
      mpage_released_unused_page() tries to unlock the pages which had been
      unlocked by the first call.
      
      Also restructure the error handling so that we only give up on writing
      the dirty pages in the case of ENOSPC where retrying the allocation
      won't help.  Otherwise, a transient failure, such as a kmalloc()
      failure in calling ext4_map_blocks() might cause us to give up on
      those pages, leading to a scary message in /var/log/messages plus data
      loss.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      cb530541
    • T
      jbd2: fix theoretical race in jbd2__journal_restart · 39c04153
      Theodore Ts'o 提交于
      Once we decrement transaction->t_updates, if this is the last handle
      holding the transaction from closing, and once we release the
      t_handle_lock spinlock, it's possible for the transaction to commit
      and be released.  In practice with normal kernels, this probably won't
      happen, since the commit happens in a separate kernel thread and it's
      unlikely this could all happen within the space of a few CPU cycles.
      
      On the other hand, with a real-time kernel, this could potentially
      happen, so save the tid found in transaction->t_tid before we release
      t_handle_lock.  It would require an insane configuration, such as one
      where the jbd2 thread was set to a very high real-time priority,
      perhaps because a high priority real-time thread is trying to read or
      write to a file system.  But some people who use real-time kernels
      have been known to do insane things, including controlling
      laser-wielding industrial robots.  :-)
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      39c04153
    • L
      ext4: only zero partial blocks in ext4_zero_partial_blocks() · e1be3a92
      Lukas Czerner 提交于
      Currently if we pass range into ext4_zero_partial_blocks() which covers
      entire block we would attempt to zero it even though we should only zero
      unaligned part of the block.
      
      Fix this by checking whether the range covers the whole block skip
      zeroing if so.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e1be3a92
    • T
      ext4: check error return from ext4_write_inline_data_end() · 42c832de
      Theodore Ts'o 提交于
      The function ext4_write_inline_data_end() can return an error.  So we
      need to assign it to a signed integer variable to check for an error
      return (since copied is an unsigned int).
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <wenqing.lz@taobao.com>
      Cc: stable@vger.kernel.org
      42c832de
    • J
      ext4: delete unnecessary C statements · 353eefd3
      jon ernst 提交于
      Comparing unsigned variable with 0 always returns false.
      err = 0 is duplicated and unnecessary.
      
      [ tytso: Also cleaned up error handling in ext4_block_zero_page_range() ]
      Signed-off-by: N"Jon Ernst" <jonernst07@gmx.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      353eefd3
    • A
      ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() · 64cb9273
      Al Viro 提交于
      Both ext3 and ext4 htree_dirblock_to_tree() is just filling the
      in-core rbtree for use by call_filldir().  All updates of ->f_pos are
      done by the latter; bumping it here (on error) is obviously wrong - we
      might very well have it nowhere near the block we'd found an error in.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      64cb9273
    • T
      jbd2: move superblock checksum calculation to jbd2_write_superblock() · fe52d17c
      Theodore Ts'o 提交于
      Some of the functions which modify the jbd2 superblock were not
      updating the checksum before calling jbd2_write_superblock().  Move
      the call to jbd2_superblock_csum_set() to jbd2_write_superblock(), so
      that the checksum is calculated consistently.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: stable@vger.kernel.org
      fe52d17c
    • A
      ext4: pass inode pointer instead of file pointer to punch hole · aeb2817a
      Ashish Sangwan 提交于
      No need to pass file pointer when we can directly pass inode pointer.
      Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
      Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      aeb2817a
    • B
      ext4: improve free space calculation for inline_data · c4932dbe
      boxi liu 提交于
      In ext4 feature inline_data,it use the xattr's space to store the
      inline data in inode.When we calculate the inline data as the xattr,we
      add the pad.But in get_max_inline_xattr_value_size() function we count
      the free space without pad.It cause some contents are moved to a block
      even if it can be
      stored in the inode.
      Signed-off-by: Nliulei <lewis.liulei@huawei.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NTao Ma <boyu.mt@taobao.com>
      c4932dbe
    • J
      ext4: reduce object size when !CONFIG_PRINTK · e7c96e8e
      Joe Perches 提交于
      Reduce the object size ~10% could be useful for embedded systems.
      
      Add #ifdef CONFIG_PRINTK #else #endif blocks to hold formats and
      arguments, passing " " to functions when !CONFIG_PRINTK and still
      verifying format and arguments with no_printk.
      
      $ size fs/ext4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       239375	    610	    888	 240873	  3ace9	fs/ext4/built-in.o.new
       264167	    738	    888	 265793	  40e41	fs/ext4/built-in.o.old
      
          $ grep -E "CONFIG_EXT4|CONFIG_PRINTK" .config
          # CONFIG_PRINTK is not set
          CONFIG_EXT4_FS=y
          CONFIG_EXT4_USE_FOR_EXT23=y
          CONFIG_EXT4_FS_POSIX_ACL=y
          # CONFIG_EXT4_FS_SECURITY is not set
          # CONFIG_EXT4_DEBUG is not set
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e7c96e8e
    • Z
      ext4: improve extent cache shrink mechanism to avoid to burn CPU time · d3922a77
      Zheng Liu 提交于
      Now we maintain an proper in-order LRU list in ext4 to reclaim entries
      from extent status tree when we are under heavy memory pressure.  For
      keeping this order, a spin lock is used to protect this list.  But this
      lock burns a lot of CPU time.  We can use the following steps to trigger
      it.
      
        % cd /dev/shm
        % dd if=/dev/zero of=ext4-img bs=1M count=2k
        % mkfs.ext4 ext4-img
        % mount -t ext4 -o loop ext4-img /mnt
        % cd /mnt
        % for ((i=0;i<160;i++)); do truncate -s 64g $i; done
        % for ((i=0;i<160;i++)); do cp $i /dev/null &; done
        % perf record -a -g
        % perf report
      
      This commit tries to fix this problem.  Now a new member called
      i_touch_when is added into ext4_inode_info to record the last access
      time for an inode.  Meanwhile we never need to keep a proper in-order
      LRU list.  So this can avoid to burns some CPU time.  When we try to
      reclaim some entries from extent status tree, we use list_sort() to get
      a proper in-order list.  Then we traverse this list to discard some
      entries.  In ext4_sb_info, we use s_es_last_sorted to record the last
      time of sorting this list.  When we traverse the list, we skip the inode
      that is newer than this time, and move this inode to the tail of LRU
      list.  When the head of the list is newer than s_es_last_sorted, we will
      sort the LRU list again.
      
      In this commit, we break the loop if s_extent_cache_cnt == 0 because
      that means that all extents in extent status tree have been reclaimed.
      
      Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is
      changed to save a local variable in these functions.
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d3922a77
    • A
      ext4: implement error handling of ext4_mb_new_preallocation() · 2c00ef3e
      Alexey Khoroshilov 提交于
      If memory allocation in ext4_mb_new_group_pa() is failed,
      it returns error code, ext4_mb_new_preallocation() propages it,
      but ext4_mb_new_blocks() ignores it.
      
      An observed result was:
      
      - allocation fail means ext4_mb_new_group_pa() does not update
        ext4_allocation_context;
      
      - ext4_mb_new_blocks() sets ext4_allocation_request->len (ar->len =
        ac->ac_b_ex.fe_len;) to number of blocks preallocated (512) instead
        of number of blocks requested (1);
      
      - that activates update cycle in ext4_splice_branch():
          for (i = 1; i < blks; i++) <-- blks is 512 instead of 1 here
            *(where->p + i) = cpu_to_le32(current_block++);
      
      - it iterates 511 times and corrupts a chunk of memory including inode
        structure;
      
      - page fault happens at EXT4_SB(inode->i_sb) in ext4_mark_inode_dirty();
      
      - system hangs with 'scheduling while atomic' BUG.
      
      The patch implements a check for ext4_mb_new_preallocation() error
      code and handles its failure as if ext4_mb_regular_allocator() fails.
      
      Found by Linux File System Verification project (linuxtesting.org).
      
      [ Patch restructed by tytso to make the flow of control easier to follow. ]
      Signed-off-by: NAlexey Khoroshilov <khoroshilov@ispras.ru>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      2c00ef3e
    • M
      ext4: fix corruption when online resizing a fs with 1K block size · 6ca792ed
      Maarten ter Huurne 提交于
      Subtracting the number of the first data block places the superblock
      backups one block too early, corrupting the file system. When the block
      size is larger than 1K, the first data block is 0, so the subtraction
      has no effect and no corruption occurs.
      Signed-off-by: NMaarten ter Huurne <maarten@treewalker.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      CC: stable@vger.kernel.org
      6ca792ed
  3. 17 6月, 2013 1 次提交
  4. 13 6月, 2013 9 次提交
    • J
      ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents · 72dac95d
      Jie Liu 提交于
      Return the FIEMAP_EXTENT_UNKNOWN flag as well except the
      FIEMAP_EXTENT_DELALLOC because the data location of an
      delayed allocation extent is unknown.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      72dac95d
    • P
      jbd2: remove debug dependency on debug_fs and update Kconfig help text · 75497d06
      Paul Gortmaker 提交于
      Commit b6e96d00 ("jbd2: use module parameters instead of debugfs
      for jbd_debug") removed any need for a dependency on DEBUG_FS.  It
      also moved the /sys variables out from underneath the typical debugfs
      mount point.  Delete the dependency and update the /sys path to where
      the debug settings are currently.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      75497d06
    • P
      jbd2: use a single printk for jbd_debug() · 169f1a2a
      Paul Gortmaker 提交于
      Since the jbd_debug() is implemented with two separate printk()
      calls, it can lead to corrupted and misleading debug output like
      the following (see lines marked with "*"):
      
      [  290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
      [  290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
      [  290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
      [* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
      [* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
      [* 290.339382] JBD2: starting commit of transaction 42104
      [  290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
      [  290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079
      
      i.e. the debug output from log_wait_commit and journal_commit_transaction
      have become interleaved.  The output should have been:
      
      (fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
      (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104
      
      It is expected that this is not easy to replicate -- I was only able
      to cause it on preempt-rt kernels, and even then only under heavy
      I/O load.
      Reported-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Suggested-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      169f1a2a
    • P
      jbd2: fix duplicate debug label for phase 2 · cfc7bc89
      Paul Gortmaker 提交于
      Currently we see this output:
      
        $git grep phase fs/jbd2
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 1\n");
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 3\n");
        fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 4\n");
        [...]
      
      There is clearly a duplicate label for phase 2, and they are
      both active (i.e. not in #if ... #else block).  Rename them to
      be "2a" and "2b" so the debug output is unambiguous.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      cfc7bc89
    • P
      jbd2: drop checkpoint mutex when waiting in __jbd2_log_wait_for_space() · 0ef54180
      Paul Gortmaker 提交于
      While trying to debug an an issue under extreme I/O loading
      on preempt-rt kernels, the following backtrace was observed
      via SysRQ output:
      
      rm              D ffff8802203afbc0  4600  4878   4748 0x00000000
       ffff8802217bfb78 0000000000000082 ffff88021fc2bb80 ffff88021fc2bb80
       ffff88021fc2bb80 ffff8802217bffd8 ffff8802217bffd8 ffff8802217bffd8
       ffff88021f1d4c80 ffff88021fc2bb80 ffff8802217bfb88 ffff88022437b000
      Call Trace:
       [<ffffffff8172dc34>] schedule+0x24/0x70
       [<ffffffff81225b5d>] jbd2_log_wait_commit+0xbd/0x140
       [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
       [<ffffffff81223635>] jbd2_log_do_checkpoint+0xf5/0x520
       [<ffffffff81223b09>] __jbd2_log_wait_for_space+0xa9/0x1f0
       [<ffffffff8121dc40>] start_this_handle.isra.10+0x2e0/0x530
       [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
       [<ffffffff8121e0a3>] jbd2__journal_start+0xc3/0x110
       [<ffffffff811de7ce>] ? ext4_rmdir+0x6e/0x230
       [<ffffffff8121e0fe>] jbd2_journal_start+0xe/0x10
       [<ffffffff811f308b>] ext4_journal_start_sb+0x5b/0x160
       [<ffffffff811de7ce>] ext4_rmdir+0x6e/0x230
       [<ffffffff811435c5>] vfs_rmdir+0xd5/0x140
       [<ffffffff8114370f>] do_rmdir+0xdf/0x120
       [<ffffffff8105c6b4>] ? task_work_run+0x44/0x80
       [<ffffffff81002889>] ? do_notify_resume+0x89/0x100
       [<ffffffff817361ae>] ? int_signal+0x12/0x17
       [<ffffffff81145d85>] sys_unlinkat+0x25/0x40
       [<ffffffff81735f22>] system_call_fastpath+0x16/0x1b
      
      What is interesting here, is that we call log_wait_commit, from
      within wait_for_space, but we are still holding the checkpoint_mutex
      as it surrounds mostly the whole of wait_for_space.  And then, as we
      are waiting, journal_commit_transaction can run, and if the JBD2_FLUSHED
      bit is set, then we will also try to take the same checkpoint_mutex.
      
      It seems that we need to drop the checkpoint_mutex while sitting in
      jbd2_log_wait_commit, if we want to guarantee that progress can be made
      by jbd2_journal_commit_transaction().  There does not seem to be
      anything preempt-rt specific about this, other then perhaps increasing
      the odds of it happening.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0ef54180
    • P
      jbd2: relocate assert after state lock in journal_commit_transaction() · 3ca841c1
      Paul Gortmaker 提交于
      The state lock is taken after we are doing an assert on the state
      value, not before.  So we might in fact be doing an assert on a
      transient value.  Ensure the state check is within the scope of
      the state lock being taken.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3ca841c1
    • D
      ext4: Fix fsync error handling after filesystem abort · 4418e141
      Dmitry Monakhov 提交于
      If filesystem was aborted after inode's write back is complete
      but before its metadata was updated we may return success
      results in data loss.
      In order to handle fs abort correctly we have to check
      fs state once we discover that it is in MS_RDONLY state
      
      Test case: http://patchwork.ozlabs.org/patch/244297Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4418e141
    • D
      ext4: fix data integrity for ext4_sync_fs · 06a407f1
      Dmitry Monakhov 提交于
      Inode's data or non journaled quota may be written w/o jounral so we
      _must_ send a barrier at the end of ext4_sync_fs. But it can be
      skipped if journal commit will do it for us.
      
      Also fix data integrity for nojournal mode.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      06a407f1
    • D
      jbd2: optimize jbd2_journal_force_commit · 9ff86446
      Dmitry Monakhov 提交于
      Current implementation of jbd2_journal_force_commit() is suboptimal because
      result in empty and useless commits. But callers just want to force and wait
      any unfinished commits. We already have jbd2_journal_force_commit_nested()
      which does exactly what we want, except we are guaranteed that we do not hold
      journal transaction open.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9ff86446
  5. 12 6月, 2013 2 次提交
    • T
      ext4: don't use EXT4_FREE_BLOCKS_FORGET unnecessarily · 981250ca
      Theodore Ts'o 提交于
      Commit 18888cf0: "ext4: speed up truncate/unlink by not using
      bforget() unless needed" removed the use of EXT4_FREE_BLOCKS_FORGET in
      the most important codepath for file systems using extents, but a
      similar optimization also can be done for file systems using indirect
      blocks, and for the two special cases in the ext4 extents code.
      
      Cc: Andrey Sidorov <qrxd43@motorola.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      981250ca
    • T
      ext4: add cond_resched() to ext4_free_blocks() & ext4_mb_regular_allocator() · 2ed5724d
      Theodore Ts'o 提交于
      For a file systems with a very large number of block groups, if all of
      the block group bitmaps are in memory and the file system is
      relatively badly fragmented, it's possible ext4_mb_regular_allocator()
      to take a long time trying to find a good match.  This is especially
      true if the tuning parameter mb_max_to_scan has been sent to a very
      large number.  So add a cond_resched() to avoid soft lockup warnings
      and to provide better system responsiveness.
      
      For ext4_free_blocks(), if we are deleting a large range of blocks,
      and data=journal is enabled so that EXT4_FREE_BLOCKS_FORGET is passed,
      the loop to call sb_find_get_block() and to call ext4_forget() can
      take over 10-15 milliseocnds or more.  So it's better to add a
      cond_resched() here a well.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      
      
      2ed5724d
  6. 07 6月, 2013 1 次提交
    • T
      ext4: use ext4_da_writepages() for all modes · 20970ba6
      Theodore Ts'o 提交于
      Rename ext4_da_writepages() to ext4_writepages() and use it for all
      modes.  We still need to iterate over all the pages in the case of
      data=journalling, but in the case of nodelalloc/data=ordered (which is
      what file systems mounted using ext3 backwards compatibility will use)
      this will allow us to use a much more efficient I/O submission path.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      20970ba6
  7. 06 6月, 2013 4 次提交
  8. 05 6月, 2013 6 次提交