1. 10 4月, 2013 3 次提交
    • L
      ext4: introduce reserved space · 27dd4385
      Lukas Czerner 提交于
      Currently in ENOSPC condition when writing into unwritten space, or
      punching a hole, we might need to split the extent and grow extent tree.
      However since we can not allocate any new metadata blocks we'll have to
      zero out unwritten part of extent or punched out part of extent, or in
      the worst case return ENOSPC even though use actually does not allocate
      any space.
      
      Also in delalloc path we do reserve metadata and data blocks for the
      time we're going to write out, however metadata block reservation is
      very tricky especially since we expect that logical connectivity implies
      physical connectivity, however that might not be the case and hence we
      might end up allocating more metadata blocks than previously reserved.
      So in future, metadata reservation checks should be removed since we can
      not assure that we do not under reserve.
      
      And this is where reserved space comes into the picture. When mounting
      the file system we slice off a little bit of the file system space (2%
      or 4096 clusters, whichever is smaller) which can be then used for the
      cases mentioned above to prevent costly zeroout, or unexpected ENOSPC.
      
      The number of reserved clusters can be set via sysfs, however it can
      never be bigger than number of free clusters in the file system.
      
      Note that this patch fixes the failure of xfstest 274 as expected.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      27dd4385
    • J
      ext4: improve credit estimate for EXT4_SINGLEDATA_TRANS_BLOCKS · f45a5ef9
      Jan Kara 提交于
      Estimate of 27 credits for allocation of a block in extent based inode
      is unnecessarily high. We can easily argue 20 is enough.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f45a5ef9
    • A
      ext4: speed-up releasing blocks on commit · eabe0444
      Andrey Sidorov 提交于
      Improve mb_free_blocks speed by clearing entire range at once instead
      of iterating over each bit. Freeing block-by-block also makes buddy
      bitmap subtree flip twice making most of the work a no-op. Very few
      bits in buddy bitmap require change, e.g. freeing entire group is a 1
      bit flip only.  As a result, releasing blocks of 60G file now takes
      5ms instead of 2.7s.  This is especially good for non-preemptive
      kernels as there is no rescheduling during release.
      Signed-off-by: NAndrey Sidorov <qrxd43@motorola.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      eabe0444
  2. 09 4月, 2013 4 次提交
    • E
      ext4: fix free space estimate in ext4_nonda_switch() · 5c1ff336
      Eric Whitney 提交于
      Values stored in s_freeclusters_counter and s_dirtyclusters_counter
      are both in cluster units.  Remove the cluster to block conversion
      applied to s_freeclusters_counter causing an inflated estimate of
      free space because s_dirtyclusters_counter is not similarly
      converted.  Rename free_blocks and dirty_blocks to better reflect
      the units these variables contain to avoid future confusion.  This
      fix corrects ENOSPC failures for xfstests 127 and 231 on bigalloc
      file systems.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5c1ff336
    • J
      ext4: fix deadlock with quota feature · bcb13850
      Jan Kara 提交于
      We didn't mark hidden quota files with S_NOQUOTA flag and thus quota was
      accounted even for quota files. Thus we could recurse back to quota code
      when adding new blocks to quota file which can easily deadlock. Mark
      hidden quota files properly.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bcb13850
    • D
      ext4: fix incorrect lock ordering for ext4_ind_migrate · e8238f9a
      Dmitry Monakhov 提交于
      existing locking ordering: journal-> i_data_sem, but
      ext4_ind_migrate() grab locks in opposite order which may result in
      deadlock.
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e8238f9a
    • D
      ext4: implementation of a new ioctl called EXT4_IOC_SWAP_BOOT · 393d1d1d
      Dr. Tilmann Bubeck 提交于
      Add a new ioctl, EXT4_IOC_SWAP_BOOT which swaps i_blocks and
      associated attributes (like i_blocks, i_size, i_flags, ...) from the
      specified inode with inode EXT4_BOOT_LOADER_INO (#5). This is
      typically used to store a boot loader in a secure part of the
      filesystem, where it can't be changed by a normal user by accident.
      The data blocks of the previous boot loader will be associated with
      the given inode.
      
      This usercode program is a simple example of the usage:
      
      int main(int argc, char *argv[])
      {
        int fd;
        int err;
      
        if ( argc != 2 ) {
          printf("usage: ext4-swap-boot-inode FILE-TO-SWAP\n");
          exit(1);
        }
      
        fd = open(argv[1], O_WRONLY);
        if ( fd < 0 ) {
          perror("open");
          exit(1);
        }
      
        err = ioctl(fd, EXT4_IOC_SWAP_BOOT);
        if ( err < 0 ) {
          perror("ioctl");
          exit(1);
        }
      
        close(fd);
        exit(0);
      }
      
      [ Modified by Theodore Ts'o to fix a number of bugs in the original code.]
      Signed-off-by: NDr. Tilmann Bubeck <t.bubeck@reinform.de>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      393d1d1d
  3. 04 4月, 2013 17 次提交
    • L
      ext4: print more info in ext4_print_free_blocks() · f78ee70d
      Lukas Czerner 提交于
      Additionally print i_allocated_meta_blocks information as well.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      f78ee70d
    • L
      ext4: try to prepend extent to the existing one · be8981be
      Lukas Czerner 提交于
      Currently when inserting extent in ext4_ext_insert_extent() we would
      only try to to see if we can append new extent to the found extent. If
      we can not, then we proceed with adding new extent into the extent tree,
      but then possibly merging it back again.
      
      We can avoid this situation by trying to append and prepend new extent
      to the existing ones. However since the new extent can be on either
      sides of the existing extent, we have to pick the right extent to try to
      append/prepend to.
      
      This patch adds the conditions to pick the right extent to
      append/prepend to and adds the actual prepending condition as well. This
      will also eliminate the need to use "reserved" block for possibly
      growing extent tree.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      be8981be
    • L
      ext4: Transfer initialized block to right neighbor if possible · bc2d9db4
      Lukas Czerner 提交于
      Currently when converting extent to initialized we attempt to transfer
      initialized block to the left neighbour if possible when certain
      criteria are met. However we do not attempt to do the same for the
      right neighbor.
      
      This commit adds the possibility to transfer initialized block to the
      right neighbour if:
      
      1. We're not converting the whole extent
      2. Both extents are stored in the same extent tree node
      3. Right neighbor is initialized
      4. Right neighbor is logically abutting the current one
      5. Right neighbor is physically abutting the current one
      6. Right neighbor would not overflow the length limit
      
      This is basically the same logic as with transferring to the left. This
      will gain us some performance benefits since it is faster than inserting
      extent and then merging it.
      
      It would also prevent some situation in delalloc patch when we might run
      out of metadata reservation. This is due to the fact that we would
      attempt to split the extent first (possibly allocating new metadata
      block) even though we did not counted for that because it can (and will)
      be merged again. This commit fix that scenario, because we no longer
      need to split the extent in such case.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      bc2d9db4
    • L
      ext4: introduce ext4_get_group_number() · bd86298e
      Lukas Czerner 提交于
      Currently on many places in ext4 we're using
      ext4_get_group_no_and_offset() even though we're only interested in
      knowing the block group of the particular block, not the offset within
      the block group so we can use more efficient way to compute block
      group.
      
      This patch introduces ext4_get_group_number() which computes block
      group for a given block much more efficiently. Use this function
      instead of ext4_get_group_no_and_offset() everywhere where we're only
      interested in knowing the block group.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bd86298e
    • L
      ext4: make ext4_block_in_group() much more efficient · 68911009
      Lukas Czerner 提交于
      Currently in when getting the block group number for a particular
      block in ext4_block_in_group() we're using
      ext4_get_group_no_and_offset() which uses do_div() to get the block
      group and the remainer which is offset within the group.
      
      We don't need all of that in ext4_block_in_group() as we only need to
      figure out the group number.
      
      This commit changes ext4_block_in_group() to calculate group number
      directly. This shows as a big improvement with regards to cpu
      utilization. Measuring fallocate -l 15T on fresh file system with perf
      showed that 23% of cpu time was spend in the
      ext4_get_group_no_and_offset(). With this change it completely
      disappears from the list only bumping the occurrence of
      ext4_init_block_bitmap() which is the biggest user of
      ext4_block_in_group() by 4%. As the result of this change on my system
      the fallocate call was approx. 10% faster.
      
      However since there is '-g' option in mkfs which allow us setting
      different groups size (mostly for developers) I've introduced new per
      file system flag whether we have a standard block group size or
      not. The flag is used to determine whether we can use the bit shift
      optimization or not.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      68911009
    • D
      ext4: unregister es_shrinker if mount failed · a75ae78f
      Dmitry Monakhov 提交于
      Otherwise destroyed ext_sb_info will be part of global shinker list
      and result in the following OOPS:
      
      JBD2: corrupted journal superblock
      JBD2: recovery failed
      EXT4-fs (dm-2): error loading journal
      general protection fault: 0000 [#1] SMP
      Modules linked in: fuse acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel microcode sg button sd_mod crc_t10dif ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_\
      mod
      CPU 1
      Pid: 2758, comm: mount Not tainted 3.8.0-rc3+ #136                  /DH55TC
      RIP: 0010:[<ffffffff811bfb2d>]  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
      RSP: 0000:ffff88011d5cbcd8  EFLAGS: 00010207
      RAX: 6b6b6b6b6b6b6b6b RBX: 6b6b6b6b6b6b6b53 RCX: 0000000000000006
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000246
      RBP: ffff88011d5cbce8 R08: 0000000000000002 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000000 R12: ffff88011cd3f848
      R13: ffff88011cd3f830 R14: ffff88011cd3f000 R15: 0000000000000000
      FS:  00007f7b721dd7e0(0000) GS:ffff880121a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00007fffa6f75038 CR3: 000000011bc1c000 CR4: 00000000000007e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process mount (pid: 2758, threadinfo ffff88011d5ca000, task ffff880116aacb80)
      Stack:
      ffff88011cd3f000 ffffffff8209b6c0 ffff88011d5cbd18 ffffffff812482f1
      00000000000003f3 00000000ffffffea ffff880115f4c200 0000000000000000
      ffff88011d5cbda8 ffffffff81249381 ffff8801219d8bf8 ffffffff00000000
      Call Trace:
      [<ffffffff812482f1>] deactivate_locked_super+0x91/0xb0
      [<ffffffff81249381>] mount_bdev+0x331/0x340
      [<ffffffff81376730>] ? ext4_alloc_flex_bg_array+0x180/0x180
      [<ffffffff81362035>] ext4_mount+0x15/0x20
      [<ffffffff8124869a>] mount_fs+0x9a/0x2e0
      [<ffffffff81277e25>] vfs_kern_mount+0xc5/0x170
      [<ffffffff81279c02>] do_new_mount+0x172/0x2e0
      [<ffffffff8127aa56>] do_mount+0x376/0x380
      [<ffffffff8127ab98>] sys_mount+0x138/0x150
      [<ffffffff818ffed9>] system_call_fastpath+0x16/0x1b
      Code: 8b 05 88 04 eb 00 48 3d 90 ff 06 82 48 8d 58 e8 75 19 4c 89 e7 e8 e4 d7 2c 00 48 c7 c7 00 ff 06 82 e8 58 5f ef ff 5b 41 5c c9 c3 <48> 8b 4b 18 48 8b 73 20 48 89 da 31 c0 48 c7 c7 c5 a0 e4 81 e\
      8
      RIP  [<ffffffff811bfb2d>] unregister_shrinker+0xad/0xe0
      RSP <ffff88011d5cbcd8>
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      a75ae78f
    • D
      ext4: fix journal callback list traversal · 5d3ee208
      Dmitry Monakhov 提交于
      It is incorrect to use list_for_each_entry_safe() for journal callback
      traversial because ->next may be removed by other task:
      ->ext4_mb_free_metadata()
        ->ext4_mb_free_metadata()
          ->ext4_journal_callback_del()
      
      This results in the following issue:
      
      WARNING: at lib/list_debug.c:62 __list_del_entry+0x1c0/0x250()
      Hardware name:
      list_del corruption. prev->next should be ffff88019a4ec198, but was 6b6b6b6b6b6b6b6b
      Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes_x86_64 xts gf128mul ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
      Pid: 16400, comm: jbd2/dm-1-8 Tainted: G        W    3.8.0-rc3+ #107
      Call Trace:
       [<ffffffff8106fb0d>] warn_slowpath_common+0xad/0xf0
       [<ffffffff8106fc06>] warn_slowpath_fmt+0x46/0x50
       [<ffffffff813637e9>] ? ext4_journal_commit_callback+0x99/0xc0
       [<ffffffff8148cae0>] __list_del_entry+0x1c0/0x250
       [<ffffffff813637bf>] ext4_journal_commit_callback+0x6f/0xc0
       [<ffffffff813ca336>] jbd2_journal_commit_transaction+0x23a6/0x2570
       [<ffffffff8108aa42>] ? try_to_del_timer_sync+0x82/0xa0
       [<ffffffff8108b491>] ? del_timer_sync+0x91/0x1e0
       [<ffffffff813d3ecf>] kjournald2+0x19f/0x6a0
       [<ffffffff810ad630>] ? wake_up_bit+0x40/0x40
       [<ffffffff813d3d30>] ? bit_spin_lock+0x80/0x80
       [<ffffffff810ac6be>] kthread+0x10e/0x120
       [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
       [<ffffffff818ff6ac>] ret_from_fork+0x7c/0xb0
       [<ffffffff810ac5b0>] ? __init_kthread_worker+0x70/0x70
      
      This patch fix the issue as follows:
      - ext4_journal_commit_callback() make list truly traversial safe
        simply by always starting from list_head
      - fix race between two ext4_journal_callback_del() and
        ext4_journal_callback_try_del()
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.com
      5d3ee208
    • T
      ext4: support simple conversion of extent-mapped inodes to use i_blocks · 996bb9fd
      Theodore Ts'o 提交于
      In order to make it simpler to test the code which support
      i_blocks/indirect-mapped inodes, support the conversion of inodes
      which are less than 12 blocks and which are contained in no more than
      a single extent.
      
      The primary intended use of this code is to converting freshly created
      zero-length files and empty directories.
      
      Note that the version of chattr in e2fsprogs 1.42.7 and earlier has a
      check that prevents the clearing of the extent flag.  A simple patch
      which allows "chattr -e <file>" to work will be checked into the
      e2fsprogs git repository.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      996bb9fd
    • T
      ext4/jbd2: don't wait (forever) for stale tid caused by wraparound · d76a3a77
      Theodore Ts'o 提交于
      In the case where an inode has a very stale transaction id (tid) in
      i_datasync_tid or i_sync_tid, it's possible that after a very large
      (2**31) number of transactions, that the tid number space might wrap,
      causing tid_geq()'s calculations to fail.
      
      Commit deeeaf13 "jbd2: fix fsync() tid wraparound bug", later modified
      by commit e7b04ac0 "jbd2: don't wake kjournald unnecessarily",
      attempted to fix this problem, but it only avoided kjournald spinning
      forever by fixing the logic in jbd2_log_start_commit().
      
      Unfortunately, in the codepaths in fs/ext4/fsync.c and fs/ext4/inode.c
      that might call jbd2_log_start_commit() with a stale tid, those
      functions will subsequently call jbd2_log_wait_commit() with the same
      stale tid, and then wait for a very long time.  To fix this, we
      replace the calls to jbd2_log_start_commit() and
      jbd2_log_wait_commit() with a call to a new function,
      jbd2_complete_transaction(), which will correctly handle stale tid's.
      
      As a bonus, jbd2_complete_transaction() will avoid locking
      j_state_lock for writing unless a commit needs to be started.  This
      should have a small (but probably not measurable) improvement for
      ext4's scalability.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: NBen Hutchings <ben@decadent.org.uk>
      Reported-by: NGeorge Barnett <gbarnett@atlassian.com>
      Cc: stable@vger.kernel.org
      
      d76a3a77
    • T
      ext4: add might_sleep() annotations · b10a44c3
      Theodore Ts'o 提交于
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      b10a44c3
    • T
      ext4: add mutex_is_locked() assertion to ext4_truncate() · 19b5ef61
      Theodore Ts'o 提交于
      [ Added fixup from Lukáš Czerner which only checks the assertion when
        the inode is not new and is not being freed. ]
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      19b5ef61
    • T
      ext4: refactor truncate code · 819c4920
      Theodore Ts'o 提交于
      Move common code in ext4_ind_truncate() and ext4_ext_truncate() into
      ext4_truncate().  This saves over 60 lines of code.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      819c4920
    • T
      ext4: refactor punch hole code · 26a4c0c6
      Theodore Ts'o 提交于
      Move common code in ext4_ind_punch_hole() and ext4_ext_punch_hole()
      into ext4_punch_hole().  This saves over 150 lines of code.
      
      This also fixes a potential bug when the punch_hole() code is racing
      against indirect-to-extents or extents-to-indirect migation.  We are
      currently using i_mutex to protect against changes to the inode flag;
      specifically, the append-only, immutable, and extents inode flags.  So
      we need to take i_mutex before deciding whether to use the
      extents-specific or indirect-specific punch_hole code.
      
      Also, there was a missing call to ext4_inode_block_unlocked_dio() in
      the indirect punch codepath.  This was added in commit 02d262df
      to block DIO readers racing against the punch operation in the
      codepath for extent-mapped inodes, but it was missing for
      indirect-block mapped inodes.  One of the advantages of refactoring
      the code is that it makes such oversights much less likely.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      26a4c0c6
    • T
      ext4: fold ext4_alloc_blocks() in ext4_alloc_branch() · 781f143e
      Theodore Ts'o 提交于
      The older code was far more complicated than it needed to be because
      of how we spliced in the ext4's new multiblock allocator into ext3's
      indirect block code.  By folding ext4_alloc_blocks() into
      ext4_alloc_branch(), we make the code far more understable, shave off
      over 130 lines of code and half a kilobyte of compiled object code.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      781f143e
    • Z
      ext4: fold ext4_generic_write_end() into ext4_write_end() · eed4333f
      Zheng Liu 提交于
      After collapsing the handling of data ordered and data writeback
      codepath, ext4_generic_write_end() has only one caller,
      ext4_write_end().  So we fold it into ext4_write_end().
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      eed4333f
    • T
      ext4: collapse handling of data=ordered and data=writeback codepaths · 74d553aa
      Theodore Ts'o 提交于
      The only difference between how we handle data=ordered and
      data=writeback is a single call to ext4_jbd2_file_inode().  Eliminate
      code duplication by factoring out redundant the code paths.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      74d553aa
    • Z
      ext4: fix big-endian bugs which could cause fs corruptions · 8cde7ad1
      Zheng Liu 提交于
      When an extent was zeroed out, we forgot to do convert from cpu to le16.
      It could make us hit a BUG_ON when we try to write dirty pages out.  So
      fix it.
      
      [ Also fix a bug found by Dmitry Monakhov where we were missing
        le32_to_cpu() calls in the new indirect punch hole code.
      
        There are a number of other big endian warnings found by static code
        analyzers, but we'll wait for the next merge window to fix them all
        up.  These fixes are designed to be Obviously Correct by code
        inspection, and easy to demonstrate that it won't make any
        difference (and hence, won't introduce any bugs) on little endian
        architectures such as x86.  --tytso ]
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reported-by: NCAI Qian <caiqian@redhat.com>
      Reported-by: NChristian Kujau <lists@nerdbynature.de>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      8cde7ad1
  4. 20 3月, 2013 2 次提交
    • T
      ext4: fix data=journal fast mount/umount hang · 2b405bfa
      Theodore Ts'o 提交于
      In data=journal mode, if we unmount the file system before a
      transaction has a chance to complete, when the journal inode is being
      evicted, we can end up calling into jbd2_log_wait_commit() for the
      last transaction, after the journalling machinery has been shut down.
      
      Arguably we should adjust ext4_should_journal_data() to return FALSE
      for the journal inode, but the only place it matters is
      ext4_evict_inode(), and so to save a bit of CPU time, and to make the
      patch much more obviously correct by inspection(tm), we'll fix it by
      explicitly not trying to waiting for a journal commit when we are
      evicting the journal inode, since it's guaranteed to never succeed in
      this case.
      
      This can be easily replicated via: 
      
           mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb
      
      ------------[ cut here ]------------
      WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd()
      Hardware name: Bochs
      JBD2: bad log_start_commit: 3005630206 3005630206 0 0
      Modules linked in:
      Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020
      Call Trace:
       [<c015c0ef>] warn_slowpath_common+0x68/0x7d
       [<c02b7e7d>] ? __jbd2_log_start_commit+0xba/0xcd
       [<c015c177>] warn_slowpath_fmt+0x2b/0x2f
       [<c02b7e7d>] __jbd2_log_start_commit+0xba/0xcd
       [<c02b8075>] jbd2_log_start_commit+0x24/0x34
       [<c0279ed5>] ext4_evict_inode+0x71/0x2e3
       [<c021f0ec>] evict+0x94/0x135
       [<c021f9aa>] iput+0x10a/0x110
       [<c02b7836>] jbd2_journal_destroy+0x190/0x1ce
       [<c0175284>] ? bit_waitqueue+0x50/0x50
       [<c028d23f>] ext4_put_super+0x52/0x294
       [<c020efe3>] generic_shutdown_super+0x48/0xb4
       [<c020f071>] kill_block_super+0x22/0x60
       [<c020f3e0>] deactivate_locked_super+0x22/0x49
       [<c020f5d6>] deactivate_super+0x30/0x33
       [<c0222795>] mntput_no_expire+0x107/0x10c
       [<c02233a7>] sys_umount+0x2cf/0x2e0
       [<c02233ca>] sys_oldumount+0x12/0x14
       [<c08096b8>] syscall_call+0x7/0xb
      ---[ end trace 6a954cc790501c1f ]---
      jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      2b405bfa
    • T
      ext4: fix ext4_evict_inode() racing against workqueue processing code · 1ada47d9
      Theodore Ts'o 提交于
      Commit 84c17543 (ext4: move work from io_end to inode) triggered a
      regression when running xfstest #270 when the file system is mounted
      with dioread_nolock.
      
      The problem is that after ext4_evict_inode() calls ext4_ioend_wait(),
      this guarantees that last io_end structure has been freed, but it does
      not guarantee that the workqueue structure, which was moved into the
      inode by commit 84c17543, is actually finished.  Once
      ext4_flush_completed_IO() calls ext4_free_io_end() on CPU #1, this
      will allow ext4_ioend_wait() to return on CPU #2, at which point the
      evict_inode() codepath can race against the workqueue code on CPU #1
      accessing EXT4_I(inode)->i_unwritten_work to find the next item of
      work to do.
      
      Fix this by calling cancel_work_sync() in ext4_ioend_wait(), which
      will be renamed ext4_ioend_shutdown(), since it is only used by
      ext4_evict_inode().  Also, move the call to ext4_ioend_shutdown()
      until after truncate_inode_pages() and filemap_write_and_wait() are
      called, to make sure all dirty pages have been written back and
      flushed from the page cache first.
      
      BUG: unable to handle kernel NULL pointer dereference at   (null)
      IP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e
      *pdpt = 0000000030bc3001 *pde = 0000000000000000 
      Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
      Modules linked in:
      Pid: 6, comm: kworker/u:0 Not tainted 3.8.0-rc3-00013-g84c17543-dirty #91 Bochs Bochs
      EIP: 0060:[<c01dda6a>] EFLAGS: 00010046 CPU: 0
      EIP is at cwq_activate_delayed_work+0x3b/0x7e
      EAX: 00000000 EBX: 00000000 ECX: f505fe54 EDX: 00000000
      ESI: ed5b697c EDI: 00000006 EBP: f64b7e8c ESP: f64b7e84
       DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
      CR0: 8005003b CR2: 00000000 CR3: 30bc2000 CR4: 000006f0
      DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
      DR6: ffff0ff0 DR7: 00000400
      Process kworker/u:0 (pid: 6, ti=f64b6000 task=f64b4160 task.ti=f64b6000)
      Stack:
       f505fe00 00000006 f64b7e9c c01de3d7 f6435540 00000003 f64b7efc c01def1d
       f6435540 00000002 00000000 0000008a c16d0808 c040a10b c16d07d8 c16d08b0
       f505fe00 c16d0780 00000000 00000000 ee153df4 c1ce4a30 c17d0e30 00000000
      Call Trace:
       [<c01de3d7>] cwq_dec_nr_in_flight+0x71/0xfb
       [<c01def1d>] process_one_work+0x5d8/0x637
       [<c040a10b>] ? ext4_end_bio+0x300/0x300
       [<c01e3105>] worker_thread+0x249/0x3ef
       [<c01ea317>] kthread+0xd8/0xeb
       [<c01e2ebc>] ? manage_workers+0x4bb/0x4bb
       [<c023a370>] ? trace_hardirqs_on+0x27/0x37
       [<c0f1b4b7>] ret_from_kernel_thread+0x1b/0x28
       [<c01ea23f>] ? __init_kthread_worker+0x71/0x71
      Code: 01 83 15 ac ff 6c c1 00 31 db 89 c6 8b 00 a8 04 74 12 89 c3 30 db 83 05 b0 ff 6c c1 01 83 15 b4 ff 6c c1 00 89 f0 e8 42 ff ff ff <8b> 13 89 f0 83 05 b8 ff 6c c1
       6c c1 00 31 c9 83
      EIP: [<c01dda6a>] cwq_activate_delayed_work+0x3b/0x7e SS:ESP 0068:f64b7e84
      CR2: 0000000000000000
      ---[ end trace a1923229da53d8a4 ]---
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan Kara <jack@suse.cz>
      1ada47d9
  5. 18 3月, 2013 1 次提交
  6. 13 3月, 2013 2 次提交
    • E
      fs: Readd the fs module aliases. · fa7614dd
      Eric W. Biederman 提交于
      I had assumed that the only use of module aliases for filesystems
      prior to "fs: Limit sys_mount to only request filesystem modules."
      was in request_module.  It turns out I was wrong.  At least mkinitcpio
      in Arch linux uses these aliases.
      
      So readd the preexising aliases, to keep from breaking userspace.
      
      Userspace eventually will have to follow and use the same aliases the
      kernel does.  So at some point we may be delete these aliases without
      problems.  However that day is not today.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      fa7614dd
    • L
      ext4: use s_extent_max_zeroout_kb value as number of kb · 4f42f80a
      Lukas Czerner 提交于
      Currently when converting extent to initialized, we have to decide
      whether to zeroout part/all of the uninitialized extent in order to
      avoid extent tree growing rapidly.
      
      The decision is made by comparing the size of the extent with the
      configurable value s_extent_max_zeroout_kb which is in kibibytes units.
      
      However when converting it to number of blocks we currently use it as it
      was in bytes. This is obviously bug and it will result in ext4 _never_
      zeroout extents, but rather always split and convert parts to
      initialized while leaving the rest uninitialized in default setting.
      
      Fix this by using s_extent_max_zeroout_kb as kibibytes.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      4f42f80a
  7. 12 3月, 2013 1 次提交
  8. 11 3月, 2013 10 次提交
    • L
      ext4: reserve metadata block for every delayed write · 386ad67c
      Lukas Czerner 提交于
      Currently we only reserve space (data+metadata) in delayed allocation if
      we're allocating from new cluster (which is always in non-bigalloc file
      system) which is ok for data blocks, because we reserve the whole cluster.
      
      However we have to reserve metadata for every delayed block we're going
      to write because every block could potentially require metedata block
      when we need to grow the extent tree.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      386ad67c
    • L
      ext4: update reserved space after the 'correction' · 232ec872
      Lukas Czerner 提交于
      Currently in ext4_ext_map_blocks() in delayed allocation writeback
      we would update the reservation and after that check whether we claimed
      cluster outside of the range of the allocation and if so, we'll give the
      block back to the reservation pool.
      
      However this also means that if the number of reserved data block
      dropped to zero before the correction, we would release all the metadata
      reservation as well, however we might still need it because the we're
      not done with the delayed allocation and there might be more blocks to
      come. This will result in error messages such as:
      
      EXT4-fs warning (device sdb): ext4_da_update_reserve_space:361: ino 12,
      allocated 1 with only 0 reserved metadata blocks (releasing 1 blocks
      with reserved 1 data blocks)
      
      This will only happen on bigalloc file system and it can be easily
      reproduced using fiemap-tester from xfstests like this:
      
      ./src/fiemap-tester -m DHDHDHDHD -S -p0 /mnt/test/file
      
      Or using xfstests such as 225.
      
      Fix this by doing the correction first and updating the reservation
      after that so that we do not accidentally decrease
      i_reserved_data_blocks to zero.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      232ec872
    • L
      ext4: do not use yield() · bb8b20ed
      Lukas Czerner 提交于
      Using yield() is strongly discouraged (see sched/core.c) especially
      since we can just use cond_resched().
      
      Replace all use of yield() with cond_resched().
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      bb8b20ed
    • L
      ext4: remove unused variable in ext4_free_blocks() · e3d85c36
      Lukas Czerner 提交于
      Remove unused variable 'freed' in ext4_free_blocks().
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      e3d85c36
    • J
      ext4: fix WARN_ON from ext4_releasepage() · e1c36595
      Jan Kara 提交于
      ext4_releasepage() warns when it is passed a page with PageChecked set.
      However this can correctly happen when invalidate_inode_pages2_range()
      invalidates pages - and we should fail the release in that case. Since
      the page was dirty anyway, it won't be discarded and no harm has
      happened but it's good to be safe. Also remove bogus page_has_buffers()
      check - we are guaranteed page has buffers in this function.
      Reported-by: NZheng Liu <gnehzuil.liu@gmail.com>
      Tested-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NJan Kara <jack@suse.cz>
      e1c36595
    • Z
      ext4: fix the wrong number of the allocated blocks in ext4_split_extent() · 3a225670
      Zheng Liu 提交于
      This commit fixes a wrong return value of the number of the allocated
      blocks in ext4_split_extent.  When the length of blocks we want to
      allocate is greater than the length of the current extent, we return a
      wrong number.  Let's see what happens in the following case when we
      call ext4_split_extent().
      
        map: [48, 72]
        ex:  [32, 64, u]
      
      'ex' will be split into two parts:
        ex1: [32, 47, u]
        ex2: [48, 64, w]
      
      'map->m_len' is returned from this function, and the value is 24.  But
      the real length is 16.  So it should be fixed.
      
      Meanwhile in this commit we use right length of the allocated blocks
      when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents
      is called.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      Cc: stable@vger.kernel.org
      3a225670
    • Z
      ext4: update extent status tree after an extent is zeroed out · adb23551
      Zheng Liu 提交于
      When we try to split an extent, this extent could be zeroed out and mark
      as initialized.  But we don't know this in ext4_map_blocks because it
      only returns a length of allocated extent.  Meanwhile we will mark this
      extent as uninitialized because we only check m_flags.
      
      This commit update extent status tree when we try to split an unwritten
      extent.  We don't need to worry about the status of this extent because
      we always mark it as initialized.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      adb23551
    • Z
      ext4: fix wrong m_len value after unwritten extent conversion · cdee7843
      Zheng Liu 提交于
      The ext4_ext_handle_uninitialized_extents() function was assuming the
      return value of ext4_ext_map_blocks() is equal to map->m_len.  This
      incorrect assumption was harmless until we started use status tree as
      a extent cache because we need to update status tree according to
      'm_len' value.
      
      Meanwhile this commit marks EXT4_MAP_MAPPED flag after unwritten extent
      conversion.  It shouldn't cause a bug because we update status tree
      according to checking EXT4_MAP_UNWRITTEN flag.  But it should be fixed.
      
      After applied this commit, the following error message from self-testing
      infrastructure disappears.
      
          ...
          kernel: ES len assertation failed for inode: 230 retval 1 !=
          map->m_len 3 in ext4_map_blocks (allocation)
          ...
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      cdee7843
    • D
      ext4: add self-testing infrastructure to do a sanity check · 921f266b
      Dmitry Monakhov 提交于
      This commit adds a self-testing infrastructure like extent tree does to
      do a sanity check for extent status tree.  After status tree is as a
      extent cache, we'd better to make sure that it caches right result.
      
      After applied this commit, we will get a lot of messages when we run
      xfstests as below.
      
      ...
      kernel: ES len assertation failed for inode: 230 retval 1 != map->m_len
      3 in ext4_map_blocks (allocation)
      ...
      kernel: ES cache assertation failed for inode: 230 es_cached ex
      [974/2/4781/20] != found ex [974/1/4781/1000]
      ...
      kernel: ES insert assertation failed for inode: 635 ex_status
      [0/45/21388/w] != es_status [44/1/21432/u]
      ...
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      921f266b
    • Z
      ext4: avoid a potential overflow in ext4_es_can_be_merged() · bd384364
      Zheng Liu 提交于
      Check the length of an extent to avoid a potential overflow in
      ext4_es_can_be_merged().
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      bd384364