1. 27 9月, 2012 1 次提交
  2. 20 9月, 2012 1 次提交
    • A
      ext4: speed up truncate/unlink by not using bforget() unless needed · 18888cf0
      Andrey Sidorov 提交于
      Do not iterate over data blocks scanning for bh's to forget as they're
      never exist. This improves time taken by unlink / truncate syscall.
      Tested by continuously truncating file that is being written by dd.
      Another test is rm -rf of linux tree while tar unpacks it. With
      ordered data mode condition unlikely(!tbh) was always met in
      ext4_free_blocks. With journal data mode tbh was found only few times,
      so optimisation is also possible.
      
      Unlinking fallocated 60G file after doing sync && echo 3 >
      /proc/sys/vm/drop_caches && time rm --help
      
      X86 before (linux 3.6-rc4):
      # time rm -f test1
      real    0m2.710s
      user    0m0.000s
      sys     0m1.530s
      
      X86 after:
      # time rm -f test1
      real    0m0.644s
      user    0m0.003s
      sys     0m0.060s
      
      MIPS before (linux 2.6.37):
      # time rm -f test1
      real    0m 4.93s
      user    0m 0.00s
      sys     0m 4.61s
      
      MIPS after:
      # time rm -f test1
      real    0m 0.16s
      user    0m 0.00s
      sys     0m 0.06s
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrey Sidorov <qrxd43@motorola.com>
      18888cf0
  3. 19 8月, 2012 2 次提交
  4. 17 8月, 2012 3 次提交
    • Z
      ext4: make the zero-out chunk size tunable · 67a5da56
      Zheng Liu 提交于
      Currently in ext4 the length of zero-out chunk is set to 7 file system
      blocks.  But if an inode has uninitailized extents from using
      fallocate to preallocate space, and the workload issues many random
      writes, this can cause a fragmented extent tree that will
      unnecessarily grow the extent tree.
      
      So create a new sysfs tunable, extent_max_zeroout_kb, which controls
      the maximum size where blocks will be zeroed out instead of creating a
      new uninitialized extent.  The default of this has been sent to 32kb.
      
      CC: Zach Brown <zab@zabbo.net>
      CC: Andreas Dilger <adilger@dilger.ca>
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      67a5da56
    • T
      ext4: collapse a single extent tree block into the inode if possible · ecb94f5f
      Theodore Ts'o 提交于
      If an inode has more than 4 extents, but then later some of the
      extents are merged together, we can optimize the file system by moving
      the extents up into the inode, and discarding the extent tree block.
      This is important, because if there are a large number of inodes with
      an external extent tree blocks where the contents could fit in the
      inode, this can significantly increase the fsck time of the file
      system.
      
      Google-Bug-Id: 6801242
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      ecb94f5f
    • T
      ext4: fix kernel BUG on large-scale rm -rf commands · 89a4e48f
      Theodore Ts'o 提交于
      Commit 968dee77: "ext4: fix hole punch failure when depth is greater
      than 0" introduced a regression in v3.5.1/v3.6-rc1 which caused kernel
      crashes when users ran run "rm -rf" on large directory hierarchy on
      ext4 filesystems on RAID devices:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
      
          Process rm (pid: 18229, threadinfo ffff8801276bc000, task ffff880123631710)
          Call Trace:
           [<ffffffff81236483>] ? __ext4_handle_dirty_metadata+0x83/0x110
           [<ffffffff812353d3>] ext4_ext_truncate+0x193/0x1d0
           [<ffffffff8120a8cf>] ? ext4_mark_inode_dirty+0x7f/0x1f0
           [<ffffffff81207e05>] ext4_truncate+0xf5/0x100
           [<ffffffff8120cd51>] ext4_evict_inode+0x461/0x490
           [<ffffffff811a1312>] evict+0xa2/0x1a0
           [<ffffffff811a1513>] iput+0x103/0x1f0
           [<ffffffff81196d84>] do_unlinkat+0x154/0x1c0
           [<ffffffff8118cc3a>] ? sys_newfstatat+0x2a/0x40
           [<ffffffff81197b0b>] sys_unlinkat+0x1b/0x50
           [<ffffffff816135e9>] system_call_fastpath+0x16/0x1b
          Code: 8b 4d 20 0f b7 41 02 48 8d 04 40 48 8d 04 81 49 89 45 18 0f b7 49 02 48 83 c1 01 49 89 4d 00 e9 ae f8 ff ff 0f 1f 00 49 8b 45 28 <48> 8b 40 28 49 89 45 20 e9 85 f8 ff ff 0f 1f 80 00 00 00
      
          RIP  [<ffffffff81233164>] ext4_ext_remove_space+0xa34/0xdf0
      
      This could be reproduced as follows:
      
      The problem in commit 968dee77 was that caused the variable 'i' to
      be left uninitialized if the truncate required more space than was
      available in the journal.  This resulted in the function
      ext4_ext_truncate_extend_restart() returning -EAGAIN, which caused
      ext4_ext_remove_space() to restart the truncate operation after
      starting a new jbd2 handle.
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Reported-by: NMarti Raudsepp <marti@juffo.org>
      Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      89a4e48f
  5. 23 7月, 2012 1 次提交
    • A
      ext4: fix hole punch failure when depth is greater than 0 · 968dee77
      Ashish Sangwan 提交于
      Whether to continue removing extents or not is decided by the return
      value of function ext4_ext_more_to_rm() which checks 2 conditions:
      a) if there are no more indexes to process.
      b) if the number of entries are decreased in the header of "depth -1".
      
      In case of hole punch, if the last block to be removed is not part of
      the last extent index than this index will not be deleted, hence the
      number of valid entries in the extent header of "depth - 1" will
      remain as it is and ext4_ext_more_to_rm will return 0 although the
      required blocks are not yet removed.
      
      This patch fixes the above mentioned problem as instead of removing
      the extents from the end of file, it starts removing the blocks from
      the particular extent from which removing blocks is actually required
      and continue backward until done.
      Signed-off-by: NAshish Sangwan <ashish.sangwan2@gmail.com>
      Signed-off-by: NNamjae Jeon <linkinjeon@gmail.com>
      Reviewed-by: NLukas Czerner <lczerner@redhat.com>
      Cc: stable@vger.kernel.org
      968dee77
  6. 10 7月, 2012 1 次提交
  7. 01 7月, 2012 1 次提交
  8. 01 6月, 2012 1 次提交
    • H
      ext4: hole-punch use truncate_pagecache_range · 5e44f8c3
      Hugh Dickins 提交于
      When truncating a file, we unmap pages from userspace first, as that's
      usually more efficient than relying, page by page, on the fallback in
      truncate_inode_page() - particularly if the file is mapped many times.
      
      Do the same when punching a hole: 3.4 added truncate_pagecache_range()
      to do the unmap and trunc, so use it in ext4_ext_punch_hole(), instead
      of calling truncate_inode_pages_range() directly.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5e44f8c3
  9. 29 5月, 2012 1 次提交
  10. 30 4月, 2012 2 次提交
  11. 17 4月, 2012 1 次提交
  12. 13 4月, 2012 1 次提交
  13. 22 3月, 2012 1 次提交
    • L
      ext4: remove restrictive checks for EOFBLOCKS_FL · afcff5d8
      Lukas Czerner 提交于
      We are going to remove the EOFBLOCKS_FL flag in the future, so this is
      the first part of the removal. We can not remove it entirely just now,
      since the e2fsck is still checking for it and it might cause headache to
      some people. Instead, remove the restrictive checks now and the rest
      later, when the new e2fsck code is out and common enough.
      
      This is also needed because punch hole already breaks the EOFBLOCKS_FL
      semantics, so it might cause the some troubles. So simply remove it.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      afcff5d8
  14. 20 3月, 2012 4 次提交
    • T
      92b97816
    • L
      ext4: give more helpful error message in ext4_ext_rm_leaf() · dc1841d6
      Lukas Czerner 提交于
      The error message produced by the ext4_ext_rm_leaf() when we are
      removing blocks which accidentally ends up inside the existing extent,
      is not very helpful, because we would like to also know which extent did
      we collide with.
      
      This commit changes the error message to get us also the information
      about the extent we are colliding with.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      dc1841d6
    • L
      ext4: remove unused code from ext4_ext_map_blocks() · 7877191c
      Lukas Czerner 提交于
      Since the commit 'Rewrite punch hole to use ext4_ext_remove_space()'
      reworked the punch hole implementation to use ext4_ext_remove_space()
      instead of ext4_ext_map_blocks(), we can remove the code which is no
      longer needed from the ext4_ext_map_blocks().
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7877191c
    • L
      ext4: rewrite punch hole to use ext4_ext_remove_space() · 5f95d21f
      Lukas Czerner 提交于
      This commit rewrites ext4 punch hole implementation to use
      ext4_ext_remove_space() instead of its home gown way of doing this via
      ext4_ext_map_blocks(). There are several reasons for changing this.
      
      Firstly it is quite non obvious that punching hole needs to
      ext4_ext_map_blocks() to punch a hole, especially given that this
      function should map blocks, not unmap it. It also required a lot of new
      code in ext4_ext_map_blocks().
      
      Secondly the design of it is not very effective. The reason is that we
      are trying to punch out blocks in ext4_ext_punch_hole() in opposite
      direction than in ext4_ext_rm_leaf() which causes the ext4_ext_rm_leaf()
      to iterate through the whole tree from the end to the start to find the
      requested extent for every extent we are going to punch out.
      
      And finally the current implementation does not use the existing code,
      but bring a lot of new code, which is IMO unnecessary since there
      already is some infrastructure we can use. Specifically
      ext4_ext_remove_space().
      
      This commit changes ext4_ext_remove_space() to accept 'end' parameter so
      we can not only truncate to the end of file, but also remove the space
      in the middle of the file (punch a hole). Moreover, because the last
      block to punch out, might be in the middle of the extent, we have to
      split the extent at 'end + 1' so ext4_ext_rm_leaf() can easily either
      remove the whole fist part of split extent, or change its size.
      
      ext4_ext_remove_space() is then used to actually remove the space
      (extents) from within the hole, instead of ext4_ext_map_blocks().
      
      Note that this also fix the issue with punch hole, where we would forget
      to remove empty index blocks from the extent tree, resulting in double
      free block error and file system corruption. This is simply because we
      now use different code path, where this problem does not exist.
      
      This has been tested with fsx running for several days and xfstests,
      plus xfstest #251 with '-o discard' run on the loop image (which
      converts discard requestes into punch hole to the backing file). All of
      it on 1K and 4K file system block size.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5f95d21f
  15. 12 3月, 2012 1 次提交
  16. 09 1月, 2012 1 次提交
  17. 29 12月, 2011 1 次提交
  18. 19 12月, 2011 2 次提交
    • R
      ext4: optimize ext4_find_delalloc_range() in nodelalloc mode · 8c48f7e8
      Robin Dong 提交于
      We found performance regression when using bigalloc with "nodelalloc"
      (1MB cluster size):
      
      1. mke2fs -C 1048576 -O ^has_journal,bigalloc /dev/sda
      2. mount -o nodelalloc /dev/sda /test/
      3. time dd if=/dev/zero of=/test/io bs=1048576 count=1024
      
      The "dd" will cost about 2 seconds to finish, but if we mke2fs without
      "bigalloc", "dd" will only cost less than 1 second.
      
      The reason is: when using ext4 with "nodelalloc", it will call
      ext4_find_delalloc_cluster() nearly everytime it call
      ext4_ext_map_blocks(), and ext4_find_delalloc_range() will also scan
      all pages in cluster because no buffer is "delayed".  A cluster has
      256 pages (1MB cluster), so it will scan 256 * 256k pags when creating
      a 1G file. That severely hurts the performance.
      
      Therefore, we return immediately from ext4_find_delalloc_range() in
      nodelalloc mode, since by definition there can't be any delalloc
      pages.
      Signed-off-by: NRobin Dong <sanbai@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8c48f7e8
    • C
      ext4: remove unused local variable · 14d7f3ef
      Curt Wohlgemuth 提交于
      In get_implied_cluster_alloc(), rr_cluster_end was being
      defined and set, but was never used.  Removed this.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      14d7f3ef
  19. 14 12月, 2011 1 次提交
  20. 13 12月, 2011 1 次提交
    • P
      ext4: Fix crash due to getting bogus eh_depth value on big-endian systems · b4611abf
      Paul Mackerras 提交于
      Commit 1939dd84 ("ext4: cleanup ext4_ext_grow_indepth code") added a
      reference to ext4_extent_header.eh_depth, but forget to pass the value
      read through le16_to_cpu.  The result is a crash on big-endian
      machines, such as this crash on a POWER7 server:
      
      attempt to access beyond end of device
      sda8: rw=0, want=776392648163376, limit=168558560
      Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6bcb
      Faulting instruction address: 0xc0000000001f5f38
      cpu 0x14: Vector: 300 (Data Access) at [c000001bd1aaecf0]
          pc: c0000000001f5f38: .__brelse+0x18/0x60
          lr: c0000000002e07a4: .ext4_ext_drop_refs+0x44/0x80
          sp: c000001bd1aaef70
         msr: 9000000000009032
         dar: 6b6b6b6b6b6b6bcb
       dsisr: 40000000
        current = 0xc000001bd15b8010
        paca    = 0xc00000000ffe4600
          pid   = 19911, comm = flush-8:0
      enter ? for help
      [c000001bd1aaeff0] c0000000002e07a4 .ext4_ext_drop_refs+0x44/0x80
      [c000001bd1aaf090] c0000000002e0c58 .ext4_ext_find_extent+0x408/0x4c0
      [c000001bd1aaf180] c0000000002e145c .ext4_ext_insert_extent+0x2bc/0x14c0
      [c000001bd1aaf2c0] c0000000002e3fb8 .ext4_ext_map_blocks+0x628/0x1710
      [c000001bd1aaf420] c0000000002b2974 .ext4_map_blocks+0x224/0x310
      [c000001bd1aaf4d0] c0000000002b7f2c .mpage_da_map_and_submit+0xbc/0x490
      [c000001bd1aaf5a0] c0000000002b8688 .write_cache_pages_da+0x2c8/0x430
      [c000001bd1aaf720] c0000000002b8b28 .ext4_da_writepages+0x338/0x670
      [c000001bd1aaf8d0] c000000000157280 .do_writepages+0x40/0x90
      [c000001bd1aaf940] c0000000001ea830 .writeback_single_inode+0xe0/0x530
      [c000001bd1aafa00] c0000000001eb680 .writeback_sb_inodes+0x210/0x300
      [c000001bd1aafb20] c0000000001ebc84 .__writeback_inodes_wb+0xd4/0x140
      [c000001bd1aafbe0] c0000000001ebfec .wb_writeback+0x2fc/0x3e0
      [c000001bd1aafce0] c0000000001ed770 .wb_do_writeback+0x2f0/0x300
      [c000001bd1aafdf0] c0000000001ed848 .bdi_writeback_thread+0xc8/0x340
      [c000001bd1aafed0] c0000000000c5494 .kthread+0xb4/0xc0
      [c000001bd1aaff90] c000000000021f48 .kernel_thread+0x54/0x70
      
      This is due to getting ext_depth(inode) == 0x101 and therefore running
      off the end of the path array in ext4_ext_drop_refs into following
      unallocated structures.
      
      This fixes it by adding the necessary le16_to_cpu.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b4611abf
  21. 02 11月, 2011 2 次提交
  22. 01 11月, 2011 2 次提交
    • G
      ext4: Don't normalize an falloc request if it can fit in 1 extent. · 3c6fe770
      Greg Harm 提交于
      If an fallocate request fits in EXT_UNINIT_MAX_LEN, then set the
      EXT4_GET_BLOCKS_NO_NORMALIZE flag. For larger fallocate requests,
      let mballoc.c normalize the request.
      
      This fixes a problem where large requests were being split into
      non-contiguous extents due to commit 556b27ab: ext4: do not
      normalize block requests from fallocate.
      
      Testing: 
      *) Checked that 8.x MB falloc'ed files are still laid down next to
      each other (contiguously).
      *) Checked that the maximum size extent (127.9MB) is allocated as 1
      extent.
      *) Checked that a 1GB file is somewhat contiguous (often 5-6
      non-contiguous extents now).
      *) Checked that a 120MB file can still be falloc'ed even if there are
      no single extents large enough to hold it.
      Signed-off-by: NGreg Harm <gharm@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3c6fe770
    • T
      ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten · 0edeb71d
      Tao Ma 提交于
      EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
      should be done simultaneously since ext4_end_io_nolock always clear
      the flag and decrease the counter in the same time.
      
      We have found some bugs that the flag is set while leaving
      i_aiodio_unwritten unchanged(commit 32c80b32). So this patch just tries
      to create a helper function to wrap them to avoid any future bug.
      The idea is inspired by Eric.
      
      Cc: Eric Sandeen <sandeen@redhat.com>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0edeb71d
  23. 29 10月, 2011 3 次提交
  24. 27 10月, 2011 2 次提交
    • E
      ext4: optimize memmmove lengths in extent/index insertions · 80e675f9
      Eric Gouriou 提交于
      ext4_ext_insert_extent() (respectively ext4_ext_insert_index())
      was using EXT_MAX_EXTENT() (resp. EXT_MAX_INDEX()) to determine
      how many entries needed to be moved beyond the insertion point.
      In practice this means that (320 - I) * 24 bytes were memmove()'d
      when I is the insertion point, rather than (#entries - I) * 24 bytes.
      
      This patch uses EXT_LAST_EXTENT() (resp. EXT_LAST_INDEX()) instead
      to only move existing entries. The code flow is also simplified
      slightly to highlight similarities and reduce code duplication in
      the insertion logic.
      
      This patch reduces system CPU consumption by over 25% on a 4kB
      synchronous append DIO write workload when used with the
      pre-2.6.39 x86_64 memmove() implementation. With the much faster
      2.6.39 memmove() implementation we still see a decrease in
      system CPU usage between 2% and 7%.
      
      Note that the ext_debug() output changes with this patch, splitting
      some log information between entries. Users of the ext_debug() output
      should note that the "move %d" units changed from reporting the number
      of bytes moved to reporting the number of entries moved.
      Signed-off-by: NEric Gouriou <egouriou@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      80e675f9
    • E
      ext4: optimize ext4_ext_convert_to_initialized() · 6f91bc5f
      Eric Gouriou 提交于
      This patch introduces a fast path in ext4_ext_convert_to_initialized()
      for the case when the conversion can be performed by transferring
      the newly initialized blocks from the uninitialized extent into
      an adjacent initialized extent. Doing so removes the expensive
      invocations of memmove() which occur during extent insertion and
      the subsequent merge.
      
      In practice this should be the common case for clients performing
      append writes into files pre-allocated via
      fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
      direct IO and when using a suboptimal implementation of memmove()
      (x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
      consumption by 32%.
      
      Two new trace points are added to ext4_ext_convert_to_initialized()
      to offer visibility into its operations. No exit trace point has
      been added due to the multiplicity of return points. This can be
      revisited once the upstream cleanup is backported.
      Signed-off-by: NEric Gouriou <egouriou@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6f91bc5f
  25. 26 10月, 2011 3 次提交