1. 20 3月, 2012 1 次提交
    • L
      ext4: rewrite punch hole to use ext4_ext_remove_space() · 5f95d21f
      Lukas Czerner 提交于
      This commit rewrites ext4 punch hole implementation to use
      ext4_ext_remove_space() instead of its home gown way of doing this via
      ext4_ext_map_blocks(). There are several reasons for changing this.
      
      Firstly it is quite non obvious that punching hole needs to
      ext4_ext_map_blocks() to punch a hole, especially given that this
      function should map blocks, not unmap it. It also required a lot of new
      code in ext4_ext_map_blocks().
      
      Secondly the design of it is not very effective. The reason is that we
      are trying to punch out blocks in ext4_ext_punch_hole() in opposite
      direction than in ext4_ext_rm_leaf() which causes the ext4_ext_rm_leaf()
      to iterate through the whole tree from the end to the start to find the
      requested extent for every extent we are going to punch out.
      
      And finally the current implementation does not use the existing code,
      but bring a lot of new code, which is IMO unnecessary since there
      already is some infrastructure we can use. Specifically
      ext4_ext_remove_space().
      
      This commit changes ext4_ext_remove_space() to accept 'end' parameter so
      we can not only truncate to the end of file, but also remove the space
      in the middle of the file (punch a hole). Moreover, because the last
      block to punch out, might be in the middle of the extent, we have to
      split the extent at 'end + 1' so ext4_ext_rm_leaf() can easily either
      remove the whole fist part of split extent, or change its size.
      
      ext4_ext_remove_space() is then used to actually remove the space
      (extents) from within the hole, instead of ext4_ext_map_blocks().
      
      Note that this also fix the issue with punch hole, where we would forget
      to remove empty index blocks from the extent tree, resulting in double
      free block error and file system corruption. This is simply because we
      now use different code path, where this problem does not exist.
      
      This has been tested with fsx running for several days and xfstests,
      plus xfstest #251 with '-o discard' run on the loop image (which
      converts discard requestes into punch hole to the backing file). All of
      it on 1K and 4K file system block size.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5f95d21f
  2. 12 3月, 2012 1 次提交
  3. 09 1月, 2012 1 次提交
  4. 29 12月, 2011 1 次提交
  5. 19 12月, 2011 2 次提交
    • R
      ext4: optimize ext4_find_delalloc_range() in nodelalloc mode · 8c48f7e8
      Robin Dong 提交于
      We found performance regression when using bigalloc with "nodelalloc"
      (1MB cluster size):
      
      1. mke2fs -C 1048576 -O ^has_journal,bigalloc /dev/sda
      2. mount -o nodelalloc /dev/sda /test/
      3. time dd if=/dev/zero of=/test/io bs=1048576 count=1024
      
      The "dd" will cost about 2 seconds to finish, but if we mke2fs without
      "bigalloc", "dd" will only cost less than 1 second.
      
      The reason is: when using ext4 with "nodelalloc", it will call
      ext4_find_delalloc_cluster() nearly everytime it call
      ext4_ext_map_blocks(), and ext4_find_delalloc_range() will also scan
      all pages in cluster because no buffer is "delayed".  A cluster has
      256 pages (1MB cluster), so it will scan 256 * 256k pags when creating
      a 1G file. That severely hurts the performance.
      
      Therefore, we return immediately from ext4_find_delalloc_range() in
      nodelalloc mode, since by definition there can't be any delalloc
      pages.
      Signed-off-by: NRobin Dong <sanbai@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8c48f7e8
    • C
      ext4: remove unused local variable · 14d7f3ef
      Curt Wohlgemuth 提交于
      In get_implied_cluster_alloc(), rr_cluster_end was being
      defined and set, but was never used.  Removed this.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      14d7f3ef
  6. 14 12月, 2011 1 次提交
  7. 13 12月, 2011 1 次提交
    • P
      ext4: Fix crash due to getting bogus eh_depth value on big-endian systems · b4611abf
      Paul Mackerras 提交于
      Commit 1939dd84 ("ext4: cleanup ext4_ext_grow_indepth code") added a
      reference to ext4_extent_header.eh_depth, but forget to pass the value
      read through le16_to_cpu.  The result is a crash on big-endian
      machines, such as this crash on a POWER7 server:
      
      attempt to access beyond end of device
      sda8: rw=0, want=776392648163376, limit=168558560
      Unable to handle kernel paging request for data at address 0x6b6b6b6b6b6b6bcb
      Faulting instruction address: 0xc0000000001f5f38
      cpu 0x14: Vector: 300 (Data Access) at [c000001bd1aaecf0]
          pc: c0000000001f5f38: .__brelse+0x18/0x60
          lr: c0000000002e07a4: .ext4_ext_drop_refs+0x44/0x80
          sp: c000001bd1aaef70
         msr: 9000000000009032
         dar: 6b6b6b6b6b6b6bcb
       dsisr: 40000000
        current = 0xc000001bd15b8010
        paca    = 0xc00000000ffe4600
          pid   = 19911, comm = flush-8:0
      enter ? for help
      [c000001bd1aaeff0] c0000000002e07a4 .ext4_ext_drop_refs+0x44/0x80
      [c000001bd1aaf090] c0000000002e0c58 .ext4_ext_find_extent+0x408/0x4c0
      [c000001bd1aaf180] c0000000002e145c .ext4_ext_insert_extent+0x2bc/0x14c0
      [c000001bd1aaf2c0] c0000000002e3fb8 .ext4_ext_map_blocks+0x628/0x1710
      [c000001bd1aaf420] c0000000002b2974 .ext4_map_blocks+0x224/0x310
      [c000001bd1aaf4d0] c0000000002b7f2c .mpage_da_map_and_submit+0xbc/0x490
      [c000001bd1aaf5a0] c0000000002b8688 .write_cache_pages_da+0x2c8/0x430
      [c000001bd1aaf720] c0000000002b8b28 .ext4_da_writepages+0x338/0x670
      [c000001bd1aaf8d0] c000000000157280 .do_writepages+0x40/0x90
      [c000001bd1aaf940] c0000000001ea830 .writeback_single_inode+0xe0/0x530
      [c000001bd1aafa00] c0000000001eb680 .writeback_sb_inodes+0x210/0x300
      [c000001bd1aafb20] c0000000001ebc84 .__writeback_inodes_wb+0xd4/0x140
      [c000001bd1aafbe0] c0000000001ebfec .wb_writeback+0x2fc/0x3e0
      [c000001bd1aafce0] c0000000001ed770 .wb_do_writeback+0x2f0/0x300
      [c000001bd1aafdf0] c0000000001ed848 .bdi_writeback_thread+0xc8/0x340
      [c000001bd1aafed0] c0000000000c5494 .kthread+0xb4/0xc0
      [c000001bd1aaff90] c000000000021f48 .kernel_thread+0x54/0x70
      
      This is due to getting ext_depth(inode) == 0x101 and therefore running
      off the end of the path array in ext4_ext_drop_refs into following
      unallocated structures.
      
      This fixes it by adding the necessary le16_to_cpu.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b4611abf
  8. 02 11月, 2011 2 次提交
  9. 01 11月, 2011 2 次提交
    • G
      ext4: Don't normalize an falloc request if it can fit in 1 extent. · 3c6fe770
      Greg Harm 提交于
      If an fallocate request fits in EXT_UNINIT_MAX_LEN, then set the
      EXT4_GET_BLOCKS_NO_NORMALIZE flag. For larger fallocate requests,
      let mballoc.c normalize the request.
      
      This fixes a problem where large requests were being split into
      non-contiguous extents due to commit 556b27ab: ext4: do not
      normalize block requests from fallocate.
      
      Testing: 
      *) Checked that 8.x MB falloc'ed files are still laid down next to
      each other (contiguously).
      *) Checked that the maximum size extent (127.9MB) is allocated as 1
      extent.
      *) Checked that a 1GB file is somewhat contiguous (often 5-6
      non-contiguous extents now).
      *) Checked that a 120MB file can still be falloc'ed even if there are
      no single extents large enough to hold it.
      Signed-off-by: NGreg Harm <gharm@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3c6fe770
    • T
      ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten · 0edeb71d
      Tao Ma 提交于
      EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
      should be done simultaneously since ext4_end_io_nolock always clear
      the flag and decrease the counter in the same time.
      
      We have found some bugs that the flag is set while leaving
      i_aiodio_unwritten unchanged(commit 32c80b32). So this patch just tries
      to create a helper function to wrap them to avoid any future bug.
      The idea is inspired by Eric.
      
      Cc: Eric Sandeen <sandeen@redhat.com>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0edeb71d
  10. 29 10月, 2011 3 次提交
  11. 27 10月, 2011 2 次提交
    • E
      ext4: optimize memmmove lengths in extent/index insertions · 80e675f9
      Eric Gouriou 提交于
      ext4_ext_insert_extent() (respectively ext4_ext_insert_index())
      was using EXT_MAX_EXTENT() (resp. EXT_MAX_INDEX()) to determine
      how many entries needed to be moved beyond the insertion point.
      In practice this means that (320 - I) * 24 bytes were memmove()'d
      when I is the insertion point, rather than (#entries - I) * 24 bytes.
      
      This patch uses EXT_LAST_EXTENT() (resp. EXT_LAST_INDEX()) instead
      to only move existing entries. The code flow is also simplified
      slightly to highlight similarities and reduce code duplication in
      the insertion logic.
      
      This patch reduces system CPU consumption by over 25% on a 4kB
      synchronous append DIO write workload when used with the
      pre-2.6.39 x86_64 memmove() implementation. With the much faster
      2.6.39 memmove() implementation we still see a decrease in
      system CPU usage between 2% and 7%.
      
      Note that the ext_debug() output changes with this patch, splitting
      some log information between entries. Users of the ext_debug() output
      should note that the "move %d" units changed from reporting the number
      of bytes moved to reporting the number of entries moved.
      Signed-off-by: NEric Gouriou <egouriou@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      80e675f9
    • E
      ext4: optimize ext4_ext_convert_to_initialized() · 6f91bc5f
      Eric Gouriou 提交于
      This patch introduces a fast path in ext4_ext_convert_to_initialized()
      for the case when the conversion can be performed by transferring
      the newly initialized blocks from the uninitialized extent into
      an adjacent initialized extent. Doing so removes the expensive
      invocations of memmove() which occur during extent insertion and
      the subsequent merge.
      
      In practice this should be the common case for clients performing
      append writes into files pre-allocated via
      fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
      direct IO and when using a suboptimal implementation of memmove()
      (x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
      consumption by 32%.
      
      Two new trace points are added to ext4_ext_convert_to_initialized()
      to offer visibility into its operations. No exit trace point has
      been added due to the multiplicity of return points. This can be
      revisited once the upstream cleanup is backported.
      Signed-off-by: NEric Gouriou <egouriou@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6f91bc5f
  12. 26 10月, 2011 3 次提交
  13. 25 10月, 2011 2 次提交
  14. 22 10月, 2011 1 次提交
    • D
      ext4: cleanup ext4_ext_grow_indepth code · 1939dd84
      Dmitry Monakhov 提交于
      Currently code make an impression what grow procedure is very complicated
      and some mythical paths, blocks are involved. But in fact grow in depth
      it relatively simple procedure:
       1) Just create new meta block and copy root data to that block.
       2) Convert root from extent to index if old depth == 0
       3) Update root block pointer
      
      This patch does:
       - Reorganize code to make it more self explanatory
       - Do not pass path parameter to new_meta_block() in order to
         provoke allocation from inode's group because top-level block
         should site closer to it's inode, but not to leaf data block.
      
         [ This happens anyway, due to logic in mballoc; we should drop
           the path parameter from new_meta_block() entirely.  -- tytso ]
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1939dd84
  15. 18 10月, 2011 1 次提交
  16. 17 10月, 2011 1 次提交
  17. 09 10月, 2011 2 次提交
  18. 10 9月, 2011 5 次提交
    • A
      ext4: attempt to fix race in bigalloc code path · 5356f261
      Aditya Kali 提交于
      Currently, there exists a race between delayed allocated writes and
      the writeback when bigalloc feature is in use. The race was because we
      wanted to determine what blocks in a cluster are under delayed
      allocation and we were using buffer_delayed(bh) check for it. But, the
      writeback codepath clears this bit without any synchronization which
      resulted in a race and an ext4 warning similar to:
      
      EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0
      		reserved data blocks
      
      The race existed in two places.
      (1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from
          writeback code path.
      (2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where
          buffer_delayed(bh) is set.
      
      To fix (1), this patch introduces a new buffer_head state bit -
      BH_Da_Mapped.  This bit is set under the protection of
      EXT4_I(inode)->i_data_sem when we have actually mapped the delayed
      allocated blocks during the writeout time. We can now reliably check
      for this bit inside ext4_find_delalloc_range() to determine whether
      the reservation for the blocks have already been claimed or not.
      
      To fix (2), it was necessary to set buffer_delay(bh) under the
      protection of i_data_sem.  So, I extracted the very beginning of
      ext4_map_blocks into a new function - ext4_da_map_blocks() - and
      performed the required setting of bh_delay bit and the quota
      reservation under the protection of i_data_sem.  These two fixes makes
      the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent,
      thus removing the race.
      
      Tested: I was able to reproduce the problem by running 'dd' and
      'fsync' in parallel. Also, xfstests sometimes used to reproduce this
      race. After the fix both my test and xfstests were successful and no
      race (warning message) was observed.
      
      Google-Bug-Id: 4997027
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      5356f261
    • A
      ext4: add some tracepoints in ext4/extents.c · d8990240
      Aditya Kali 提交于
      This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
      ext4/inode.c.
      
      Tested: Built and ran the kernel and verified that these tracepoints work.
      Also ran xfstests.
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
          
      d8990240
    • A
      ext4: Fix bigalloc quota accounting and i_blocks value · 7b415bf6
      Aditya Kali 提交于
      With bigalloc changes, the i_blocks value was not correctly set (it was still
      set to number of blocks being used, but in case of bigalloc, we want i_blocks
      to represent the number of clusters being used). Since the quota subsystem sets
      the i_blocks value, this patch fixes the quota accounting and makes sure that
      the i_blocks value is set correctly.
      Signed-off-by: NAditya Kali <adityakali@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      7b415bf6
    • T
      ext4: teach ext4_ext_truncate() about the bigalloc feature · 0aa06000
      Theodore Ts'o 提交于
      When we are truncating (as opposed unlinking) a file, we need to worry
      about partial truncates of a file, especially in the light of sparse
      files.  The changes here make sure that arbitrary truncates of sparse
      files works correctly.  Yeah, it's messy.
      
      Note that these functions will need to be revisted when the punch
      ioctl is integrated --- in fact this commit will probably have merge
      conflicts with the punch changes which Allison Henders and the IBM LTC
      have been working on.  I will need to fix this up when either patch
      hits mainline.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0aa06000
    • T
      ext4: teach ext4_ext_map_blocks() about the bigalloc feature · 4d33b1ef
      Theodore Ts'o 提交于
      If we need to allocate a new block in ext4_ext_map_blocks(), the
      function needs to see if the cluster has already been allocated.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4d33b1ef
  19. 07 9月, 2011 1 次提交
    • A
      ext4: fix fsx truncate failure · 189e868f
      Allison Henderson 提交于
      While running extended fsx tests to verify the first
      two patches, a similar bug was also found in the
      truncate operation.
      
      This bug happens because the truncate routine only zeros
      the unblock aligned portion of the last page.  This means
      that the block aligned portions of the page appearing after
      i_size are left unzeroed, and the buffer heads still mapped.
      
      This bug is corrected by using ext4_discard_partial_page_buffers
      in the truncate routine to zero the partial page and unmap
      the buffer headers.
      Signed-off-by: NAllison Henderson <achender@linux.vnet.ibm.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      189e868f
  20. 04 9月, 2011 1 次提交
    • T
      jbd2: add debugging information to jbd2_journal_dirty_metadata() · 9ea7a0df
      Theodore Ts'o 提交于
      Add debugging information in case jbd2_journal_dirty_metadata() is
      called with a buffer_head which didn't have
      jbd2_journal_get_write_access() called on it, or if the journal_head
      has the wrong transaction in it.  In addition, return an error code.
      This won't change anything for ocfs2, which will BUG_ON() the non-zero
      exit code.
      
      For ext4, the caller of this function is ext4_handle_dirty_metadata(),
      and on seeing a non-zero return code, will call __ext4_journal_stop(),
      which will print the function and line number of the (buggy) calling
      function and abort the journal.  This will allow us to recover instead
      of bug halting, which is better from a robustness and reliability
      point of view.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9ea7a0df
  21. 03 9月, 2011 2 次提交
    • A
      ext4: fix 2nd xfstests 127 punch hole failure · 2be4751b
      Allison Henderson 提交于
      This patch fixes a second punch hole bug found by xfstests 127.
      
      This bug happens because punch hole needs to flush the pages
      of the hole to avoid race conditions.  But if the end of the
      hole is in the same page as i_size, the buffer heads beyond
      i_size need to be unmapped and the page needs to be zeroed
      after it is flushed.
      
      To correct this, the new ext4_discard_partial_page_buffers
      routine is used to zero and unmap the partial page
      beyond i_size if the end of the hole appears in the same
      page as i_size.
      
      The code has also been optimized to set the end of the hole
      to the page after i_size if the specified hole exceeds i_size,
      and the code that flushes the pages has been simplified.
      Signed-off-by: NAllison Henderson <achender@linux.vnet.ibm.com>
      2be4751b
    • A
      ext4: fix xfstests 75, 112, 127 punch hole failure · ba06208a
      Allison Henderson 提交于
      This patch addresses a bug found by xfstests 75, 112, 127
      when blocksize = 1k
      
      This bug happens because the punch hole code only zeros
      out non block aligned regions of the page.  This means that if the
      blocks are smaller than a page, then the block aligned regions of
      the page inside the hole are left un-zeroed, and their buffer heads
      are still mapped.  This bug is corrected by using
      ext4_discard_partial_page_buffers to properly zero the partial page
      at the head and tail of the hole, and unmap the corresponding buffer
      heads
      
      This patch also addresses a bug reported by Lukas while working on a
      new patch to add discard support for loop devices using punch hole.
      The bug happened because of the first and last block number
      needed to be cast to a larger data type before calculating the
      byte offset, but since now we only need the byte offsets of the
      pages, we no longer even need to be calculating the byte offsets
      of the blocks.  The code to do the block offset calculations is
      removed in this patch.
      Signed-off-by: NAllison Henderson <achender@linux.vnet.ibm.com>
      ba06208a
  22. 28 7月, 2011 2 次提交
    • U
      ext4: Fix overflow caused by missing cast in ext4_fallocate() · 29ae07b7
      Utako Kusaka 提交于
      The logical block number in map.l_blk is a __u32, and so before we
      shift it left, by the block size, we neeed cast it to a 64-bit size.
      
      Otherwise i_size can be corrupted on an ENOSPC.
      
      # df -T /mnt/mp1
      Filesystem    Type   1K-blocks      Used Available Use% Mounted on
      /dev/sda6     ext4     9843276    153056   9190200   2% /mnt/mp1
      # fallocate -o 0 -l 2199023251456 /mnt/mp1/testfile
      fallocate: /mnt/mp1/testfile: fallocate failed: No space left on device
      # stat /mnt/mp1/testfile
        File: `/mnt/mp1/testfile'
        Size: 4293656576	Blocks: 19380440   IO Block: 4096   regular file
      Device: 806h/2054d	Inode: 12          Links: 1
      Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
      Access: 2011-07-25 13:01:31.414490496 +0900
      Modify: 2011-07-25 13:01:31.414490496 +0900
      Change: 2011-07-25 13:01:31.454490495 +0900
      Signed-off-by: NUtako Kusaka <u-kusaka@wm.jp.nec.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      --
       fs/ext4/extents.c |    2 +-
       1 files changed, 1 insertions(+), 1 deletions(-)
      29ae07b7
    • R
      ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole · 0e1147b0
      Robin Dong 提交于
      The old function ext4_ext_rm_idx is used only for truncate case
      because it just remove last index in extent-index-block. When punching
      hole, it usually needed to remove "middle" index, therefore we must
      move indexes which after it forward.
      
      (I create a file with 1 depth extent tree and punch hole in the middle
      of it, the last index in index-block strangly gone, so I find out this
      bug)
      Signed-off-by: NRobin Dong <sanbai@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0e1147b0
  23. 24 7月, 2011 2 次提交