1. 02 3月, 2013 1 次提交
  2. 01 3月, 2013 1 次提交
    • T
      ext4: optimize ext4_es_shrink() · 24630774
      Theodore Ts'o 提交于
      When the system is under memory pressure, ext4_es_srhink() will get
      called very often.  So optimize returning the number of items in the
      file system's extent status cache by keeping a per-filesystem count,
      instead of calculating it each time by scanning all of the inodes in
      the extent status cache.
      
      Also rename the slab used for the extent status cache to be
      "ext4_extent_status" so it's obviousl the slab in question is created
      by ext4.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Zheng Liu <gnehzuil.liu@gmail.com>
      24630774
  3. 28 2月, 2013 1 次提交
  4. 23 2月, 2013 2 次提交
    • L
      ext4: fix free clusters calculation in bigalloc filesystem · 304e220f
      Lukas Czerner 提交于
      ext4_has_free_clusters() should tell us whether there is enough free
      clusters to allocate, however number of free clusters in the file system
      is converted to blocks using EXT4_C2B() which is not only wrong use of
      the macro (we should have used EXT4_NUM_B2C) but it's also completely
      wrong concept since everything else is in cluster units.
      
      Moreover when calculating number of root clusters we should be using
      macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be
      off by one. However r_blocks_count should always be a multiple of the
      cluster ratio so doing a plain bit shift should be enough here. We
      avoid using EXT4_B2C() because it's confusing.
      
      As a result of the first problem number of free clusters is much bigger
      than it should have been and ext4_has_free_clusters() would return 1 even
      if there is really not enough free clusters available.
      
      Fix this by removing the EXT4_C2B() conversion of free clusters and
      using bit shift when calculating number of root clusters. This bug
      affects number of xfstests tests covering file system ENOSPC situation
      handling. With this patch most of the ENOSPC problems with bigalloc file
      system disappear, especially the errors caused by delayed allocation not
      having enough space when the actual allocation is finally requested.
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      304e220f
    • E
      ext4: no need to remove extent if len is 0 in ext4_es_remove_extent() · d4381472
      Eryu Guan 提交于
      len is 0 means no extent needs to be removed, so return immediately.
      Otherwise it could trigger the following BUG_ON() in
      ext4_es_remove_extent()
      
      	end = lblk + len - 1;
      	BUG_ON(end < lblk);
      
      This could be reproduced by a simple truncate(1) command by an
      unprivileged user
      
      	truncate -s $(($((2**32 - 1)) * 4096)) /mnt/ext4/testfile
      
      The same is true for __es_insert_extent().
      
      Patched kernel passed xfstests regression test.
      Signed-off-by: NEryu Guan <guaneryu@gmail.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NZheng Liu <wenqing.lz@taobao.com>
      d4381472
  5. 19 2月, 2013 1 次提交
    • L
      ext4: fix xattr block allocation/release with bigalloc · 1231b3a1
      Lukas Czerner 提交于
      Currently when new xattr block is created or released we we would call
      dquot_free_block() or dquot_alloc_block() respectively, among the else
      decrementing or incrementing the number of blocks assigned to the
      inode by one block.
      
      This however does not work for bigalloc file system because we always
      allocate/free the whole cluster so we have to count with that in
      dquot_free_block() and dquot_alloc_block() as well.
      
      Use the clusters-to-blocks conversion EXT4_C2B() when passing number of
      blocks to the dquot_alloc/free functions to fix the problem.
      
      The problem has been revealed by xfstests #117 (and possibly others).
      Signed-off-by: NLukas Czerner <lczerner@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Cc: stable@vger.kernel.org
      1231b3a1
  6. 18 2月, 2013 9 次提交
    • Z
      ext4: reclaim extents from extent status tree · 74cd15cd
      Zheng Liu 提交于
      Although extent status is loaded on-demand, we also need to reclaim
      extent from the tree when we are under a heavy memory pressure because
      in some cases fragmented extent tree causes status tree costs too much
      memory.
      
      Here we maintain a lru list in super_block.  When the extent status of
      an inode is accessed and changed, this inode will be move to the tail
      of the list.  The inode will be dropped from this list when it is
      cleared.  In the inode, a counter is added to count the number of
      cached objects in extent status tree.  Here only written/unwritten/hole
      extent is counted because delayed extent doesn't be reclaimed due to
      fiemap, bigalloc and seek_data/hole need it.  The counter will be
      increased as a new extent is allocated, and it will be decreased as a
      extent is freed.
      
      In this commit we use normal shrinker framework to reclaim memory from
      the status tree.  ext4_es_reclaim_extents_count() traverses the lru list
      to count the number of reclaimable extents.  ext4_es_shrink() tries to
      reclaim written/unwritten/hole extents from extent status tree.  The
      inode that has been shrunk is moved to the tail of lru list.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      74cd15cd
    • Z
      ext4: adjust some functions for reclaiming extents from extent status tree · bdedbb7b
      Zheng Liu 提交于
      This commit changes some interfaces in extent status tree because we
      need to use inode to count the cached objects in a extent status tree.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      bdedbb7b
    • Z
      ext4: remove single extent cache · 69eb33dc
      Zheng Liu 提交于
      Single extent cache could be removed because we have extent status tree
      as a extent cache, and it would be better.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      69eb33dc
    • Z
      ext4: lookup block mapping in extent status tree · d100eef2
      Zheng Liu 提交于
      After tracking all extent status, we already have a extent cache in
      memory.  Every time we want to lookup a block mapping, we can first
      try to lookup it in extent status tree to avoid a potential disk I/O.
      
      A new function called ext4_es_lookup_extent is defined to finish this
      work.  When we try to lookup a block mapping, we always call
      ext4_map_blocks and/or ext4_da_map_blocks.  So in these functions we
      first try to lookup a block mapping in extent status tree.
      
      A new flag EXT4_GET_BLOCKS_NO_PUT_HOLE is used in ext4_da_map_blocks
      in order not to put a hole into extent status tree because this hole
      will be converted to delayed extent in the tree immediately.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      d100eef2
    • Z
      ext4: track all extent status in extent status tree · f7fec032
      Zheng Liu 提交于
      By recording the phycisal block and status, extent status tree is able
      to track the status of every extents.  When we call _map_blocks
      functions to lookup an extent or create a new written/unwritten/delayed
      extent, this extent will be inserted into extent status tree.
      
      We don't load all extents from disk in alloc_inode() because it costs
      too much memory, and if a file is opened and closed frequently it will
      takes too much time to load all extent information.  So currently when
      we create/lookup an extent, this extent will be inserted into extent
      status tree.  Hence, the extent status tree may not comprehensively
      contain all of the extents found in the file.
      
      Here a condition we need to take care is that an extent might contains
      unwritten and delayed status simultaneously because an extent is delayed
      allocated and could be allocated by fallocate.  At this time we need to
      keep delayed status because later we need to update delayed reservation
      space using it.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      f7fec032
    • Z
      ext4: let ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag · a25a4e1a
      Zheng Liu 提交于
      This commit lets ext4_ext_map_blocks return EXT4_MAP_UNWRITTEN flag
      because in later commit ext4_map_blocks needs to use this flag to
      determine the extent status.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      a25a4e1a
    • Z
      ext4: rename and improbe ext4_es_find_extent() · be401363
      Zheng Liu 提交于
      This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
      and improve this function.  First, we split input and output parameter.
      Second, this function never return the first block of the next delayed
      extent after 'es'.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Jan kara <jack@suse.cz>
      be401363
    • Z
      ext4: add physical block and status member into extent status tree · fdc0212e
      Zheng Liu 提交于
      This commit adds two members in extent_status structure to let it record
      physical block and extent status.  Here es_pblk is used to record both
      of them because physical block only has 48 bits.  So extent status could
      be stashed into it so that we can save some memory.  Now written,
      unwritten, delayed and hole are defined as status.
      
      Due to new member is added into extent status tree, all interfaces need
      to be adjusted.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      fdc0212e
    • Z
      ext4: refine extent status tree · 06b0c886
      Zheng Liu 提交于
      This commit refines the extent status tree code.
      
      1) A prefix 'es_' is added to to the extent status tree structure
      members.
      
      2) Refactored es_remove_extent() so that __es_remove_extent() can be
      used by es_insert_extent() to remove the old extent entry(-ies) before
      inserting a new one.
      
      3) Rename extent_status_end() to ext4_es_end()
      
      4) ext4_es_can_be_merged() is define to check whether two extents can
      be merged or not.
      
      5) Update and clarified comments.
      Signed-off-by: NZheng Liu <wenqing.lz@taobao.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      06b0c886
  7. 15 2月, 2013 4 次提交
  8. 10 2月, 2013 6 次提交
    • T
      jbd2: use module parameters instead of debugfs for jbd_debug · b6e96d00
      Theodore Ts'o 提交于
      There are multiple reasons to move away from debugfs.  First of all,
      we are only using it for a single parameter, and it is much more
      complicated to set up (some 30 lines of code compared to 3), and one
      more thing that might fail while loading the jbd2 module.
      
      Secondly, as a module paramter it can be specified as a boot option if
      jbd2 is built into the kernel, or as a parameter when the module is
      loaded, and it can also be manipulated dynamically under
      /sys/module/jbd2/parameters/jbd2_debug.  So it is more flexible.
      
      Ultimately we want to move away from using jbd_debug() towards
      tracepoints, but for now this is still a useful simplification of the
      code base.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      b6e96d00
    • T
      ext4: use module parameters instead of debugfs for mballoc_debug · a0b30c12
      Theodore Ts'o 提交于
      There are multiple reasons to move away from debugfs.  First of all,
      we are only using it for a single parameter, and it is much more
      complicated to set up (some 30 lines of code compared to 3), and one
      more thing that might fail while loading the ext4 module.
      
      Secondly, as a module paramter it can be specified as a boot option if
      ext4 is built into the kernel, or as a parameter when the module is
      loaded, and it can also be manipulated dynamically under
      /sys/module/ext4/parameters/mballoc_debug.  So it is more flexible.
      
      Ultimately we want to move away from using mb_debug() towards
      tracepoints, but for now this is still a useful simplification of the
      code base.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a0b30c12
    • T
      ext4: start handle at the last possible moment when creating inodes · 1139575a
      Theodore Ts'o 提交于
      In ext4_{create,mknod,mkdir,symlink}(), don't start the journal handle
      until the inode has been succesfully allocated.  In order to do this,
      we need to start the handle in the ext4_new_inode().  So create a new
      variant of this function, ext4_new_inode_start_handle(), so the handle
      can be created at the last possible minute, before we need to modify
      the inode allocation bitmap block.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      1139575a
    • T
      ext4: fix the number of credits needed for acl ops with inline data · 95eaefbd
      Theodore Ts'o 提交于
      Operations which modify extended attributes may need extra journal
      credits if inline data is used, since there is a chance that some
      extended attributes may need to get pushed to an external attribute
      block.
      
      Changes to reflect this was made in xattr.c, but they were missed in
      fs/ext4/acl.c.  To fix this, abstract the calculation of the number of
      credits needed for xattr operations to an inline function defined in
      ext4_jbd2.h, and use it in acl.c and xattr.c.
      
      Also move the function declarations used in inline.c from xattr.h
      (where they are non-obviously hidden, and caused problems since
      ext4_jbd2.h needs to use the function ext4_has_inline_data), and move
      them to ext4.h.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NTao Ma <boyu.mt@taobao.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      95eaefbd
    • T
      ext4: fix the number of credits needed for ext4_unlink() and ext4_rmdir() · 64044abf
      Theodore Ts'o 提交于
      The ext4_unlink() and ext4_rmdir() don't actually release the blocks
      associated with the file/directory.  This gets done in a separate jbd2
      handle called via ext4_evict_inode().  Thus, we don't need to reserve
      lots of journal credits for the truncate.
      
      Note that using too many journal credits is non-optimal because it can
      leading to the journal transmit getting closed too early, before it is
      strictly necessary.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      64044abf
    • T
      ext4: fix the number of credits needed for ext4_ext_migrate() · 4b217630
      Theodore Ts'o 提交于
      The migration ioctl creates a temporary inode.  Since this inode is
      never linked to a directory, we don't need to reserve journal credits
      required for modifying the directory.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      4b217630
  9. 09 2月, 2013 6 次提交
    • T
      ext4: start handle at the last possible moment in ext4_rmdir() · 8dcfaad2
      Theodore Ts'o 提交于
      Don't start the jbd2 transaction handle until after the directory
      entry has been found, to minimize the amount of time that a handle is
      held active.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      8dcfaad2
    • T
      ext4: start handle at the last possible moment in ext4_unlink() · 931b6864
      Theodore Ts'o 提交于
      Don't start the jbd2 transaction handle until after the directory
      entry has been found, to minimize the amount of time that a handle is
      held active.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      931b6864
    • T
      ext4: grab page before starting transaction handle in write_begin() · 47564bfb
      Theodore Ts'o 提交于
      The grab_cache_page_write_begin() function can potentially sleep for a
      long time, since it may need to do memory allocation which can block
      if the system is under significant memory pressure, and because it may
      be blocked on page writeback.  If it does take a long time to grab the
      page, it's better that we not hold an active jbd2 handle.
      
      So grab a handle on the page first, and _then_ start the transaction
      handle.
      
      This commit fixes the following long transaction handle hold time:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      47564bfb
    • T
      ext4: pass context information to jbd2__journal_start() · 9924a92a
      Theodore Ts'o 提交于
      So we can better understand what bits of ext4 are responsible for
      long-running jbd2 handles, use jbd2__journal_start() so we can pass
      context information for logging purposes.
      
      The recommended way for finding the longer-running handles is:
      
         T=/sys/kernel/debug/tracing
         EVENT=$T/events/jbd2/jbd2_handle_stats
         echo "interval > 5" > $EVENT/filter
         echo 1 > $EVENT/enable
      
         ./run-my-fs-benchmark
      
         cat $T/trace > /tmp/problem-handles
      
      This will list handles that were active for longer than 20ms.  Having
      longer-running handles is bad, because a commit started at the wrong
      time could stall for those 20+ milliseconds, which could delay an
      fsync() or an O_SYNC operation.  Here is an example line from the
      trace file describing a handle which lived on for 311 jiffies, or over
      1.2 seconds:
      
      postmark-2917  [000] ....   196.435786: jbd2_handle_stats: dev 254,32 
         tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
         dirtied_blocks 0
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      9924a92a
    • T
      ext4: move the jbd2 wrapper functions out of super.c · 722887dd
      Theodore Ts'o 提交于
      Move the jbd2 wrapper functions which start and stop handles out of
      super.c, where they don't really logically belong, and into
      ext4_jbd2.c.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      722887dd
    • T
      jbd2: add tracepoints which provide per-handle statistics · 343d9c28
      Theodore Ts'o 提交于
      Handles which stay open a long time are problematic when it comes time
      to close down a transaction so it can be committed.  These tracepoints
      will help us determine which ones are the problematic ones, and to
      validate whether changes makes things better or worse.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      343d9c28
  10. 07 2月, 2013 2 次提交
  11. 05 2月, 2013 1 次提交
    • T
      ext4: optimize mballoc for large allocations · 40ae3487
      Theodore Ts'o 提交于
      The ext4 block allocator only maintains buddy bitmaps for chunks which
      are less than or equal to one quarter of a block group.  That is, for
      a file aystem with a 1k blocksize, and where the number of blocks in a
      block group is 8192 blocks, the largest chunk size tracked by buddy
      bitmaps is 2048 blocks.
      
      For a file system with a 4k blocksize, and where the number of blocks
      in a block group is 32768 blocks, the largest chunk size tracked by
      buddy bitmaps is 8192 blocks.
      
      To work around this code, mballoc.c before this commit would truncate
      allocation requests to the number of blocks in a block group minus 10.
      Why 10?  Aside from being a completely arbitrary number, it avoids
      block allocation to be a power of two larger than 25% of the block
      group.  If you try to explicitly fallocate 50% of the block group
      size, this will demonstrate the problem; the block allocation code
      will scan the all of the blocks in the file system with cr==0 (since
      the request is for a natural power of two), but then completely fail
      for all blocks groups, since the buddy bitmaps don't track chunk sizes
      of 50% of the block group.
      
      To fix this, in these we use ext4_mb_complex_scan_group() instead of
      ext4_mb_simple_scan_group().
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger@dilger.ca>
      40ae3487
  12. 03 2月, 2013 4 次提交
  13. 02 2月, 2013 2 次提交