1. 30 9月, 2009 1 次提交
  2. 28 9月, 2009 1 次提交
  3. 17 9月, 2009 1 次提交
    • E
      ext4: limit block allocations for indirect-block files to < 2^32 · fb0a387d
      Eric Sandeen 提交于
      Today, the ext4 allocator will happily allocate blocks past
      2^32 for indirect-block files, which results in the block
      numbers getting truncated, and corruption ensues.
      
      This patch limits such allocations to < 2^32, and adds
      BUG_ONs if we do get blocks larger than that.
      
      This should address RH Bug 519471, ext4 bitmap allocator 
      must limit blocks to < 2^32
      
      * ext4_find_goal() is modified to choose a goal < UINT_MAX,
        so that our starting point is in an acceptable range.
      
      * ext4_xattr_block_set() is modified such that the goal block
        is < UINT_MAX, as above.
      
      * ext4_mb_regular_allocator() is modified so that the group
        search does not continue into groups which are too high
      
      * ext4_mb_use_preallocated() has a check that we don't use
        preallocated space which is too far out
      
      * ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs
      
      No attempt has been made to limit inode locations to < 2^32,
      so we may wind up with blocks far from their inodes.  Doing
      this much already will lead to some odd ENOSPC issues when the
      "lower 32" gets full, and further restricting inodes could
      make that even weirder.
      
      For high inodes, choosing a goal of the original, % UINT_MAX,
      may be a bit odd, but then we're in an odd situation anyway,
      and I don't know of a better heuristic.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      fb0a387d
  4. 10 9月, 2009 3 次提交
  5. 05 9月, 2009 1 次提交
  6. 26 8月, 2009 2 次提交
  7. 18 8月, 2009 2 次提交
    • E
      simplify some logic in ext4_mb_normalize_request · 38877f4e
      Eric Sandeen 提交于
      While reading through some of the mballoc code it seems that a couple
      spots in the size normalization function could be streamlined.
      
      The test for non-overlapping PAs can be or'd for the start & end
      conditions, and the tests for adjacent PAs can be else-if'd - 
      it's essentially independently testing:
      
      	if (A + B <= C)
      		...
      	if (A > C)
      		...
      
      These cannot both be true so it seems like the else-if might
      be slightly more efficient and/or informative.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      38877f4e
    • E
      ext4: open-code ext4_mb_update_group_info · 0373130d
      Eric Sandeen 提交于
      ext4_mb_update_group_info is only called in one place, and it's
      extremely simple.  There's no reason to have it in a separate function
      in a separate file as far as I can tell, it just obfuscates what's
      really going on.
      
      Perhaps it was intended to keep the grp->bb_* manipulation local to
      mballoc.c but we're already accessing other grp-> fields in balloc.c
      directly so this seems ok.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0373130d
  8. 19 9月, 2009 1 次提交
    • T
      ext4: Avoid group preallocation for closed files · 50797481
      Theodore Ts'o 提交于
      Currently the group preallocation code tries to find a large (512)
      free block from which to do per-cpu group allocation for small files.
      The problem with this scheme is that it leaves the filesystem horribly
      fragmented.  In the worst case, if the filesystem is unmounted and
      remounted (after a system shutdown, for example) we forget the fact
      that wee were using a particular (now-partially filled) 512 block
      extent.  So the next time we try to allocate space for a small file,
      we will find *another* completely free 512 block chunk to allocate
      small files.  Given that there are 32,768 blocks in a block group,
      after 64 iterations of "mount, write one 4k file in a directory,
      unmount", the block group will have 64 files, each separated by 511
      blocks, and the block group will no longer have any free 512
      completely free chunks of blocks for group preallocation space.
      
      So if we try to allocate blocks for a file that has been closed, such
      that we know the final size of the file, and the filesystem is not
      busy, avoid using group preallocation.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      50797481
  9. 10 8月, 2009 2 次提交
    • T
      ext4: Fix bugs in mballoc's stream allocation mode · 4ba74d00
      Theodore Ts'o 提交于
      The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all
      screwed up.  These fields were getting unconditionally all the time,
      set even when stream allocation had not taken place, and if they were
      being used when the file was smaller than s_mb_stream_request, which
      is when the allocation should _not_ be doing stream allocation.
      
      Fix this by determining whether or not we stream allocation should
      take place once, in ext4_mb_group_or_file(), and setting a flag which
      gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found().
      This simplifies the code and assures that we are consistently using
      (or not using) the stream allocation logic.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      4ba74d00
    • T
      ext4: Display the mballoc flags in mb_history in hex instead of decimal · 0ef90db9
      Theodore Ts'o 提交于
      Displaying the flags in base 16 makes it easier to see which flags
      have been set.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      0ef90db9
  10. 19 9月, 2009 1 次提交
  11. 13 7月, 2009 1 次提交
    • T
      ext4: Fix ext4_mb_initialize_context() to initialize all fields · 833576b3
      Theodore Ts'o 提交于
      Pavel Roskin pointed out that kmemcheck indicated that
      ext4_mb_store_history() was accessing uninitialized values of
      ac->ac_tail and ac->ac_buddy leading to garbage in the mballoc
      history.  Fix this by initializing the entire structure to all zeros
      first.
      
      Also, two fields were getting doubly initialized by the caller of
      ext4_mb_initialize_context, so remove them for efficiency's sake.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      833576b3
  12. 06 7月, 2009 2 次提交
  13. 17 7月, 2009 1 次提交
  14. 15 6月, 2009 1 次提交
  15. 06 7月, 2009 2 次提交
    • J
      ext4: Use rcu_barrier() on module unload. · 3e03f9ca
      Jesper Dangaard Brouer 提交于
      The ext4 module uses rcu_call() thus it should use rcu_barrier()on
      module unload.
      
      The kmem cache ext4_pspace_cachep is sometimes free'ed using
      call_rcu() callbacks.  Thus, we must wait for completion of call_rcu()
      before doing kmem_cache_destroy().
      Signed-off-by: NJesper Dangaard Brouer <hawk@comx.dk>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      3e03f9ca
    • E
      ext4: mark several more functions in mballoc.c as noinline · 089ceecc
      Eric Sandeen 提交于
      Ted noticed a stack-deep callchain through
      writepages->ext4_mb_regular_allocator->ext4_mb_init_cache->submit_bh ...
      
      With all the static functions in mballoc.c, gcc helpfully
      inlines for us, and we get something like this:
      
      ext4_mb_regular_allocator	(232 bytes stack)
      	ext4_mb_init_cache	(232 bytes stack)
      		submit_bh	(starts 464 deeper)
      
      the 2 ext4 functions here get several others inlined; by telling
      gcc not to inline them, we can save stack space for when we
      head off into submit_bh land and associated block layer callchains.
      The following noinlined functions are only called once, so this
      won't impact any other callchains:
      
      ext4_mb_regular_allocator 			(104) (was 232)
      	ext4_mb_find_by_goal			 (56) (noinlined)
      	ext4_mb_init_group			 (24) (noinlined)
      		ext4_mb_init_cache		(136) (was 232)
      			ext4_mb_generate_buddy	 (88) (noinlined)
      			ext4_mb_generate_from_pa (40) (noinlined)
      			submit_bh
      	ext4_mb_simple_scan_group		 (24) (noinlined)
      	ext4_mb_scan_aligned			 (56) (noinlined)
      	ext4_mb_complex_scan_group		 (40) (noinlined)
      	ext4_mb_try_best_found			 (24) (noinlined)
      
      now when we head off into submit_bh() we're only 264 bytes deeper
      in stack than when we entered ext4_mb_regular_allocator()
      (vs. 464 bytes before).  Every 200 bytes helps.  :)
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      089ceecc
  16. 18 5月, 2009 1 次提交
    • T
      ext4: Add a comprehensive block validity check to ext4_get_blocks() · 6fd058f7
      Theodore Ts'o 提交于
      To catch filesystem bugs or corruption which could lead to the
      filesystem getting severly damaged, this patch adds a facility for
      tracking all of the filesystem metadata blocks by contiguous regions
      in a red-black tree.  This allows quick searching of the tree to
      locate extents which might overlap with filesystem metadata blocks.
      
      This facility is also used by the multi-block allocator to assure that
      it is not allocating blocks out of the system zone, as well as by the
      routines used when reading indirect blocks and extents information
      from disk to make sure their contents are valid.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      6fd058f7
  17. 15 5月, 2009 1 次提交
  18. 03 5月, 2009 1 次提交
  19. 02 5月, 2009 2 次提交
    • C
      ext4: Make the length of the mb_history file tunable · f4033903
      Curt Wohlgemuth 提交于
      In memory-constrained systems with many partitions, the ~68K for each
      partition for the mb_history buffer can be excessive.
      
      This patch adds a new mount option, mb_history_length, as well as a
      way of setting the default via a module parameter (or via a sysfs
      parameter in /sys/module/ext4/parameter/default_mb_history_length).
      If the mb_history_length is set to zero, the mb_history facility is
      disabled entirely.
      Signed-off-by: NCurt Wohlgemuth <curtw@google.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      f4033903
    • T
      ext4: Don't avoid using BLOCK_UNINIT block groups in mballoc · 75507efb
      Theodore Ts'o 提交于
      By avoiding the use of not-yet-used block groups (i.e., block groups
      with the BLOCK_UNINIT flag), mballoc had a tendency to create large
      files with large non-contiguous gaps.  In addition avoiding the use of
      new block groups had a tendency to push regular file data into the
      first block group in a flex_bg group, which slows down the speed of
      e2fsck pass 2, since it has a tendency to seek much more.  For
      example:
      
                     Before Patch                       After Patch
                    Time in seconds                   Time in seconds
                  Real /  User/  Sys   MB/s      Real /  User/  Sys    MB/s
      Pass 1      8.52 / 2.21 / 0.46  20.43      8.84 / 4.97 / 1.11   19.68
      Pass 2     21.16 / 1.02 / 1.86  11.30      6.54 / 1.77 / 1.78   36.39
      Pass 3      0.01 / 0.00 / 0.00 139.00      0.01 / 0.01 / 0.00  128.90
      Pass 4      0.16 / 0.15 / 0.00   0.00      0.17 / 0.17 / 0.00    0.00
      Pass 5      2.52 / 1.99 / 0.09   0.79      2.31 / 1.78 / 0.06    0.86
      Total      32.40 / 5.11 / 2.49  12.81     17.99 / 8.75 / 2.98   23.01
      
      This was on a sample 80 gig root filesystem which was approximately
      50% full.  Note the improved e2fsck pass 2 performance, by over a
      factor of 3, due to a decreased number of seeks.  (The total amount of
      I/O in pass 2 was unchanged; the layout of the directory blocks was
      simply much better from e2fsck's's perspective.)
      
      Other changes as a result of this patch on this sample filesystem:
      
                                   Before Patch    After Patch
      # of non-contig files           762             779
      # of non-contig directories     571             570
      # of BLOCK_UNINIT bg's          307             293
      # of INODE_UNINIT bg's          503             503
      
      Out of 640 block groups, of which 333 were in use, this patch caused
      an extra 14 block groups to be utilized.  The number of non-contiguous
      files did go up slightly, but when measured against the 99.9% of the
      files (603,154) which were contiguously allocated, this is pretty
      insignificant.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndreas Dilger <adilger@sun.com>
      75507efb
  20. 17 6月, 2009 1 次提交
  21. 01 5月, 2009 1 次提交
    • T
      ext4: Avoid races caused by on-line resizing and SMP memory reordering · 8df9675f
      Theodore Ts'o 提交于
      Ext4's on-line resizing adds a new block group and then, only at the
      last step adjusts s_groups_count.  However, it's possible on SMP
      systems that another CPU could see the updated the s_group_count and
      not see the newly initialized data structures for the just-added block
      group.  For this reason, it's important to insert a SMP read barrier
      after reading s_groups_count and before reading any (for example) the
      new block group descriptors allowed by the increased value of
      s_groups_count.
      
      Unfortunately, we rather blatently violate this locking protocol
      documented in fs/ext4/resize.c.  Fortunately, (1) on-line resizes
      happen relatively rarely, and (2) it seems rare that the filesystem
      code will immediately try to use just-added block group before any
      memory ordering issues resolve themselves.  So apparently problems
      here are relatively hard to hit, since ext3 has been vulnerable to the
      same issue for years with no one apparently complaining.
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8df9675f
  22. 28 3月, 2009 3 次提交
  23. 26 3月, 2009 2 次提交
  24. 17 3月, 2009 1 次提交
    • E
      ext4: fix bb_prealloc_list corruption due to wrong group locking · d33a1976
      Eric Sandeen 提交于
      This is for Red Hat bug 490026: EXT4 panic, list corruption in
      ext4_mb_new_inode_pa
      
      ext4_lock_group(sb, group) is supposed to protect this list for
      each group, and a common code flow to remove an album is like
      this:
      
          ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
          ext4_lock_group(sb, grp);
          list_del(&pa->pa_group_list);
          ext4_unlock_group(sb, grp);
      
      so it's critical that we get the right group number back for
      this prealloc context, to lock the right group (the one 
      associated with this pa) and prevent concurrent list manipulation.
      
      however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a 
      comment, "-1 is to protect from crossing allocation group".
      
      This makes sense for the group_pa, where pa_pstart is advanced
      by the length which has been used (in ext4_mb_release_context()),
      and when the entire length has been used, pa_pstart has been
      advanced to the first block of the next group.
      
      However, for inode_pa, pa_pstart is never advanced; it's just
      set once to the first block in the group and not moved after
      that.  So in this case, if we subtract one in ext4_mb_put_pa(),
      we are actually locking the *previous* group, and opening the
      race with the other threads which do not subtract off the extra
      block.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d33a1976
  25. 14 3月, 2009 1 次提交
    • E
      ext4: fix bogus BUG_ONs in in mballoc code · 8d03c7a0
      Eric Sandeen 提交于
      Thiemo Nagel reported that:
      
      # dd if=/dev/zero of=image.ext4 bs=1M count=2
      # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
        -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
      # mount -o loop image.ext4 mnt/
      # dd if=/dev/zero of=mnt/file
      
      oopsed, with a BUG_ON in ext4_mb_normalize_request because
      size == EXT4_BLOCKS_PER_GROUP
      
      It appears to me (esp. after talking to Andreas) that the BUG_ON
      is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
      be allowed, though larger sizes do indicate a problem.
      
      Fix that an another (apparently rare) codepath with a similar check.
      Reported-by: NThiemo Nagel <thiemo.nagel@ph.tum.de>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8d03c7a0
  26. 05 3月, 2009 1 次提交
  27. 31 3月, 2009 1 次提交
  28. 13 3月, 2009 1 次提交
    • T
      ext4: New inode/block allocation algorithms for flex_bg filesystems · a4912123
      Theodore Ts'o 提交于
      The find_group_flex() inode allocator is now only used if the
      filesystem is mounted using the "oldalloc" mount option.  It is
      replaced with the original Orlov allocator that has been updated for
      flex_bg filesystems (it should behave the same way if flex_bg is
      disabled).  The inode allocator now functions by taking into account
      each flex_bg group, instead of each block group, when deciding whether
      or not it's time to allocate a new directory into a fresh flex_bg.
      
      The block allocator has also been changed so that the first block
      group in each flex_bg is preferred for use for storing directory
      blocks.  This keeps directory blocks close together, which is good for
      speeding up e2fsck since large directories are more likely to look
      like this:
      
      debugfs:  stat /home/tytso/Maildir/cur
      Inode: 1844562   Type: directory    Mode:  0700   Flags: 0x81000
      Generation: 1132745781    Version: 0x00000000:0000ad71
      User: 15806   Group: 15806   Size: 1060864
      File ACL: 0    Directory ACL: 0
      Links: 2   Blockcount: 2072
      Fragment:  Address: 0    Number: 0    Size: 0
       ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009
       atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009
       mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009
      crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009
      Size of extra inode fields: 28
      BLOCKS:
      (0):7348651, (1-258):7348654-7348911
      TOTAL: 259
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      a4912123
  29. 14 2月, 2009 1 次提交