1. 25 3月, 2009 14 次提交
    • C
      Btrfs: tree logging unlink/rename fixes · 12fcfd22
      Chris Mason 提交于
      The tree logging code allows individual files or directories to be logged
      without including operations on other files and directories in the FS.
      It tries to commit the minimal set of changes to disk in order to
      fsync the single file or directory that was sent to fsync or O_SYNC.
      
      The tree logging code was allowing files and directories to be unlinked
      if they were part of a rename operation where only one directory
      in the rename was in the fsync log.  This patch adds a few new rules
      to the tree logging.
      
      1) on rename or unlink, if the inode being unlinked isn't in the fsync
      log, we must force a full commit before doing an fsync of the directory
      where the unlink was done.  The commit isn't done during the unlink,
      but it is forced the next time we try to log the parent directory.
      
      Solution: record transid of last unlink/rename per directory when the
      directory wasn't already logged.  For renames this is only done when
      renaming to a different directory.
      
      mkdir foo/some_dir
      normal commit
      rename foo/some_dir foo2/some_dir
      mkdir foo/some_dir
      fsync foo/some_dir/some_file
      
      The fsync above will unlink the original some_dir without recording
      it in its new location (foo2).  After a crash, some_dir will be gone
      unless the fsync of some_file forces a full commit
      
      2) we must log any new names for any file or dir that is in the fsync
      log.  This way we make sure not to lose files that are unlinked during
      the same transaction.
      
      2a) we must log any new names for any file or dir during rename
      when the directory they are being removed from was logged.
      
      2a is actually the more important variant.  Without the extra logging
      a crash might unlink the old name without recreating the new one
      
      3) after a crash, we must go through any directories with a link count
      of zero and redo the rm -rf
      
      mkdir f1/foo
      normal commit
      rm -rf f1/foo
      fsync(f1)
      
      The directory f1 was fully removed from the FS, but fsync was never
      called on f1, only its parent dir.  After a crash the rm -rf must
      be replayed.  This must be able to recurse down the entire
      directory tree.  The inode link count fixup code takes care of the
      ugly details.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      12fcfd22
    • C
      Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay · a74ac322
      Chris Mason 提交于
      During log replay, inodes are copied from the log to the main filesystem
      btrees.  Sometimes they have a zero link count in the log but they actually
      gain links during the replay or have some in the main btree.
      
      This patch updates the link count to be at least one after copying the
      inode out of the log.  This makes sure the inode is deleted during an
      iput while the rest of the replay code is still working on it.
      
      The log replay has fixup code to make sure that link counts are correct
      at the end of the replay, so we could use any non-zero number here and
      it would work fine.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a74ac322
    • C
      Btrfs: limit balancing work while flushing delayed refs · a4b6e07d
      Chris Mason 提交于
      The delayed reference mechanism is responsible for all updates to the
      extent allocation trees, including those updates created while processing
      the delayed references.
      
      This commit tries to limit the amount of work that gets created during
      the final run of delayed refs before a commit.  It avoids cowing new blocks
      unless it is required to finish the commit, and so it avoids new allocations
      that were not really required.
      
      The goal is to avoid infinite loops where we are always making more work
      on the final run of delayed refs.  Over the long term we'll make a
      special log for the last delayed ref updates as well.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      a4b6e07d
    • C
      Btrfs: readahead checksums during btrfs_finish_ordered_io · 5d13a98f
      Chris Mason 提交于
      This reads in blocks in the checksum btree before starting the
      transaction in btrfs_finish_ordered_io.  It makes it much more likely
      we'll be able to do operations inside the transaction without
      needing any btree reads, which limits transaction latencies overall.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5d13a98f
    • C
      Btrfs: leave btree locks spinning more often · b9473439
      Chris Mason 提交于
      btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
      for the buffers it was dirtying.  This may require a kmalloc and it
      was not atomic.  So, anyone who called btrfs_mark_buffer_dirty had to
      set any btree locks they were holding to blocking first.
      
      This commit changes dirty tracking for extent buffers to just use a flag
      in the extent buffer.  Now that we have one and only one extent buffer
      per page, this can be safely done without losing dirty bits along the way.
      
      This also introduces a path->leave_spinning flag that callers of
      btrfs_search_slot can use to indicate they will properly deal with a
      path returned where all the locks are spinning instead of blocking.
      
      Many of the btree search callers now expect spinning paths,
      resulting in better btree concurrency overall.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9473439
    • C
      Btrfs: Only let very young transactions grow during commit · 89573b9c
      Chris Mason 提交于
      Commits are fairly expensive, and so btrfs has code to sit around for a while
      during the commit and let new writers come in.
      
      But, while we're sitting there, new delayed refs might be added, and those
      can be expensive to process as well.  Unless the transaction is very very
      young, it makes sense to go ahead and let the commit finish without hanging
      around.
      
      The commit grow loop isn't as important as it used to be, the fsync logging
      code handles most performance critical syncs now.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      89573b9c
    • C
      Btrfs: Check for a blocking lock before taking the spin · 66d7e85e
      Chris Mason 提交于
      This reduces contention on the extent buffer spin locks by testing for a
      blocking lock before trying to take the spinlock.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      66d7e85e
    • C
      Btrfs: reduce stack in cow_file_range · 7f366cfe
      Chris Mason 提交于
      The fs/btrfs/inode.c code to run delayed allocation during writout
      needed some stack usage optimization.  This is the first pass, it does
      the check for compression earlier on, which allows us to do the common
      (no compression) case higher up in the call chain.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7f366cfe
    • C
      Btrfs: reduce stalls during transaction commit · b7ec40d7
      Chris Mason 提交于
      To avoid deadlocks and reduce latencies during some critical operations, some
      transaction writers are allowed to jump into the running transaction and make
      it run a little longer, while others sit around and wait for the commit to
      finish.
      
      This is a bit unfair, especially when the callers that jump in do a bunch
      of IO that makes all the others procs on the box wait.  This commit
      reduces the stalls this produces by pre-reading file extent pointers
      during btrfs_finish_ordered_io before the transaction is joined.
      
      It also tunes the drop_snapshot code to politely wait for transactions
      that have started writing out their delayed refs to finish.  This avoids
      new delayed refs being flooded into the queue while we're trying to
      close off the transaction.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b7ec40d7
    • C
      Btrfs: process the delayed reference queue in clusters · c3e69d58
      Chris Mason 提交于
      The delayed reference queue maintains pending operations that need to
      be done to the extent allocation tree.  These are processed by
      finding records in the tree that are not currently being processed one at
      a time.
      
      This is slow because it uses lots of time searching through the rbtree
      and because it creates lock contention on the extent allocation tree
      when lots of different procs are running delayed refs at the same time.
      
      This commit changes things to grab a cluster of refs for processing,
      using a cursor into the rbtree as the starting point of the next search.
      This way we walk smoothly through the rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      c3e69d58
    • C
      Btrfs: try to cleanup delayed refs while freeing extents · 1887be66
      Chris Mason 提交于
      When extents are freed, it is likely that we've removed the last
      delayed reference update for the extent.  This checks the delayed
      ref tree when things are freed, and if no ref updates area left it
      immediately processes the delayed ref.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1887be66
    • C
      Btrfs: reduce stack usage in some crucial tree balancing functions · 44871b1b
      Chris Mason 提交于
      Many of the tree balancing functions follow the same pattern.
      
      1) cow a block
      2) do something to the result
      
      This commit breaks them up into two functions so the variables and
      code required for part two don't suck down stack during part one.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      44871b1b
    • C
      Btrfs: do extent allocation and reference count updates in the background · 56bec294
      Chris Mason 提交于
      The extent allocation tree maintains a reference count and full
      back reference information for every extent allocated in the
      filesystem.  For subvolume and snapshot trees, every time
      a block goes through COW, the new copy of the block adds a reference
      on every block it points to.
      
      If a btree node points to 150 leaves, then the COW code needs to go
      and add backrefs on 150 different extents, which might be spread all
      over the extent allocation tree.
      
      These updates currently happen during btrfs_cow_block, and most COWs
      happen during btrfs_search_slot.  btrfs_search_slot has locks held
      on both the parent and the node we are COWing, and so we really want
      to avoid IO during the COW if we can.
      
      This commit adds an rbtree of pending reference count updates and extent
      allocations.  The tree is ordered by byte number of the extent and byte number
      of the parent for the back reference.  The tree allows us to:
      
      1) Modify back references in something close to disk order, reducing seeks
      2) Significantly reduce the number of modifications made as block pointers
      are balanced around
      3) Do all of the extent insertion and back reference modifications outside
      of the performance critical btrfs_search_slot code.
      
      #3 has the added benefit of greatly reducing the btrfs stack footprint.
      The extent allocation tree modifications are done without the deep
      (and somewhat recursive) call chains used in the past.
      
      These delayed back reference updates must be done before the transaction
      commits, and so the rbtree is tied to the transaction.  Throttling is
      implemented to help keep the queue of backrefs at a reasonable size.
      
      Since there was a similar mechanism in place for the extent tree
      extents, that is removed and replaced by the delayed reference tree.
      
      Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      56bec294
    • C
      Btrfs: don't preallocate metadata blocks during btrfs_search_slot · 9fa8cfe7
      Chris Mason 提交于
      In order to avoid doing expensive extent management with tree locks held,
      btrfs_search_slot will preallocate tree blocks for use by COW without
      any tree locks held.
      
      A later commit moves all of the extent allocation work for COW into
      a delayed update mechanism, and this preallocation will no longer be
      required.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9fa8cfe7
  2. 23 3月, 2009 3 次提交
  3. 20 3月, 2009 3 次提交
  4. 18 3月, 2009 2 次提交
    • B
      NFSD: provide encode routine for OP_OPENATTR · 84f09f46
      Benny Halevy 提交于
      Although this operation is unsupported by our implementation
      we still need to provide an encode routine for it to
      merely encode its (error) status back in the compound reply.
      
      Thanks for Bill Baker at sun.com for testing with the Sun
      OpenSolaris' client, finding, and reporting this bug at
      Connectathon 2009.
      
      This bug was introduced in 2.6.27
      Signed-off-by: NBenny Halevy <bhalevy@panasas.com>
      Cc: stable@kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@citi.umich.edu>
      84f09f46
    • L
      Avoid 64-bit "switch()" statements on 32-bit architectures · ee568b25
      Linus Torvalds 提交于
      Commit ee6f779b ("filp->f_pos not
      correctly updated in proc_task_readdir") changed the proc code to use
      filp->f_pos directly, rather than through a temporary variable.  In the
      process, that caused the operations to be done on the full 64 bits, even
      though the offset is never that big.
      
      That's all fine and dandy per se, but for some unfathomable reason gcc
      generates absolutely horrid code when using 64-bit values in switch()
      statements.  To the point of actually calling out to gcc helper
      functions like __cmpdi2 rather than just doing the trivial comparisons
      directly the way gcc does for normal compares.  At which point we get
      link failures, because we really don't want to support that kind of
      crazy code.
      
      Fix this by just casting the f_pos value to "unsigned long", which
      is plenty big enough for /proc, and avoids the gcc code generation issue.
      Reported-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Zhang Le <r0bertz@gentoo.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee568b25
  5. 17 3月, 2009 1 次提交
    • E
      ext4: fix bb_prealloc_list corruption due to wrong group locking · d33a1976
      Eric Sandeen 提交于
      This is for Red Hat bug 490026: EXT4 panic, list corruption in
      ext4_mb_new_inode_pa
      
      ext4_lock_group(sb, group) is supposed to protect this list for
      each group, and a common code flow to remove an album is like
      this:
      
          ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
          ext4_lock_group(sb, grp);
          list_del(&pa->pa_group_list);
          ext4_unlock_group(sb, grp);
      
      so it's critical that we get the right group number back for
      this prealloc context, to lock the right group (the one 
      associated with this pa) and prevent concurrent list manipulation.
      
      however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a 
      comment, "-1 is to protect from crossing allocation group".
      
      This makes sense for the group_pa, where pa_pstart is advanced
      by the length which has been used (in ext4_mb_release_context()),
      and when the entire length has been used, pa_pstart has been
      advanced to the first block of the next group.
      
      However, for inode_pa, pa_pstart is never advanced; it's just
      set once to the first block in the group and not moved after
      that.  So in this case, if we subtract one in ext4_mb_put_pa(),
      we are actually locking the *previous* group, and opening the
      race with the other threads which do not subtract off the extra
      block.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      d33a1976
  6. 16 3月, 2009 1 次提交
    • Z
      filp->f_pos not correctly updated in proc_task_readdir · ee6f779b
      Zhang Le 提交于
      filp->f_pos only get updated at the end of the function. Thus d_off of those
      dirents who are in the middle will be 0, and this will cause a problem in
      glibc's readdir implementation, specifically endless loop. Because when overflow
      occurs, f_pos will be set to next dirent to read, however it will be 0, unless
      the next one is the last one. So it will start over again and again.
      
      There is a sample program in man 2 gendents. This is the output of the program
      running on a multithread program's task dir before this patch is applied:
      
        $ ./a.out /proc/3807/task
        --------------- nread=128 ---------------
        i-node#  file type  d_reclen  d_off   d_name
          506442  directory    16          1  .
          506441  directory    16          0  ..
          506443  directory    16          0  3807
          506444  directory    16          0  3809
          506445  directory    16          0  3812
          506446  directory    16          0  3861
          506447  directory    16          0  3862
          506448  directory    16          8  3863
      
      This is the output after this patch is applied
      
        $ ./a.out /proc/3807/task
        --------------- nread=128 ---------------
        i-node#  file type  d_reclen  d_off   d_name
          506442  directory    16          1  .
          506441  directory    16          2  ..
          506443  directory    16          3  3807
          506444  directory    16          4  3809
          506445  directory    16          5  3812
          506446  directory    16          6  3861
          506447  directory    16          7  3862
          506448  directory    16          8  3863
      Signed-off-by: NZhang Le <r0bertz@gentoo.org>
      Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee6f779b
  7. 15 3月, 2009 5 次提交
  8. 14 3月, 2009 1 次提交
    • E
      ext4: fix bogus BUG_ONs in in mballoc code · 8d03c7a0
      Eric Sandeen 提交于
      Thiemo Nagel reported that:
      
      # dd if=/dev/zero of=image.ext4 bs=1M count=2
      # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
        -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
      # mount -o loop image.ext4 mnt/
      # dd if=/dev/zero of=mnt/file
      
      oopsed, with a BUG_ON in ext4_mb_normalize_request because
      size == EXT4_BLOCKS_PER_GROUP
      
      It appears to me (esp. after talking to Andreas) that the BUG_ON
      is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
      be allowed, though larger sizes do indicate a problem.
      
      Fix that an another (apparently rare) codepath with a similar check.
      Reported-by: NThiemo Nagel <thiemo.nagel@ph.tum.de>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
      8d03c7a0
  9. 13 3月, 2009 9 次提交
  10. 12 3月, 2009 1 次提交
    • P
      Squashfs: Valid filesystems are flagged as bad by the corrupted fs patch · 363911d0
      Phillip Lougher 提交于
      The corrupted filesystem patch added a check against zlib trying to
      output too much data in the presence of data corruption.  This check
      triggered if zlib_inflate asked to be called again (Z_OK) with
      avail_out == 0 and no more output buffers available.  This check proves
      to be rather dumb, as it incorrectly catches the case where zlib has
      generated all the output, but there are still input bytes to be processed.
      
      This patch does a number of things.  It removes the original check and
      replaces it with code to not move to the next output buffer if there
      are no more output buffers available, relying on zlib to error if it
      wants an extra output buffer in the case of data corruption.  It
      also replaces the Z_NO_FLUSH flag with the more correct Z_SYNC_FLUSH
      flag, and makes the error messages more understandable to
      non-technical users.
      Signed-off-by: NPhillip Lougher <phillip@lougher.demon.co.uk>
      Reported-by: NStefan Lippers-Hollmann <s.L-H@gmx.de>
      363911d0