1. 30 8月, 2014 4 次提交
  2. 29 8月, 2014 4 次提交
    • D
      ext4: fix same-dir rename when inline data directory overflows · d80d448c
      Darrick J. Wong 提交于
      When performing a same-directory rename, it's possible that adding or
      setting the new directory entry will cause the directory to overflow
      the inline data area, which causes the directory to be converted to an
      extent-based directory.  Under this circumstance it is necessary to
      re-read the directory when deleting the old dirent because the "old
      directory" context still points to i_block in the inode table, which
      is now an extent tree root!  The delete fails with an FS error, and
      the subsequent fsck complains about incorrect link counts and
      hardlinked directories.
      
      Test case (originally found with flat_dir_test in the metadata_csum
      test program):
      
      # mkfs.ext4 -O inline_data /dev/sda
      # mount /dev/sda /mnt
      # mkdir /mnt/x
      # touch /mnt/x/changelog.gz /mnt/x/copyright /mnt/x/README.Debian
      # sync
      # for i in /mnt/x/*; do mv $i $i.longer; done
      # ls -la /mnt/x/
      total 0
      -rw-r--r-- 1 root root 0 Aug 25 12:03 changelog.gz.longer
      -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright
      -rw-r--r-- 1 root root 0 Aug 25 12:03 copyright.longer
      -rw-r--r-- 1 root root 0 Aug 25 12:03 README.Debian.longer
      
      (Hey!  Why are there four files now??)
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d80d448c
    • D
      jbd2: fix descriptor block size handling errors with journal_csum · db9ee220
      Darrick J. Wong 提交于
      It turns out that there are some serious problems with the on-disk
      format of journal checksum v2.  The foremost is that the function to
      calculate descriptor tag size returns sizes that are too big.  This
      causes alignment issues on some architectures and is compounded by the
      fact that some parts of jbd2 use the structure size (incorrectly) to
      determine the presence of a 64bit journal instead of checking the
      feature flags.
      
      Therefore, introduce journal checksum v3, which enlarges the
      descriptor block tag format to allow for full 32-bit checksums of
      journal blocks, fix the journal tag function to return the correct
      sizes, and fix the jbd2 recovery code to use feature flags to
      determine 64bitness.
      
      Add a few function helpers so we don't have to open-code quite so
      many pieces.
      
      Switching to a 16-byte block size was found to increase journal size
      overhead by a maximum of 0.1%, to convert a 32-bit journal with no
      checksumming to a 32-bit journal with checksum v3 enabled.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reported-by: NTR Reardon <thomas_reardon@hotmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      db9ee220
    • D
      jbd2: fix infinite loop when recovering corrupt journal blocks · 022eaa75
      Darrick J. Wong 提交于
      When recovering the journal, don't fall into an infinite loop if we
      encounter a corrupt journal block.  Instead, just skip the block and
      return an error, which fails the mount and thus forces the user to run
      a full filesystem fsck.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      022eaa75
    • D
      ext4: update i_disksize coherently with block allocation on error path · 6603120e
      Dmitry Monakhov 提交于
      In case of delalloc block i_disksize may be less than i_size. So we
      have to update i_disksize each time we allocated and submitted some
      blocks beyond i_disksize.  We weren't doing this on the error paths,
      so fix this.
      
      testcase: xfstest generic/019
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      6603120e
  3. 28 8月, 2014 2 次提交
  4. 27 8月, 2014 3 次提交
  5. 25 8月, 2014 1 次提交
    • B
      aio: fix reqs_available handling · d856f32a
      Benjamin LaHaise 提交于
      As reported by Dan Aloni, commit f8567a38 ("aio: fix aio request
      leak when events are reaped by userspace") introduces a regression when
      user code attempts to perform io_submit() with more events than are
      available in the ring buffer.  Reverting that commit would reintroduce a
      regression when user space event reaping is used.
      
      Fixing this bug is a bit more involved than the previous attempts to fix
      this regression.  Since we do not have a single point at which we can
      count events as being reaped by user space and io_getevents(), we have
      to track event completion by looking at the number of events left in the
      event ring.  So long as there are as many events in the ring buffer as
      there have been completion events generate, we cannot call
      put_reqs_available().  The code to check for this is now placed in
      refill_reqs_available().
      
      A test program from Dan and modified by me for verifying this bug is available
      at http://www.kvack.org/~bcrl/20140824-aio_bug.c .
      Reported-by: NDan Aloni <dan@kernelim.com>
      Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
      Acked-by: NDan Aloni <dan@kernelim.com>
      Cc: Kent Overstreet <kmo@daterainc.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Petr Matousek <pmatouse@redhat.com>
      Cc: stable@vger.kernel.org      # v3.16 and anything that f8567a38 was backported to
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d856f32a
  6. 24 8月, 2014 4 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
    • D
      ext4: move i_size,i_disksize update routines to helper function · 4631dbf6
      Dmitry Monakhov 提交于
      Cc: stable@vger.kernel.org # needed for bug fix patches
      Signed-off-by: NDmitry Monakhov <dmonakhov@openvz.org>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      4631dbf6
    • T
      ext4: fix BUG_ON in mb_free_blocks() · c99d1e6e
      Theodore Ts'o 提交于
      If we suffer a block allocation failure (for example due to a memory
      allocation failure), it's possible that we will call
      ext4_discard_allocated_blocks() before we've actually allocated any
      blocks.  In that case, fe_len and fe_start in ac->ac_f_ex will still
      be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0)
      triggering the BUG_ON on mb_free_blocks():
      
      	BUG_ON(last >= (sb->s_blocksize << 3));
      
      Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len
      is zero.
      
      Also fix a missing ext4_mb_unload_buddy() call in
      ext4_discard_allocated_blocks().
      
      Google-Bug-Id: 16844242
      
      Fixes: 86f0afd4Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      c99d1e6e
    • T
      ext4: propagate errors up to ext4_find_entry()'s callers · 36de9286
      Theodore Ts'o 提交于
      If we run into some kind of error, such as ENOMEM, while calling
      ext4_getblk() or ext4_dx_find_entry(), we need to make sure this error
      gets propagated up to ext4_find_entry() and then to its callers.  This
      way, transient errors such as ENOMEM can get propagated to the VFS.
      This is important so that the system calls return the appropriate
      error, and also so that in the case of ext4_lookup(), we return an
      error instead of a NULL inode, since that will result in a negative
      dentry cache entry that will stick around long past the OOM condition
      which caused a transient ENOMEM error.
      
      Google-Bug-Id: #17142205
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      36de9286
  7. 23 8月, 2014 8 次提交
  8. 21 8月, 2014 10 次提交
    • C
      Btrfs: fix filemap_flush call in btrfs_file_release · f6dc45c7
      Chris Mason 提交于
      We should only be flushing on close if the file was flagged as needing
      it during truncate.  I broke this with my ordered data vs transaction
      commit deadlock fix.
      
      Thanks to Miao Xie for catching this.
      Signed-off-by: NChris Mason <clm@fb.com>
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      f6dc45c7
    • L
      Btrfs: fix crash on endio of reading corrupted block · 38c1c2e4
      Liu Bo 提交于
      The crash is
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/extent_io.c:2124!
      [...]
      Workqueue: btrfs-endio normal_work_helper [btrfs]
      RIP: 0010:[<ffffffffa02d6055>]  [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs]
      
      This is in fact a regression.
      
      It is because we forgot to increase @offset properly in reading corrupted block,
      so that the @offset remains, and this leads to checksum errors while reading
      left blocks queued up in the same bio, and then ends up with hiting the above
      BUG_ON.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      38c1c2e4
    • E
      btrfs: fix leak in qgroup_subtree_accounting() error path · a3c10895
      Eric Sandeen 提交于
      Coverity pointed this out; in the newly added
      qgroup_subtree_accounting(), if btrfs_find_all_roots()
      returns an error, we leak at least the parents pointer,
      and possibly the roots pointer, depending on what failure
      occurs.
      
      If btrfs_find_all_roots() returns an error, we need to
      free up all allocations before we return.  "roots" is
      initialized to NULL, so it should be safe to free
      it unconditionally (ulist_free() handles that case).
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Signed-off-by: NChris Mason <clm@fb.com>
      a3c10895
    • Q
      btrfs: Use right extent length when inserting overlap extent map. · 51f395ad
      Qu Wenruo 提交于
      When current btrfs finds that a new extent map is going to be insereted
      but failed with -EEXIST, it will try again to insert the extent map
      but with the length of sectorsize.
      This is OK if we don't enable 'no-holes' feature since all extent space
      is continuous, we will not go into the not found->insert routine.
      
      But if we enable 'no-holes' feature, it will make things out of control.
      e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent():
      btrfs_get_extent() args: start:  27874 len 4100
      28672		  27874		28672	27874+4100	32768
                          |-----------------------|
      |---------hole--------------------|---------data----------|
      
      1) not found and insert
      Since no extent map containing the range, btrfs_get_extent() will go
      into the not_found and insert routine, which will try to insert the
      extent map (27874, 27847 + 4100).
      
      2) first overlap
      But it overlaps with (28672, 32768) extent, so -EEXIST will be returned
      by add_extent_mapping().
      
      3) retry but still overlap
      After catching the -EEXIST, then btrfs_get_extent() will try insert it
      again but with 4K length, which still overlaps, so -EEXIST will be
      returned.
      
      This makes the following patch fail to punch hole.
      d7781546 btrfs: Avoid trucating page or punching hole in a already existed hole.
      
      This patch will use the right length, which is the (exsisting->start -
      em->start) to insert, making the above patch works in 'no-holes' mode.
      Also, some small code style problems in above patch is fixed too.
      Reported-by: NFilipe David Manana <fdmanana@gmail.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NFilipe David Manana <fdmanana@suse.com>
      Tested-by: NFilipe David Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      51f395ad
    • F
      Btrfs: clone, don't create invalid hole extent map · 62e2390e
      Filipe Manana 提交于
      When cloning a file that consists of an inline extent, we were creating
      an extent map that represents a non-existing trailing hole starting at a
      file offset that isn't a multiple of the sector size. This happened because
      when processing an inline extent we weren't aligning the extent's length to
      the sector size, and therefore incorrectly treating the range
      [inline_extent_length; sector_size[ as a hole.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      62e2390e
    • F
      Btrfs: don't monopolize a core when evicting inode · 7064dd5c
      Filipe Manana 提交于
      If an inode has a very large number of extent maps, we can spend
      a lot of time freeing them, which triggers a soft lockup warning.
      Therefore reschedule if we need to when freeing the extent maps
      while evicting the inode.
      
      I could trigger this all the time by running xfstests/generic/299 on
      a file system with the no-holes feature enabled. That test creates
      an inode with 11386677 extent maps.
      
          $ mkfs.btrfs -f -O no-holes $TEST_DEV
          $ MKFS_OPTIONS="-O no-holes" ./check generic/299
          generic/299 382s ...
          Message from syslogd@debian-vm3 at Aug  7 10:44:29 ...
           kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330]
           384s
          Ran: generic/299
          Passed all 1 tests
      
          $ dmesg
          (...)
          [86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330]
          (...)
          [86304.300036] Call Trace:
          [86304.300036]  [<ffffffff81698ba9>] __slab_free+0x54/0x295
          [86304.300036]  [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs]
          [86304.300036]  [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0
          [86304.300036]  [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs]
          [86304.300036]  [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs]
          [86304.300036]  [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0
          [86304.300036]  [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40
          [86304.300036]  [<ffffffff811d8cbb>] evict+0xab/0x180
          [86304.300036]  [<ffffffff811d8dce>] dispose_list+0x3e/0x60
          [86304.300036]  [<ffffffff811d9b04>] evict_inodes+0xf4/0x110
          [86304.300036]  [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110
          [86304.300036]  [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30
          [86304.300036]  [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs]
          [86304.300036]  [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80
          [86304.300036]  [<ffffffff811be44e>] deactivate_super+0x4e/0x70
          [86304.300036]  [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0
          [86304.300036]  [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0
          [86304.300036]  [<ffffffff811e0517>] SyS_umount+0x97/0x100
          (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Tested-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7064dd5c
    • F
      Btrfs: fix hole detection during file fsync · 74121f7c
      Filipe Manana 提交于
      The file hole detection logic during a file fsync wasn't correct,
      because it didn't look back (in a previous leaf) for the last file
      extent item that can be in a leaf to the left of our leaf and that
      has a generation lower than the current transaction id. This made it
      assume that a hole exists when it really doesn't exist in the file.
      
      Such false positive hole detection happens in the following scenario:
      
      * We have a file that has many file extent items, covering 3 or more
        btree leafs (the first leaf must contain non file extent items too).
      
      * Two ranges of the file are modified, with their extent items being
        located at 2 different leafs and those leafs aren't consecutive.
      
      * When processing the second modified leaf, we weren't checking if
        some file extent item exists that is located in some leaf that is
        between our 2 modified leafs, and therefore assumed the range defined
        between the last file extent item in the first leaf and the first file
        extent item in the second leaf matched a hole.
      
      Fortunately this didn't result in overriding the log with wrong data,
      instead it made the last loop in copy_items() attempt to insert a
      duplicated key (for a hole file extent item), which makes the file
      fsync code return with -EEXIST to file.c:btrfs_sync_file() which in
      turn ends up doing a full transaction commit, which is much more
      expensive then writing only to the log tree and wait for it to be
      durably persisted (as well as the file's modified extents/pages).
      Therefore fix the hole detection logic, so that we don't pay the
      cost of doing full transaction commits.
      
      I could trigger this issue with the following test for xfstests (which
      never fails, either without or with this patch). The last fsync call
      results in a full transaction commit, due to the -EEXIST error mentioned
      above. I could also observe this behaviour happening frequently when
      running xfstests/generic/075 in a loop.
      
      Test:
      
          _cleanup()
          {
              _cleanup_flakey
              rm -fr $tmp
          }
      
          # get standard environment, filters and checks
          . ./common/rc
          . ./common/filter
          . ./common/dmflakey
      
          # real QA test starts here
          _supported_fs btrfs
          _supported_os Linux
          _require_scratch
          _require_dm_flakey
          _need_to_be_root
      
          rm -f $seqres.full
      
          # Create a file with many file extent items, each representing a 4Kb extent.
          # These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size
          # as of btrfs-progs 3.12).
          _scratch_mkfs -l 16384 >/dev/null 2>&1
          _init_flakey
          SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS"
          MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999"
          _mount_flakey
      
          # First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set.
          $XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \
                  $SCRATCH_MNT/foo | _filter_xfs_io
      
          # For any of the following fsync calls, inode doesn't have the flag
          # BTRFS_INODE_NEEDS_FULL_SYNC set.
          for ((i = 1; i <= 500; i++)); do
              OFFSET=$((4096 * i))
              LEN=4096
              $XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \
                      $SCRATCH_MNT/foo | _filter_xfs_io
          done
      
          # Commit transaction and bump next transaction's id (to 7).
          sync
      
          # Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's
          # inode runtime flags.
          $XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo
      
          # Commit transaction and bump next transaction's id (to 8).
          sync
      
          # Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf
          # in the middle, containing only file extent items, isn't touched. So the
          # next fsync, when calling btrfs_search_forward(), won't visit that middle
          # leaf. First and 3rd leaf have now a generation with value 8, while the
          # middle leaf remains with a generation with value 6.
          $XFS_IO_PROG \
              -c "pwrite -S 0xee -b 4096 0 4096" \
              -c "pwrite -S 0xff -b 4096 2043904 4096" \
              -c "fsync" \
              $SCRATCH_MNT/foo | _filter_xfs_io
      
          _load_flakey_table $FLAKEY_DROP_WRITES
          md5sum $SCRATCH_MNT/foo | _filter_scratch
          _unmount_flakey
      
          _load_flakey_table $FLAKEY_ALLOW_WRITES
          # During mount, we'll replay the log created by the fsync above, and the file's
          # md5 digest should be the same we got before the unmount.
          _mount_flakey
          md5sum $SCRATCH_MNT/foo | _filter_scratch
          _unmount_flakey
          MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS"
      
          status=0
          exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      74121f7c
    • F
      Btrfs: ensure tmpfile inode is always persisted with link count of 0 · 5762b5c9
      Filipe Manana 提交于
      If we open a file with O_TMPFILE, don't do any further operation on
      it (so that the inode item isn't updated) and then force a transaction
      commit, we get a persisted inode item with a link count of 1, and not 0
      as it should be.
      
      Steps to reproduce it (requires a modern xfs_io with -T support):
      
          $ mkfs.btrfs -f /dev/sdd
          $ mount -o /dev/sdd /mnt
          $ xfs_io -T /mnt &
          $ sync
      
      Then btrfs-debug-tree shows the inode item with a link count of 1:
      
          $ btrfs-debug-tree /dev/sdd
          (...)
          fs tree key (FS_TREE ROOT_ITEM 0)
          leaf 29556736 items 4 free space 15851 generation 6 owner 5
          fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0
          chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6
          	item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
      		inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1
          	item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
          		inode ref index 0 namelen 2 name: ..
          	item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160
          		inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1
          	item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0
      		orphan item
          checksum tree key (CSUM_TREE ROOT_ITEM 0)
          (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5762b5c9
    • F
      Btrfs: race free update of commit root for ro snapshots · 9c3b306e
      Filipe Manana 提交于
      This is a better solution for the problem addressed in the following
      commit:
      
          Btrfs: update commit root on snapshot creation after orphan cleanup
          (3821f348)
      
      The previous solution wasn't the best because of 2 reasons:
      
          1) It added another full transaction commit, which is more expensive
             than just swapping the commit root with the root;
      
          2) If a reboot happened after the first transaction commit (the one
             that creates the snapshot) and before the second transaction commit,
             then we would end up with the same problem if a send using that
             snapshot was requested before the first transaction commit after
             the reboot.
      
      This change addresses those 2 issues. The second issue is addressed by
      switching the commit root in the dentry lookup VFS callback, which is
      also called by the snapshot/subvol creation ioctl and performs orphan
      cleanup if needed. Like the vfs, the ioctl locks the parent inode too,
      preventing race issues between a dentry lookup and snapshot creation.
      
      Cc: Alex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9c3b306e
    • L
      Btrfs: fix regression of btrfs device replace · 87fa3bb0
      Liu Bo 提交于
      Commit 49c6f736(
      btrfs: dev replace should replace the sysfs entry) added the missing sysfs entry
      in the process of device replace, but didn't take missing devices into account,
      so now we have
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
      IP: [<ffffffffa0268551>] btrfs_kobj_rm_device+0x21/0x40 [btrfs]
      ...
      
      To reproduce it,
      1. mkfs.btrfs -f disk1 disk2
      2. mkfs.ext4 disk1
      3. mount disk2 /mnt -odegraded
      4. btrfs replace start -B 1 disk3 /mnt
      --------------------------
      
      This fixes the problem.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Tested-by: NSatoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      87fa3bb0
  9. 20 8月, 2014 3 次提交
  10. 19 8月, 2014 1 次提交
    • M
      Btrfs: don't consider the missing device when allocating new chunks · 95669976
      Miao Xie 提交于
      The original code allocated new chunks by the number of the writable devices
      and missing devices to make sure that any RAID levels on a degraded FS continue
      to be honored, but it introduced a problem that it stopped us to allocating
      new chunks, the steps to reproduce is following:
      
       # mkfs.btrfs -m raid1 -d raid1 -f <dev0> <dev1>
       # mkfs.btrfs -f <dev1>			//Removing <dev1> from the original fs
       # mount -o degraded <dev0> <mnt>
       # dd if=/dev/null of=<mnt>/tmpfile bs=1M
      
      It is because we allocate new chunks only on the writable devices, if we take
      the number of missing devices into account, and want to allocate new chunks
      with higher RAID level, we will fail becaue we don't have enough writable
      device. Fix it by ignoring the number of missing devices when allocating
      new chunks.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      95669976