1. 09 8月, 2015 34 次提交
    • C
      Btrfs: add support for blkio controllers · da2f0f74
      Chris Mason 提交于
      This attaches accounting information to bios as we submit them so the
      new blkio controllers can throttle on btrfs filesystems.
      
      Not much is required, we're just associating bios with blkcgs during clone,
      calling wbc_init_bio()/wbc_account_io() during writepages submission,
      and attaching the bios to the current context during direct IO.
      
      Finally if we are splitting bios during btrfs_map_bio, this attaches
      accounting information to the split.
      
      The end result is able to throttle nicely on single disk filesystems.  A
      little more work is required for multi-device filesystems.
      Signed-off-by: NChris Mason <clm@fb.com>
      da2f0f74
    • B
      Btrfs: remove unused mutex from struct 'btrfs_fs_info' · a4027a20
      Byongho Lee 提交于
      The code using 'ordered_extent_flush_mutex' mutex has removed by below
      commit.
       - 8d875f95
         btrfs: disable strict file flushes for renames and truncates
      But the mutex still lives in struct 'btrfs_fs_info'.
      
      So, this patch removes the mutex from struct 'btrfs_fs_info' and its
      initialization code.
      Signed-off-by: NByongho Lee <bhlee.kernel@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a4027a20
    • O
      Btrfs: fix parity scrub of RAID 5/6 with missing device · 4a770891
      Omar Sandoval 提交于
      When testing the previous patch, Zhao Lei reported a similar bug when
      attempting to scrub a degraded RAID 5/6 filesystem with a missing
      device, leading to NULL pointer dereferences from the RAID 5/6 parity
      scrubbing code.
      
      The first cause was the same as in the previous patch: attempting to
      call bio_add_page() on a missing block device. To fix this,
      scrub_extent_for_parity() can just mark the sectors on the missing
      device as errors instead of attempting to read from it.
      
      Additionally, the code uses scrub_remap_extent() to map the extent of
      the corresponding data stripe, but the extent wasn't already mapped. If
      scrub_remap_extent() finds a missing block device, it doesn't initialize
      extent_dev, so we're left with a NULL struct btrfs_device. The solution
      is to use btrfs_map_block() directly.
      Reported-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4a770891
    • O
      Btrfs: fix device replace of a missing RAID 5/6 device · 73ff61db
      Omar Sandoval 提交于
      The original implementation of device replace on RAID 5/6 seems to have
      missed support for replacing a missing device. When this is attempted,
      we end up calling bio_add_page() on a bio with a NULL ->bi_bdev, which
      crashes when we try to dereference it. This happens because
      btrfs_map_block() has no choice but to return us the missing device
      because RAID 5/6 don't have any alternate mirrors to read from, and a
      missing device has a NULL bdev.
      
      The idea implemented here is to handle the missing device case
      separately, which better only happen when we're replacing a missing RAID
      5/6 device. We use the new BTRFS_RBIO_REBUILD_MISSING operation to
      reconstruct the data from parity, check it with
      scrub_recheck_block_checksum(), and write it out with
      scrub_write_block_to_dev_replace().
      Reported-by: NPhilip <bugzilla@philip-seeger.de>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=96141Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      73ff61db
    • O
      Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation · b4ee1782
      Omar Sandoval 提交于
      The current RAID 5/6 recovery code isn't quite prepared to handle
      missing devices. In particular, it expects a bio that we previously
      attempted to use in the read path, meaning that it has valid pages
      allocated. However, missing devices have a NULL blkdev, and we can't
      call bio_add_page() on a bio with a NULL blkdev. We could do manual
      manipulation of bio->bi_io_vec, but that's pretty gross. So instead, add
      a separate path that allows us to manually add pages to the rbio.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b4ee1782
    • O
      Btrfs: count devices correctly in readahead during RAID 5/6 replace · 7cb2c420
      Omar Sandoval 提交于
      Commit 5fbc7c59 ("Btrfs: fix unfinished readahead thread for raid5/6
      degraded mounting") fixed a problem where we would skip a missing device
      when we shouldn't have because there are no other mirrors to read from
      in RAID 5/6. After commit 2c8cdd6e ("Btrfs, replace: write dirty
      pages into the replace target device"), the fix doesn't work when we're
      doing a missing device replace on RAID 5/6 because the replace device is
      counted as a mirror so we're tricked into thinking we can safely skip
      the missing device. The fix is to count only the real stripes and decide
      based on that.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7cb2c420
    • O
      Btrfs: remove misleading handling of missing device scrub · 03679ade
      Omar Sandoval 提交于
      scrub_submit() claims that it can handle a bio with a NULL block device,
      but this is misleading, as calling bio_add_page() on a bio with a NULL
      ->bi_bdev would've already crashed. Delete this, as we're about to
      properly handle a missing block device.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      03679ade
    • M
      btrfs: fix clone / extent-same deadlocks · 293a8489
      Mark Fasheh 提交于
      Clone and extent same lock their source and target inodes in opposite order.
      In addition to this, the range locking in clone doesn't take ordering into
      account. Fix this by having clone use the same locking helpers as
      btrfs-extent-same.
      
      In addition, I do a small cleanup of the locking helpers, removing a case
      (both inodes being the same) which was poorly accounted for and never
      actually used by the callers.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      293a8489
    • L
      Btrfs: fix defrag to merge tail file extent · 4a3560c4
      Liu Bo 提交于
      The file layout is
      
      [extent 1]...[extent n][4k extent][HOLE][extent x]
      
      extent 1~n and 4k extent can be merged during defrag, and the whole
      defrag bytes is larger than our defrag thresh(256k), 4k extent as a
      tail is left unmerged since we check if its next extent can be merged
      (the next one is a hole, so the check will fail), the layout thus can
      be
      
      [new extent][4k extent][HOLE][extent x]
       (1~n)
      
      To fix it, beside looking at the next one, this also looks at the
      previous one by checking @defrag_end, which is set to 0 when we
      decide to stop merging contiguous extents, otherwise, we can merge
      the previous one with our extent.
      
      Also, this makes btrfs behave consistent with how xfs and ext4 do.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4a3560c4
    • L
      Btrfs: fix warning in backref walking · acdf898d
      Liu Bo 提交于
      When we do backref walking, we search firstly in queued delayed refs
      and then the on-disk backrefs, but we parse differently for shared
      references, for delayed refs we also add 'ref->root' while for on-disk
      backrefs we don't, this can prevent us from merging refs indexed
      by the same bytenr and cause find_parent_nodes() to throw a warning at
      'WARN_ON(ref->count < 0)', for example, when we have a shared data extent
      with 'ref_cnt=1' and a delayed shared data with a BTRFS_DROP_DELAYED_REF,
      that happens.
      
      For shared references, no matter if it's delayed or on-disk, ref->root is
      not at all used, instead it's ref->parent that really matters, so this has
      delayed refs handled as the same way as on-disk refs.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      acdf898d
    • Z
      btrfs: Add WARN_ON() for double lock in btrfs_tree_lock() · 166f66d0
      Zhaolei 提交于
      When a task trying to double lock a extent buffer, there are no
      lockdep warning about it because this lock may be in "blocking_lock"
      state, and make us hard to debug.
      
      This patch add a WARN_ON() for above condition, it can not report
      all deadlock cases(as lock between tasks), but at least helps us
      some.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      166f66d0
    • Z
      btrfs: Remove root argument in extent_data_ref_count() · 9ed0dea0
      Zhaolei 提交于
      Because it is never used.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ed0dea0
    • Z
      btrfs: Fix wrong comment of btrfs_alloc_tree_block() · d0220751
      Zhaolei 提交于
      These wrong comment was copyed from another function(expired) from
      init, this patch fixed them.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d0220751
    • Z
      btrfs: abort transaction on btrfs_reloc_cow_block() · 93314e3b
      Zhaolei 提交于
      When btrfs_reloc_cow_block() failed in __btrfs_cow_block(), current
      code just return a err-value to caller, but leave new_created extent
      buffer exist and locked.
      
      Then subsequent code (in relocate) try to lock above eb again,
      and caused deadlock without any dmesg.
      (eb lock use wait_event(), so no lockdep message)
      
      It is hard to do recover work in __btrfs_cow_block() at this error
      point, but we can abort transaction to avoid deadlock and operate on
      unstable state.a
      
      It also helps developer to find wrong place quickly.
      (better than a frozen fs without any dmesg before patch)
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      93314e3b
    • Z
      btrfs: Remove unnecessary variants in relocation.c · 147d256e
      Zhaolei 提交于
      These arguments are not used in functions, remove them for cleanup
      and make kernel stack happy.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      147d256e
    • Z
      btrfs: Cleanup: Remove chunk_objectid argument from btrfs_relocate_chunk() · dc2ee4e2
      Zhaolei 提交于
      Remove chunk_objectid argument from btrfs_relocate_chunk() because
      it is not necessary, it can also cleanup some code in caller for
      prepare its value.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dc2ee4e2
    • Z
      btrfs: Cleanup: Remove objectid's init-value in create_reloc_inode() · 4624900d
      Zhaolei 提交于
      objectid's init-value is not used in any case, remove it.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4624900d
    • Z
      btrfs: Error handle for get_ref_objectid_v0() in relocate_block_group() · 4b3576e4
      Zhaolei 提交于
      We need error checking code for get_ref_objectid_v0() in
      relocate_block_group(), to avoid unpredictable result, especially
      for accessing uninitialized value(when function failed) after
      this line.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4b3576e4
    • Z
      btrfs: Fix data checksum error cause by replace with io-load. · 55e3a601
      Zhaolei 提交于
      xfstests btrfs/070 sometimes failed.
      In my test machine, its fail rate is about 30%.
      In another vm(vmware), its fail rate is about 50%.
      
      Reason:
        btrfs/070 do replace and defrag with fsstress simultaneously,
        after above operation, checksum error is found by scrub.
      
        Actually, it have no relationship with defrag operation, only
        replace with fsstress can trigger this bug.
      
        New data writen to target device have possibility rewrited by
        old data from source device by replace code in debug, to avoid
        above problem, we can set target block group to readonly in
        replace period, so new data requested by other operation will
        not write to same place with replace code.
      
        Before patch(4.1-rc3):
          30% failed in 100 xfstests.
        After patch:
          0% failed in 300 xfstests.
      
      It also happened in btrfs/071 as it's another scrub with IO load tests.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      55e3a601
    • Z
      btrfs: use scrub_pause_on/off() to reduce code in scrub_enumerate_chunks() · b708ce96
      Zhaolei 提交于
      Use new intruduced scrub_pause_on/off() can make this code block
      clean and more readable.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b708ce96
    • Z
      btrfs: Separate scrub_blocked_if_needed() to scrub_pause_on/off() · 0e22be89
      Zhaolei 提交于
      It can reduce current duplicated code which is similar to
      scrub_blocked_if_needed() but can not call it because little
      different.
      It also used by my next patch which is in same case.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0e22be89
    • Z
      btrfs: Use ref_cnt for set_block_group_ro() · 868f401a
      Zhaolei 提交于
      More than one code call set_block_group_ro() and restore rw in fail.
      
      Old code use bool bit to save blockgroup's ro state, it can not
      support parallel case(it is confirmd exist in my debug log).
      
      This patch use ref count to store ro state, and rename
      set_block_group_ro/set_block_group_rw
      to
      inc_block_group_ro/dec_block_group_ro.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      868f401a
    • Z
      btrfs: Bypass unrelated items before accessing its contents in scrub · d7cad238
      Zhao Lei 提交于
      When we access extent_root in scrub_stripe() and
      scrub_raid56_parity(), we need bypass unrelated tree item firstly
      before using its contents to do other condition.
      
      It is not a bug fix, only making code sequence in logic.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7cad238
    • Z
      btrfs: Load only necessary csums into list in scrub · fe8cf654
      Zhao Lei 提交于
      We need not load csum of whole strip in scrub because strip is trimed
      before use, it is to say, what we really need to calculate csum is
      data between [extent_logical, extent_len).
      
      This patch changed to use above segment for btrfs_lookup_csums_range()
      in scrub_stripe()
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fe8cf654
    • Z
      btrfs: Fix calculate typo caused by ambiguous meaning of logic_end · a0dd59de
      Zhao Lei 提交于
      For example, in scrub_raid56_parity(), following lines are used
      to judge is all data processed:
       place1: if (key.objectid > logic_end) ...
       place2: if (logic_start >= logic_end) ...
       ...
       (place2 is typo, is should be ">", it is copied from other
        place, where logic_end's meaning is different, long story...)
      
      We can fix above typo directly, but the root reason is ambiguous
      meaning of logic_end in scrub raid56 parity.
      
      In other place, XXX_end is pointed to data which is not included,
      and we need to process segment of [XXX_start, XXX_end).
      
      But for scrub raid56 parity, logic_end is pointed to lattest data
      need to process, and introduced many "+ 1" and "- 1" in code as
      below:
       length = sparity->logic_end - sparity->logic_start + 1
       logic_end - logic_start + 1
       stripe_logical + increment - 1
      
      This patch changed logic_end's meaning to make it in normal understanding
      in raid56 parity functions and data struct alone with above bugfix.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a0dd59de
    • Z
      btrfs: Free checksum list on scrub_extent() fail · 6fa96d72
      Zhao Lei 提交于
      When scrub_extent() failed, we need to free previois created
      checksum list.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6fa96d72
    • Z
      btrfs: Check cancel and pause in interval of scrub operation · f2f66a2f
      Zhao Lei 提交于
      Old code checking cancel and pause request inside scrub stripe
      operation, like:
        loop() {
          if (parity) {
            scrub_parity_stripe();
            continue;
          }
      
          check_cancel_and_pause()
      
          scrub_normal_stripe();
        }
      
      Reason is when introduce raid56 stripe scrub, new code is inserted
      simplely to front of loop.
      
      Better to:
        loop() {
          check_cancel_and_pause()
      
          if (parity)
            scrub_parity_stripe();
          else
            scrub_normal_stripe();
        }
      
      This patch adjusted code place to realize above sequence.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f2f66a2f
    • Z
      btrfs: Show detail information when mount failed on missing devices · 78fa1770
      Zhao Lei 提交于
      When mount failed because missing device, we can see following
      dmesg:
       [ 1060.267743] BTRFS: too many missing devices, writeable mount is not allowed
       [ 1060.273158] BTRFS: open_ctree failed
      
      This patch add missing_device_number and tolerated_missing_device_number
      to above output, to let user know what really happened, and helps
      bug-report and debug.
      
      dmesg after patch:
       [  127.050367] BTRFS: missing devices(1) exceeds the limit(0), writeable mount is not allowed
       [  127.056099] BTRFS: open_ctree failed
      
      Changelog v1->v2:
      1: Changed to more clear description, suggested-by:
         Anand Jain <anand.jain@oracle.com>
      Suggested-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      78fa1770
    • Z
      btrfs: Fix scrub panic when leaf crosses stripes · a323e813
      Zhao Lei 提交于
      Scrub panic in following operation:
        mkfs.ext4 /dev/vdh
        btrfs-convert /dev/vdh
        mount /dev/vdh /mnt/tmp1
        btrfs scrub start -B /dev/vdh
        (panic)
      
      Reason:
        1: In some case, leaf created by btrfs-convert was splited into 2
           strips.
        2: Scrub bypassed part of above wrong leaf data, but remain data
           caused panic in scrub_checksum_tree_block().
      
      For reason 1:
        we can get following information after some simple operation.
        a. mkfs.ext4 /dev/vdh
           btrfs-convert /dev/vdh
        b. btrfs-debug-tree /dev/vdh
           we can see following item in extent tree:
           item 25 key (27054080 METADATA_ITEM 0) itemoff 15083 itemsize 33
           Its logical address is [27054080, 27070464)
           and acrossed 2 strips:
           [27000832, 27066368)
           [27066368, 27131904)
        Will be fixed in btrfs-progs(btrfs-convert, btrfsck, ...)
      
      For reason 2:
        Scrub is trying to do a "bypass" in this case, but the result is
        "panic", because current code lacks of some condition in bypass,
        and let some wrong leaf data escaped.
      
      This patch fixed above scrub code.
      
      Before patch:
        # btrfs scrub start -B /dev/vdh
        (panic)
      
      After patch:
        # btrfs scrub start -B /dev/vdh
        scrub done for 353cec8f-da31-4a94-aa35-be72d997b06e
        ...
        # dmesg
        ...
        [   59.088697] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27000832
        [   59.089929] BTRFS error (device vdh): scrub: tree block 27054080 spanning stripes, ignored. logical=27066368
        #
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      a323e813
    • F
      Btrfs: fix stale dir entries after removing a link and fsync · 18aa0922
      Filipe Manana 提交于
      We have one more case where after a log tree is replayed we get
      inconsistent metadata leading to stale directory entries, due to
      some directories having entries pointing to some inode while the
      inode does not have a matching BTRFS_INODE_[REF|EXTREF]_KEY item.
      
      To trigger the problem we need to have a file with multiple hard links
      belonging to different parent directories. Then if one of those hard
      links is removed and we fsync the file using one of its other links
      that belongs to a different parent directory, we end up not logging
      the fact that the removed hard link doesn't exists anymore in the
      parent directory.
      
      Simple reproducer:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test directory and file.
        mkdir $SCRATCH_MNT/testdir
        touch $SCRATCH_MNT/foo
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo2
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/testdir/foo3
      
        # Make sure everything done so far is durably persisted.
        sync
      
        # Now we remove one of our file's hardlinks in the directory testdir.
        unlink $SCRATCH_MNT/testdir/foo3
      
        # We now fsync our file using the "foo" link, which has a parent that
        # is not the directory "testdir".
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Silently drop all writes and unmount to simulate a crash/power
        # failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again, mount to trigger journal/log replay.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # After the journal/log is replayed we expect to not see the "foo3"
        # link anymore and we should be able to remove all names in the
        # directory "testdir" and then remove it (no stale directory entries
        # left after the journal/log replay).
        echo "Entries in testdir:"
        ls -1 $SCRATCH_MNT/testdir
      
        rm -f $SCRATCH_MNT/testdir/*
        rmdir $SCRATCH_MNT/testdir
      
        _unmount_flakey
      
        status=0
        exit
      
      The test fails with:
      
        $ ./check generic/107
        FSTYP         -- btrfs
        PLATFORM      -- Linux/x86_64 debian3 4.1.0-rc6-btrfs-next-11+
        MKFS_OPTIONS  -- /dev/sdc
        MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
      
        generic/107 3s ... - output mismatch (see .../results/generic/107.out.bad)
          --- tests/generic/107.out	2015-08-01 01:39:45.807462161 +0100
          +++ /home/fdmanana/git/hub/xfstests/results//generic/107.out.bad
          @@ -1,3 +1,5 @@
           QA output created by 107
           Entries in testdir:
           foo2
          +foo3
          +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
          ...
          _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent \
            (see /home/fdmanana/git/hub/xfstests/results//generic/107.full)
          _check_dmesg: something found in dmesg (see .../results/generic/107.dmesg)
        Ran: generic/107
        Failures: generic/107
        Failed 1 of 1 tests
      
        $ cat /home/fdmanana/git/hub/xfstests/results//generic/107.full
        (...)
        checking fs roots
        root 5 inode 257 errors 200, dir isize wrong
      	unresolved ref dir 257 index 3 namelen 4 name foo3 filetype 1 errors 5, no dir item, no inode ref
        (...)
      
      And produces the following warning in dmesg:
      
        [127298.759064] BTRFS info (device dm-0): failed to delete reference to foo3, inode 258 parent 257
        [127298.762081] ------------[ cut here ]------------
        [127298.763311] WARNING: CPU: 10 PID: 7891 at fs/btrfs/inode.c:3956 __btrfs_unlink_inode+0x182/0x35a [btrfs]()
        [127298.767327] BTRFS: Transaction aborted (error -2)
        (...)
        [127298.788611] Call Trace:
        [127298.789137]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
        [127298.790090]  [<ffffffff81095de5>] ? console_unlock+0x356/0x3a2
        [127298.791157]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
        [127298.792323]  [<ffffffffa065ad09>] ? __btrfs_unlink_inode+0x182/0x35a [btrfs]
        [127298.793633]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
        [127298.794699]  [<ffffffffa065ad09>] __btrfs_unlink_inode+0x182/0x35a [btrfs]
        [127298.797640]  [<ffffffffa065be8f>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
        [127298.798876]  [<ffffffffa065bf11>] btrfs_unlink+0x60/0x9b [btrfs]
        [127298.800154]  [<ffffffff8116fb48>] vfs_unlink+0x9c/0xed
        [127298.801303]  [<ffffffff81173481>] do_unlinkat+0x12b/0x1fb
        [127298.802450]  [<ffffffff81253855>] ? lockdep_sys_exit_thunk+0x12/0x14
        [127298.803797]  [<ffffffff81174056>] SyS_unlinkat+0x29/0x2b
        [127298.805017]  [<ffffffff81465197>] system_call_fastpath+0x12/0x6f
        [127298.806310] ---[ end trace bbfddacb7aaada7b ]---
        [127298.807325] BTRFS warning (device dm-0): __btrfs_unlink_inode:3956: Aborting unused transaction(No such entry).
      
      So fix this by logging all parent inodes, current and old ones, to make
      sure we do not get stale entries after log replay. This is not a simple
      solution such as triggering a full transaction commit because it would
      imply full transaction commit when an inode is fsynced in the same
      transaction that modified it and reloaded it after eviction (because its
      last_unlink_trans is set to the same value as its last_trans as of the
      commit with the title "Btrfs: fix stale dir entries after unlink, inode
      eviction and fsync"), and it would also make fstest generic/066 fail
      since one of the fsyncs triggers a full commit and the next fsync will
      not find the inode in the log anymore (therefore not removing the xattr).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18aa0922
    • N
      btrfs: fix search key advancing condition · dd81d459
      Naohiro Aota 提交于
      The search key advancing condition used in copy_to_sk() is loose. It can
      advance the key even if it reaches sk->max_*: e.g. when the max key = (512,
      1024, -1) and the current key = (512, 1025, 10), it increments the
      offset by 1, continues hopeless search from (512, 1025, 11). This issue
      make ioctl() to take unexpectedly long time scanning all the leaf a blocks
      one by one.
      
      This commit fix the problem using standard way of key comparison:
      btrfs_comp_cpu_keys()
      Signed-off-by: NNaohiro Aota <naota@elisp.net>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dd81d459
    • F
      Btrfs: teach backref walking about backrefs with underflowed offset values · d6589101
      Filipe Manana 提交于
      When cloning/deduplicating file extents (through the clone and extent_same
      ioctls) we can get data back references with offset values that are a
      result of an unsigned integer arithmetic underflow, that is, values that
      are much larger then they could be otherwise.
      
      This is not a problem when decrementing or dropping the back references
      (happens when we overwrite the extents or punch a hole for example, through
      __btrfs_drop_extents()), since we compute the same too large offset value,
      but it is a problem for the backref walking code, used by an incremental
      send and the ioctls that are used by the btrfs tool "inspect-internal"
      commands, as it makes it miss the corresponding file extent items because
      the search key is set for an extent item that starts at an offset matching
      the exceptionally large offset value of the data back reference. For an
      incremental send this causes the send ioctl to fail with -EIO.
      
      So teach the backref walking code to deal with these cases by setting the
      search key's offset to 0 if the backref's offset value is larger than
      LLONG_MAX (the largest possible file offset). This makes sure the backref
      walking code finds the corresponding file extent items at the expense of
      scanning more items and leafs in the btree.
      
      Fixing the clone/dedup ioctls to not produce such underflowed results would
      require major changes breaking backward compatibility, updating user space
      tools, etc.
      
      Simple reproducer case for fstests:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
      
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -fr $send_files_dir
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
        _need_to_be_root
      
        send_files_dir=$TEST_DIR/btrfs-test-$seq
      
        rm -f $seqres.full
        rm -fr $send_files_dir
        mkdir $send_files_dir
      
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount
      
        # Create our test file with a single extent of 64K starting at file
        # offset 128K.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 128K 64K" $SCRATCH_MNT/foo \
            | _filter_xfs_io
      
        _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
            $SCRATCH_MNT/mysnap1
      
        # Now clone parts of the original extent into lower offsets of the file.
        #
        # The first clone operation adds a file extent item to file offset 0
        # that points to our initial extent with a data offset of 16K. The
        # corresponding data back reference in the extent tree has an offset of
        # 18446744073709535232, which is the result of file_offset - data_offset
        # = 0 - 16K.
        #
        # The second clone operation adds a file extent item to file offset 16K
        # that points to our initial extent with a data offset of 48K. The
        # corresponding data back reference in the extent tree has an offset of
        # 18446744073709518848, which is the result of file_offset - data_offset
        # = 16K - 48K.
        #
        # Those large back reference offsets (result of unsigned arithmetic
        # underflow) confused the back reference walking code (used by an
        # incremental send and the multiple inspect-internal ioctls) and made it
        # miss the back references, which for the case of an incremental send it
        # made it fail with -EIO and print a message like the following to
        # dmesg:
        #
        # "BTRFS error (device sdc): did not find backref in send_root. \
        #  inode=257, offset=0, disk_byte=12845056 found extent=12845056"
        #
        $CLONER_PROG -s $(((128 + 16) * 1024)) -d 0 -l $((16 * 1024)) \
            $SCRATCH_MNT/foo $SCRATCH_MNT/foo
        $CLONER_PROG -s $(((128 + 48) * 1024)) -d $((16 * 1024)) \
            -l $((16 * 1024)) $SCRATCH_MNT/foo $SCRATCH_MNT/foo
      
        _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
            $SCRATCH_MNT/mysnap2
      
        _run_btrfs_util_prog send $SCRATCH_MNT/mysnap1 -f $send_files_dir/1.snap
        _run_btrfs_util_prog send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \
            -f $send_files_dir/2.snap
      
        echo "File digest in the original filesystem:"
        md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
      
        # Now recreate the filesystem by receiving both send streams and verify
        # we get the same file contents that the original filesystem had.
        _scratch_unmount
        _scratch_mkfs >>$seqres.full 2>&1
        _scratch_mount
      
        _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap
        _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/2.snap
      
        echo "File digest in the new filesystem:"
        md5sum $SCRATCH_MNT/mysnap2/foo | _filter_scratch
      
        status=0
        exit
      
      The test's expected golden output is:
      
        wrote 65536/65536 bytes at offset 131072
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File digest in the original filesystem:
        6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
        File digest in the new filesystem:
        6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
      
      But it failed with:
      
          (...)
          @@ -1,7 +1,5 @@
           QA output created by 097
           wrote 65536/65536 bytes at offset 131072
           XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
          -File digest in the original filesystem:
          -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
          -File digest in the new filesystem:
          -6c6079335cff141b8a31233ead04cbff  SCRATCH_MNT/mysnap2/foo
          ...
      
        $ cat /home/fdmanana/git/hub/xfstests/results//btrfs/097.full
        (...)
        ERROR: send ioctl failed with -5: Input/output error
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d6589101
    • F
      Btrfs: fix stale dir entries after unlink, inode eviction and fsync · bde6c242
      Filipe Manana 提交于
      If we remove a hard link from an inode, the inode gets evicted, then
      we fsync the inode and then power fail/crash, when the log tree is
      replayed, the parent directory inode still has entries pointing to
      the name that no longer exists, while our inode no longer has the
      BTRFS_INODE_REF_KEY item matching the deleted hard link (as expected),
      leaving the filesystem in an inconsistent state. The stale directory
      entries can not be deleted (an attempt to delete them causes -ESTALE
      errors), which makes it impossible to delete the parent directory.
      
      This happens because we track the id of the transaction where the last
      unlink operation for the inode happened (last_unlink_trans) in an
      in-memory only field of the inode, that is, a value that is never
      persisted in the inode item stored on the fs/subvol btree. So if an
      inode is evicted and loaded again, the value for last_unlink_trans is
      set to 0, which prevents the fsync from logging the parent directory
      at btrfs_log_inode_parent(). So fix this by setting last_unlink_trans
      to the id of the transaction that last modified the inode when we
      load the inode. This is a pessimistic approach but it always ensures
      correctness with the trade off of ocassional full transaction commits
      when an fsync is done against the inode in the same transaction where
      it was evicted and reloaded when our inode is a directory and often
      logging its parent unnecessarily when our inode is not a directory.
      
      The following test case for fstests triggers the problem:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file with 2 hard links.
        mkdir $SCRATCH_MNT/testdir
        touch $SCRATCH_MNT/testdir/foo
        ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar
      
        # Make sure everything done so far is durably persisted.
        sync
      
        # Now remove one of the links, trigger inode eviction and then fsync
        # our inode.
        unlink $SCRATCH_MNT/testdir/bar
        echo 2 > /proc/sys/vm/drop_caches
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/foo
      
        # Silently drop all writes on our scratch device to simulate a power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again and mount the fs to trigger log/journal replay.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now verify our directory entries.
        echo "Entries in testdir:"
        ls -1 $SCRATCH_MNT/testdir
      
        # If we remove our inode, its parent should become empty and therefore we should
        # be able to remove the parent.
        rm -f $SCRATCH_MNT/testdir/*
        rmdir $SCRATCH_MNT/testdir
      
        _unmount_flakey
      
        # The fstests framework will call fsck against our filesystem which will verify
        # that all metadata is in a consistent state.
      
        status=0
        exit
      
      The test failed on btrfs with:
      
        generic/098 4s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad)
          --- tests/generic/098.out	2015-07-23 18:01:12.616175932 +0100
          +++ /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad	2015-07-23 18:04:58.924138308 +0100
          @@ -1,3 +1,6 @@
           QA output created by 098
           Entries in testdir:
          +bar
           foo
          +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo': Stale file handle
          +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
          ...
          (Run 'diff -u tests/generic/098.out /home/fdmanana/git/hub/xfstests/results//generic/098.out.bad'  to see the entire diff)
        _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent (see /home/fdmanana/git/hub/xfstests/results//generic/098.full)
      
        $ cat /home/fdmanana/git/hub/xfstests/results//generic/098.full
        (...)
        checking fs roots
        root 5 inode 258 errors 2001, no inode item, link count wrong
           unresolved ref dir 257 index 0 namelen 3 name foo filetype 1 errors 6, no dir index, no inode ref
           unresolved ref dir 257 index 3 namelen 3 name bar filetype 1 errors 5, no dir item, no inode ref
        Checking filesystem on /dev/sdc
        (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bde6c242
    • F
      Btrfs: fix stale directory entries after fsync log replay · bb53eda9
      Filipe Manana 提交于
      We have another case where after an fsync log replay we get an inode with
      a wrong link count (smaller than it should be) and a number of directory
      entries greater than its link count. This happens when we add a new link
      hard link to our inode A and then we fsync some other inode B that has
      the side effect of logging the parent directory inode too. In this case
      at log replay time we add the new hard link to our inode (the item with
      key BTRFS_INODE_REF_KEY) when processing the parent directory but we
      never adjust the link count of our inode A. As a result we get stale dir
      entries for our inode A that can never be deleted and therefore it makes
      it impossible to remove the parent directory (as its i_size can never
      decrease back to 0).
      
      A simple reproducer for fstests that triggers this issue:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs generic
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test directory and files.
        mkdir $SCRATCH_MNT/testdir
        touch $SCRATCH_MNT/testdir/foo
        touch $SCRATCH_MNT/testdir/bar
      
        # Make sure everything done so far is durably persisted.
        sync
      
        # Create one hard link for file foo and another one for file bar. After
        # that fsync only the file bar.
        ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/bar_link
        ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/foo_link
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar
      
        # Silently drop all writes on scratch device to simulate power failure.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Allow writes again and mount the fs to trigger log/journal replay.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now verify both our files have a link count of 2.
        echo "Link count for file foo: $(stat --format=%h $SCRATCH_MNT/testdir/foo)"
        echo "Link count for file bar: $(stat --format=%h $SCRATCH_MNT/testdir/bar)"
      
        # We should be able to remove all the links of our files in testdir, and
        # after that the parent directory should become empty and therefore
        # possible to remove it.
        rm -f $SCRATCH_MNT/testdir/*
        rmdir $SCRATCH_MNT/testdir
      
        _unmount_flakey
      
        # The fstests framework will call fsck against our filesystem which will verify
        # that all metadata is in a consistent state.
      
        status=0
        exit
      
      The test fails with:
      
       -Link count for file foo: 2
       +Link count for file foo: 1
        Link count for file bar: 2
       +rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/testdir/foo_link': Stale file handle
       +rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/testdir': Directory not empty
       (...)
       _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
      
      And fsck's output:
      
        (...)
        checking fs roots
        root 5 inode 258 errors 2001, no inode item, link count wrong
            unresolved ref dir 257 index 5 namelen 8 name foo_link filetype 1 errors 4, no inode ref
        Checking filesystem on /dev/sdc
        (...)
      
      So fix this by marking inodes for link count fixup at log replay time
      whenever a directory entry is replayed if the entry was created in the
      transaction where the fsync was made and if it points to a non-directory
      inode.
      
      This isn't a new problem/regression, the issue exists for a long time,
      possibly since the log tree feature was added (2008).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      bb53eda9
  2. 02 8月, 2015 1 次提交
    • A
      link_path_walk(): be careful when failing with ENOTDIR · 97242f99
      Al Viro 提交于
      In RCU mode we might end up with dentry evicted just we check
      that it's a directory.  In such case we should return ECHILD
      rather than ENOTDIR, so that pathwalk would be retries in non-RCU
      mode.
      
      Breakage had been introduced in commit b18825a7 - prior to that
      we were looking at nd->inode, which had been fetched before
      verifying that ->d_seq was still valid.  That form of check
      would only be satisfied if at some point the pathname prefix
      would indeed have resolved to a non-directory.  The fix consists
      of checking ->d_seq after we'd run into a non-directory dentry,
      and failing with ECHILD in case of mismatch.
      
      Note that all branches since 3.12 have that problem...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      97242f99
  3. 29 7月, 2015 3 次提交
    • D
      xfs: remote attributes need to be considered data · df150ed1
      Dave Chinner 提交于
      We don't log remote attribute contents, and instead write them
      synchronously before we commit the block allocation and attribute
      tree update transaction. As a result we are writing to the allocated
      space before the allcoation has been made permanent.
      
      As a result, we cannot consider this allocation to be a metadata
      allocation. Metadata allocation can take blocks from the free list
      and so reuse them before the transaction that freed the block is
      committed to disk. This behaviour is perfectly fine for journalled
      metadata changes as log recovery will ensure the free operation is
      replayed before the overwrite, but for remote attribute writes this
      is not the case.
      
      Hence we have to consider the remote attribute blocks to contain
      data and allocate accordingly. We do this by dropping the
      XFS_BMAPI_METADATA flag from the block allocation. This means the
      allocation will not use blocks that are on the busy list without
      first ensuring that the freeing transaction has been committed to
      disk and the blocks removed from the busy list. This ensures we will
      never overwrite a freed block without first ensuring that it is
      really free.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      df150ed1
    • D
      xfs: remote attribute headers contain an invalid LSN · e3c32ee9
      Dave Chinner 提交于
      In recent testing, a system that crashed failed log recovery on
      restart with a bad symlink buffer magic number:
      
      XFS (vda): Starting recovery (logdev: internal)
      XFS (vda): Bad symlink block magic!
      XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 2060
      
      On examination of the log via xfs_logprint, none of the symlink
      buffers in the log had a bad magic number, nor were any other types
      of buffer log format headers mis-identified as symlink buffers.
      Tracing was used to find the buffer the kernel was tripping over,
      and xfs_db identified it's contents as:
      
      000: 5841524d 00000000 00000346 64d82b48 8983e692 d71e4680 a5f49e2c b317576e
      020: 00000000 00602038 00000000 006034ce d0020000 00000000 4d4d4d4d 4d4d4d4d
      040: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
      060: 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d 4d4d4d4d
      .....
      
      This is a remote attribute buffer, which are notable in that they
      are not logged but are instead written synchronously by the remote
      attribute code so that they exist on disk before the attribute
      transactions are committed to the journal.
      
      The above remote attribute block has an invalid LSN in it - cycle
      0xd002000, block 0 - which means when log recovery comes along to
      determine if the transaction that writes to the underlying block
      should be replayed, it sees a block that has a future LSN and so
      does not replay the buffer data in the transaction. Instead, it
      validates the buffer magic number and attaches the buffer verifier
      to it.  It is this buffer magic number check that is failing in the
      above assert, indicating that we skipped replay due to the LSN of
      the underlying buffer.
      
      The problem here is that the remote attribute buffers cannot have a
      valid LSN placed into them, because the transaction that contains 
      the attribute tree pointer changes and the block allocation that the
      attribute data is being written to hasn't yet been committed. Hence
      the LSN field in the attribute block is completely unwritten,
      thereby leaving the underlying contents of the block in the LSN
      field. It could have any value, and hence a future overwrite of the
      block by log recovery may or may not work correctly.
      
      Fix this by always writing an invalid LSN to the remote attribute
      block, as any buffer in log recovery that needs to write over the
      remote attribute should occur. We are protected from having old data
      written over the attribute by the fact that freeing the block before
      the remote attribute is written will result in the buffer being
      marked stale in the log and so all changes prior to the buffer stale
      transaction will be cancelled by log recovery.
      
      Hence it is safe to ignore the LSN in the case or synchronously
      written, unlogged metadata such as remote attribute blocks, and to
      ensure we do that correctly, we need to write an invalid LSN to all
      remote attribute blocks to trigger immediate recovery of metadata
      that is written over the top.
      
      As a further protection for filesystems that may already have remote
      attribute blocks with bad LSNs on disk, change the log recovery code
      to always trigger immediate recovery of metadata over remote
      attribute blocks.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e3c32ee9
    • D
      xfs: call dax_fault on read page faults for DAX · b2442c5a
      Dave Chinner 提交于
      When modifying the patch series to handle the XFS MMAP_LOCK nesting
      of page faults, I botched the conversion of the read page fault
      path, and so it is only every calling through the page cache. Re-add
      the necessary __dax_fault() call for such files.
      
      Because the get_blocks callback on read faults may not set up the
      mapping buffer correctly to allow unwritten extent completion to be
      run, we need to allow callers of __dax_fault() to pass a null
      complete_unwritten() callback. The DAX code always zeros the
      unwritten page when it is read faulted so there are no stale data
      exposure issues with not doing the conversion. The only downside
      will be the potential for increased CPU overhead on repeated read
      faults of the same page. If this proves to be a problem, then the
      filesystem needs to fix it's get_block callback and provide a
      convert_unwritten() callback to the read fault path.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMatthew Wilcox <willy@linux.intel.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b2442c5a
  4. 28 7月, 2015 2 次提交
    • K
      nfs: Fix an oops caused by using other thread's stack space in ASYNC mode · a49c2691
      Kinglong Mee 提交于
      An oops caused by using other thread's stack space in sunrpc ASYNC sending thread.
      
      [ 9839.007187] ------------[ cut here ]------------
      [ 9839.007923] kernel BUG at fs/nfs/nfs4xdr.c:910!
      [ 9839.008069] invalid opcode: 0000 [#1] SMP
      [ 9839.008069] Modules linked in: blocklayoutdriver rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm joydev iosf_mbi crct10dif_pclmul snd_timer crc32_pclmul crc32c_intel ghash_clmulni_intel snd soundcore ppdev pvpanic parport_pc i2c_piix4 serio_raw virtio_balloon parport acpi_cpufreq nfsd nfs_acl lockd grace auth_rpcgss sunrpc qxl drm_kms_helper virtio_net virtio_console virtio_blk ttm drm virtio_pci virtio_ring virtio ata_generic pata_acpi
      [ 9839.008069] CPU: 0 PID: 308 Comm: kworker/0:1H Not tainted 4.0.0-0.rc4.git1.3.fc23.x86_64 #1
      [ 9839.008069] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [ 9839.008069] Workqueue: rpciod rpc_async_schedule [sunrpc]
      [ 9839.008069] task: ffff8800d8b4d8e0 ti: ffff880036678000 task.ti: ffff880036678000
      [ 9839.008069] RIP: 0010:[<ffffffffa0339cc9>]  [<ffffffffa0339cc9>] reserve_space.part.73+0x9/0x10 [nfsv4]
      [ 9839.008069] RSP: 0018:ffff88003667ba58  EFLAGS: 00010246
      [ 9839.008069] RAX: 0000000000000000 RBX: 000000001fc15e18 RCX: ffff8800c0193800
      [ 9839.008069] RDX: ffff8800e4ae3f24 RSI: 000000001fc15e2c RDI: ffff88003667bcd0
      [ 9839.008069] RBP: ffff88003667ba58 R08: ffff8800d9173008 R09: 0000000000000003
      [ 9839.008069] R10: ffff88003667bcd0 R11: 000000000000000c R12: 0000000000010000
      [ 9839.008069] R13: ffff8800d9173350 R14: 0000000000000000 R15: ffff8800c0067b98
      [ 9839.008069] FS:  0000000000000000(0000) GS:ffff88011fc00000(0000) knlGS:0000000000000000
      [ 9839.008069] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 9839.008069] CR2: 00007f988c9c8bb0 CR3: 00000000d99b6000 CR4: 00000000000407f0
      [ 9839.008069] Stack:
      [ 9839.008069]  ffff88003667bbc8 ffffffffa03412c5 00000000c6c55680 ffff880000000003
      [ 9839.008069]  0000000000000088 00000010c6c55680 0001000000000002 ffffffff816e87e9
      [ 9839.008069]  0000000000000000 00000000477290e2 ffff88003667bab8 ffffffff81327ba3
      [ 9839.008069] Call Trace:
      [ 9839.008069]  [<ffffffffa03412c5>] encode_attrs+0x435/0x530 [nfsv4]
      [ 9839.008069]  [<ffffffff816e87e9>] ? inet_sendmsg+0x69/0xb0
      [ 9839.008069]  [<ffffffff81327ba3>] ? selinux_socket_sendmsg+0x23/0x30
      [ 9839.008069]  [<ffffffff8164c1df>] ? do_sock_sendmsg+0x9f/0xc0
      [ 9839.008069]  [<ffffffff8164c278>] ? kernel_sendmsg+0x58/0x70
      [ 9839.008069]  [<ffffffffa011acc0>] ? xdr_reserve_space+0x20/0x170 [sunrpc]
      [ 9839.008069]  [<ffffffffa011acc0>] ? xdr_reserve_space+0x20/0x170 [sunrpc]
      [ 9839.008069]  [<ffffffffa0341b40>] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4]
      [ 9839.008069]  [<ffffffffa03419a5>] encode_open+0x2d5/0x340 [nfsv4]
      [ 9839.008069]  [<ffffffffa0341b40>] ? nfs4_xdr_enc_open_noattr+0x130/0x130 [nfsv4]
      [ 9839.008069]  [<ffffffffa011ab89>] ? xdr_encode_opaque+0x19/0x20 [sunrpc]
      [ 9839.008069]  [<ffffffffa0339cfb>] ? encode_string+0x2b/0x40 [nfsv4]
      [ 9839.008069]  [<ffffffffa0341bf3>] nfs4_xdr_enc_open+0xb3/0x140 [nfsv4]
      [ 9839.008069]  [<ffffffffa0110a4c>] rpcauth_wrap_req+0xac/0xf0 [sunrpc]
      [ 9839.008069]  [<ffffffffa01017db>] call_transmit+0x18b/0x2d0 [sunrpc]
      [ 9839.008069]  [<ffffffffa0101650>] ? call_decode+0x860/0x860 [sunrpc]
      [ 9839.008069]  [<ffffffffa0101650>] ? call_decode+0x860/0x860 [sunrpc]
      [ 9839.008069]  [<ffffffffa010caa0>] __rpc_execute+0x90/0x460 [sunrpc]
      [ 9839.008069]  [<ffffffffa010ce85>] rpc_async_schedule+0x15/0x20 [sunrpc]
      [ 9839.008069]  [<ffffffff810b452b>] process_one_work+0x1bb/0x410
      [ 9839.008069]  [<ffffffff810b47d3>] worker_thread+0x53/0x470
      [ 9839.008069]  [<ffffffff810b4780>] ? process_one_work+0x410/0x410
      [ 9839.008069]  [<ffffffff810b4780>] ? process_one_work+0x410/0x410
      [ 9839.008069]  [<ffffffff810ba7b8>] kthread+0xd8/0xf0
      [ 9839.008069]  [<ffffffff810ba6e0>] ? kthread_worker_fn+0x180/0x180
      [ 9839.008069]  [<ffffffff81786418>] ret_from_fork+0x58/0x90
      [ 9839.008069]  [<ffffffff810ba6e0>] ? kthread_worker_fn+0x180/0x180
      [ 9839.008069] Code: 00 00 48 c7 c7 21 fa 37 a0 e8 94 1c d6 e0 c6 05 d2 17 05 00 01 8b 03 eb d7 66 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 <0f> 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 89 f3
      [ 9839.008069] RIP  [<ffffffffa0339cc9>] reserve_space.part.73+0x9/0x10 [nfsv4]
      [ 9839.008069]  RSP <ffff88003667ba58>
      [ 9839.071114] ---[ end trace cc14c03adb522e94 ]---
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      a49c2691
    • J
      nfs: plug memory leak when ->prepare_layoutcommit fails · 3471648a
      Jeff Layton 提交于
      "data" is currently leaked when the prepare_layoutcommit operation
      returns an error. Put the cred before taking the spinlock in that
      case, take the lock and then goto out_unlock which will drop the
      lock and then free "data".
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      3471648a