1. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  2. 04 4月, 2016 1 次提交
  3. 12 3月, 2016 1 次提交
  4. 02 3月, 2016 3 次提交
    • F
      Btrfs: fix extent_same allowing destination offset beyond i_size · f4dfe687
      Filipe Manana 提交于
      When using the same file as the source and destination for a dedup
      (extent_same ioctl) operation we were allowing it to dedup to a
      destination offset beyond the file's size, which doesn't make sense and
      it's not allowed for the case where the source and destination files are
      not the same file. This made de deduplication operation successful only
      when the source range corresponded to a hole, a prealloc extent or an
      extent with all bytes having a value of 0x00. This was also leaving a
      file hole (between i_size and destination offset) without the
      corresponding file extent items, which can be reproduced with the
      following steps for example:
      
        $ mkfs.btrfs -f /dev/sdi
        $ mount /dev/sdi /mnt/sdi
      
        $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar
        wrote 404990/404990 bytes at offset 304457
        395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec)
      
        $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792
        Deduping 2 total files
        (28672, 24576): /mnt/sdi/foobar
        (929792, 24576): /mnt/sdi/foobar
        1 files asked to be deduped
        i: 0, status: 0, bytes_deduped: 24576
        24576 total bytes deduped in this operation
      
        $ umount /mnt/sdi
        $ btrfsck /dev/sdi
        Checking filesystem on /dev/sdi
        UUID: 98c528aa-0833-427d-9403-b98032ffbf9d
        checking extents
        checking free space cache
        checking fs roots
        root 5 inode 257 errors 100, file extent discount
        Found file extent holes:
                start: 712704, len: 217088
        found 540673 bytes used err is 1
        total csum bytes: 400
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 123675
        file data blocks allocated: 671744
          referenced 671744
        btrfs-progs v4.2.3
      
      So fix this by not allowing the destination to go beyond the file's size,
      just as we do for the same where the source and destination files are not
      the same.
      
      A test for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f4dfe687
    • F
      Btrfs: fix file loss on log replay after renaming a file and fsync · 2be63d5c
      Filipe Manana 提交于
      We have two cases where we end up deleting a file at log replay time
      when we should not. For this to happen the file must have been renamed
      and a directory inode must have been fsynced/logged.
      
      Two examples that exercise these two cases are listed below.
      
        Case 1)
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir -p /mnt/a/b
        $ mkdir /mnt/c
        $ touch /mnt/a/b/foo
        $ sync
        $ mv /mnt/a/b/foo /mnt/c/
        # Create file bar just to make sure the fsync on directory a/ does
        # something and it's not a no-op.
        $ touch /mnt/a/bar
        $ xfs_io -c "fsync" /mnt/a
        < power fail / crash >
      
        The next time the filesystem is mounted, the log replay procedure
        deletes file foo.
      
        Case 2)
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ mkdir /mnt/a
        $ mkdir /mnt/b
        $ mkdir /mnt/c
        $ touch /mnt/a/foo
        $ ln /mnt/a/foo /mnt/b/foo_link
        $ touch /mnt/b/bar
        $ sync
        $ unlink /mnt/b/foo_link
        $ mv /mnt/b/bar /mnt/c/
        $ xfs_io -c "fsync" /mnt/a/foo
        < power fail / crash >
      
        The next time the filesystem is mounted, the log replay procedure
        deletes file bar.
      
      The reason why the files are deleted is because when we log inodes
      other then the fsync target inode, we ignore their last_unlink_trans
      value and leave the log without enough information to later replay the
      rename operations. So we need to look at the last_unlink_trans values
      and fallback to a transaction commit if they are greater than the
      id of the last committed transaction.
      
      So fix this by looking at the last_unlink_trans values and fallback to
      transaction commits when needed. Also, when logging other inodes (for
      case 1 we logged descendants of the fsync target inode while for case 2
      we logged ascendants) we need to care about concurrent tasks updating
      the last_unlink_trans of inodes we are logging (which was already an
      existing problem in check_parent_dirs_for_sync()). Since we can not
      acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes
      deadlocks with other concurrent operations that acquire the i_mutex of
      2 inodes (other fsyncs or renames for example), we need to serialize on
      the log_mutex of the inode we are logging. A task setting a new value for
      an inode's last_unlink_trans must acquire the inode's log_mutex and it
      must do this update before doing the actual unlink operation (which is
      already the case except when deleting a snapshot). Conversely the task
      logging the inode must first log the inode and then check the inode's
      last_unlink_trans value while holding its log_mutex, as if its value is
      not greater then the id of the last committed transaction it means it
      logged a safe state of the inode's items, while if its value is not
      smaller then the id of the last committed transaction it means the inode
      state it has logged might not be safe (the concurrent task might have
      just updated last_unlink_trans but hasn't done yet the unlink operation)
      and therefore a transaction commit must be done.
      
      Test cases for xfstests follow in separate patches.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2be63d5c
    • F
      Btrfs: fix unreplayable log after snapshot delete + parent dir fsync · 1ec9a1ae
      Filipe Manana 提交于
      If we delete a snapshot, fsync its parent directory and crash/power fail
      before the next transaction commit, on the next mount when we attempt to
      replay the log tree of the root containing the parent directory we will
      fail and prevent the filesystem from mounting, which is solvable by wiping
      out the log trees with the btrfs-zero-log tool but very inconvenient as
      we will lose any data and metadata fsynced before the parent directory
      was fsynced.
      
      For example:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
        $ mkdir /mnt/testdir
        $ btrfs subvolume snapshot /mnt /mnt/testdir/snap
        $ btrfs subvolume delete /mnt/testdir/snap
        $ xfs_io -c "fsync" /mnt/testdir
        < crash / power failure and reboot >
        $ mount /dev/sdc /mnt
        mount: mount(2) failed: No such file or directory
      
      And in dmesg/syslog we get the following message and trace:
      
      [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257
      [192066.363010] ------------[ cut here ]------------
      [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]()
      [192066.367250] BTRFS: Transaction aborted (error -2)
      [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
      [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G        W       4.4.0-rc6-btrfs-next-20+ #1
      [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [192066.380889]  0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8
      [192066.382561]  ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe
      [192066.384191]  ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710
      [192066.385827] Call Trace:
      [192066.386373]  [<ffffffff81257570>] dump_stack+0x4e/0x79
      [192066.387387]  [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2
      [192066.388429]  [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs]
      [192066.389236]  [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50
      [192066.389884]  [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs]
      [192066.390621]  [<ffffffff81184b55>] ? iput+0xb0/0x266
      [192066.391200]  [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs]
      [192066.391930]  [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs]
      [192066.392715]  [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs]
      [192066.393510]  [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs]
      [192066.394241]  [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs]
      [192066.394958]  [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs]
      [192066.395628]  [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs]
      [192066.396790]  [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs]
      [192066.397891]  [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs]
      [192066.398897]  [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs]
      [192066.399823]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
      [192066.400739]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
      [192066.401700]  [<ffffffff811714b9>] mount_fs+0x67/0x131
      [192066.402482]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
      [192066.403930]  [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs]
      [192066.404831]  [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf
      [192066.405726]  [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3
      [192066.406621]  [<ffffffff811714b9>] mount_fs+0x67/0x131
      [192066.407401]  [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde
      [192066.408247]  [<ffffffff8118ae36>] do_mount+0x893/0x9d2
      [192066.409047]  [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c
      [192066.409842]  [<ffffffff8118b187>] SyS_mount+0x75/0xa1
      [192066.410621]  [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]---
      [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry
      [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree)
      [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30
      [192066.444613] BTRFS: open_ctree failed
      
      This happens because when we are replaying the log and processing the
      directory entry pointing to the snapshot in the subvolume tree, we treat
      its btrfs_dir_item item as having a location with a key type matching
      BTRFS_INODE_ITEM_KEY, which is wrong because the type matches
      BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the
      object id refers to a root number and not to an inode in the root
      containing the parent directory.
      
      So fix this by triggering a transaction commit if an fsync against the
      parent directory is requested after deleting a snapshot. This is the
      simplest approach for a rare use case. Some alternative that avoids the
      transaction commit would require more code to explicitly delete the
      snapshot at log replay time (factoring out common code from ioctl.c:
      btrfs_ioctl_snap_destroy()), special care at fsync time to remove the
      log tree of the snapshot's root from the log root of the root of tree
      roots, amongst other steps.
      
      A test case for xfstests that triggers the issue follows.
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_target flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create a snapshot at the root of our filesystem (mount point path), delete it,
        # fsync the mount point path, crash and mount to replay the log. This should
        # succeed and after the filesystem is mounted the snapshot should not be visible
        # anymore.
        _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1
        _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT
        _flakey_drop_and_remount
        [ -e $SCRATCH_MNT/snap1 ] && \
            echo "Snapshot snap1 still exists after log replay"
      
        # Similar scenario as above, but this time the snapshot is created inside a
        # directory and not directly under the root (mount point path).
        mkdir $SCRATCH_MNT/testdir
        _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2
        _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir
        _flakey_drop_and_remount
        [ -e $SCRATCH_MNT/testdir/snap2 ] && \
            echo "Snapshot snap2 still exists after log replay"
      
        _unmount_flakey
      
        echo "Silence is golden"
        status=0
        exit
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Tested-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1ec9a1ae
  5. 23 2月, 2016 3 次提交
  6. 18 2月, 2016 2 次提交
  7. 11 2月, 2016 2 次提交
  8. 04 2月, 2016 2 次提交
    • F
      Btrfs: fix page reading in extent_same ioctl leading to csum errors · 31314002
      Filipe Manana 提交于
      In the extent_same ioctl, we were grabbing the pages (locked) and
      attempting to read them without bothering about any concurrent IO
      against them. That is, we were not checking for any ongoing ordered
      extents nor waiting for them to complete, which leads to a race where
      the extent_same() code gets a checksum verification error when it
      reads the pages, producing a message like the following in dmesg
      and making the operation fail to user space with -ENOMEM:
      
      [18990.161265] BTRFS warning (device sdc): csum failed ino 259 off 495616 csum 685204116 expected csum 1515870868
      
      Fix this by using btrfs_readpage() for reading the pages instead of
      extent_read_full_page_nolock(), which waits for any concurrent ordered
      extents to complete and locks the io range. Also do better error handling
      and don't treat all failures as -ENOMEM, as that's clearly misleasing,
      becoming identical to the checks and operation of prepare_uptodate_page().
      
      The use of extent_read_full_page_nolock() was required before
      commit f4414602 ("btrfs: fix deadlock with extent-same and readpage"),
      as we had the range locked in an inode's io tree before attempting to
      read the pages.
      
      Fixes: f4414602 ("btrfs: fix deadlock with extent-same and readpage")
      Cc: stable@vger.kernel.org   # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      31314002
    • F
      Btrfs: fix invalid page accesses in extent_same (dedup) ioctl · e0bd70c6
      Filipe Manana 提交于
      In the extent_same ioctl we are getting the pages for the source and
      target ranges and unlocking them immediately after, which is incorrect
      because later we attempt to map them (with kmap_atomic) and access their
      contents at btrfs_cmp_data(). When we do such access the pages might have
      been relocated or removed from memory, which leads to an invalid memory
      access. This issue is detected on a kernel with CONFIG_DEBUG_PAGEALLOC=y
      which produces a trace like the following:
      
      186736.677437] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [186736.680382] Modules linked in: btrfs dm_flakey dm_mod ppdev xor raid6_pq sha256_generic hmac drbg ansi_cprng acpi_cpufreq evdev sg aesni_intel aes_x86_64
      parport_pc ablk_helper tpm_tis psmouse parport i2c_piix4 tpm cryptd i2c_core lrw processor button serio_raw pcspkr gf128mul glue_helper loop autofs4 ext4
      crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last
      unloaded: btrfs]
      [186736.681319] CPU: 13 PID: 10222 Comm: duperemove Tainted: G        W       4.4.0-rc6-btrfs-next-18+ #1
      [186736.681319] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [186736.681319] task: ffff880132600400 ti: ffff880362284000 task.ti: ffff880362284000
      [186736.681319] RIP: 0010:[<ffffffff81264d00>]  [<ffffffff81264d00>] memcmp+0xb/0x22
      [186736.681319] RSP: 0018:ffff880362287d70  EFLAGS: 00010287
      [186736.681319] RAX: 000002c002468acf RBX: 0000000012345678 RCX: 0000000000000000
      [186736.681319] RDX: 0000000000001000 RSI: 0005d129c5cf9000 RDI: 0005d129c5cf9000
      [186736.681319] RBP: ffff880362287d70 R08: 0000000000000000 R09: 0000000000001000
      [186736.681319] R10: ffff880000000000 R11: 0000000000000476 R12: 0000000000001000
      [186736.681319] R13: ffff8802f91d4c88 R14: ffff8801f2a77830 R15: ffff880352e83e40
      [186736.681319] FS:  00007f27b37fe700(0000) GS:ffff88043dda0000(0000) knlGS:0000000000000000
      [186736.681319] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [186736.681319] CR2: 00007f27a406a000 CR3: 0000000217421000 CR4: 00000000001406e0
      [186736.681319] Stack:
      [186736.681319]  ffff880362287ea0 ffffffffa048d0bd 000000000009f000 0000000000001000
      [186736.681319]  0100000000000000 ffff8801f2a77850 ffff8802f91d49b0 ffff880132600400
      [186736.681319]  00000000000004f8 ffff8801c1efbe41 0000000000000000 0000000000000038
      [186736.681319] Call Trace:
      [186736.681319]  [<ffffffffa048d0bd>] btrfs_ioctl+0x24cb/0x2731 [btrfs]
      [186736.681319]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
      [186736.681319]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
      [186736.681319]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
      [186736.681319]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
      [186736.681319]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
      [186736.681319]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      [186736.681319] Code: 0a 3c 6e 74 0d 3c 79 74 04 3c 59 75 0c c6 06 01 eb 03 c6 06 00 31 c0 eb 05 b8 ea ff ff ff 5d c3 55 31 c9 48 89 e5 48 39 d1 74 13 <0f> b6
      04 0f 44 0f b6 04 0e 48 ff c1 44 29 c0 74 ea eb 02 31 c0
      
      (gdb) list *(btrfs_ioctl+0x24cb)
      0x5e0e1 is in btrfs_ioctl (fs/btrfs/ioctl.c:2972).
      2967                    dst_addr = kmap_atomic(dst_page);
      2968
      2969                    flush_dcache_page(src_page);
      2970                    flush_dcache_page(dst_page);
      2971
      2972                    if (memcmp(addr, dst_addr, cmp_len))
      2973                            ret = BTRFS_SAME_DATA_DIFFERS;
      2974
      2975                    kunmap_atomic(addr);
      2976                    kunmap_atomic(dst_addr);
      
      So fix this by making sure we keep the pages locked and respect the same
      locking order as everywhere else: get and lock the pages first and then
      lock the range in the inode's io tree (like for example at
      __btrfs_buffered_write() and extent_readpages()). If an ordered extent
      is found after locking the range in the io tree, unlock the range,
      unlock the pages, wait for the ordered extent to complete and repeat the
      entire locking process until no overlapping ordered extents are found.
      
      Cc: stable@vger.kernel.org   # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      e0bd70c6
  9. 02 2月, 2016 1 次提交
  10. 30 1月, 2016 1 次提交
  11. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  12. 22 1月, 2016 1 次提交
  13. 16 1月, 2016 1 次提交
    • C
      Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots · f32e48e9
      Chandan Rajendra 提交于
      The following call trace is seen when btrfs/031 test is executed in a loop,
      
      [  158.661848] ------------[ cut here ]------------
      [  158.662634] WARNING: CPU: 2 PID: 890 at /home/chandan/repos/linux/fs/btrfs/ioctl.c:558 create_subvol+0x3d1/0x6ea()
      [  158.664102] BTRFS: Transaction aborted (error -2)
      [  158.664774] Modules linked in:
      [  158.665266] CPU: 2 PID: 890 Comm: btrfs Not tainted 4.4.0-rc6-g511711af #2
      [  158.666251] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      [  158.667392]  ffffffff81c0a6b0 ffff8806c7c4f8e8 ffffffff81431fc8 ffff8806c7c4f930
      [  158.668515]  ffff8806c7c4f920 ffffffff81051aa1 ffff880c85aff000 ffff8800bb44d000
      [  158.669647]  ffff8808863b5c98 0000000000000000 00000000fffffffe ffff8806c7c4f980
      [  158.670769] Call Trace:
      [  158.671153]  [<ffffffff81431fc8>] dump_stack+0x44/0x5c
      [  158.671884]  [<ffffffff81051aa1>] warn_slowpath_common+0x81/0xc0
      [  158.672769]  [<ffffffff81051b27>] warn_slowpath_fmt+0x47/0x50
      [  158.673620]  [<ffffffff813bc98d>] create_subvol+0x3d1/0x6ea
      [  158.674440]  [<ffffffff813777c9>] btrfs_mksubvol.isra.30+0x369/0x520
      [  158.675376]  [<ffffffff8108a4aa>] ? percpu_down_read+0x1a/0x50
      [  158.676235]  [<ffffffff81377a81>] btrfs_ioctl_snap_create_transid+0x101/0x180
      [  158.677268]  [<ffffffff81377b52>] btrfs_ioctl_snap_create+0x52/0x70
      [  158.678183]  [<ffffffff8137afb4>] btrfs_ioctl+0x474/0x2f90
      [  158.678975]  [<ffffffff81144b8e>] ? vma_merge+0xee/0x300
      [  158.679751]  [<ffffffff8115be31>] ? alloc_pages_vma+0x91/0x170
      [  158.680599]  [<ffffffff81123f62>] ? lru_cache_add_active_or_unevictable+0x22/0x70
      [  158.681686]  [<ffffffff813d99cf>] ? selinux_file_ioctl+0xff/0x1d0
      [  158.682581]  [<ffffffff8117b791>] do_vfs_ioctl+0x2c1/0x490
      [  158.683399]  [<ffffffff813d3cde>] ? security_file_ioctl+0x3e/0x60
      [  158.684297]  [<ffffffff8117b9d4>] SyS_ioctl+0x74/0x80
      [  158.685051]  [<ffffffff819b2bd7>] entry_SYSCALL_64_fastpath+0x12/0x6a
      [  158.685958] ---[ end trace 4b63312de5a2cb76 ]---
      [  158.686647] BTRFS: error (device loop0) in create_subvol:558: errno=-2 No such entry
      [  158.709508] BTRFS info (device loop0): forced readonly
      [  158.737113] BTRFS info (device loop0): disk space caching is enabled
      [  158.738096] BTRFS error (device loop0): Remounting read-write after error is not allowed
      [  158.851303] BTRFS error (device loop0): cleaner transaction attach returned -30
      
      This occurs because,
      
      Mount filesystem
      Create subvol with ID 257
      Unmount filesystem
      Mount filesystem
      Delete subvol with ID 257
        btrfs_drop_snapshot()
          Add root corresponding to subvol 257 into
          btrfs_transaction->dropped_roots list
      Create new subvol (i.e. create_subvol())
        257 is returned as the next free objectid
        btrfs_read_fs_root_no_name()
          Finds the btrfs_root instance corresponding to the old subvol with ID 257
          in btrfs_fs_info->fs_roots_radix.
          Returns error since btrfs_root_item->refs has the value of 0.
      
      To fix the issue the commit initializes tree root's and subvolume root's
      highest_objectid when loading the roots from disk.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f32e48e9
  14. 07 1月, 2016 6 次提交
  15. 01 1月, 2016 1 次提交
  16. 08 12月, 2015 1 次提交
    • C
      vfs: pull btrfs clone API to vfs layer · 04b38d60
      Christoph Hellwig 提交于
      The btrfs clone ioctls are now adopted by other file systems, with NFS
      and CIFS already having support for them, and XFS being under active
      development.  To avoid growth of various slightly incompatible
      implementations, add one to the VFS.  Note that clones are different from
      file copies in several ways:
      
       - they are atomic vs other writers
       - they support whole file clones
       - they support 64-bit legth clones
       - they do not allow partial success (aka short writes)
       - clones are expected to be a fast metadata operation
      
      Because of that it would be rather cumbersome to try to piggyback them on
      top of the recent clone_file_range infrastructure.  The converse isn't
      true and the clone_file_range system call could try clone file range as
      a first attempt to copy, something that further patches will enable.
      
      Based on earlier work from Peng Tao.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      04b38d60
  17. 03 12月, 2015 2 次提交
  18. 02 12月, 2015 1 次提交
  19. 27 10月, 2015 1 次提交
    • D
      btrfs: check unsupported filters in balance arguments · 849ef928
      David Sterba 提交于
      We don't verify that all the balance filter arguments supplemented by
      the flags are actually known to the kernel. Thus we let it silently pass
      and do nothing.
      
      At the moment this means only the 'limit' filter, but we're going to add
      a few more soon so it's better to have that fixed. Also in older stable
      kernels so that it works with newer userspace tools.
      
      Cc: stable@vger.kernel.org # 3.16+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      849ef928
  20. 26 10月, 2015 1 次提交
    • F
      Btrfs: fix regression running delayed references when using qgroups · b06c4bf5
      Filipe Manana 提交于
      In the kernel 4.2 merge window we had a big changes to the implementation
      of delayed references and qgroups which made the no_quota field of delayed
      references not used anymore. More specifically the no_quota field is not
      used anymore as of:
      
        commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.")
      
      Leaving the no_quota field actually prevents delayed references from
      getting merged, which in turn cause the following BUG_ON(), at
      fs/btrfs/extent-tree.c, to be hit when qgroups are enabled:
      
        static int run_delayed_tree_ref(...)
        {
           (...)
           BUG_ON(node->ref_mod != 1);
           (...)
        }
      
      This happens on a scenario like the following:
      
        1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
      
        2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota.
      
        3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref2 is incompatible
           due to Ref2->no_quota != Ref3->no_quota.
      
        4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added.
           It's not merged with the reference at the tail of the list of refs
           for bytenr X because the reference at the tail, Ref3 is incompatible
           due to Ref3->no_quota != Ref4->no_quota.
      
        5) We run delayed references, trigger merging of delayed references,
           through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs().
      
        6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and
           all other conditions are satisfied too. So Ref1 gets a ref_mod
           value of 2.
      
        7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and
           all other conditions are satisfied too. So Ref2 gets a ref_mod
           value of 2.
      
        8) Ref1 and Ref2 aren't merged, because they have different values
           for their no_quota field.
      
        9) Delayed reference Ref1 is picked for running (select_delayed_ref()
           always prefers references with an action == BTRFS_ADD_DELAYED_REF).
           So run_delayed_tree_ref() is called for Ref1 which triggers the
           BUG_ON because Ref1->red_mod != 1 (equals 2).
      
      So fix this by removing the no_quota field, as it's not used anymore as
      of commit 0ed4792a ("btrfs: qgroup: Switch to new extent-oriented
      qgroup mechanism.").
      
      The use of no_quota was also buggy in at least two places:
      
      1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting
         no_quota to 0 instead of 1 when the following condition was true:
         is_fstree(ref_root) || !fs_info->quota_enabled
      
      2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to
         reset a node's no_quota when the condition "!is_fstree(root_objectid)
         || !root->fs_info->quota_enabled" was true but we did it only in
         an unused local stack variable, that is, we never reset the no_quota
         value in the node itself.
      
      This fixes the remainder of problems several people have been having when
      running delayed references, mostly while a balance is running in parallel,
      on a 4.2+ kernel.
      
      Very special thanks to Stéphane Lesimple for helping debugging this issue
      and testing this fix on his multi terabyte filesystem (which took more
      than one day to balance alone, plus fsck, etc).
      
      Also, this fixes deadlock issue when using the clone ioctl with qgroups
      enabled, as reported by Elias Probst in the mailing list. The deadlock
      happens because after calling btrfs_insert_empty_item we have our path
      holding a write lock on a leaf of the fs/subvol tree and then before
      releasing the path we called check_ref() which did backref walking, when
      qgroups are enabled, and tried to read lock the same leaf. The trace for
      this case is the following:
      
        INFO: task systemd-nspawn:6095 blocked for more than 120 seconds.
        (...)
        Call Trace:
          [<ffffffff86999201>] schedule+0x74/0x83
          [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea
          [<ffffffff86137ed7>] ? wait_woken+0x74/0x74
          [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810
          [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce
          [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127
          [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667
          [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe
          [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6
          [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0
          [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65
          [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88
          [<ffffffff863e852e>] check_ref+0x64/0xc4
          [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d
          [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb
          [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77
          [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb
        (...)
      
      The problem goes away by eleminating check_ref(), which no longer is
      needed as its purpose was to get a value for the no_quota field of
      a delayed reference (this patch removes the no_quota field as mentioned
      earlier).
      Reported-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Tested-by: NStéphane Lesimple <stephane_btrfs@lesimple.fr>
      Reported-by: NElias Probst <mail@eliasprobst.eu>
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Reported-by: NMalte Schröder <malte@tnxip.de>
      Reported-by: NDerek Dongray <derek@valedon.co.uk>
      Reported-by: NErkki Seppala <flux-btrfs@inside.org>
      Cc: stable@vger.kernel.org  # 4.2+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      b06c4bf5
  21. 22 10月, 2015 4 次提交
  22. 14 10月, 2015 2 次提交
    • F
      Btrfs: fix file corruption and data loss after cloning inline extents · 8039d87d
      Filipe Manana 提交于
      Currently the clone ioctl allows to clone an inline extent from one file
      to another that already has other (non-inlined) extents. This is a problem
      because btrfs is not designed to deal with files having inline and regular
      extents, if a file has an inline extent then it must be the only extent
      in the file and must start at file offset 0. Having a file with an inline
      extent followed by regular extents results in EIO errors when doing reads
      or writes against the first 4K of the file.
      
      Also, the clone ioctl allows one to lose data if the source file consists
      of a single inline extent, with a size of N bytes, and the destination
      file consists of a single inline extent with a size of M bytes, where we
      have M > N. In this case the clone operation removes the inline extent
      from the destination file and then copies the inline extent from the
      source file into the destination file - we lose the M - N bytes from the
      destination file, a read operation will get the value 0x00 for any bytes
      in the the range [N, M] (the destination inode's i_size remained as M,
      that's why we can read past N bytes).
      
      So fix this by not allowing such destructive operations to happen and
      return errno EOPNOTSUPP to user space.
      
      Currently the fstest btrfs/035 tests the data loss case but it totally
      ignores this - i.e. expects the operation to succeed and does not check
      the we got data loss.
      
      The following test case for fstests exercises all these cases that result
      in file corruption and data loss:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
      
        # real QA test starts here
        _need_to_be_root
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_cloner
        _require_btrfs_fs_feature "no_holes"
        _require_btrfs_mkfs_feature "no-holes"
      
        rm -f $seqres.full
      
        test_cloning_inline_extents()
        {
            local mkfs_opts=$1
            local mount_opts=$2
      
            _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1
            _scratch_mount $mount_opts
      
            # File bar, the source for all the following clone operations, consists
            # of a single inline extent (50 bytes).
            $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
                | _filter_xfs_io
      
            # Test cloning into a file with an extent (non-inlined) where the
            # destination offset overlaps that extent. It should not be possible to
            # clone the inline extent from file bar into this file.
            $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo
      
            # Doing IO against any range in the first 4K of the file should work.
            # Due to a past clone ioctl bug which allowed cloning the inline extent,
            # these operations resulted in EIO errors.
            echo "File foo data after clone operation:"
            # All bytes should have the value 0xaa (clone operation failed and did
            # not modify our file).
            od -t x1 $SCRATCH_MNT/foo
            $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io
      
            # Test cloning the inline extent against a file which has a hole in its
            # first 4K followed by a non-inlined extent. It should not be possible
            # as well to clone the inline extent from file bar into this file.
            $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2
      
            # Doing IO against any range in the first 4K of the file should work.
            # Due to a past clone ioctl bug which allowed cloning the inline extent,
            # these operations resulted in EIO errors.
            echo "File foo2 data after clone operation:"
            # All bytes should have the value 0x00 (clone operation failed and did
            # not modify our file).
            od -t x1 $SCRATCH_MNT/foo2
            $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io
      
            # Test cloning the inline extent against a file which has a size of zero
            # but has a prealloc extent. It should not be possible as well to clone
            # the inline extent from file bar into this file.
            $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3
      
            # Doing IO against any range in the first 4K of the file should work.
            # Due to a past clone ioctl bug which allowed cloning the inline extent,
            # these operations resulted in EIO errors.
            echo "First 50 bytes of foo3 after clone operation:"
            # Should not be able to read any bytes, file has 0 bytes i_size (the
            # clone operation failed and did not modify our file).
            od -t x1 $SCRATCH_MNT/foo3
            $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io
      
            # Test cloning the inline extent against a file which consists of a
            # single inline extent that has a size not greater than the size of
            # bar's inline extent (40 < 50).
            # It should be possible to do the extent cloning from bar to this file.
            $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4
      
            # Doing IO against any range in the first 4K of the file should work.
            echo "File foo4 data after clone operation:"
            # Must match file bar's content.
            od -t x1 $SCRATCH_MNT/foo4
            $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io
      
            # Test cloning the inline extent against a file which consists of a
            # single inline extent that has a size greater than the size of bar's
            # inline extent (60 > 50).
            # It should not be possible to clone the inline extent from file bar
            # into this file.
            $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
                | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5
      
            # Reading the file should not fail.
            echo "File foo5 data after clone operation:"
            # Must have a size of 60 bytes, with all bytes having a value of 0x03
            # (the clone operation failed and did not modify our file).
            od -t x1 $SCRATCH_MNT/foo5
      
            # Test cloning the inline extent against a file which has no extents but
            # has a size greater than bar's inline extent (16K > 50).
            # It should not be possible to clone the inline extent from file bar
            # into this file.
            $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6
      
            # Reading the file should not fail.
            echo "File foo6 data after clone operation:"
            # Must have a size of 16K, with all bytes having a value of 0x00 (the
            # clone operation failed and did not modify our file).
            od -t x1 $SCRATCH_MNT/foo6
      
            # Test cloning the inline extent against a file which has no extents but
            # has a size not greater than bar's inline extent (30 < 50).
            # It should be possible to clone the inline extent from file bar into
            # this file.
            $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7
      
            # Reading the file should not fail.
            echo "File foo7 data after clone operation:"
            # Must have a size of 50 bytes, with all bytes having a value of 0xbb.
            od -t x1 $SCRATCH_MNT/foo7
      
            # Test cloning the inline extent against a file which has a size not
            # greater than the size of bar's inline extent (20 < 50) but has
            # a prealloc extent that goes beyond the file's size. It should not be
            # possible to clone the inline extent from bar into this file.
            $XFS_IO_PROG -f -c "falloc -k 0 1M" \
                            -c "pwrite -S 0x88 0 20" \
                            $SCRATCH_MNT/foo8 | _filter_xfs_io
            $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8
      
            echo "File foo8 data after clone operation:"
            # Must have a size of 20 bytes, with all bytes having a value of 0x88
            # (the clone operation did not modify our file).
            od -t x1 $SCRATCH_MNT/foo8
      
            _scratch_unmount
        }
      
        echo -e "\nTesting without compression and without the no-holes feature...\n"
        test_cloning_inline_extents
      
        echo -e "\nTesting with compression and without the no-holes feature...\n"
        test_cloning_inline_extents "" "-o compress"
      
        echo -e "\nTesting without compression and with the no-holes feature...\n"
        test_cloning_inline_extents "-O no-holes" ""
      
        echo -e "\nTesting with compression and with the no-holes feature...\n"
        test_cloning_inline_extents "-O no-holes" "-o compress"
      
        status=0
        exit
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      8039d87d
    • D
      btrfs: check unsupported filters in balance arguments · 8eb93459
      David Sterba 提交于
      We don't verify that all the balance filter arguments supplemented by
      the flags are actually known to the kernel. Thus we let it silently pass
      and do nothing.
      
      At the moment this means only the 'limit' filter, but we're going to add
      a few more soon so it's better to have that fixed. Also in older stable
      kernels so that it works with newer userspace tools.
      
      Cc: stable@vger.kernel.org # 3.16+
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8eb93459
  23. 08 10月, 2015 1 次提交