1. 05 3月, 2016 1 次提交
  2. 04 3月, 2016 2 次提交
    • F
      Btrfs: fix loading of orphan roots leading to BUG_ON · 909c3a22
      Filipe Manana 提交于
      When looking for orphan roots during mount we can end up hitting a
      BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is
      replayed and qgroups are enabled. This is because after a log tree is
      replayed, a transaction commit is made, which triggers qgroup extent
      accounting which in turn does backref walking which ends up reading and
      inserting all roots in the radix tree fs_info->fs_root_radix, including
      orphan roots (deleted snapshots). So after the log tree is replayed, when
      finding orphan roots we hit the BUG_ON with the following trace:
      
      [118209.182438] ------------[ cut here ]------------
      [118209.183279] kernel BUG at fs/btrfs/root-tree.c:314!
      [118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse
      processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata
      virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs]
      [118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G        W       4.5.0-rc5-btrfs-next-24+ #1
      [118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
      [118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000
      [118209.186318] RIP: 0010:[<ffffffffa04237d7>]  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
      [118209.186318] RSP: 0018:ffff8800af34faa8  EFLAGS: 00010246
      [118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001
      [118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff
      [118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000
      [118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000
      [118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000
      [118209.186318] FS:  00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000
      [118209.186318] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0
      [118209.186318] Stack:
      [118209.186318]  fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000
      [118209.186318]  fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000
      [118209.186318]  0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000
      [118209.186318] Call Trace:
      [118209.186318]  [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs]
      [118209.186318]  [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs]
      [118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
      [118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
      [118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
      [118209.186318]  [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs]
      [118209.186318]  [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf
      [118209.186318]  [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3
      [118209.186318]  [<ffffffff8117b87e>] mount_fs+0x67/0x131
      [118209.186318]  [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde
      [118209.186318]  [<ffffffff81195637>] do_mount+0x8a6/0x9e8
      [118209.186318]  [<ffffffff8119598d>] SyS_mount+0x77/0x9f
      [118209.186318]  [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b
      [118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b
      4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00
      [118209.186318] RIP  [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs]
      [118209.186318]  RSP <ffff8800af34faa8>
      [118209.230735] ---[ end trace 83938f987d85d477 ]---
      
      So fix this by not treating the error -EEXIST, returned when attempting
      to insert a root already inserted by the backref walking code, as an error.
      
      The following test case for xfstests reproduces the bug:
      
        seq=`basename $0`
        seqres=$RESULT_DIR/$seq
        echo "QA output created by $seq"
        tmp=/tmp/$$
        status=1	# failure is the default!
        trap "_cleanup; exit \$status" 0 1 2 3 15
      
        _cleanup()
        {
            _cleanup_flakey
            cd /
            rm -f $tmp.*
        }
      
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
      
        # real QA test starts here
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_target flakey
        _require_metadata_journaling $SCRATCH_DEV
      
        rm -f $seqres.full
      
        _scratch_mkfs >>$seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        _run_btrfs_util_prog quota enable $SCRATCH_MNT
      
        # Create 2 directories with one file in one of them.
        # We use these just to trigger a transaction commit later, moving the file from
        # directory a to directory b and doing an fsync against directory a.
        mkdir $SCRATCH_MNT/a
        mkdir $SCRATCH_MNT/b
        touch $SCRATCH_MNT/a/f
        sync
      
        # Create our test file with 2 4K extents.
        $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
      
        # Create a snapshot and delete it. This doesn't really delete the snapshot
        # immediately, just makes it inaccessible and invisible to user space, the
        # snapshot is deleted later by a dedicated kernel thread (cleaner kthread)
        # which is woke up at the next transaction commit.
        # A root orphan item is inserted into the tree of tree roots, so that if a
        # power failure happens before the dedicated kernel thread does the snapshot
        # deletion, the next time the filesystem is mounted it resumes the snapshot
        # deletion.
        _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap
        _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap
      
        # Now overwrite half of the extents we wrote before. Because we made a snapshpot
        # before, which isn't really deleted yet (since no transaction commit happened
        # after we did the snapshot delete request), the non overwritten extents get
        # referenced twice, once by the default subvolume and once by the snapshot.
        $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
      
        # Now move file f from directory a to directory b and fsync directory a.
        # The fsync on the directory a triggers a transaction commit (because a file
        # was moved from it to another directory) and the file fsync leaves a log tree
        # with file extent items to replay.
        mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar
      
        echo "File digest before power failure:"
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        # Now simulate a power failure and mount the filesystem to replay the log tree.
        # After the log tree was replayed, we used to hit a BUG_ON() when processing
        # the root orphan item for the deleted snapshot. This is because when processing
        # an orphan root the code expected to be the first code inserting the root into
        # the fs_info->fs_root_radix radix tree, while in reallity it was the second
        # caller attempting to do it - the first caller was the transaction commit that
        # took place after replaying the log tree, when updating the qgroup counters.
        _flakey_drop_and_remount
      
        echo "File digest before after failure:"
        # Must match what he got before the power failure.
        md5sum $SCRATCH_MNT/foobar | _filter_scratch
      
        _unmount_flakey
        status=0
        exit
      
      Fixes: 2d9e9776 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref")
      Cc: stable@vger.kernel.org  # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      909c3a22
    • T
      writeback: flush inode cgroup wb switches instead of pinning super_block · a1a0e23e
      Tejun Heo 提交于
      If cgroup writeback is in use, inodes can be scheduled for
      asynchronous wb switching.  Before 5ff8eaac ("writeback: keep
      superblock pinned during cgroup writeback association switches"), this
      could race with umount leading to super_block being destroyed while
      inodes are pinned for wb switching.  5ff8eaac fixed it by bumping
      s_active while wb switches are in flight; however, this allowed
      in-flight wb switches to make umounts asynchronous when the userland
      expected synchronosity - e.g. fsck immediately following umount may
      fail because the device is still busy.
      
      This patch removes the problematic super_block pinning and instead
      makes generic_shutdown_super() flush in-flight wb switches.  wb
      switches are now executed on a dedicated isw_wq so that they can be
      flushed and isw_nr_in_flight keeps track of the number of in-flight wb
      switches so that flushing can be avoided in most cases.
      
      v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE
          check in inode_switch_wbs() as Jan an Al suggested.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NTahsin Erdogan <tahsin@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com
      Fixes: 5ff8eaac ("writeback: keep superblock pinned during cgroup writeback association switches")
      Cc: stable@vger.kernel.org #v4.5
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NTahsin Erdogan <tahsin@google.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      a1a0e23e
  3. 03 3月, 2016 1 次提交
    • L
      userfaultfd: don't block on the last VM updates at exit time · 39680f50
      Linus Torvalds 提交于
      The exit path will do some final updates to the VM of an exiting process
      to inform others of the fact that the process is going away.
      
      That happens, for example, for robust futex state cleanup, but also if
      the parent has asked for a TID update when the process exits (we clear
      the child tid field in user space).
      
      However, at the time we do those final VM accesses, we've already
      stopped accepting signals, so the usual "stop waiting for userfaults on
      signal" code in fs/userfaultfd.c no longer works, and the process can
      become an unkillable zombie waiting for something that will never
      happen.
      
      To solve this, just make handle_userfault() abort any user fault
      handling if we're already in the exit path past the signal handling
      state being dead (marked by PF_EXITING).
      
      This VM special case is pretty ugly, and it is possible that we should
      look at finalizing signals later (or move the VM final accesses
      earlier).  But in the meantime this is a fairly minimally intrusive fix.
      Reported-and-tested-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39680f50
  4. 01 3月, 2016 2 次提交
  5. 29 2月, 2016 3 次提交
    • Y
      Fix cifs_uniqueid_to_ino_t() function for s390x · 1ee9f4bd
      Yadan Fan 提交于
      This issue is caused by commit 02323db1 ("cifs: fix
      cifs_uniqueid_to_ino_t not to ever return 0"), when BITS_PER_LONG
      is 64 on s390x, the corresponding cifs_uniqueid_to_ino_t()
      function will cast 64-bit fileid to 32-bit by using (ino_t)fileid,
      because ino_t (typdefed __kernel_ino_t) is int type.
      
      It's defined in arch/s390/include/uapi/asm/posix_types.h
      
          #ifndef __s390x__
      
          typedef unsigned long   __kernel_ino_t;
          ...
          #else /* __s390x__ */
      
          typedef unsigned int    __kernel_ino_t;
      
      So the #ifdef condition is wrong for s390x, we can just still use
      one cifs_uniqueid_to_ino_t() function with comparing sizeof(ino_t)
      and sizeof(u64) to choose the correct execution accordingly.
      Signed-off-by: NYadan Fan <ydfan@suse.com>
      CC: stable <stable@vger.kernel.org>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      1ee9f4bd
    • P
      CIFS: Fix SMB2+ interim response processing for read requests · 6cc3b242
      Pavel Shilovsky 提交于
      For interim responses we only need to parse a header and update
      a number credits. Now it is done for all SMB2+ command except
      SMB2_READ which is wrong. Fix this by adding such processing.
      Signed-off-by: NPavel Shilovsky <pshilovsky@samba.org>
      Tested-by: NShirish Pargaonkar <shirishpargaonkar@gmail.com>
      CC: Stable <stable@vger.kernel.org>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      6cc3b242
    • J
      cifs: fix out-of-bounds access in lease parsing · deb7deff
      Justin Maggard 提交于
      When opening a file, SMB2_open() attempts to parse the lease state from the
      SMB2 CREATE Response.  However, the parsing code was not careful to ensure
      that the create contexts are not empty or invalid, which can lead to out-
      of-bounds memory access.  This can be seen easily by trying
      to read a file from a OSX 10.11 SMB3 server.  Here is sample crash output:
      
      BUG: unable to handle kernel paging request at ffff8800a1a77cc6
      IP: [<ffffffff8828a734>] SMB2_open+0x804/0x960
      PGD 8f77067 PUD 0
      Oops: 0000 [#1] SMP
      Modules linked in:
      CPU: 3 PID: 2876 Comm: cp Not tainted 4.5.0-rc3.x86_64.1+ #14
      Hardware name: NETGEAR ReadyNAS 314          /ReadyNAS 314          , BIOS 4.6.5 10/11/2012
      task: ffff880073cdc080 ti: ffff88005b31c000 task.ti: ffff88005b31c000
      RIP: 0010:[<ffffffff8828a734>]  [<ffffffff8828a734>] SMB2_open+0x804/0x960
      RSP: 0018:ffff88005b31fa08  EFLAGS: 00010282
      RAX: 0000000000000015 RBX: 0000000000000000 RCX: 0000000000000006
      RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffff88007eb8c8b0
      RBP: ffff88005b31fad8 R08: 666666203d206363 R09: 6131613030383866
      R10: 3030383866666666 R11: 00000000000002b0 R12: ffff8800660fd800
      R13: ffff8800a1a77cc2 R14: 00000000424d53fe R15: ffff88005f5a28c0
      FS:  00007f7c8a2897c0(0000) GS:ffff88007eb80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: ffff8800a1a77cc6 CR3: 000000005b281000 CR4: 00000000000006e0
      Stack:
       ffff88005b31fa70 ffffffff88278789 00000000000001d3 ffff88005f5a2a80
       ffffffff00000003 ffff88005d029d00 ffff88006fde05a0 0000000000000000
       ffff88005b31fc78 ffff88006fde0780 ffff88005b31fb2f 0000000100000fe0
      Call Trace:
       [<ffffffff88278789>] ? cifsConvertToUTF16+0x159/0x2d0
       [<ffffffff8828cf68>] smb2_open_file+0x98/0x210
       [<ffffffff8811e80c>] ? __kmalloc+0x1c/0xe0
       [<ffffffff882685f4>] cifs_open+0x2a4/0x720
       [<ffffffff88122cef>] do_dentry_open+0x1ff/0x310
       [<ffffffff88268350>] ? cifsFileInfo_get+0x30/0x30
       [<ffffffff88123d92>] vfs_open+0x52/0x60
       [<ffffffff88131dd0>] path_openat+0x170/0xf70
       [<ffffffff88097d48>] ? remove_wait_queue+0x48/0x50
       [<ffffffff88133a29>] do_filp_open+0x79/0xd0
       [<ffffffff8813f2ca>] ? __alloc_fd+0x3a/0x170
       [<ffffffff881240c4>] do_sys_open+0x114/0x1e0
       [<ffffffff881241a9>] SyS_open+0x19/0x20
       [<ffffffff8896e257>] entry_SYSCALL_64_fastpath+0x12/0x6a
      Code: 4d 8d 6c 07 04 31 c0 4c 89 ee e8 47 6f e5 ff 31 c9 41 89 ce 44 89 f1 48 c7 c7 28 b1 bd 88 31 c0 49 01 cd 4c 89 ee e8 2b 6f e5 ff <45> 0f b7 75 04 48 c7 c7 31 b1 bd 88 31 c0 4d 01 ee 4c 89 f6 e8
      RIP  [<ffffffff8828a734>] SMB2_open+0x804/0x960
       RSP <ffff88005b31fa08>
      CR2: ffff8800a1a77cc6
      ---[ end trace d9f69ba64feee469 ]---
      Signed-off-by: NJustin Maggard <jmaggard@netgear.com>
      Signed-off-by: NSteve French <smfrench@gmail.com>
      CC: Stable <stable@vger.kernel.org>
      deb7deff
  6. 28 2月, 2016 14 次提交
  7. 25 2月, 2016 3 次提交
  8. 23 2月, 2016 2 次提交
  9. 20 2月, 2016 4 次提交
    • M
      fs/pnode.c: treat zero mnt_group_id-s as unequal · 7ae8fd03
      Maxim Patlasov 提交于
      propagate_one(m) calculates "type" argument for copy_tree() like this:
      
      >    if (m->mnt_group_id == last_dest->mnt_group_id) {
      >        type = CL_MAKE_SHARED;
      >    } else {
      >        type = CL_SLAVE;
      >        if (IS_MNT_SHARED(m))
      >           type |= CL_MAKE_SHARED;
      >   }
      
      The "type" argument then governs clone_mnt() behavior with respect to flags
      and mnt_master of new mount. When we iterate through a slave group, it is
      possible that both current "m" and "last_dest" are not shared (although,
      both are slaves, i.e. have non-NULL mnt_master-s). Then the comparison
      above erroneously makes new mount shared and sets its mnt_master to
      last_source->mnt_master. The patch fixes the problem by handling zero
      mnt_group_id-s as though they are unequal.
      
      The similar problem exists in the implementation of "else" clause above
      when we have to ascend upward in the master/slave tree by calling:
      
      >    last_source = last_source->mnt_master;
      >    last_dest = last_source->mnt_parent;
      
      proper number of times. The last step is governed by
      "n->mnt_group_id != last_dest->mnt_group_id" condition that may lie if
      both are zero. The patch fixes this case in the same way as the former one.
      
      [AV: don't open-code an obvious helper...]
      Signed-off-by: NMaxim Patlasov <mpatlasov@virtuozzo.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ae8fd03
    • A
      affs_do_readpage_ofs(): just use kmap_atomic() around memcpy() · 0bacbe52
      Al Viro 提交于
      It forgets kunmap() on a failure exit, but there's really no point keeping
      the page kmapped at all - after all, what we are doing is a bunch of memcpy()
      into the parts of page, so kmap_atomic()/kunmap_atomic() just around those
      memcpy() is enough.
      Spotted-by: NInsu Yun <wuninsu@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0bacbe52
    • M
      xattr handlers: plug a lock leak in simple_xattr_list · 0e9a7da5
      Mateusz Guzik 提交于
      The code could leak xattrs->lock on error.
      
      Problem introduced with 786534b9 "tmpfs: listxattr should
      include POSIX ACL xattrs".
      Signed-off-by: NMateusz Guzik <mguzik@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0e9a7da5
    • W
      fs: allow no_seek_end_llseek to actually seek · 2feb55f8
      Wouter van Kesteren 提交于
      The user-visible impact of the issue is for example that without this
      patch sensors-detect breaks when trying to seek in /dev/cpu/0/cpuid.
      
      '~0ULL' is a 'unsigned long long' that when converted to a loff_t,
      which is signed, gets turned into -1. later in vfs_setpos we have
      'if (offset > maxsize)', which makes it always return EINVAL.
      
      Fixes: b25472f9 ("new helpers: no_seek_end_llseek{,_size}()")
      Signed-off-by: NWouter van Kesteren <woutershep@gmail.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2feb55f8
  10. 19 2月, 2016 4 次提交
    • J
      ext4: fix crashes in dioread_nolock mode · 74dae427
      Jan Kara 提交于
      Competing overwrite DIO in dioread_nolock mode will just overwrite
      pointer to io_end in the inode. This may result in data corruption or
      extent conversion happening from IO completion interrupt because we
      don't properly set buffer_defer_completion() when unlocked DIO races
      with locked DIO to unwritten extent.
      
      Since unlocked DIO doesn't need io_end for anything, just avoid
      allocating it and corrupting pointer from inode for locked DIO.
      A cleaner fix would be to avoid these games with io_end pointer from the
      inode but that requires more intrusive changes so we leave that for
      later.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      74dae427
    • J
      ext4: fix bh->b_state corruption · ed8ad838
      Jan Kara 提交于
      ext4 can update bh->b_state non-atomically in _ext4_get_block() and
      ext4_da_get_block_prep(). Usually this is fine since bh is just a
      temporary storage for mapping information on stack but in some cases it
      can be fully living bh attached to a page. In such case non-atomic
      update of bh->b_state can race with an atomic update which then gets
      lost. Usually when we are mapping bh and thus updating bh->b_state
      non-atomically, nobody else touches the bh and so things work out fine
      but there is one case to especially worry about: ext4_finish_bio() uses
      BH_Uptodate_Lock on the first bh in the page to synchronize handling of
      PageWriteback state. So when blocksize < pagesize, we can be atomically
      modifying bh->b_state of a buffer that actually isn't under IO and thus
      can race e.g. with delalloc trying to map that buffer. The result is
      that we can mistakenly set / clear BH_Uptodate_Lock bit resulting in the
      corruption of PageWriteback state or missed unlock of BH_Uptodate_Lock.
      
      Fix the problem by always updating bh->b_state bits atomically.
      
      CC: stable@vger.kernel.org
      Reported-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      ed8ad838
    • J
      fsnotify: turn fsnotify reaper thread into a workqueue job · 0918f1c3
      Jeff Layton 提交于
      We don't require a dedicated thread for fsnotify cleanup.  Switch it
      over to a workqueue job instead that runs on the system_unbound_wq.
      
      In the interest of not thrashing the queued job too often when there are
      a lot of marks being removed, we delay the reaper job slightly when
      queueing it, to allow several to gather on the list.
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Tested-by: NEryu Guan <guaneryu@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0918f1c3
    • J
      Revert "fsnotify: destroy marks with call_srcu instead of dedicated thread" · 13d34ac6
      Jeff Layton 提交于
      This reverts commit c510eff6 ("fsnotify: destroy marks with
      call_srcu instead of dedicated thread").
      
      Eryu reported that he was seeing some OOM kills kick in when running a
      testcase that adds and removes inotify marks on a file in a tight loop.
      
      The above commit changed the code to use call_srcu to clean up the
      marks.  While that does (in principle) work, the srcu callback job is
      limited to cleaning up entries in small batches and only once per jiffy.
      It's easily possible to overwhelm that machinery with too many call_srcu
      callbacks, and Eryu's reproduer did just that.
      
      There's also another potential problem with using call_srcu here.  While
      you can obviously sleep while holding the srcu_read_lock, the callbacks
      run under local_bh_disable, so you can't sleep there.
      
      It's possible when putting the last reference to the fsnotify_mark that
      we'll end up putting a chain of references including the fsnotify_group,
      uid, and associated keys.  While I don't see any obvious ways that that
      could occurs, it's probably still best to avoid using call_srcu here
      after all.
      
      This patch reverts the above patch.  A later patch will take a different
      approach to eliminated the dedicated thread here.
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Reported-by: NEryu Guan <guaneryu@gmail.com>
      Tested-by: NEryu Guan <guaneryu@gmail.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Eric Paris <eparis@parisplace.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13d34ac6
  11. 18 2月, 2016 3 次提交
    • K
      pnfs/blocklayout: fix a memeory leak when using,vmalloc_to_page · c8975706
      Kinglong Mee 提交于
      unreferenced object 0xffffc90000abf000 (size 16900):
        comm "fsync02", pid 15765, jiffies 4297431627 (age 423.772s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 a0 c2 19 00 88 ff ff  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8174d54e>] kmemleak_alloc+0x4e/0xb0
          [<ffffffff811b9b91>] __vmalloc_node_range+0x231/0x280
          [<ffffffff811b9c2a>] __vmalloc+0x4a/0x50
          [<ffffffffa02c9ec1>] ext_tree_prepare_commit+0x231/0x2e0 [blocklayoutdriver]
          [<ffffffffa02c700e>] bl_prepare_layoutcommit+0xe/0x10 [blocklayoutdriver]
          [<ffffffffa0596a6c>] pnfs_layoutcommit_inode+0x29c/0x330 [nfsv4]
          [<ffffffffa0596b13>] pnfs_generic_sync+0x13/0x20 [nfsv4]
          [<ffffffffa0585188>] nfs4_file_fsync+0x58/0x150 [nfsv4]
          [<ffffffff81228e5b>] vfs_fsync_range+0x4b/0xb0
          [<ffffffff81228f1d>] do_fsync+0x3d/0x70
          [<ffffffff812291d0>] SyS_fsync+0x10/0x20
          [<ffffffff81757def>] entry_SYSCALL_64_fastpath+0x12/0x76
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      v2, add missing include header
      Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      c8975706
    • C
      nfs4: fix stateid handling for the NFS v4.2 operations · 4bdf87eb
      Christoph Hellwig 提交于
      The newly added NFS v4.2 operations (ALLOCATE, DEALLOCATE, SEEK and CLONE)
      use a helper called nfs42_set_rw_stateid to select a stateid that is sent
      to the server.  But they don't set the inode and state fields in the
      nfs4_exception structure, and this don't partake in the stateid recovery
      protocol.  Because of this they will simply return errors insted of trying
      to recover a stateid when the server return a BAD_STATEID error.
      
      Additionally CLONE has the problem that it operates on two files and thus
      two stateids, and thus needs to call the exception handler twice to
      recover stateids.
      
      While we're at it stop grabbing an addititional reference to the open
      context in all these operations - having the file open guarantees that
      the open context won't go away.
      
      All this can be produces with the generic/168 and generic/170 tests in
      xfstests which stress the CLONE stateid handling.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      4bdf87eb
    • B
      NFSv4: Fix a dentry leak on alias use · d9dfd8d7
      Benjamin Coddington 提交于
      In the case where d_add_unique() finds an appropriate alias to use it will
      have already incremented the reference count.  An additional dget() to swap
      the open context's dentry is unnecessary and will leak a reference.
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Fixes: 275bb307 ("NFSv4: Move dentry instantiation into the NFSv4-...")
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      d9dfd8d7
  12. 17 2月, 2016 1 次提交