1. 16 4月, 2015 1 次提交
    • J
      aio: fix serial draining in exit_aio() · dc48e56d
      Jens Axboe 提交于
      exit_aio() currently serializes killing io contexts. Each context
      killing ends up having to do percpu_ref_kill(), which in turns has
      to wait for an RCU grace period. This can take a long time, depending
      on the number of contexts. And there's no point in doing them serially,
      when we could be waiting for all of them in one fell swoop.
      
      This patches makes my fio thread offload test case exit 0.2s instead
      of almost 6s.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      dc48e56d
  2. 13 3月, 2015 3 次提交
    • S
      fanotify: fix event filtering with FAN_ONDIR set · b3c1030d
      Suzuki K. Poulose 提交于
      With FAN_ONDIR set, the user can end up getting events, which it hasn't
      marked.  This was revealed with fanotify04 testcase failure on
      Linux-4.0-rc1, and is a regression from 3.19, revealed with 66ba93c0
      ("fanotify: don't set FAN_ONDIR implicitly on a marks ignored mask").
      
         # /opt/ltp/testcases/bin/fanotify04
         [ ... ]
        fanotify04    7  TPASS  :  event generated properly for type 100000
        fanotify04    8  TFAIL  :  fanotify04.c:147: got unexpected event 30
        fanotify04    9  TPASS  :  No event as expected
      
      The testcase sets the adds the following marks : FAN_OPEN | FAN_ONDIR for
      a fanotify on a dir.  Then does an open(), followed by close() of the
      directory and expects to see an event FAN_OPEN(0x20).  However, the
      fanotify returns (FAN_OPEN|FAN_CLOSE_NOWRITE(0x10)).  This happens due to
      the flaw in the check for event_mask in fanotify_should_send_event() which
      does:
      
      	if (event_mask & marks_mask & ~marks_ignored_mask)
      		return true;
      
      where, event_mask == (FAN_ONDIR | FAN_CLOSE_NOWRITE),
             marks_mask == (FAN_ONDIR | FAN_OPEN),
             marks_ignored_mask == 0
      
      Fix this by masking the outgoing events to the user, as we already take
      care of FAN_ONDIR and FAN_EVENT_ON_CHILD.
      Signed-off-by: NSuzuki K. Poulose <suzuki.poulose@arm.com>
      Tested-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3c1030d
    • R
      nilfs2: fix deadlock of segment constructor during recovery · 283ee148
      Ryusuke Konishi 提交于
      According to a report from Yuxuan Shui, nilfs2 in kernel 3.19 got stuck
      during recovery at mount time.  The code path that caused the deadlock was
      as follows:
      
        nilfs_fill_super()
          load_nilfs()
            nilfs_salvage_orphan_logs()
              * Do roll-forwarding, attach segment constructor for recovery,
                and kick it.
      
              nilfs_segctor_thread()
                nilfs_segctor_thread_construct()
                 * A lock is held with nilfs_transaction_lock()
                   nilfs_segctor_do_construct()
                     nilfs_segctor_drop_written_files()
                       iput()
                         iput_final()
                           write_inode_now()
                             writeback_single_inode()
                               __writeback_single_inode()
                                 do_writepages()
                                   nilfs_writepage()
                                     nilfs_construct_dsync_segment()
                                       nilfs_transaction_lock() --> deadlock
      
      This can happen if commit 7ef3ff2f ("nilfs2: fix deadlock of segment
      constructor over I_SYNC flag") is applied and roll-forward recovery was
      performed at mount time.  The roll-forward recovery can happen if datasync
      write is done and the file system crashes immediately after that.  For
      instance, we can reproduce the issue with the following steps:
      
       < nilfs2 is mounted on /nilfs (device: /dev/sdb1) >
       # dd if=/dev/zero of=/nilfs/test bs=4k count=1 && sync
       # dd if=/dev/zero of=/nilfs/test conv=notrunc oflag=dsync bs=4k
       count=1 && reboot -nfh
       < the system will immediately reboot >
       # mount -t nilfs2 /dev/sdb1 /nilfs
      
      The deadlock occurs because iput() can run segment constructor through
      writeback_single_inode() if MS_ACTIVE flag is not set on sb->s_flags.  The
      above commit changed segment constructor so that it calls iput()
      asynchronously for inodes with i_nlink == 0, but that change was
      imperfect.
      
      This fixes the another deadlock by deferring iput() in segment constructor
      even for the case that mount is not finished, that is, for the case that
      MS_ACTIVE flag is not set.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Reported-by: NYuxuan Shui <yshuiv7@gmail.com>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      283ee148
    • M
      ocfs2: make append_dio an incompat feature · 18d585f0
      Mark Fasheh 提交于
      It turns out that making this feature ro_compat isn't quite enough to
      prevent accidental corruption on mount from older kernels.  Ocfs2 (like
      other file systems) will process orphaned inodes even when the user mounts
      in 'ro' mode.  So for the case of a filesystem not knowing the append_dio
      feature, mounting the filesystem could result in orphaned-for-dio files
      being deleted, which we clearly don't want.
      
      So instead, turn this into an incompat flag.
      
      Btw, this is kind of my fault - initially I asked that we add a flag to
      cover the feature and even suggested that we use an ro flag.  It wasn't
      until I was looking through our commits for v4.0-rc1 that I realized we
      actually want this to be incompat.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18d585f0
  3. 06 3月, 2015 3 次提交
    • Q
      Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. · dd9ef135
      Quentin Casasnovas 提交于
      Improper arithmetics when calculting the address of the extended ref could
      lead to an out of bounds memory read and kernel panic.
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      cc: stable@vger.kernel.org # v3.7+
      Signed-off-by: NChris Mason <clm@fb.com>
      dd9ef135
    • F
      Btrfs: fix data loss in the fast fsync path · 3a8b36f3
      Filipe Manana 提交于
      When using the fast file fsync code path we can miss the fact that new
      writes happened since the last file fsync and therefore return without
      waiting for the IO to finish and write the new extents to the fsync log.
      
      Here's an example scenario where the fsync will miss the fact that new
      file data exists that wasn't yet durably persisted:
      
      1. fs_info->last_trans_committed == N - 1 and current transaction is
         transaction N (fs_info->generation == N);
      
      2. do a buffered write;
      
      3. fsync our inode, this clears our inode's full sync flag, starts
         an ordered extent and waits for it to complete - when it completes
         at btrfs_finish_ordered_io(), the inode's last_trans is set to the
         value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
         btrfs_set_inode_last_trans);
      
      4. transaction N is committed, so fs_info->last_trans_committed is now
         set to the value N and fs_info->generation remains with the value N;
      
      5. do another buffered write, when this happens btrfs_file_write_iter
         sets our inode's last_trans to the value N + 1 (that is
         fs_info->generation + 1 == N + 1);
      
      6. transaction N + 1 is started and fs_info->generation now has the
         value N + 1;
      
      7. transaction N + 1 is committed, so fs_info->last_trans_committed
         is set to the value N + 1;
      
      8. fsync our inode - because it doesn't have the full sync flag set,
         we only start the ordered extent, we don't wait for it to complete
         (only in a later phase) therefore its last_trans field has the
         value N + 1 set previously by btrfs_file_write_iter(), and so we
         have:
      
             inode->last_trans <= fs_info->last_trans_committed
                 (N + 1)              (N + 1)
      
         Which made us not log the last buffered write and exit the fsync
         handler immediately, returning success (0) to user space and resulting
         in data loss after a crash.
      
      This can actually be triggered deterministically and the following excerpt
      from a testcase I made for xfstests triggers the issue. It moves a dummy
      file across directories and then fsyncs the old parent directory - this
      is just to trigger a transaction commit, so moving files around isn't
      directly related to the issue but it was chosen because running 'sync' for
      example does more than just committing the current transaction, as it
      flushes/waits for all file data to be persisted. The issue can also happen
      at random periods, since the transaction kthread periodicaly commits the
      current transaction (about every 30 seconds by default).
      The body of the test is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file 'foo', the one we check for data loss.
        # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
        # bit from its flags (btrfs inode specific flags).
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                        -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now create one other file and 2 directories. We will move this second file
        # from one directory to the other later because it forces btrfs to commit its
        # currently open transaction if we fsync the old parent directory. This is
        # necessary to trigger the data loss bug that affected btrfs.
        mkdir $SCRATCH_MNT/testdir_1
        touch $SCRATCH_MNT/testdir_1/bar
        mkdir $SCRATCH_MNT/testdir_2
      
        # Make sure everything is durably persisted.
        sync
      
        # Write more 8Kb of data to our file.
        $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Move our 'bar' file into a new directory.
        mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
      
        # Fsync our first directory. Because it had a file moved into some other
        # directory, this made btrfs commit the currently open transaction. This is
        # a condition necessary to trigger the data loss bug.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
      
        # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
        # data we wrote previously to be persisted and available if a crash happens.
        # This did not happen with btrfs, because of the transaction commit that
        # happened when we fsynced the parent directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now check that all data we wrote before are available.
        echo "File content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        status=0
        exit
      
      The expected golden output for the test, which is what we get with this
      fix applied (or when running against ext3/4 and xfs), is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0040000
      
      Without this fix applied, the output shows the test file does not have
      the second 8Kb extent that we successfully fsynced:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
      
      So fix this by skipping the fsync only if we're doing a full sync and
      if the inode's last_trans is <= fs_info->last_trans_committed, or if
      the inode is already in the log. Also remove setting the inode's
      last_trans in btrfs_file_write_iter since it's useless/unreliable.
      
      Also because btrfs_file_write_iter no longer sets inode->last_trans to
      fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
      bail out if last_trans is 0, otherwise something as simple as the following
      example wouldn't log the second write on the last fsync:
      
        1. write to file
      
        2. fsync file
      
        3. fsync file
             |--> btrfs_inode_in_log() returns true and it set last_trans to 0
      
        4. write to file
             |--> btrfs_file_write_iter() no longers sets last_trans, so it
                  remained with a value of 0
        5. fsync
             |--> inode->last_trans == 0, so it bails out without logging the
                  second write
      
      A test case for xfstests will be sent soon.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3a8b36f3
    • J
      Btrfs: remove extra run_delayed_refs in update_cowonly_root · f5c0a122
      Josef Bacik 提交于
      This got added with my dirty_bgs patch, it's not needed.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f5c0a122
  4. 05 3月, 2015 1 次提交
    • J
      locks: fix fasync_struct memory leak in lease upgrade/downgrade handling · 0164bf02
      Jeff Layton 提交于
      Commit 8634b51f (locks: convert lease handling to file_lock_context)
      introduced a regression in the handling of lease upgrade/downgrades.
      
      In the event that we already have a lease on a file and are going to
      either upgrade or downgrade it, we skip doing any list insertion or
      deletion and simply re-call lm_setup on the existing lease.
      
      As of commit 8634b51f however, we end up calling lm_setup on the
      lease that was passed in, instead of on the existing lease. This causes
      us to leak the fasync_struct that was allocated in the event that there
      was not already an existing one (as it always appeared that there
      wasn't one).
      
      Fixes: 8634b51f (locks: convert lease handling to file_lock_context)
      Reported-and-Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      0164bf02
  5. 04 3月, 2015 4 次提交
  6. 03 3月, 2015 13 次提交
    • T
      eCryptfs: don't pass fs-specific ioctl commands through · 6d65261a
      Tyler Hicks 提交于
      eCryptfs can't be aware of what to expect when after passing an
      arbitrary ioctl command through to the lower filesystem. The ioctl
      command may trigger an action in the lower filesystem that is
      incompatible with eCryptfs.
      
      One specific example is when one attempts to use the Btrfs clone
      ioctl command when the source file is in the Btrfs filesystem that
      eCryptfs is mounted on top of and the destination fd is from a new file
      created in the eCryptfs mount. The ioctl syscall incorrectly returns
      success because the command is passed down to Btrfs which thinks that it
      was able to do the clone operation. However, the result is an empty
      eCryptfs file.
      
      This patch allows the trim, {g,s}etflags, and {g,s}etversion ioctl
      commands through and then copies up the inode metadata from the lower
      inode to the eCryptfs inode to catch any changes made to the lower
      inode's metadata. Those five ioctl commands are mostly common across all
      filesystems but the whitelist may need to be further pruned in the
      future.
      
      https://bugzilla.kernel.org/show_bug.cgi?id=93691
      https://launchpad.net/bugs/1305335Signed-off-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Rocko <rockorequin@hotmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: stable@vger.kernel.org # v2.6.36+: c43f7b8f eCryptfs: Handle ioctl calls with unlocked and compat functions
      6d65261a
    • T
      NFSv4: Ensure we skip delegations that are already being returned · ec3ca4e5
      Trond Myklebust 提交于
      In nfs_client_return_marked_delegations() and nfs_delegation_reap_unclaimed()
      we want to optimise the loop traversal by skipping delegations that are
      already in the process of being returned.
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      ec3ca4e5
    • T
      NFSv4: Pin the superblock while we're returning the delegation · 9f0f8e12
      Trond Myklebust 提交于
      This patch ensures that the superblock doesn't go ahead and disappear
      underneath us while the state manager thread is returning delegations.
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      9f0f8e12
    • T
      NFSv4: Ensure we honour NFS_DELEGATION_RETURNING in nfs_inode_set_delegation() · ade04647
      Trond Myklebust 提交于
      Ensure that nfs_inode_set_delegation() doesn't inadvertently detach a
      delegation that is already in the process of being returned.
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      ade04647
    • T
    • A
      NFS: Fix stateid used for NFS v4 closes · 369d6b7f
      Anna Schumaker 提交于
      After 566fcec6 the client uses the "current stateid" from the
      nfs4_state structure to close a file.  This could potentially contain a
      delegation stateid, which is disallowed by the protocol and causes
      servers to return NFS4ERR_BAD_STATEID.  This patch restores the
      (correct) behavior of sending the open stateid to close a file.
      Reported-by: NOlga Kornievskaia <kolga@netapp.com>
      Fixes: 566fcec6 (NFSv4: Fix an atomicity problem in CLOSE)
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      369d6b7f
    • F
      Btrfs: incremental send, don't rename a directory too soon · 84471e24
      Filipe Manana 提交于
      There's one more case where we can't issue a rename operation for a
      directory as soon as we process it. We used to delay directory renames
      only if they have some ancestor directory with a higher inode number
      that got renamed too, but there's another case where we need to delay
      the rename too - when a directory A is renamed to the old name of a
      directory B but that directory B has its rename delayed because it
      has now (in the send root) an ancestor with a higher inode number that
      was renamed. If we don't delay the directory rename in this case, the
      receiving end of the send stream will attempt to rename A to the old
      name of B before B got renamed to its new name, which results in a
      "directory not empty" error. So fix this by delaying directory renames
      for this case too.
      
      Steps to reproduce:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/a
        $ mkdir /mnt/b
        $ mkdir /mnt/c
        $ touch /mnt/a/file
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap1
      
        $ mv /mnt/c /mnt/x
        $ mv /mnt/a /mnt/x/y
        $ mv /mnt/b /mnt/a
      
        $ btrfs subvolume snapshot -r /mnt /mnt/snap2
      
        $ btrfs send /mnt/snap1 -f /tmp/1.send
        $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/2.send
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt2
        $ btrfs receive /mnt2 -f /tmp/1.send
        $ btrfs receive /mnt2 -f /tmp/2.send
        ERROR: rename b -> a failed. Directory not empty
      
      A test case for xfstests follows soon.
      Reported-by: NAmes Cornish <ames@cornishes.net>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      84471e24
    • D
      btrfs: fix lost return value due to variable shadowing · 1932b7be
      David Sterba 提交于
      A block-local variable stores error code but btrfs_get_blocks_direct may
      not return it in the end as there's a ret defined in the function scope.
      
      CC: <stable@vger.kernel.org>	# 3.6+
      Fixes: d187663e ("Btrfs: lock extents as we map them in DIO")
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      1932b7be
    • F
      Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr · 5cdf83ed
      Filipe Manana 提交于
      The return value from btrfs_lookup_xattr() can be a pointer encoding an
      error, therefore deal with it. This fixes commit 5f5bc6b1
      ("Btrfs: make xattr replace operations atomic").
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5cdf83ed
    • F
      Btrfs: fix off-by-one logic error in btrfs_realloc_node · 5dfe2be7
      Filipe Manana 提交于
      The end_slot variable actually matches the number of pointers in the
      node and not the last slot (which is 'nritems - 1'). Therefore in order
      to check that the current slot in the for loop doesn't match the last
      one, the correct logic is to check if 'i' is less than 'end_slot - 1'
      and not 'end_slot - 2'.
      
      Fix this and set end_slot to be 'nritems - 1', as it's less confusing
      since the variable name implies it's inclusive rather then exclusive.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      5dfe2be7
    • F
      Btrfs: add missing inode update when punching hole · e8c1c76e
      Filipe Manana 提交于
      When punching a file hole if we endup only zeroing parts of a page,
      because the start offset isn't a multiple of the sector size or the
      start offset and length fall within the same page, we were not updating
      the inode item. This prevented an fsync from doing anything, if no other
      file changes happened in the current transaction, because the fields
      in btrfs_inode used to check if the inode needs to be fsync'ed weren't
      updated.
      
      This issue is easy to reproduce and the following excerpt from the
      xfstest case I made shows how to trigger it:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our test file.
        $XFS_IO_PROG -f -c "pwrite -S 0x22 -b 16K 0 16K" \
            $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Fsync the file, this makes btrfs update some btrfs inode specific fields
        # that are used to track if the inode needs to be written/updated to the fsync
        # log or not. After this fsync, the new values for those fields indicate that
        # a subsequent fsync does not need to touch the fsync log.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Force a commit of the current transaction. After this point, any operation
        # that modifies the data or metadata of our file, should update those fields in
        # the btrfs inode with values that make the next fsync operation write to the
        # fsync log.
        sync
      
        # Punch a hole in our file. This small range affects only 1 page.
        # This made the btrfs hole punching implementation write only some zeroes in
        # one page, but it did not update the btrfs inode fields used to determine if
        # the next fsync needs to write to the fsync log.
        $XFS_IO_PROG -c "fpunch 8000 4K" $SCRATCH_MNT/foo
      
        # Another variation of the previously mentioned case.
        $XFS_IO_PROG -c "fpunch 15000 100" $SCRATCH_MNT/foo
      
        # Now fsync the file. This was a no-operation because the previous hole punch
        # operation didn't update the inode's fields mentioned before, so they remained
        # with the values they had after the first fsync - that is, they indicate that
        # it is not needed to write to fsync log.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        echo "File content before:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        # Enable writes and mount the fs. This makes the fsync log replay code run.
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Because the last fsync didn't do anything, here the file content matched what
        # it was after the first fsync, before the holes were punched, and not what it
        # was after the holes were punched.
        echo "File content after:"
        od -t x1 $SCRATCH_MNT/foo
      
      This issue has been around since 2012, when the punch hole implementation
      was added, commit 2aaa6655 ("Btrfs: add hole punching").
      
      A test case for xfstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e8c1c76e
    • J
      Btrfs: abort the transaction if we fail to update the free space cache inode · 0c0ef4bc
      Josef Bacik 提交于
      Our gluster boxes were hitting a problem where they'd run out of space when
      updating the block group cache and therefore wouldn't be able to update the free
      space inode.  This is a problem because this is how we invalidate the cache and
      protect ourselves from errors further down the stack, so if this fails we have
      to abort the transaction so we make sure we don't end up with stale free space
      cache.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0c0ef4bc
    • F
      Btrfs: fix fsync race leading to ordered extent memory leaks · 4d884fce
      Filipe Manana 提交于
      We can have multiple fsync operations against the same file during the
      same transaction and they can collect the same ordered extents while they
      don't complete (still accessible from the inode's ordered tree). If this
      happens, those ordered extents will never get their reference counts
      decremented to 0, leading to memory leaks and inode leaks (an iput for an
      ordered extent's inode is scheduled only when the ordered extent's refcount
      drops to 0). The following sequence diagram explains this race:
      
               CPU 1                                         CPU 2
      
      btrfs_sync_file()
      
                                                       btrfs_sync_file()
      
        mutex_lock(inode->i_mutex)
        btrfs_log_inode()
          btrfs_get_logged_extents()
            --> collects ordered extent X
            --> increments ordered
                extent X's refcount
          btrfs_submit_logged_extents()
        mutex_unlock(inode->i_mutex)
      
                                                         mutex_lock(inode->i_mutex)
        btrfs_sync_log()
           btrfs_wait_logged_extents()
             --> list_del_init(&ordered->log_list)
                                                           btrfs_log_inode()
                                                             btrfs_get_logged_extents()
                                                               --> Adds ordered extent X
                                                                   to logged_list because
                                                                   at this point:
                                                                   list_empty(&ordered->log_list)
                                                                   && test_bit(BTRFS_ORDERED_LOGGED,
                                                                               &ordered->flags) == 0
                                                               --> Increments ordered extent
                                                                   X's refcount
             --> check if ordered extent's io is
                 finished or not, start it if
                 necessary and wait for it to finish
             --> sets bit BTRFS_ORDERED_LOGGED
                 on ordered extent X's flags
                 and adds it to trans->ordered
        btrfs_sync_log() finishes
      
                                                             btrfs_submit_logged_extents()
                                                           btrfs_log_inode() finishes
                                                         mutex_unlock(inode->i_mutex)
      
      btrfs_sync_file() finishes
      
                                                         btrfs_sync_log()
                                                            btrfs_wait_logged_extents()
                                                              --> Sees ordered extent X has the
                                                                  bit BTRFS_ORDERED_LOGGED set in
                                                                  its flags
                                                              --> X's refcount is untouched
                                                         btrfs_sync_log() finishes
      
                                                       btrfs_sync_file() finishes
      
      btrfs_commit_transaction()
        --> called by transaction kthread for e.g.
        btrfs_wait_pending_ordered()
          --> waits for ordered extent X to
              complete
          --> decrements ordered extent X's
              refcount by 1 only, corresponding
              to the increment done by the fsync
              task ran by CPU 1
      
      In the scenario of the above diagram, after the transaction commit,
      the ordered extent will remain with a refcount of 1 forever, leaking
      the ordered extent structure and preventing the i_count of its inode
      from ever decreasing to 0, since the delayed iput is scheduled only
      when the ordered extent's refcount drops to 0, preventing the inode
      from ever being evicted by the VFS.
      
      Fix this by using the flag BTRFS_ORDERED_LOGGED differently. Use it to
      mean that an ordered extent is already being processed by an fsync call,
      which will attach it to the current transaction, preventing it from being
      collected by subsequent fsync operations against the same inode.
      
      This race was introduced with the following change (added in 3.19 and
      backported to stable 3.18 and 3.17):
      
        Btrfs: make sure logged extents complete in the current transaction V3
        commit 50d9aa99
      
      I ran into this issue while running xfstests/generic/113 in a loop, which
      failed about 1 out of 10 runs with the following warning in dmesg:
      
      [ 2612.440038] WARNING: CPU: 4 PID: 22057 at fs/btrfs/disk-io.c:3558 free_fs_root+0x36/0x133 [btrfs]()
      [ 2612.442810] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop processor parport_pc parport psmouse therma
      l_sys i2c_piix4 serio_raw pcspkr evdev microcode button i2c_core ext4 crc16 jbd2 mbcache sd_mod sg sr_mod cdrom virtio_scsi ata_generic virtio_pci ata_piix virtio_ring libata virtio flo
      ppy e1000 scsi_mod [last unloaded: btrfs]
      [ 2612.452711] CPU: 4 PID: 22057 Comm: umount Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
      [ 2612.454921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [ 2612.457709]  0000000000000009 ffff8801342c3c78 ffffffff8142425e ffff88023ec8f2d8
      [ 2612.459829]  0000000000000000 ffff8801342c3cb8 ffffffff81045308 ffff880046460000
      [ 2612.461564]  ffffffffa036da56 ffff88003d07b000 ffff880046460000 ffff880046460068
      [ 2612.463163] Call Trace:
      [ 2612.463719]  [<ffffffff8142425e>] dump_stack+0x4c/0x65
      [ 2612.464789]  [<ffffffff81045308>] warn_slowpath_common+0xa1/0xbb
      [ 2612.466026]  [<ffffffffa036da56>] ? free_fs_root+0x36/0x133 [btrfs]
      [ 2612.467247]  [<ffffffff810453c5>] warn_slowpath_null+0x1a/0x1c
      [ 2612.468416]  [<ffffffffa036da56>] free_fs_root+0x36/0x133 [btrfs]
      [ 2612.469625]  [<ffffffffa036f2a7>] btrfs_drop_and_free_fs_root+0x93/0x9b [btrfs]
      [ 2612.471251]  [<ffffffffa036f353>] btrfs_free_fs_roots+0xa4/0xd6 [btrfs]
      [ 2612.472536]  [<ffffffff8142612e>] ? wait_for_completion+0x24/0x26
      [ 2612.473742]  [<ffffffffa0370bbc>] close_ctree+0x1f3/0x33c [btrfs]
      [ 2612.475477]  [<ffffffff81059d1d>] ? destroy_workqueue+0x148/0x1ba
      [ 2612.476695]  [<ffffffffa034e3da>] btrfs_put_super+0x19/0x1b [btrfs]
      [ 2612.477911]  [<ffffffff81153e53>] generic_shutdown_super+0x73/0xef
      [ 2612.479106]  [<ffffffff811540e2>] kill_anon_super+0x13/0x1e
      [ 2612.480226]  [<ffffffffa034e1e3>] btrfs_kill_super+0x17/0x23 [btrfs]
      [ 2612.481471]  [<ffffffff81154307>] deactivate_locked_super+0x3b/0x50
      [ 2612.482686]  [<ffffffff811547a7>] deactivate_super+0x3f/0x43
      [ 2612.483791]  [<ffffffff8116b3ed>] cleanup_mnt+0x59/0x78
      [ 2612.484842]  [<ffffffff8116b44c>] __cleanup_mnt+0x12/0x14
      [ 2612.485900]  [<ffffffff8105d019>] task_work_run+0x8f/0xbc
      [ 2612.486960]  [<ffffffff810028d8>] do_notify_resume+0x5a/0x6b
      [ 2612.488083]  [<ffffffff81236e5b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [ 2612.489333]  [<ffffffff8142a17f>] int_signal+0x12/0x17
      [ 2612.490353] ---[ end trace 54a960a6bdcb8d93 ]---
      [ 2612.557253] VFS: Busy inodes after unmount of sdb. Self-destruct in 5 seconds.  Have a nice day...
      
      Kmemleak confirmed the ordered extent leak (and btrfs inode specific
      structures such as delayed nodes):
      
      $ cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff880154290db0 (size 576):
        comm "btrfsck", pid 21980, jiffies 4295542503 (age 1273.412s)
        hex dump (first 32 bytes):
          01 40 00 00 01 00 00 00 b0 1d f1 4e 01 88 ff ff  .@.........N....
          00 00 00 00 00 00 00 00 c8 0d 29 54 01 88 ff ff  ..........)T....
        backtrace:
          [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
          [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
          [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
          [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
          [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
          [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
          [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
          [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
          [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
          [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
          [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
          [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
          [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff88014ef11db0 (size 576):
        comm "rm", pid 22009, jiffies 4295542593 (age 1273.052s)
        hex dump (first 32 bytes):
          02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 c8 1d f1 4e 01 88 ff ff  ...........N....
        backtrace:
          [<ffffffff8141d74d>] kmemleak_update_trace+0x4c/0x6a
          [<ffffffff8122f2c0>] radix_tree_node_alloc+0x6d/0x83
          [<ffffffff8122fb26>] __radix_tree_create+0x109/0x190
          [<ffffffff8122fbdd>] radix_tree_insert+0x30/0xac
          [<ffffffffa03b9bde>] btrfs_get_or_create_delayed_node+0x130/0x187 [btrfs]
          [<ffffffffa03bb82d>] btrfs_delayed_delete_inode_ref+0x32/0xac [btrfs]
          [<ffffffffa0379dae>] __btrfs_unlink_inode+0xee/0x288 [btrfs]
          [<ffffffffa037c715>] btrfs_unlink_inode+0x1e/0x40 [btrfs]
          [<ffffffffa037c797>] btrfs_unlink+0x60/0x9b [btrfs]
          [<ffffffff8115d7f0>] vfs_unlink+0x9c/0xed
          [<ffffffff8115f5de>] do_unlinkat+0x12c/0x1fa
          [<ffffffff811601a7>] SyS_unlinkat+0x29/0x2b
          [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff8800336feda8 (size 584):
        comm "aio-stress", pid 22031, jiffies 4295543006 (age 1271.400s)
        hex dump (first 32 bytes):
          00 40 3e 00 00 00 00 00 00 00 8f 42 00 00 00 00  .@>........B....
          00 00 01 00 00 00 00 00 00 00 01 00 00 00 00 00  ................
        backtrace:
          [<ffffffff8114eb34>] create_object+0x172/0x29a
          [<ffffffff8141d790>] kmemleak_alloc+0x25/0x41
          [<ffffffff81141ae6>] kmemleak_alloc_recursive.constprop.52+0x16/0x18
          [<ffffffff81145288>] kmem_cache_alloc+0xf7/0x198
          [<ffffffffa0389243>] __btrfs_add_ordered_extent+0x43/0x309 [btrfs]
          [<ffffffffa038968b>] btrfs_add_ordered_extent_dio+0x12/0x14 [btrfs]
          [<ffffffffa03810e2>] btrfs_get_blocks_direct+0x3ef/0x571 [btrfs]
          [<ffffffff81181349>] do_blockdev_direct_IO+0x62a/0xb47
          [<ffffffff8118189a>] __blockdev_direct_IO+0x34/0x36
          [<ffffffffa03776e5>] btrfs_direct_IO+0x16a/0x1e8 [btrfs]
          [<ffffffff81100373>] generic_file_direct_write+0xb8/0x12d
          [<ffffffffa038615c>] btrfs_file_write_iter+0x24b/0x42f [btrfs]
          [<ffffffff8118bb0d>] aio_run_iocb+0x2b7/0x32e
          [<ffffffff8118c99a>] do_io_submit+0x26e/0x2ff
          [<ffffffff8118ca3b>] SyS_io_submit+0x10/0x12
          [<ffffffff81429e92>] system_call_fastpath+0x12/0x17
      
      CC: <stable@vger.kernel.org> # 3.19, 3.18 and 3.17
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4d884fce
  7. 02 3月, 2015 13 次提交
  8. 01 3月, 2015 1 次提交
    • R
      nilfs2: fix potential memory overrun on inode · 957ed60b
      Ryusuke Konishi 提交于
      Each inode of nilfs2 stores a root node of a b-tree, and it turned out to
      have a memory overrun issue:
      
      Each b-tree node of nilfs2 stores a set of key-value pairs and the number
      of them (in "bn_nchildren" member of nilfs_btree_node struct), as well as
      a few other "bn_*" members.
      
      Since the value of "bn_nchildren" is used for operations on the key-values
      within the b-tree node, it can cause memory access overrun if a large
      number is incorrectly set to "bn_nchildren".
      
      For instance, nilfs_btree_node_lookup() function determines the range of
      binary search with it, and too large "bn_nchildren" leads
      nilfs_btree_node_get_key() in that function to overrun.
      
      As for intermediate b-tree nodes, this is prevented by a sanity check
      performed when each node is read from a drive, however, no sanity check
      has been done for root nodes stored in inodes.
      
      This patch fixes the issue by adding missing sanity check against b-tree
      root nodes so that it's called when on-memory inodes are read from ifile,
      inode metadata file.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      957ed60b
  9. 28 2月, 2015 1 次提交