1. 24 4月, 2015 5 次提交
    • P
      nfs: fix DIO good bytes calculation · 1ccbad9f
      Peng Tao 提交于
      For direct read that has IO size larger than rsize, we'll split
      it into several READ requests and nfs_direct_good_bytes() would
      count completed bytes incorrectly by eating last zero count reply.
      
      Fix it by handling mirror and non-mirror cases differently such that
      we only count mirrored writes differently.
      
      This fixes 5fadeb47("nfs: count DIO good bytes correctly with mirroring").
      Reported-by: NJean Spector <jean@primarydata.com>
      Cc: <stable@vger.kernel.org> # v3.19+
      Signed-off-by: NPeng Tao <tao.peng@primarydata.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      1ccbad9f
    • A
      nfs: Fetch MOUNTED_ON_FILEID when updating an inode · ea96d1ec
      Anna Schumaker 提交于
      2ef47eb1 (NFS: Fix use of nfs_attr_use_mounted_on_fileid()) was a good
      start to fixing a circular directory structure warning for NFS v4
      "junctioned" mountpoints.  Unfortunately, further testing continued to
      generate this error.
      
      My server is configured like this:
      
      anna@nfsd ~ % df
      Filesystem      Size  Used Avail Use% Mounted on
      /dev/vda1       9.1G  2.0G  6.5G  24% /
      /dev/vdc1      1014M   33M  982M   4% /exports
      /dev/vdc2      1014M   33M  982M   4% /exports/vol1
      /dev/vdc3      1014M   33M  982M   4% /exports/vol1/vol2
      
      anna@nfsd ~ % cat /etc/exports
      /exports/          *(rw,async,no_subtree_check,no_root_squash)
      /exports/vol1/     *(rw,async,no_subtree_check,no_root_squash)
      /exports/vol1/vol2 *(rw,async,no_subtree_check,no_root_squash)
      
      I've been running chown across the entire mountpoint twice in a row to
      hit this problem.  The first run succeeds, but the second one fails with
      the circular directory warning along with:
      
      anna@client ~ % dmesg
      [Apr 3 14:28] NFS: server 192.168.100.204 error: fileid changed
                    fsid 0:39: expected fileid 0x100080, got 0x80
      
      WHere 0x80 is the mountpoint's fileid and 0x100080 is the mounted-on
      fileid.
      
      This patch fixes the issue by requesting an updated mounted-on fileid
      from the server during nfs_update_inode(), and then checking that the
      fileid stored in the nfs_inode matches either the fileid or mounted-on
      fileid returned by the server.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      ea96d1ec
    • J
      nfs: fix high load average due to callback thread sleeping · 5d05e54a
      Jeff Layton 提交于
      Chuck pointed out a problem that crept in with commit 6ffa30d3 (nfs:
      don't call blocking operations while !TASK_RUNNING). Linux counts tasks
      in uninterruptible sleep against the load average, so this caused the
      system's load average to be pinned at at least 1 when there was a
      NFSv4.1+ mount active.
      
      Not a huge problem, but it's probably worth fixing before we get too
      many complaints about it. This patch converts the code back to use
      TASK_INTERRUPTIBLE sleep, simply has it flush any signals on each loop
      iteration. In practice no one should really be signalling this thread at
      all, so I think this is reasonably safe.
      
      With this change, there's also no need to game the hung task watchdog so
      we can also convert the schedule_timeout call back to a normal schedule.
      
      Cc: <stable@vger.kernel.org>
      Reported-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Tested-by: NChuck Lever <chuck.lever@oracle.com>
      Fixes: commit 6ffa30d3 (“nfs: don't call blocking . . .”)
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      5d05e54a
    • A
      NFS: Reduce time spent holding the i_mutex during fallocate() · f830f7dd
      Anna Schumaker 提交于
      At the very least, we should not be taking the i_mutex until after
      checking if the server even supports ALLOCATE or DEALLOCATE, allowing
      v4.0 or v4.1 to exit without potentially waiting on a lock.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      f830f7dd
    • A
      NFS: Don't zap caches on fallocate() · 9a51940b
      Anna Schumaker 提交于
      This patch adds a GETATTR to the end of ALLOCATE and DEALLOCATE
      operations so we can set the updated inode size and change attribute
      directly.  DEALLOCATE will still need to release pagecache pages, so
      nfs42_proc_deallocate() now calls truncate_pagecache_range() before
      contacting the server.
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      9a51940b
  2. 28 3月, 2015 16 次提交
  3. 13 3月, 2015 2 次提交
  4. 06 3月, 2015 3 次提交
    • Q
      Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. · dd9ef135
      Quentin Casasnovas 提交于
      Improper arithmetics when calculting the address of the extended ref could
      lead to an out of bounds memory read and kernel panic.
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      cc: stable@vger.kernel.org # v3.7+
      Signed-off-by: NChris Mason <clm@fb.com>
      dd9ef135
    • F
      Btrfs: fix data loss in the fast fsync path · 3a8b36f3
      Filipe Manana 提交于
      When using the fast file fsync code path we can miss the fact that new
      writes happened since the last file fsync and therefore return without
      waiting for the IO to finish and write the new extents to the fsync log.
      
      Here's an example scenario where the fsync will miss the fact that new
      file data exists that wasn't yet durably persisted:
      
      1. fs_info->last_trans_committed == N - 1 and current transaction is
         transaction N (fs_info->generation == N);
      
      2. do a buffered write;
      
      3. fsync our inode, this clears our inode's full sync flag, starts
         an ordered extent and waits for it to complete - when it completes
         at btrfs_finish_ordered_io(), the inode's last_trans is set to the
         value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
         btrfs_set_inode_last_trans);
      
      4. transaction N is committed, so fs_info->last_trans_committed is now
         set to the value N and fs_info->generation remains with the value N;
      
      5. do another buffered write, when this happens btrfs_file_write_iter
         sets our inode's last_trans to the value N + 1 (that is
         fs_info->generation + 1 == N + 1);
      
      6. transaction N + 1 is started and fs_info->generation now has the
         value N + 1;
      
      7. transaction N + 1 is committed, so fs_info->last_trans_committed
         is set to the value N + 1;
      
      8. fsync our inode - because it doesn't have the full sync flag set,
         we only start the ordered extent, we don't wait for it to complete
         (only in a later phase) therefore its last_trans field has the
         value N + 1 set previously by btrfs_file_write_iter(), and so we
         have:
      
             inode->last_trans <= fs_info->last_trans_committed
                 (N + 1)              (N + 1)
      
         Which made us not log the last buffered write and exit the fsync
         handler immediately, returning success (0) to user space and resulting
         in data loss after a crash.
      
      This can actually be triggered deterministically and the following excerpt
      from a testcase I made for xfstests triggers the issue. It moves a dummy
      file across directories and then fsyncs the old parent directory - this
      is just to trigger a transaction commit, so moving files around isn't
      directly related to the issue but it was chosen because running 'sync' for
      example does more than just committing the current transaction, as it
      flushes/waits for all file data to be persisted. The issue can also happen
      at random periods, since the transaction kthread periodicaly commits the
      current transaction (about every 30 seconds by default).
      The body of the test is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file 'foo', the one we check for data loss.
        # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
        # bit from its flags (btrfs inode specific flags).
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                        -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now create one other file and 2 directories. We will move this second file
        # from one directory to the other later because it forces btrfs to commit its
        # currently open transaction if we fsync the old parent directory. This is
        # necessary to trigger the data loss bug that affected btrfs.
        mkdir $SCRATCH_MNT/testdir_1
        touch $SCRATCH_MNT/testdir_1/bar
        mkdir $SCRATCH_MNT/testdir_2
      
        # Make sure everything is durably persisted.
        sync
      
        # Write more 8Kb of data to our file.
        $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Move our 'bar' file into a new directory.
        mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
      
        # Fsync our first directory. Because it had a file moved into some other
        # directory, this made btrfs commit the currently open transaction. This is
        # a condition necessary to trigger the data loss bug.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
      
        # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
        # data we wrote previously to be persisted and available if a crash happens.
        # This did not happen with btrfs, because of the transaction commit that
        # happened when we fsynced the parent directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now check that all data we wrote before are available.
        echo "File content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        status=0
        exit
      
      The expected golden output for the test, which is what we get with this
      fix applied (or when running against ext3/4 and xfs), is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0040000
      
      Without this fix applied, the output shows the test file does not have
      the second 8Kb extent that we successfully fsynced:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
      
      So fix this by skipping the fsync only if we're doing a full sync and
      if the inode's last_trans is <= fs_info->last_trans_committed, or if
      the inode is already in the log. Also remove setting the inode's
      last_trans in btrfs_file_write_iter since it's useless/unreliable.
      
      Also because btrfs_file_write_iter no longer sets inode->last_trans to
      fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
      bail out if last_trans is 0, otherwise something as simple as the following
      example wouldn't log the second write on the last fsync:
      
        1. write to file
      
        2. fsync file
      
        3. fsync file
             |--> btrfs_inode_in_log() returns true and it set last_trans to 0
      
        4. write to file
             |--> btrfs_file_write_iter() no longers sets last_trans, so it
                  remained with a value of 0
        5. fsync
             |--> inode->last_trans == 0, so it bails out without logging the
                  second write
      
      A test case for xfstests will be sent soon.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3a8b36f3
    • J
      Btrfs: remove extra run_delayed_refs in update_cowonly_root · f5c0a122
      Josef Bacik 提交于
      This got added with my dirty_bgs patch, it's not needed.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f5c0a122
  5. 05 3月, 2015 1 次提交
    • J
      locks: fix fasync_struct memory leak in lease upgrade/downgrade handling · 0164bf02
      Jeff Layton 提交于
      Commit 8634b51f (locks: convert lease handling to file_lock_context)
      introduced a regression in the handling of lease upgrade/downgrades.
      
      In the event that we already have a lease on a file and are going to
      either upgrade or downgrade it, we skip doing any list insertion or
      deletion and simply re-call lm_setup on the existing lease.
      
      As of commit 8634b51f however, we end up calling lm_setup on the
      lease that was passed in, instead of on the existing lease. This causes
      us to leak the fasync_struct that was allocated in the event that there
      was not already an existing one (as it always appeared that there
      wasn't one).
      
      Fixes: 8634b51f (locks: convert lease handling to file_lock_context)
      Reported-and-Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      0164bf02
  6. 04 3月, 2015 4 次提交
  7. 03 3月, 2015 9 次提交