1. 21 3月, 2015 1 次提交
  2. 20 3月, 2015 1 次提交
    • C
      Subject: nfsd: don't recursively call nfsd4_cb_layout_fail · 133d5582
      Christoph Hellwig 提交于
      Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS
      layout recalls"), we recursively call nfsd4_cb_layout_fail from itself,
      leading to stack overflows.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Fixes:  c5c707f9 ("nfsd: implement pNFS layout recalls")
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      ---
       fs/nfsd/nfs4layouts.c | 2 --
       1 file changed, 2 deletions(-)
      
      diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c
      index 3c1bfa1..1028a06 100644
      --- a/fs/nfsd/nfs4layouts.c
      +++ b/fs/nfsd/nfs4layouts.c
      @@ -587,8 +587,6 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls)
      
       	rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str));
      
      -	nfsd4_cb_layout_fail(ls);
      -
       	printk(KERN_WARNING
       		"nfsd: client %s failed to respond to layout recall. "
       		"  Fencing..\n", addr_str);
      --
      1.9.1
      133d5582
  3. 19 3月, 2015 1 次提交
    • T
      fuse: explicitly set /dev/fuse file's private_data · 94e4fe2c
      Tom Van Braeckel 提交于
      The misc subsystem (which is used for /dev/fuse) initializes private_data to
      point to the misc device when a driver has registered a custom open file
      operation, and initializes it to NULL when a custom open file operation has
      *not* been provided.
      
      This subtle quirk is confusing, to the point where kernel code registers
      *empty* file open operations to have private_data point to the misc device
      structure. And it leads to bugs, where the addition or removal of a custom open
      file operation surprisingly changes the initial contents of a file's
      private_data structure.
      
      So to simplify things in the misc subsystem, a patch [1] has been proposed to
      *always* set the private_data to point to the misc device, instead of only
      doing this when a custom open file operation has been registered.
      
      But before this patch can be applied we need to modify drivers that make the
      assumption that a misc device file's private_data is initialized to NULL
      because they didn't register a custom open file operation, so they don't rely
      on this assumption anymore. FUSE uses private_data to store the fuse_conn and
      errors out if this is not initialized to NULL at mount time.
      
      Hence, we now set a file's private_data to NULL explicitly, to be independent
      of whatever value the misc subsystem initializes it to by default.
      
      [1] https://lkml.org/lkml/2014/12/4/939Reported-by: NGiedrius Statkevicius <giedriuswork@gmail.com>
      Reported-by: NThierry Reding <thierry.reding@gmail.com>
      Signed-off-by: NTom Van Braeckel <tomvanbraeckel@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      94e4fe2c
  4. 18 3月, 2015 8 次提交
    • H
      ovl: upper fs should not be R/O · 71cbad7e
      hujianyang 提交于
      After importing multi-lower layer support, users could mount a r/o
      partition as the left most lowerdir instead of using it as upperdir.
      And a r/o upperdir may cause an error like
      
      	overlayfs: failed to create directory ./workdir/work
      
      during mount.
      
      This patch check the *s_flags* of upper fs and return an error if
      it is a r/o partition. The checking of *upper_mnt->mnt_sb->s_flags*
      can be removed now.
      
      This patch also remove
      
      	/* FIXME: workdir is not needed for a R/O mount */
      
      from ovl_fill_super() because:
      
      1) for upper fs r/o case
      Setting a r/o partition as upper is prevented, no need to care about
      workdir in this case.
      
      2) for "mount overlay -o ro" with a r/w upper fs case
      Users could remount overlayfs to r/w in this case, so workdir should
      not be omitted.
      Signed-off-by: Nhujianyang <hujianyang@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      71cbad7e
    • H
      ovl: check lowerdir amount for non-upper mount · 6be4506e
      hujianyang 提交于
      Recently multi-lower layer mount support allow upperdir and workdir
      to be omitted, then cause overlayfs can be mount with only one
      lowerdir directory. This action make no sense and have potential risk.
      
      This patch check the total number of lower directories to prevent
      mounting overlayfs with only one directory.
      
      Also, an error message is added to indicate lower directories exceed
      OVL_MAX_STACK limit.
      Signed-off-by: Nhujianyang <hujianyang@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      6be4506e
    • H
      ovl: print error message for invalid mount options · bead55ef
      hujianyang 提交于
      Overlayfs should print an error message if an incorrect mount option
      is caught like other filesystems.
      
      After this patch, improper option input could be clearly known.
      Reported-by: NFabian Sturm <fabian.sturm@aduu.de>
      Signed-off-by: Nhujianyang <hujianyang@huawei.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      bead55ef
    • J
      Btrfs: fix outstanding_extents accounting in DIO · e1cbbfa5
      Josef Bacik 提交于
      We are keeping track of how many extents we need to reserve properly based on
      the amount we want to write, but we were still incrementing outstanding_extents
      if we wrote less than what we requested.  This isn't quite right since we will
      be limited to our max extent size.  So instead lets do something horrible!  Keep
      track of how many outstanding_extents we reserved, and decrement each time we
      allocate an extent.  If we use our entire reserve make sure to jack up
      outstanding_extents on the inode so the accounting works out properly.  Thanks,
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e1cbbfa5
    • J
      Btrfs: add sanity test for outstanding_extents accounting · 6a3891c5
      Josef Bacik 提交于
      I introduced a regression wrt outstanding_extents accounting.  These are tricky
      areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
      at any time.  So add sanity tests to cover the various conditions that are
      tricky in order to make sure we don't introduce regressions in the future.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6a3891c5
    • J
      Btrfs: just free dummy extent buffers · bcb7e449
      Josef Bacik 提交于
      If we fail during our sanity tests we could get NULL deref's because we unload
      the module before the dummy extent buffers are free'd via RCU.  So check for
      this case and just free the things directly.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      bcb7e449
    • J
      Btrfs: account merges/splits properly · ba117213
      Josef Bacik 提交于
      My fix
      
      Btrfs: fix merge delalloc logic
      
      only fixed half of the problems, it didn't fix the case where we have two large
      extents on either side and then join them together with a new small extent.  We
      need to instead keep track of how many extents we have accounted for with each
      side of the new extent, and then see how many extents we need for the new large
      extent.  If they match then we know we need to keep our reservation, otherwise
      we need to drop our reservation.  This shows up with a case like this
      
      [BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]
      
      Previously the logic would have said that the number extents required for the
      new size (3) is larger than the number of extents required for the largest side
      (2) therefore we need to keep our reservation.  But this isn't the case, since
      both sides require a reservation of 2 which leads to 4 for the whole range
      currently reserved, but we only need 3, so we need to drop one of the
      reservations.  The same problem existed for splits, we'd think we only need 3
      extents when creating the hole but in reality we need 4.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      ba117213
    • K
      pagemap: do not leak physical addresses to non-privileged userspace · ab676b7d
      Kirill A. Shutemov 提交于
      As pointed by recent post[1] on exploiting DRAM physical imperfection,
      /proc/PID/pagemap exposes sensitive information which can be used to do
      attacks.
      
      This disallows anybody without CAP_SYS_ADMIN to read the pagemap.
      
      [1] http://googleprojectzero.blogspot.com/2015/03/exploiting-dram-rowhammer-bug-to-gain.html
      
      [ Eventually we might want to do anything more finegrained, but for now
        this is the simple model.   - Linus ]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Mark Seaborn <mseaborn@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab676b7d
  5. 17 3月, 2015 2 次提交
  6. 14 3月, 2015 7 次提交
  7. 13 3月, 2015 3 次提交
    • S
      fanotify: fix event filtering with FAN_ONDIR set · b3c1030d
      Suzuki K. Poulose 提交于
      With FAN_ONDIR set, the user can end up getting events, which it hasn't
      marked.  This was revealed with fanotify04 testcase failure on
      Linux-4.0-rc1, and is a regression from 3.19, revealed with 66ba93c0
      ("fanotify: don't set FAN_ONDIR implicitly on a marks ignored mask").
      
         # /opt/ltp/testcases/bin/fanotify04
         [ ... ]
        fanotify04    7  TPASS  :  event generated properly for type 100000
        fanotify04    8  TFAIL  :  fanotify04.c:147: got unexpected event 30
        fanotify04    9  TPASS  :  No event as expected
      
      The testcase sets the adds the following marks : FAN_OPEN | FAN_ONDIR for
      a fanotify on a dir.  Then does an open(), followed by close() of the
      directory and expects to see an event FAN_OPEN(0x20).  However, the
      fanotify returns (FAN_OPEN|FAN_CLOSE_NOWRITE(0x10)).  This happens due to
      the flaw in the check for event_mask in fanotify_should_send_event() which
      does:
      
      	if (event_mask & marks_mask & ~marks_ignored_mask)
      		return true;
      
      where, event_mask == (FAN_ONDIR | FAN_CLOSE_NOWRITE),
             marks_mask == (FAN_ONDIR | FAN_OPEN),
             marks_ignored_mask == 0
      
      Fix this by masking the outgoing events to the user, as we already take
      care of FAN_ONDIR and FAN_EVENT_ON_CHILD.
      Signed-off-by: NSuzuki K. Poulose <suzuki.poulose@arm.com>
      Tested-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3c1030d
    • R
      nilfs2: fix deadlock of segment constructor during recovery · 283ee148
      Ryusuke Konishi 提交于
      According to a report from Yuxuan Shui, nilfs2 in kernel 3.19 got stuck
      during recovery at mount time.  The code path that caused the deadlock was
      as follows:
      
        nilfs_fill_super()
          load_nilfs()
            nilfs_salvage_orphan_logs()
              * Do roll-forwarding, attach segment constructor for recovery,
                and kick it.
      
              nilfs_segctor_thread()
                nilfs_segctor_thread_construct()
                 * A lock is held with nilfs_transaction_lock()
                   nilfs_segctor_do_construct()
                     nilfs_segctor_drop_written_files()
                       iput()
                         iput_final()
                           write_inode_now()
                             writeback_single_inode()
                               __writeback_single_inode()
                                 do_writepages()
                                   nilfs_writepage()
                                     nilfs_construct_dsync_segment()
                                       nilfs_transaction_lock() --> deadlock
      
      This can happen if commit 7ef3ff2f ("nilfs2: fix deadlock of segment
      constructor over I_SYNC flag") is applied and roll-forward recovery was
      performed at mount time.  The roll-forward recovery can happen if datasync
      write is done and the file system crashes immediately after that.  For
      instance, we can reproduce the issue with the following steps:
      
       < nilfs2 is mounted on /nilfs (device: /dev/sdb1) >
       # dd if=/dev/zero of=/nilfs/test bs=4k count=1 && sync
       # dd if=/dev/zero of=/nilfs/test conv=notrunc oflag=dsync bs=4k
       count=1 && reboot -nfh
       < the system will immediately reboot >
       # mount -t nilfs2 /dev/sdb1 /nilfs
      
      The deadlock occurs because iput() can run segment constructor through
      writeback_single_inode() if MS_ACTIVE flag is not set on sb->s_flags.  The
      above commit changed segment constructor so that it calls iput()
      asynchronously for inodes with i_nlink == 0, but that change was
      imperfect.
      
      This fixes the another deadlock by deferring iput() in segment constructor
      even for the case that mount is not finished, that is, for the case that
      MS_ACTIVE flag is not set.
      Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Reported-by: NYuxuan Shui <yshuiv7@gmail.com>
      Tested-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      283ee148
    • M
      ocfs2: make append_dio an incompat feature · 18d585f0
      Mark Fasheh 提交于
      It turns out that making this feature ro_compat isn't quite enough to
      prevent accidental corruption on mount from older kernels.  Ocfs2 (like
      other file systems) will process orphaned inodes even when the user mounts
      in 'ro' mode.  So for the case of a filesystem not knowing the append_dio
      feature, mounting the filesystem could result in orphaned-for-dio files
      being deleted, which we clearly don't want.
      
      So instead, turn this into an incompat flag.
      
      Btw, this is kind of my fault - initially I asked that we add a flag to
      cover the feature and even suggested that we use an ro flag.  It wasn't
      until I was looking through our commits for v4.0-rc1 that I realized we
      actually want this to be incompat.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18d585f0
  8. 06 3月, 2015 3 次提交
    • Q
      Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref. · dd9ef135
      Quentin Casasnovas 提交于
      Improper arithmetics when calculting the address of the extended ref could
      lead to an out of bounds memory read and kernel panic.
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      cc: stable@vger.kernel.org # v3.7+
      Signed-off-by: NChris Mason <clm@fb.com>
      dd9ef135
    • F
      Btrfs: fix data loss in the fast fsync path · 3a8b36f3
      Filipe Manana 提交于
      When using the fast file fsync code path we can miss the fact that new
      writes happened since the last file fsync and therefore return without
      waiting for the IO to finish and write the new extents to the fsync log.
      
      Here's an example scenario where the fsync will miss the fact that new
      file data exists that wasn't yet durably persisted:
      
      1. fs_info->last_trans_committed == N - 1 and current transaction is
         transaction N (fs_info->generation == N);
      
      2. do a buffered write;
      
      3. fsync our inode, this clears our inode's full sync flag, starts
         an ordered extent and waits for it to complete - when it completes
         at btrfs_finish_ordered_io(), the inode's last_trans is set to the
         value N (via btrfs_update_inode_fallback -> btrfs_update_inode ->
         btrfs_set_inode_last_trans);
      
      4. transaction N is committed, so fs_info->last_trans_committed is now
         set to the value N and fs_info->generation remains with the value N;
      
      5. do another buffered write, when this happens btrfs_file_write_iter
         sets our inode's last_trans to the value N + 1 (that is
         fs_info->generation + 1 == N + 1);
      
      6. transaction N + 1 is started and fs_info->generation now has the
         value N + 1;
      
      7. transaction N + 1 is committed, so fs_info->last_trans_committed
         is set to the value N + 1;
      
      8. fsync our inode - because it doesn't have the full sync flag set,
         we only start the ordered extent, we don't wait for it to complete
         (only in a later phase) therefore its last_trans field has the
         value N + 1 set previously by btrfs_file_write_iter(), and so we
         have:
      
             inode->last_trans <= fs_info->last_trans_committed
                 (N + 1)              (N + 1)
      
         Which made us not log the last buffered write and exit the fsync
         handler immediately, returning success (0) to user space and resulting
         in data loss after a crash.
      
      This can actually be triggered deterministically and the following excerpt
      from a testcase I made for xfstests triggers the issue. It moves a dummy
      file across directories and then fsyncs the old parent directory - this
      is just to trigger a transaction commit, so moving files around isn't
      directly related to the issue but it was chosen because running 'sync' for
      example does more than just committing the current transaction, as it
      flushes/waits for all file data to be persisted. The issue can also happen
      at random periods, since the transaction kthread periodicaly commits the
      current transaction (about every 30 seconds by default).
      The body of the test is:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file 'foo', the one we check for data loss.
        # By doing an fsync against our file, it makes btrfs clear the 'needs_full_sync'
        # bit from its flags (btrfs inode specific flags).
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" \
                        -c "fsync" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Now create one other file and 2 directories. We will move this second file
        # from one directory to the other later because it forces btrfs to commit its
        # currently open transaction if we fsync the old parent directory. This is
        # necessary to trigger the data loss bug that affected btrfs.
        mkdir $SCRATCH_MNT/testdir_1
        touch $SCRATCH_MNT/testdir_1/bar
        mkdir $SCRATCH_MNT/testdir_2
      
        # Make sure everything is durably persisted.
        sync
      
        # Write more 8Kb of data to our file.
        $XFS_IO_PROG -c "pwrite -S 0xbb 8K 8K" $SCRATCH_MNT/foo | _filter_xfs_io
      
        # Move our 'bar' file into a new directory.
        mv $SCRATCH_MNT/testdir_1/bar $SCRATCH_MNT/testdir_2/bar
      
        # Fsync our first directory. Because it had a file moved into some other
        # directory, this made btrfs commit the currently open transaction. This is
        # a condition necessary to trigger the data loss bug.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir_1
      
        # Now fsync our main test file. If the fsync succeeds, we expect the 8Kb of
        # data we wrote previously to be persisted and available if a crash happens.
        # This did not happen with btrfs, because of the transaction commit that
        # happened when we fsynced the parent directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Now check that all data we wrote before are available.
        echo "File content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        status=0
        exit
      
      The expected golden output for the test, which is what we get with this
      fix applied (or when running against ext3/4 and xfs), is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
        *
        0040000
      
      Without this fix applied, the output shows the test file does not have
      the second 8Kb extent that we successfully fsynced:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 8192/8192 bytes at offset 8192
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
      
      So fix this by skipping the fsync only if we're doing a full sync and
      if the inode's last_trans is <= fs_info->last_trans_committed, or if
      the inode is already in the log. Also remove setting the inode's
      last_trans in btrfs_file_write_iter since it's useless/unreliable.
      
      Also because btrfs_file_write_iter no longer sets inode->last_trans to
      fs_info->generation + 1, don't set last_trans to 0 if we bail out and don't
      bail out if last_trans is 0, otherwise something as simple as the following
      example wouldn't log the second write on the last fsync:
      
        1. write to file
      
        2. fsync file
      
        3. fsync file
             |--> btrfs_inode_in_log() returns true and it set last_trans to 0
      
        4. write to file
             |--> btrfs_file_write_iter() no longers sets last_trans, so it
                  remained with a value of 0
        5. fsync
             |--> inode->last_trans == 0, so it bails out without logging the
                  second write
      
      A test case for xfstests will be sent soon.
      
      CC: <stable@vger.kernel.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3a8b36f3
    • J
      Btrfs: remove extra run_delayed_refs in update_cowonly_root · f5c0a122
      Josef Bacik 提交于
      This got added with my dirty_bgs patch, it's not needed.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f5c0a122
  9. 05 3月, 2015 1 次提交
    • J
      locks: fix fasync_struct memory leak in lease upgrade/downgrade handling · 0164bf02
      Jeff Layton 提交于
      Commit 8634b51f (locks: convert lease handling to file_lock_context)
      introduced a regression in the handling of lease upgrade/downgrades.
      
      In the event that we already have a lease on a file and are going to
      either upgrade or downgrade it, we skip doing any list insertion or
      deletion and simply re-call lm_setup on the existing lease.
      
      As of commit 8634b51f however, we end up calling lm_setup on the
      lease that was passed in, instead of on the existing lease. This causes
      us to leak the fasync_struct that was allocated in the event that there
      was not already an existing one (as it always appeared that there
      wasn't one).
      
      Fixes: 8634b51f (locks: convert lease handling to file_lock_context)
      Reported-and-Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      0164bf02
  10. 04 3月, 2015 4 次提交
  11. 03 3月, 2015 9 次提交