1. 26 7月, 2019 15 次提交
    • D
      xfs: rename m_inotbt_nores to m_finobt_nores · 3a895cc0
      Darrick J. Wong 提交于
      commit e1f6ca11381588e3ef138c10de60eeb34cb8466a upstream.
      
      Rename this flag variable to imply more strongly that it's related to
      the free inode btree (finobt) operation.  No functional changes.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Suggested-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3a895cc0
    • D
      xfs: don't overflow xattr listent buffer · 2ab62234
      Darrick J. Wong 提交于
      commit 3b50086f0c0d78c144d9483fa292c1509c931b70 upstream.
      
      For VFS listxattr calls, xfs_xattr_put_listent calls
      __xfs_xattr_put_listent twice if it sees an attribute
      "trusted.SGI_ACL_FILE": once for that name, and again for
      "system.posix_acl_access".  Unfortunately, if we happen to run out of
      buffer space while emitting the first name, we set count to -1 (so that
      we can feed ERANGE to the caller).  The second invocation doesn't check that
      the context parameters make sense and overwrites the byte before the
      buffer, triggering a KASAN report:
      
      ==================================================================
      BUG: KASAN: slab-out-of-bounds in strncpy+0xb3/0xd0
      Write of size 1 at addr ffff88807fbd317f by task syz/1113
      
      CPU: 3 PID: 1113 Comm: syz Not tainted 5.0.0-rc6-xfsx #rc6
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
      Call Trace:
       dump_stack+0xcc/0x180
       print_address_description+0x6c/0x23c
       kasan_report.cold.3+0x1c/0x35
       strncpy+0xb3/0xd0
       __xfs_xattr_put_listent+0x1a9/0x2c0 [xfs]
       xfs_attr_list_int_ilocked+0x11af/0x1800 [xfs]
       xfs_attr_list_int+0x20c/0x2e0 [xfs]
       xfs_vn_listxattr+0x225/0x320 [xfs]
       listxattr+0x11f/0x1b0
       path_listxattr+0xbd/0x130
       do_syscall_64+0x139/0x560
      
      While we're at it we add an assert to the other put_listent to avoid
      this sort of thing ever happening to the attrlist_by_handle code.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Suggested-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2ab62234
    • D
      xfs: flush removing page cache in xfs_reflink_remap_prep · 1dc8b13c
      Dave Chinner 提交于
      commit 2c307174ab77e34645e75e12827646e044d273c3 upstream.
      
      On a sub-page block size filesystem, fsx is failing with a data
      corruption after a series of operations involving copying a file
      with the destination offset beyond EOF of the destination of the file:
      
      8093(157 mod 256): TRUNCATE DOWN        from 0x7a120 to 0x50000 ******WWWW
      8094(158 mod 256): INSERT 0x25000 thru 0x25fff  (0x1000 bytes)
      8095(159 mod 256): COPY 0x18000 thru 0x1afff    (0x3000 bytes) to 0x2f400
      8096(160 mod 256): WRITE    0x5da00 thru 0x651ff        (0x7800 bytes) HOLE
      8097(161 mod 256): COPY 0x2000 thru 0x5fff      (0x4000 bytes) to 0x6fc00
      
      The second copy here is beyond EOF, and it is to sub-page (4k) but
      block aligned (1k) offset. The clone runs the EOF zeroing, landing
      in a pre-existing post-eof delalloc extent. This zeroes the post-eof
      extents in the page cache just fine, dirtying the pages correctly.
      
      The problem is that xfs_reflink_remap_prep() now truncates the page
      cache over the range that it is copying it to, and rounds that down
      to cover the entire start page. This removes the dirty page over the
      delalloc extent from the page cache without having written it back.
      Hence later, when the page cache is flushed, the page at offset
      0x6f000 has not been written back and hence exposes stale data,
      which fsx trips over less than 10 operations later.
      
      Fix this by changing xfs_reflink_remap_prep() to use
      xfs_flush_unmap_range().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1dc8b13c
    • D
      xfs: fix pagecache truncation prior to reflink · 788920d1
      Darrick J. Wong 提交于
      commit 4918ef4ea008cd2ff47eb852894e3f9b9047f4f3 upstream.
      
      Prior to remapping blocks, it is necessary to remove pages from the
      destination file's page cache.  Unfortunately, the truncation is not
      aggressive enough -- if page size > block size, we'll end up zeroing
      subpage blocks instead of removing them.  So, round the start offset
      down and the end offset up to page boundaries.  We already wrote all
      the dirty data so the larger range shouldn't be a problem.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      788920d1
    • J
      coda: pass the host file in vma->vm_file on mmap · afa3e571
      Jan Harkes 提交于
      commit 7fa0a1da3dadfd9216df7745a1331fdaa0940d1c upstream.
      
      Patch series "Coda updates".
      
      The following patch series is a collection of various fixes for Coda,
      most of which were collected from linux-fsdevel or linux-kernel but
      which have as yet not found their way upstream.
      
      This patch (of 22):
      
      Various file systems expect that vma->vm_file points at their own file
      handle, several use file_inode(vma->vm_file) to get at their inode or
      use vma->vm_file->private_data.  However the way Coda wrapped mmap on a
      host file broke this assumption, vm_file was still pointing at the Coda
      file and the host file systems would scribble over Coda's inode and
      private file data.
      
      This patch fixes the incorrect expectation and wraps vm_ops->open and
      vm_ops->close to allow Coda to track when the vm_area_struct is
      destroyed so we still release the reference on the Coda file handle at
      the right time.
      
      Link: http://lkml.kernel.org/r/0e850c6e59c0b147dc2dcd51a3af004c948c3697.1558117389.git.jaharkes@cs.cmu.eduSigned-off-by: NJan Harkes <jaharkes@cs.cmu.edu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Mikko Rapeli <mikko.rapeli@iki.fi>
      Cc: Sam Protsenko <semen.protsenko@linaro.org>
      Cc: Yann Droneaud <ydroneaud@opteya.com>
      Cc: Zhouyang Jia <jiazhouyang09@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      afa3e571
    • F
      Btrfs: add missing inode version, ctime and mtime updates when punching hole · 4bd95324
      Filipe Manana 提交于
      commit 179006688a7e888cbff39577189f2e034786d06a upstream.
      
      If the range for which we are punching a hole covers only part of a page,
      we end up updating the inode item but we skip the update of the inode's
      iversion, mtime and ctime. Fix that by ensuring we update those properties
      of the inode.
      
      A patch for fstests test case generic/059 that tests this as been sent
      along with this fix.
      
      Fixes: 2aaa6655 ("Btrfs: add hole punching")
      Fixes: e8c1c76e ("Btrfs: add missing inode update when punching hole")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4bd95324
    • F
      Btrfs: fix fsync not persisting dentry deletions due to inode evictions · fffedf5c
      Filipe Manana 提交于
      commit 803f0f64d17769071d7287d9e3e3b79a3e1ae937 upstream.
      
      In order to avoid searches on a log tree when unlinking an inode, we check
      if the inode being unlinked was logged in the current transaction, as well
      as the inode of its parent directory. When any of the inodes are logged,
      we proceed to delete directory items and inode reference items from the
      log, to ensure that if a subsequent fsync of only the inode being unlinked
      or only of the parent directory when the other is not fsync'ed as well,
      does not result in the entry still existing after a power failure.
      
      That check however is not reliable when one of the inodes involved (the
      one being unlinked or its parent directory's inode) is evicted, since the
      logged_trans field is transient, that is, it is not stored on disk, so it
      is lost when the inode is evicted and loaded into memory again (which is
      set to zero on load). As a consequence the checks currently being done by
      btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always
      return true if the inode was evicted before, regardless of the inode
      having been logged or not before (and in the current transaction), this
      results in the dentry being unlinked still existing after a log replay
      if after the unlink operation only one of the inodes involved is fsync'ed.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Keep an open file descriptor on our directory while we evict inodes.
        # We just want to evict the file's inode, the directory's inode must not
        # be evicted.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time to background process to chdir to our test
        # directory.
        $ sleep 0.5
      
        # Trigger eviction of the file's inode.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # Unlink our file and fsync the parent directory. After a power failure
        # we don't expect to see the file anymore, since we fsync'ed the parent
        # directory.
        $ rm -f $SCRATCH_MNT/dir/foo
        $ xfs_io -c fsync /mnt/dir
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ ls /mnt/dir
        foo
        $
         --> file still there, unlink not persisted despite explicit fsync on dir
      
      Fix this by checking if the inode has the full_sync bit set in its runtime
      flags as well, since that bit is set everytime an inode is loaded from
      disk, or for other less common cases such as after a shrinking truncate
      or failure to allocate extent maps for holes, and gets cleared after the
      first fsync. Also consider the inode as possibly logged only if it was
      last modified in the current transaction (besides having the full_fsync
      flag set).
      
      Fixes: 3a5f1d45 ("Btrfs: Optimize btree walking while logging inodes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fffedf5c
    • F
      Btrfs: fix data loss after inode eviction, renaming it, and fsync it · 110850ff
      Filipe Manana 提交于
      commit d1d832a0b51dd9570429bb4b81b2a6c1759e681a upstream.
      
      When we log an inode, regardless of logging it completely or only that it
      exists, we always update it as logged (logged_trans and last_log_commit
      fields of the inode are updated). This is generally fine and avoids future
      attempts to log it from having to do repeated work that brings no value.
      
      However, if we write data to a file, then evict its inode after all the
      dealloc was flushed (and ordered extents completed), rename the file and
      fsync it, we end up not logging the new extents, since the rename may
      result in logging that the inode exists in case the parent directory was
      logged before. The following reproducer shows and explains how this can
      happen:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/dir
        $ touch /mnt/dir/foo
        $ touch /mnt/dir/bar
      
        # Do a direct IO write instead of a buffered write because with a
        # buffered write we would need to make sure dealloc gets flushed and
        # complete before we do the inode eviction later, and we can not do that
        # from user space with call to things such as sync(2) since that results
        # in a transaction commit as well.
        $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar
      
        # Keep the directory dir in use while we evict inodes. We want our file
        # bar's inode to be evicted but we don't want our directory's inode to
        # be evicted (if it were evicted too, we would not be able to reproduce
        # the issue since the first fsync below, of file foo, would result in a
        # transaction commit.
        $ ( cd /mnt/dir; while true; do :; done ) &
        $ pid=$!
      
        # Wait a bit to give time for the background process to chdir.
        $ sleep 0.1
      
        # Evict all inodes, except the inode for the directory dir because it is
        # currently in use by our background process.
        $ echo 2 > /proc/sys/vm/drop_caches
      
        # fsync file foo, which ends up persisting information about the parent
        # directory because it is a new inode.
        $ xfs_io -c fsync /mnt/dir/foo
      
        # Rename bar, this results in logging that this inode exists (inode item,
        # names, xattrs) because the parent directory is in the log.
        $ mv /mnt/dir/bar /mnt/dir/baz
      
        # Now fsync baz, which ends up doing absolutely nothing because of the
        # rename operation which logged that the inode exists only.
        $ xfs_io -c fsync /mnt/dir/baz
      
        <power failure>
      
        $ mount /dev/sdb /mnt
        $ od -t x1 -A d /mnt/dir/baz
        0000000
      
          --> Empty file, data we wrote is missing.
      
      Fix this by not updating last_sub_trans of an inode when we are logging
      only that it exists and the inode was not yet logged since it was loaded
      from disk (full_sync bit set), this is enough to make btrfs_inode_in_log()
      return false for this scenario and make us log the inode. The logged_trans
      of the inode is still always setsince that alone is used to track if names
      need to be deleted as part of unlink operations.
      
      Fixes: 257c62e1 ("Btrfs: avoid tree log commit when there are no changes")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      110850ff
    • R
      fs/proc/proc_sysctl.c: fix the default values of i_uid/i_gid on /proc/sys inodes. · 2c7b50c7
      Radoslaw Burny 提交于
      commit 5ec27ec735ba0477d48c80561cc5e856f0c5dfaf upstream.
      
      Normally, the inode's i_uid/i_gid are translated relative to s_user_ns,
      but this is not a correct behavior for proc.  Since sysctl permission
      check in test_perm is done against GLOBAL_ROOT_[UG]ID, it makes more
      sense to use these values in u_[ug]id of proc inodes.  In other words:
      although uid/gid in the inode is not read during test_perm, the inode
      logically belongs to the root of the namespace.  I have confirmed this
      with Eric Biederman at LPC and in this thread:
        https://lore.kernel.org/lkml/87k1kzjdff.fsf@xmission.com
      
      Consequences
      ============
      
      Since the i_[ug]id values of proc nodes are not used for permissions
      checks, this change usually makes no functional difference.  However, it
      causes an issue in a setup where:
      
       * a namespace container is created without root user in container -
         hence the i_[ug]id of proc nodes are set to INVALID_[UG]ID
      
       * container creator tries to configure it by writing /proc/sys files,
         e.g. writing /proc/sys/kernel/shmmax to configure shared memory limit
      
      Kernel does not allow to open an inode for writing if its i_[ug]id are
      invalid, making it impossible to write shmmax and thus - configure the
      container.
      
      Using a container with no root mapping is apparently rare, but we do use
      this configuration at Google.  Also, we use a generic tool to configure
      the container limits, and the inability to write any of them causes a
      failure.
      
      History
      =======
      
      The invalid uids/gids in inodes first appeared due to 81754357 (fs:
      Update i_[ug]id_(read|write) to translate relative to s_user_ns).
      However, AFAIK, this did not immediately cause any issues.  The
      inability to write to these "invalid" inodes was only caused by a later
      commit 0bd23d09 (vfs: Don't modify inodes with a uid or gid unknown
      to the vfs).
      
      Tested: Used a repro program that creates a user namespace without any
      mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
      Before the change, it shows the overflow uid, with the change it's 0.
      The overflow uid indicates that the uid in the inode is not correct and
      thus it is not possible to open the file for writing.
      
      Link: http://lkml.kernel.org/r/20190708115130.250149-1-rburny@google.com
      Fixes: 0bd23d09 ("vfs: Don't modify inodes with a uid or gid unknown to the vfs")
      Signed-off-by: NRadoslaw Burny <rburny@google.com>
      Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      Cc: John Sperbeck <jsperbeck@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c7b50c7
    • T
      pnfs: Fix a problem where we gratuitously start doing I/O through the MDS · 0b174bac
      Trond Myklebust 提交于
      commit 58bbeab425c6c5e318f5b6ae31d351331ddfb34b upstream.
      
      If the client has to stop in pnfs_update_layout() to wait for another
      layoutget to complete, it currently exits and defaults to I/O through
      the MDS if the layoutget was successful.
      
      Fixes: d03360aa ("pNFS: Ensure we return the error if someone kills...")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org # v4.20+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0b174bac
    • T
      pNFS: Fix a typo in pnfs_update_layout · f64ff591
      Trond Myklebust 提交于
      commit 400417b05f3ec0531544ca5f94e64d838d8b8849 upstream.
      
      We're supposed to wait for the outstanding layout count to go to zero,
      but that got lost somehow.
      
      Fixes: d03360aa ("pNFS: Ensure we return the error if someone...")
      Reported-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f64ff591
    • T
      pnfs/flexfiles: Fix PTR_ERR() dereferences in ff_layout_track_ds_error · 603e7497
      Trond Myklebust 提交于
      commit 8e04fdfadda75a849c649f7e50fe7d97772e1fcb upstream.
      
      mirror->mirror_ds can be NULL if uninitialised, but can contain
      a PTR_ERR() if call to GETDEVICEINFO failed.
      
      Fixes: 65990d1a ("pNFS/flexfiles: Fix a deadlock on LAYOUTGET")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: stable@vger.kernel.org # 4.10+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      603e7497
    • T
      NFSv4: Handle the special Linux file open access mode · 5347e619
      Trond Myklebust 提交于
      commit 44942b4e457beda00981f616402a1a791e8c616e upstream.
      
      According to the open() manpage, Linux reserves the access mode 3
      to mean "check for read and write permission on the file and return
      a file descriptor that can't be used for reading or writing."
      
      Currently, the NFSv4 code will ask the server to open the file,
      and will use an incorrect share access mode of 0. Since it has
      an incorrect share access mode, the client later forgets to send
      a corresponding close, meaning it can leak stateids on the server.
      
      Fixes: ce4ef7c0 ("NFS: Split out NFS v4 file operations")
      Cc: stable@vger.kernel.org # 3.6+
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5347e619
    • T
      blkcg, writeback: dead memcgs shouldn't contribute to writeback ownership arbitration · b7d66bbc
      Tejun Heo 提交于
      [ Upstream commit 6631142229005e1b1c311a09efe9fb3cfdac8559 ]
      
      wbc_account_io() collects information on cgroup ownership of writeback
      pages to determine which cgroup should own the inode.  Pages can stay
      associated with dead memcgs but we want to avoid attributing IOs to
      dead blkcgs as much as possible as the association is likely to be
      stale.  However, currently, pages associated with dead memcgs
      contribute to the accounting delaying and/or confusing the
      arbitration.
      
      Fix it by ignoring pages associated with dead memcgs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b7d66bbc
    • E
      fscrypt: clean up some BUG_ON()s in block encryption/decryption · 71b029a5
      Eric Biggers 提交于
      [ Upstream commit eeacfdc68a104967162dfcba60f53f6f5b62a334 ]
      
      Replace some BUG_ON()s with WARN_ON_ONCE() and returning an error code,
      and move the check for len divisible by FS_CRYPTO_BLOCK_SIZE into
      fscrypt_crypt_block() so that it's done for both encryption and
      decryption, not just encryption.
      Reviewed-by: NChandan Rajendra <chandan@linux.ibm.com>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      71b029a5
  2. 21 7月, 2019 1 次提交
    • D
      afs: Fix uninitialised spinlock afs_volume::cb_break_lock · fdfff855
      David Howells 提交于
      [ Upstream commit 90fa9b64523a645a97edc0bdcf2d74759957eeee ]
      
      Fix the cb_break_lock spinlock in afs_volume struct by initialising it when
      the volume record is allocated.
      
      Also rename the lock to cb_v_break_lock to distinguish it from the lock of
      the same name in the afs_server struct.
      
      Without this, the following trace may be observed when a volume-break
      callback is received:
      
        INFO: trying to register non-static key.
        the code is fine but needs lockdep annotation.
        turning off the locking correctness validator.
        CPU: 2 PID: 50 Comm: kworker/2:1 Not tainted 5.2.0-rc1-fscache+ #3045
        Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
        Workqueue: afs SRXAFSCB_CallBack
        Call Trace:
         dump_stack+0x67/0x8e
         register_lock_class+0x23b/0x421
         ? check_usage_forwards+0x13c/0x13c
         __lock_acquire+0x89/0xf73
         lock_acquire+0x13b/0x166
         ? afs_break_callbacks+0x1b2/0x3dd
         _raw_write_lock+0x2c/0x36
         ? afs_break_callbacks+0x1b2/0x3dd
         afs_break_callbacks+0x1b2/0x3dd
         ? trace_event_raw_event_afs_server+0x61/0xac
         SRXAFSCB_CallBack+0x11f/0x16c
         process_one_work+0x2c5/0x4ee
         ? worker_thread+0x234/0x2ac
         worker_thread+0x1d8/0x2ac
         ? cancel_delayed_work_sync+0xf/0xf
         kthread+0x11f/0x127
         ? kthread_park+0x76/0x76
         ret_from_fork+0x24/0x30
      
      Fixes: 68251f0a ("afs: Fix whole-volume callback handling")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fdfff855
  3. 14 7月, 2019 4 次提交
  4. 10 7月, 2019 4 次提交
    • P
      nfsd: Fix overflow causing non-working mounts on 1 TB machines · 8129a10c
      Paul Menzel 提交于
      commit 3b2d4dcf71c4a91b420f835e52ddea8192300a3b upstream.
      
      Since commit 10a68cdf (nfsd: fix performance-limiting session
      calculation) (Linux 5.1-rc1 and 4.19.31), shares from NFS servers with
      1 TB of memory cannot be mounted anymore. The mount just hangs on the
      client.
      
      The gist of commit 10a68cdf is the change below.
      
          -avail = clamp_t(int, avail, slotsize, avail/3);
          +avail = clamp_t(int, avail, slotsize, total_avail/3);
      
      Here are the macros.
      
          #define min_t(type, x, y)       __careful_cmp((type)(x), (type)(y), <)
          #define clamp_t(type, val, lo, hi) min_t(type, max_t(type, val, lo), hi)
      
      `total_avail` is 8,434,659,328 on the 1 TB machine. `clamp_t()` casts
      the values to `int`, which for 32-bit integers can only hold values
      −2,147,483,648 (−2^31) through 2,147,483,647 (2^31 − 1).
      
      `avail` (in the function signature) is just 65536, so that no overflow
      was happening. Before the commit the assignment would result in 21845,
      and `num = 4`.
      
      When using `total_avail`, it is causing the assignment to be
      18446744072226137429 (printed as %lu), and `num` is then 4164608182.
      
      My next guess is, that `nfsd_drc_mem_used` is then exceeded, and the
      server thinks there is no memory available any more for this client.
      
      Updating the arguments of `clamp_t()` and `min_t()` to `unsigned long`
      fixes the issue.
      
      Now, `avail = 65536` (before commit 10a68cdf `avail = 21845`), but
      `num = 4` remains the same.
      
      Fixes: c54f24e338ed (nfsd: fix performance-limiting session calculation)
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8129a10c
    • J
      f2fs: don't access node/meta inode mapping after iput · e2379b04
      Jaegeuk Kim 提交于
      [ Upstream commit 7c77bf7de1574ac7a31a2b76f4927404307d13e7 ]
      
      This fixes wrong access of address spaces of node and meta inodes after iput.
      
      Fixes: 60aa4d5536ab ("f2fs: fix use-after-free issue when accessing sbi->stat_info")
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      e2379b04
    • N
      btrfs: Ensure replaced device doesn't have pending chunk allocation · fb814f21
      Nikolay Borisov 提交于
      commit debd1c065d2037919a7da67baf55cc683fee09f0 upstream.
      
      Recent FITRIM work, namely bbbf7243d62d ("btrfs: combine device update
      operations during transaction commit") combined the way certain
      operations are recoded in a transaction. As a result an ASSERT was added
      in dev_replace_finish to ensure the new code works correctly.
      Unfortunately I got reports that it's possible to trigger the assert,
      meaning that during a device replace it's possible to have an unfinished
      chunk allocation on the source device.
      
      This is supposed to be prevented by the fact that a transaction is
      committed before finishing the replace oepration and alter acquiring the
      chunk mutex. This is not sufficient since by the time the transaction is
      committed and the chunk mutex acquired it's possible to allocate a chunk
      depending on the workload being executed on the replaced device. This
      bug has been present ever since device replace was introduced but there
      was never code which checks for it.
      
      The correct way to fix is to ensure that there is no pending device
      modification operation when the chunk mutex is acquire and if there is
      repeat transaction commit. Unfortunately it's not possible to just
      exclude the source device from btrfs_fs_devices::dev_alloc_list since
      this causes ENOSPC to be hit in transaction commit.
      
      Fixing that in another way would need to add special cases to handle the
      last writes and forbid new ones. The looped transaction fix is more
      obvious, and can be easily backported. The runtime of dev-replace is
      long so there's no noticeable delay caused by that.
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Fixes: 391cd9df ("Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace")
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      fb814f21
    • E
      fs/userfaultfd.c: disable irqs for fault_pending and event locks · 052b3181
      Eric Biggers 提交于
      commit cbcfa130a911c613a1d9d921af2eea171c414172 upstream.
      
      When IOCB_CMD_POLL is used on a userfaultfd, aio_poll() disables IRQs
      and takes kioctx::ctx_lock, then userfaultfd_ctx::fd_wqh.lock.
      
      This may have to wait for userfaultfd_ctx::fd_wqh.lock to be released by
      userfaultfd_ctx_read(), which in turn can be waiting for
      userfaultfd_ctx::fault_pending_wqh.lock or
      userfaultfd_ctx::event_wqh.lock.
      
      But elsewhere the fault_pending_wqh and event_wqh locks are taken with
      IRQs enabled.  Since the IRQ handler may take kioctx::ctx_lock, lockdep
      reports that a deadlock is possible.
      
      Fix it by always disabling IRQs when taking the fault_pending_wqh and
      event_wqh locks.
      
      Commit ae62c16e105a ("userfaultfd: disable irqs when taking the
      waitqueue lock") didn't fix this because it only accounted for the
      fd_wqh lock, not the other locks nested inside it.
      
      Link: http://lkml.kernel.org/r/20190627075004.21259-1-ebiggers@kernel.org
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reported-by: syzbot+fab6de82892b6b9c6191@syzkaller.appspotmail.com
      Reported-by: syzbot+53c0b767f7ca0dc0c451@syzkaller.appspotmail.com
      Reported-by: syzbot+a3accb352f9c22041cfa@syzkaller.appspotmail.com
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      052b3181
  5. 03 7月, 2019 4 次提交
  6. 25 6月, 2019 8 次提交
    • S
      SMB3: retry on STATUS_INSUFFICIENT_RESOURCES instead of failing write · 5293c79c
      Steve French 提交于
      commit 8d526d62db907e786fd88948c75d1833d82bd80e upstream.
      
      Some servers such as Windows 10 will return STATUS_INSUFFICIENT_RESOURCES
      as the number of simultaneous SMB3 requests grows (even though the client
      has sufficient credits).  Return EAGAIN on STATUS_INSUFFICIENT_RESOURCES
      so that we can retry writes which fail with this status code.
      
      This (for example) fixes large file copies to Windows 10 on fast networks.
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      CC: Stable <stable@vger.kernel.org>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5293c79c
    • N
      btrfs: start readahead also in seed devices · c592b1c3
      Naohiro Aota 提交于
      commit c4e0540d0ad49c8ceab06cceed1de27c4fe29f6e upstream.
      
      Currently, btrfs does not consult seed devices to start readahead. As a
      result, if readahead zone is added to the seed devices, btrfs_reada_wait()
      indefinitely wait for the reada_ctl to finish.
      
      You can reproduce the hung by modifying btrfs/163 to have larger initial
      file size (e.g. xfs_io pwrite 4M instead of current 256K).
      
      Fixes: 7414a03f ("btrfs: initial readahead code and prototypes")
      Cc: stable@vger.kernel.org # 3.2+: ce7791ff: Btrfs: fix race between readahead and device replace/removal
      Cc: stable@vger.kernel.org # 3.2+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c592b1c3
    • A
      ovl: fix bogus -Wmaybe-unitialized warning · 0319ef1d
      Arnd Bergmann 提交于
      [ Upstream commit 1dac6f5b0ed2601be21bb4e27a44b0c3e667b7f4 ]
      
      gcc gets a bit confused by the logic in ovl_setup_trap() and
      can't figure out whether the local 'trap' variable in the caller
      was initialized or not:
      
      fs/overlayfs/super.c: In function 'ovl_fill_super':
      fs/overlayfs/super.c:1333:4: error: 'trap' may be used uninitialized in this function [-Werror=maybe-uninitialized]
          iput(trap);
          ^~~~~~~~~~
      fs/overlayfs/super.c:1312:17: note: 'trap' was declared here
      
      Reword slightly to make it easier for the compiler to understand.
      
      Fixes: 146d62e5a586 ("ovl: detect overlapping layers")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0319ef1d
    • M
      ovl: don't fail with disconnected lower NFS · 639e8c2f
      Miklos Szeredi 提交于
      [ Upstream commit 9179c21dc6ed1c993caa5fe4da876a6765c26af7 ]
      
      NFS mounts can be disconnected from fs root.  Don't fail the overlapping
      layer check because of this.
      
      The check is not authoritative anyway, since topology can change during or
      after the check.
      Reported-by: NAntti Antinoja <antti@fennosys.fi>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Fixes: 146d62e5a586 ("ovl: detect overlapping layers")
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      639e8c2f
    • A
      ovl: detect overlapping layers · f1c5aa5e
      Amir Goldstein 提交于
      [ Upstream commit 146d62e5a5867fbf84490d82455718bfb10fe824 ]
      
      Overlapping overlay layers are not supported and can cause unexpected
      behavior, but overlayfs does not currently check or warn about these
      configurations.
      
      User is not supposed to specify the same directory for upper and
      lower dirs or for different lower layers and user is not supposed to
      specify directories that are descendants of each other for overlay
      layers, but that is exactly what this zysbot repro did:
      
          https://syzkaller.appspot.com/x/repro.syz?x=12c7a94f400000
      
      Moving layer root directories into other layers while overlayfs
      is mounted could also result in unexpected behavior.
      
      This commit places "traps" in the overlay inode hash table.
      Those traps are dummy overlay inodes that are hashed by the layers
      root inodes.
      
      On mount, the hash table trap entries are used to verify that overlay
      layers are not overlapping.  While at it, we also verify that overlay
      layers are not overlapping with directories "in-use" by other overlay
      instances as upperdir/workdir.
      
      On lookup, the trap entries are used to verify that overlay layers
      root inodes have not been moved into other layers after mount.
      
      Some examples:
      
      $ ./run --ov --samefs -s
      ...
      ( mkdir -p base/upper/0/u base/upper/0/w base/lower lower upper mnt
        mount -o bind base/lower lower
        mount -o bind base/upper upper
        mount -t overlay none mnt ...
              -o lowerdir=lower,upperdir=upper/0/u,workdir=upper/0/w)
      
      $ umount mnt
      $ mount -t overlay none mnt ...
              -o lowerdir=base,upperdir=upper/0/u,workdir=upper/0/w
      
        [   94.434900] overlayfs: overlapping upperdir path
        mount: mount overlay on mnt failed: Too many levels of symbolic links
      
      $ mount -t overlay none mnt ...
              -o lowerdir=upper/0/u,upperdir=upper/0/u,workdir=upper/0/w
      
        [  151.350132] overlayfs: conflicting lowerdir path
        mount: none is already mounted or mnt busy
      
      $ mount -t overlay none mnt ...
              -o lowerdir=lower:lower/a,upperdir=upper/0/u,workdir=upper/0/w
      
        [  201.205045] overlayfs: overlapping lowerdir path
        mount: mount overlay on mnt failed: Too many levels of symbolic links
      
      $ mount -t overlay none mnt ...
              -o lowerdir=lower,upperdir=upper/0/u,workdir=upper/0/w
      $ mv base/upper/0/ base/lower/
      $ find mnt/0
        mnt/0
        mnt/0/w
        find: 'mnt/0/w/work': Too many levels of symbolic links
        find: 'mnt/0/u': Too many levels of symbolic links
      
      Reported-by: syzbot+9c69c282adc4edd2b540@syzkaller.appspotmail.com
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f1c5aa5e
    • A
      ovl: make i_ino consistent with st_ino in more cases · a00f405e
      Amir Goldstein 提交于
      [ Upstream commit 6dde1e42f497b2d4e22466f23019016775607947 ]
      
      Relax the condition that overlayfs supports nfs export, to require
      that i_ino is consistent with st_ino/d_ino.
      
      It is enough to require that st_ino and d_ino are consistent.
      
      This fixes the failure of xfstest generic/504, due to mismatch of
      st_ino to inode number in the output of /proc/locks.
      
      Fixes: 12574a9f ("ovl: consistent i_ino for non-samefs with xino")
      Cc: <stable@vger.kernel.org> # v4.19
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a00f405e
    • A
      ovl: fix wrong flags check in FS_IOC_FS[SG]ETXATTR ioctls · d6623379
      Amir Goldstein 提交于
      [ Upstream commit 941d935ac7636911a3fd8fa80e758e52b0b11e20 ]
      
      The ioctl argument was parsed as the wrong type.
      
      Fixes: b21d9c435f93 ("ovl: support the FS_IOC_FS[SG]ETXATTR ioctls")
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d6623379
    • A
      ovl: support the FS_IOC_FS[SG]ETXATTR ioctls · 3cb5d7fa
      Amir Goldstein 提交于
      [ Upstream commit b21d9c435f935014d3e3fa6914f2e4fbabb0e94d ]
      
      They are the extended version of FS_IOC_FS[SG]ETFLAGS ioctls.
      xfs_io -c "chattr <flags>" uses the new ioctls for setting flags.
      
      This used to work in kernel pre v4.19, before stacked file ops
      introduced the ovl_ioctl whitelist.
      Reported-by: NDave Chinner <david@fromorbit.com>
      Fixes: d1d04ef8 ("ovl: stack file ops")
      Cc: <stable@vger.kernel.org> # v4.19
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      3cb5d7fa
  7. 22 6月, 2019 3 次提交
  8. 19 6月, 2019 1 次提交
    • R
      f2fs: fix to avoid accessing xattr across the boundary · ae3787d4
      Randall Huang 提交于
      [ Upstream commit 2777e654371dd4207a3a7f4fb5fa39550053a080 ]
      
      When we traverse xattr entries via __find_xattr(),
      if the raw filesystem content is faked or any hardware failure occurs,
      out-of-bound error can be detected by KASAN.
      Fix the issue by introducing boundary check.
      
      [   38.402878] c7   1827 BUG: KASAN: slab-out-of-bounds in f2fs_getxattr+0x518/0x68c
      [   38.402891] c7   1827 Read of size 4 at addr ffffffc0b6fb35dc by task
      [   38.402935] c7   1827 Call trace:
      [   38.402952] c7   1827 [<ffffff900809003c>] dump_backtrace+0x0/0x6bc
      [   38.402966] c7   1827 [<ffffff9008090030>] show_stack+0x20/0x2c
      [   38.402981] c7   1827 [<ffffff900871ab10>] dump_stack+0xfc/0x140
      [   38.402995] c7   1827 [<ffffff9008325c40>] print_address_description+0x80/0x2d8
      [   38.403009] c7   1827 [<ffffff900832629c>] kasan_report_error+0x198/0x1fc
      [   38.403022] c7   1827 [<ffffff9008326104>] kasan_report_error+0x0/0x1fc
      [   38.403037] c7   1827 [<ffffff9008325000>] __asan_load4+0x1b0/0x1b8
      [   38.403051] c7   1827 [<ffffff90085fcc44>] f2fs_getxattr+0x518/0x68c
      [   38.403066] c7   1827 [<ffffff90085fc508>] f2fs_xattr_generic_get+0xb0/0xd0
      [   38.403080] c7   1827 [<ffffff9008395708>] __vfs_getxattr+0x1f4/0x1fc
      [   38.403096] c7   1827 [<ffffff9008621bd0>] inode_doinit_with_dentry+0x360/0x938
      [   38.403109] c7   1827 [<ffffff900862d6cc>] selinux_d_instantiate+0x2c/0x38
      [   38.403123] c7   1827 [<ffffff900861b018>] security_d_instantiate+0x68/0x98
      [   38.403136] c7   1827 [<ffffff9008377db8>] d_splice_alias+0x58/0x348
      [   38.403149] c7   1827 [<ffffff900858d16c>] f2fs_lookup+0x608/0x774
      [   38.403163] c7   1827 [<ffffff900835eacc>] lookup_slow+0x1e0/0x2cc
      [   38.403177] c7   1827 [<ffffff9008367fe0>] walk_component+0x160/0x520
      [   38.403190] c7   1827 [<ffffff9008369ef4>] path_lookupat+0x110/0x2b4
      [   38.403203] c7   1827 [<ffffff900835dd38>] filename_lookup+0x1d8/0x3a8
      [   38.403216] c7   1827 [<ffffff900835eeb0>] user_path_at_empty+0x54/0x68
      [   38.403229] c7   1827 [<ffffff9008395f44>] SyS_getxattr+0xb4/0x18c
      [   38.403241] c7   1827 [<ffffff9008084200>] el0_svc_naked+0x34/0x38
      Signed-off-by: NRandall Huang <huangrandall@google.com>
      [Jaegeuk Kim: Fix wrong ending boundary]
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ae3787d4