1. 20 12月, 2018 6 次提交
  2. 17 12月, 2018 17 次提交
    • M
      dax: Check page->mapping isn't NULL · 384f1811
      Matthew Wilcox 提交于
      commit c93db7bb6ef3251e0ea48ade311d3e9942748e1c upstream.
      
      If we race with inode destroy, it's possible for page->mapping to be
      NULL before we even enter this routine, as well as after having slept
      waiting for the dax entry to become unlocked.
      
      Fixes: c2a7d2a1 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      384f1811
    • T
      flexfiles: enforce per-mirror stateid only for v4 DSes · 111758f7
      Tigran Mkrtchyan 提交于
      commit 320f35b7bf8cccf1997ca3126843535e1b95e9c4 upstream.
      
      Since commit bb21ce0ad227 we always enforce per-mirror stateid.
      However, this makes sense only for v4+ servers.
      Signed-off-by: NTigran Mkrtchyan <tigran.mkrtchyan@desy.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      111758f7
    • P
      ocfs2: fix potential use after free · a31da26a
      Pan Bian 提交于
      [ Upstream commit 164f7e586739d07eb56af6f6d66acebb11f315c8 ]
      
      ocfs2_get_dentry() calls iput(inode) to drop the reference count of
      inode, and if the reference count hits 0, inode is freed.  However, in
      this function, it then reads inode->i_generation, which may result in a
      use after free bug.  Move the put operation later.
      
      Link: http://lkml.kernel.org/r/1543109237-110227-1-git-send-email-bianpan2016@163.com
      Fixes: 781f200c("ocfs2: Remove masklog ML_EXPORT.")
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a31da26a
    • P
      hfsplus: do not free node before using · ab31765e
      Pan Bian 提交于
      [ Upstream commit c7d7d620dcbd2a1c595092280ca943f2fced7bbd ]
      
      hfs_bmap_free() frees node via hfs_bnode_put(node).  However it then
      reads node->this when dumping error message on an error path, which may
      result in a use-after-free bug.  This patch frees node only when it is
      never used.
      
      Link: http://lkml.kernel.org/r/1543053441-66942-1-git-send-email-bianpan2016@163.comSigned-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Viacheslav Dubeyko <slava@dubeyko.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ab31765e
    • P
      hfs: do not free node before using · f7cbec75
      Pan Bian 提交于
      [ Upstream commit ce96a407adef126870b3f4a1b73529dd8aa80f49 ]
      
      hfs_bmap_free() frees the node via hfs_bnode_put(node).  However, it
      then reads node->this when dumping error message on an error path, which
      may result in a use-after-free bug.  This patch frees the node only when
      it is never again used.
      
      Link: http://lkml.kernel.org/r/1542963889-128825-1-git-send-email-bianpan2016@163.com
      Fixes: a1185ffa2fc ("HFS rewrite")
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
      Cc: Viacheslav Dubeyko <slava@dubeyko.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f7cbec75
    • L
      ocfs2: fix deadlock caused by ocfs2_defrag_extent() · 6aab48ae
      Larry Chen 提交于
      [ Upstream commit e21e57445a64598b29a6f629688f9b9a39e7242a ]
      
      ocfs2_defrag_extent may fall into deadlock.
      
      ocfs2_ioctl_move_extents
          ocfs2_ioctl_move_extents
            ocfs2_move_extents
              ocfs2_defrag_extent
                ocfs2_lock_allocators_move_extents
      
                  ocfs2_reserve_clusters
                    inode_lock GLOBAL_BITMAP_SYSTEM_INODE
      
      	  __ocfs2_flush_truncate_log
                    inode_lock GLOBAL_BITMAP_SYSTEM_INODE
      
      As backtrace shows above, ocfs2_reserve_clusters() will call inode_lock
      against the global bitmap if local allocator has not sufficient cluters.
      Once global bitmap could meet the demand, ocfs2_reserve_cluster will
      return success with global bitmap locked.
      
      After ocfs2_reserve_cluster(), if truncate log is full,
      __ocfs2_flush_truncate_log() will definitely fall into deadlock because
      it needs to inode_lock global bitmap, which has already been locked.
      
      To fix this bug, we could remove from
      ocfs2_lock_allocators_move_extents() the code which intends to lock
      global allocator, and put the removed code after
      __ocfs2_flush_truncate_log().
      
      ocfs2_lock_allocators_move_extents() is referred by 2 places, one is
      here, the other does not need the data allocator context, which means
      this patch does not affect the caller so far.
      
      Link: http://lkml.kernel.org/r/20181101071422.14470-1-lchen@suse.comSigned-off-by: NLarry Chen <lchen@suse.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6aab48ae
    • C
      fscache, cachefiles: remove redundant variable 'cache' · 1f925643
      Colin Ian King 提交于
      [ Upstream commit 31ffa563833576bd49a8bf53120568312755e6e2 ]
      
      Variable 'cache' is being assigned but is never used hence it is
      redundant and can be removed.
      
      Cleans up clang warning:
      warning: variable 'cache' set but not used [-Wunused-but-set-variable]
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1f925643
    • N
      cachefiles: Explicitly cast enumerated type in put_object · d8bf97a0
      Nathan Chancellor 提交于
      [ Upstream commit b7e768b7e3522695ed36dcb48ecdcd344bd30a9b ]
      
      Clang warns when one enumerated type is implicitly converted to another.
      
      fs/cachefiles/namei.c:247:50: warning: implicit conversion from
      enumeration type 'enum cachefiles_obj_ref_trace' to different
      enumeration type 'enum fscache_obj_ref_trace' [-Wenum-conversion]
              cache->cache.ops->put_object(&xobject->fscache,
      cachefiles_obj_put_wait_retry);
      
      Silence this warning by explicitly casting to fscache_obj_ref_trace,
      which is also done in put_object.
      Reported-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d8bf97a0
    • N
      fscache: fix race between enablement and dropping of object · 02bd7b74
      NeilBrown 提交于
      [ Upstream commit c5a94f434c82529afda290df3235e4d85873c5b4 ]
      
      It was observed that a process blocked indefintely in
      __fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
      to be cleared via fscache_wait_for_deferred_lookup().
      
      At this time, ->backing_objects was empty, which would normaly prevent
      __fscache_read_or_alloc_page() from getting to the point of waiting.
      This implies that ->backing_objects was cleared *after*
      __fscache_read_or_alloc_page was was entered.
      
      When an object is "killed" and then "dropped",
      FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
      KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
      ->backing_objects cleared.  This leaves a window where
      something else can set FSCACHE_COOKIE_LOOKING_UP and
      __fscache_read_or_alloc_page() can start waiting, before
      ->backing_objects is cleared
      
      There is some uncertainty in this analysis, but it seems to be fit the
      observations.  Adding the wake in this patch will be handled correctly
      by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
      is empty again, after waiting.
      
      Customer which reported the hang, also report that the hang cannot be
      reproduced with this fix.
      
      The backtrace for the blocked process looked like:
      
      PID: 29360  TASK: ffff881ff2ac0f80  CPU: 3   COMMAND: "zsh"
       #0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
       #1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
       #2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
       #3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
       #4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
       #5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
       #6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
       #7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
       #8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
       #9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
      #10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
      #11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
      #12 [ffff881ff43eff18] sys_read at ffffffff811fda62
      #13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      02bd7b74
    • D
      afs: Fix validation/callback interaction · 52da87f0
      David Howells 提交于
      [ Upstream commit ae3b7361dc0ee9a425bf7d77ce211f533500b39b ]
      
      When afs_validate() is called to validate a vnode (inode), there are two
      unhandled cases in the fastpath at the top of the function:
      
       (1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break
           counters match and the data has expired, then there's an implicit case
           in which the vnode needs revalidating.
      
           This has no consequences since the default "valid = false" set at the
           top of the function happens to do the right thing.
      
       (2) If the vnode is not promised and it hasn't been deleted
           (AFS_VNODE_DELETED is not set) then there's a default case we're not
           handling in which the vnode is invalid.  If the vnode is invalid, we
           need to bring cb_s_break and cb_v_break up to date before we refetch
           the status.
      
           As a consequence, once the server loses track of the client
           (ie. sufficient time has passed since we last sent it an operation),
           it will send us a CB.InitCallBackState* operation when we next try to
           talk to it.  This calls afs_init_callback_state() which increments
           afs_server::cb_s_break, but this then doesn't propagate to the
           afs_vnode record.
      
           The result being that every afs_validate() call thereafter sends a
           status fetch operation to the server.
      
      Clarify and fix this by:
      
       (A) Setting valid in all the branches rather than initialising it at the
           top so that the compiler catches where we've missed.
      
       (B) Restructuring the logic in the 'promised' branch so that we set valid
           to false if the callback is due to expire (or has expired) and so that
           the final case is that the vnode is still valid.
      
       (C) Adding an else-statement that ups cb_s_break and cb_v_break if the
           promised and deleted cases don't match.
      
      Fixes: c435ee34 ("afs: Overhaul the callback handling")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      52da87f0
    • K
      pstore/ram: Correctly calculate usable PRZ bytes · ce469db0
      Kees Cook 提交于
      [ Upstream commit 89d328f6 ]
      
      The actual number of bytes stored in a PRZ is smaller than the
      bytes requested by platform data, since there is a header on each
      PRZ. Additionally, if ECC is enabled, there are trailing bytes used
      as well. Normally this mismatch doesn't matter since PRZs are circular
      buffers and the leading "overflow" bytes are just thrown away. However, in
      the case of a compressed record, this rather badly corrupts the results.
      
      This corruption was visible with "ramoops.mem_size=204800 ramoops.ecc=1".
      Any stored crashes would not be uncompressable (producing a pstorefs
      "dmesg-*.enc.z" file), and triggering errors at boot:
      
        [    2.790759] pstore: crypto_comp_decompress failed, ret = -22!
      
      Backporting this depends on commit 70ad35db ("pstore: Convert console
      write to use ->write_buf")
      Reported-by: NJoel Fernandes <joel@joelfernandes.org>
      Fixes: b0aad7a9 ("pstore: Add compression support to pstore")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ce469db0
    • K
      cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active · eee2269f
      Kiran Kumar Modukuri 提交于
      [ Upstream commit 9a24ce5b ]
      
      [Description]
      
      In a heavily loaded system where the system pagecache is nearing memory
      limits and fscache is enabled, pages can be leaked by fscache while trying
      read pages from cachefiles backend.  This can happen because two
      applications can be reading same page from a single mount, two threads can
      be trying to read the backing page at same time.  This results in one of
      the threads finding that a page for the backing file or netfs file is
      already in the radix tree.  During the error handling cachefiles does not
      clean up the reference on backing page, leading to page leak.
      
      [Fix]
      The fix is straightforward, to decrement the reference when error is
      encountered.
      
        [dhowells: Note that I've removed the clearance and put of newpage as
         they aren't attested in the commit message and don't appear to actually
         achieve anything since a new page is only allocated is newpage!=NULL and
         any residual new page is cleared before returning.]
      
      [Testing]
      I have tested the fix using following method for 12+ hrs.
      
      1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc <server_ip>:/export /mnt/nfs
      2) create 10000 files of 2.8MB in a NFS mount.
      3) start a thread to simulate heavy VM presssure
         (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
      4) start multiple parallel reader for data set at same time
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         ..
         ..
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
      5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
         free -h , cat /proc/meminfo and page-types -r -b lru
         to ensure all pages are freed.
      Reviewed-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NShantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: NKiran Kumar Modukuri <kiran.modukuri@gmail.com>
      [dja: forward ported to current upstream]
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      eee2269f
    • D
      cachefiles: Fix an assertion failure when trying to update a failed object · 5132f913
      David Howells 提交于
      [ Upstream commit e6bc06fa ]
      
      If cachefiles gets an error other then ENOENT when trying to look up an
      object in the cache (in this case, EACCES), the object state machine will
      eventually transition to the DROP_OBJECT state.
      
      This state invokes fscache_drop_object() which tries to sync the auxiliary
      data with the cache (this is done lazily since commit 402cb8dd) on an
      incomplete cache object struct.
      
      The problem comes when cachefiles_update_object_xattr() is called to
      rewrite the xattr holding the data.  There's an assertion there that the
      cache object points to a dentry as we're going to update its xattr.  The
      assertion trips, however, as dentry didn't get set.
      
      Fix the problem by skipping the update in cachefiles if the object doesn't
      refer to a dentry.  A better way to do it could be to skip the update from
      the DROP_OBJECT state handler in fscache, but that might deny the cache the
      opportunity to update intermediate state.
      
      If this error occurs, the kernel log includes lines that look like the
      following:
      
       CacheFiles: Lookup failed error -13
       CacheFiles:
       CacheFiles: Assertion failed
       ------------[ cut here ]------------
       kernel BUG at fs/cachefiles/xattr.c:138!
       ...
       Workqueue: fscache_object fscache_object_work_func [fscache]
       RIP: 0010:cachefiles_update_object_xattr.cold.4+0x18/0x1a [cachefiles]
       ...
       Call Trace:
        cachefiles_update_object+0xdd/0x1c0 [cachefiles]
        fscache_update_aux_data+0x23/0x30 [fscache]
        fscache_drop_object+0x18e/0x1c0 [fscache]
        fscache_object_work_func+0x74/0x2b0 [fscache]
        process_one_work+0x18d/0x340
        worker_thread+0x2e/0x390
        ? pwq_unbound_release_workfn+0xd0/0xd0
        kthread+0x112/0x130
        ? kthread_bind+0x30/0x30
        ret_from_fork+0x35/0x40
      
      Note that there are actually two issues here: (1) EACCES happened on a
      cache object and (2) an oops occurred.  I think that the second is a
      consequence of the first (it certainly looks like it ought to be).  This
      patch only deals with the second.
      
      Fixes: 402cb8dd ("fscache: Attach the index key and aux data to the cookie")
      Reported-by: NZhibin Li <zhibli@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5132f913
    • P
      exportfs: do not read dentry after free · ad374d10
      Pan Bian 提交于
      [ Upstream commit 2084ac6c505a58f7efdec13eba633c6aaa085ca5 ]
      
      The function dentry_connected calls dput(dentry) to drop the previously
      acquired reference to dentry. In this case, dentry can be released.
      After that, IS_ROOT(dentry) checks the condition
      (dentry == dentry->d_parent), which may result in a use-after-free bug.
      This patch directly compares dentry with its parent obtained before
      dropping the reference.
      
      Fixes: a056cc89("exportfs: stop retrying once we race with
      rename/remove")
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ad374d10
    • R
      Btrfs: send, fix infinite loop due to directory rename dependencies · 91f6a9aa
      Robbie Ko 提交于
      [ Upstream commit a4390aee72713d9e73f1132bcdeb17d72fbbf974 ]
      
      When doing an incremental send, due to the need of delaying directory move
      (rename) operations we can end up in infinite loop at
      apply_children_dir_moves().
      
      An example scenario that triggers this problem is described below, where
      directory names correspond to the numbers of their respective inodes.
      
      Parent snapshot:
      
       .
       |--- 261/
             |--- 271/
                   |--- 266/
                         |--- 259/
                         |--- 260/
                         |     |--- 267
                         |
                         |--- 264/
                         |     |--- 258/
                         |           |--- 257/
                         |
                         |--- 265/
                         |--- 268/
                         |--- 269/
                         |     |--- 262/
                         |
                         |--- 270/
                         |--- 272/
                         |     |--- 263/
                         |     |--- 275/
                         |
                         |--- 274/
                               |--- 273/
      
      Send snapshot:
      
       .
       |-- 275/
            |-- 274/
                 |-- 273/
                      |-- 262/
                           |-- 269/
                                |-- 258/
                                     |-- 271/
                                          |-- 268/
                                               |-- 267/
                                                    |-- 270/
                                                         |-- 259/
                                                         |    |-- 265/
                                                         |
                                                         |-- 272/
                                                              |-- 257/
                                                                   |-- 260/
                                                                   |-- 264/
                                                                        |-- 263/
                                                                             |-- 261/
                                                                                  |-- 266/
      
      When processing inode 257 we delay its move (rename) operation because its
      new parent in the send snapshot, inode 272, was not yet processed. Then
      when processing inode 272, we delay the move operation for that inode
      because inode 274 is its ancestor in the send snapshot. Finally we delay
      the move operation for inode 274 when processing it because inode 275 is
      its new parent in the send snapshot and was not yet moved.
      
      When finishing processing inode 275, we start to do the move operations
      that were previously delayed (at apply_children_dir_moves()), resulting in
      the following iterations:
      
      1) We issue the move operation for inode 274;
      
      2) Because inode 262 depended on the move operation of inode 274 (it was
         delayed because 274 is its ancestor in the send snapshot), we issue the
         move operation for inode 262;
      
      3) We issue the move operation for inode 272, because it was delayed by
         inode 274 too (ancestor of 272 in the send snapshot);
      
      4) We issue the move operation for inode 269 (it was delayed by 262);
      
      5) We issue the move operation for inode 257 (it was delayed by 272);
      
      6) We issue the move operation for inode 260 (it was delayed by 272);
      
      7) We issue the move operation for inode 258 (it was delayed by 269);
      
      8) We issue the move operation for inode 264 (it was delayed by 257);
      
      9) We issue the move operation for inode 271 (it was delayed by 258);
      
      10) We issue the move operation for inode 263 (it was delayed by 264);
      
      11) We issue the move operation for inode 268 (it was delayed by 271);
      
      12) We verify if we can issue the move operation for inode 270 (it was
          delayed by 271). We detect a path loop in the current state, because
          inode 267 needs to be moved first before we can issue the move
          operation for inode 270. So we delay again the move operation for
          inode 270, this time we will attempt to do it after inode 267 is
          moved;
      
      13) We issue the move operation for inode 261 (it was delayed by 263);
      
      14) We verify if we can issue the move operation for inode 266 (it was
          delayed by 263). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12);
      
      15) We issue the move operation for inode 267 (it was delayed by 268);
      
      16) We verify if we can issue the move operation for inode 266 (it was
          delayed by 270). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12). So here we added
          again the same delayed move operation that we added in step 14;
      
      17) We attempt again to see if we can issue the move operation for inode
          266, and as in step 16, we realize we can not due to a path loop in
          the current state due to a dependency on inode 270. Again we delay
          inode's 266 rename to happen after inode's 270 move operation, adding
          the same dependency to the empty stack that we did in steps 14 and 16.
          The next iteration will pick the same move dependency on the stack
          (the only entry) and realize again there is still a path loop and then
          again the same dependency to the stack, over and over, resulting in
          an infinite loop.
      
      So fix this by preventing adding the same move dependency entries to the
      stack by removing each pending move record from the red black tree of
      pending moves. This way the next call to get_pending_dir_moves() will
      not return anything for the current parent inode.
      
      A test case for fstests, with this reproducer, follows soon.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [Wrote changelog with example and more clear explanation]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      91f6a9aa
    • J
      aio: fix failure to put the file pointer · df66ef67
      Jens Axboe 提交于
      [ Upstream commit 53fffe29a9e664a999dd3787e4428da8c30533e0 ]
      
      If the ioprio capability check fails, we return without putting
      the file pointer.
      
      Fixes: d9a08a9e ("fs: Add aio iopriority support")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      df66ef67
    • Y
      sysv: return 'err' instead of 0 in __sysv_write_inode · f6168a80
      YueHaibing 提交于
      [ Upstream commit c4b7d1ba7d263b74bb72e9325262a67139605cde ]
      
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      fs/sysv/inode.c: In function '__sysv_write_inode':
      fs/sysv/inode.c:239:6: warning:
       variable 'err' set but not used [-Wunused-but-set-variable]
      
      __sysv_write_inode should return 'err' instead of 0
      
      Fixes: 05459ca8 ("repair sysv_write_inode(), switch sysv to simple_fsync()")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f6168a80
  3. 13 12月, 2018 4 次提交
  4. 08 12月, 2018 1 次提交
    • Q
      btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable · b0234f15
      Qu Wenruo 提交于
      commit 10950929 upstream.
      
      [BUG]
      A completely valid btrfs will refuse to mount, with error message like:
        BTRFS critical (device sdb2): corrupt leaf: root=2 block=239681536 slot=172 \
          bg_start=12018974720 bg_len=10888413184, invalid block group size, \
          have 10888413184 expect (0, 10737418240]
      
      This has been reported several times as the 4.19 kernel is now being
      used. The filesystem refuses to mount, but is otherwise ok and booting
      4.18 is a workaround.
      
      Btrfs check returns no error, and all kernels used on this fs is later
      than 2011, which should all have the 10G size limit commit.
      
      [CAUSE]
      For a 12 devices btrfs, we could allocate a chunk larger than 10G due to
      stripe stripe bump up.
      
      __btrfs_alloc_chunk()
      |- max_stripe_size = 1G
      |- max_chunk_size = 10G
      |- data_stripe = 11
      |- if (1G * 11 > 10G) {
             stripe_size = 976128930;
             stripe_size = round_up(976128930, SZ_16M) = 989855744
      
      However the final stripe_size (989855744) * 11 = 10888413184, which is
      still larger than 10G.
      
      [FIX]
      For the comprehensive check, we need to do the full check at chunk read
      time, and rely on bg <-> chunk mapping to do the check.
      
      We could just skip the length check for now.
      
      Fixes: fce466ea ("btrfs: tree-checker: Verify block_group_item")
      Cc: stable@vger.kernel.org # v4.19+
      Reported-by: NWang Yugui <wangyugui@e16-tech.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0234f15
  5. 06 12月, 2018 10 次提交
    • P
      ext2: fix potential use after free · ffaaaf68
      Pan Bian 提交于
      commit ecebf55d27a11538ea84aee0be643dd953f830d5 upstream.
      
      The function ext2_xattr_set calls brelse(bh) to drop the reference count
      of bh. After that, bh may be freed. However, following brelse(bh),
      it reads bh->b_data via macro HDR(bh). This may result in a
      use-after-free bug. This patch moves brelse(bh) after reading field.
      
      CC: stable@vger.kernel.org
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ffaaaf68
    • X
      ext2: initialize opts.s_mount_opt as zero before using it · 1666cf8c
      xingaopeng 提交于
      commit e5f5b717983bccfa033282e9886811635602510e upstream.
      
      We need to initialize opts.s_mount_opt as zero before using it, else we
      may get some unexpected mount options.
      
      Fixes: 08851957 ("ext2: Parse mount options into a dedicated structure")
      CC: stable@vger.kernel.org
      Signed-off-by: Nxingaopeng <xingaopeng@huawei.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1666cf8c
    • M
      fs: fix lost error code in dio_complete · adcd35a3
      Maximilian Heyne 提交于
      commit 41e817bc upstream.
      
      commit e2592217 ("fs: simplify the
      generic_write_sync prototype") reworked callers of generic_write_sync(),
      and ended up dropping the error return for the directio path. Prior to
      that commit, in dio_complete(), an error would be bubbled up the stack,
      but after that commit, errors passed on to dio_complete were eaten up.
      
      This was reported on the list earlier, and a fix was proposed in
      https://lore.kernel.org/lkml/20160921141539.GA17898@infradead.org/, but
      never followed up with.  We recently hit this bug in our testing where
      fencing io errors, which were previously erroring out with EIO, were
      being returned as success operations after this commit.
      
      The fix proposed on the list earlier was a little short -- it would have
      still called generic_write_sync() in case `ret` already contained an
      error. This fix ensures generic_write_sync() is only called when there's
      no pending error in the write. Additionally, transferred is replaced
      with ret to bring this code in line with other callers.
      
      Fixes: e2592217 ("fs: simplify the generic_write_sync prototype")
      Reported-by: NRavi Nankani <rnankani@amazon.com>
      Signed-off-by: NMaximilian Heyne <mheyne@amazon.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      CC: Torsten Mehlan <tomeh@amazon.de>
      CC: Uwe Dannowski <uwed@amazon.de>
      CC: Amit Shah <aams@amazon.de>
      CC: David Woodhouse <dwmw@amazon.co.uk>
      CC: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      adcd35a3
    • P
      btrfs: relocation: set trans to be NULL after ending transaction · 59065765
      Pan Bian 提交于
      commit 42a657f57628402c73237547f0134e083e2f6764 upstream.
      
      The function relocate_block_group calls btrfs_end_transaction to release
      trans when update_backref_cache returns 1, and then continues the loop
      body. If btrfs_block_rsv_refill fails this time, it will jump out the
      loop and the freed trans will be accessed. This may result in a
      use-after-free bug. The patch assigns NULL to trans after trans is
      released so that it will not be accessed.
      
      Fixes: 0647bf56 ("Btrfs: improve forever loop when doing balance relocation")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59065765
    • F
      Btrfs: fix race between enabling quotas and subvolume creation · 172a94eb
      Filipe Manana 提交于
      commit 552f0329c75b3e1d7f9bb8c9e421d37403f192cd upstream.
      
      We have a race between enabling quotas end subvolume creation that cause
      subvolume creation to fail with -EINVAL, and the following diagram shows
      how it happens:
      
                    CPU 0                                          CPU 1
      
       btrfs_ioctl()
        btrfs_ioctl_quota_ctl()
         btrfs_quota_enable()
          mutex_lock(fs_info->qgroup_ioctl_lock)
      
                                                        btrfs_ioctl()
                                                         create_subvol()
                                                          btrfs_qgroup_inherit()
                                                           -> save fs_info->quota_root
                                                              into quota_root
                                                           -> stores a NULL value
                                                           -> tries to lock the mutex
                                                              qgroup_ioctl_lock
                                                              -> blocks waiting for
                                                                 the task at CPU0
      
         -> sets BTRFS_FS_QUOTA_ENABLED in fs_info
         -> sets quota_root in fs_info->quota_root
            (non-NULL value)
      
         mutex_unlock(fs_info->qgroup_ioctl_lock)
      
                                                           -> checks quota enabled
                                                              flag is set
                                                           -> returns -EINVAL because
                                                              fs_info->quota_root was
                                                              NULL before it acquired
                                                              the mutex
                                                              qgroup_ioctl_lock
                                                         -> ioctl returns -EINVAL
      
      Returning -EINVAL to user space will be confusing if all the arguments
      passed to the subvolume creation ioctl were valid.
      
      Fix it by grabbing the value from fs_info->quota_root after acquiring
      the mutex.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      172a94eb
    • F
      Btrfs: fix rare chances for data loss when doing a fast fsync · 715608db
      Filipe Manana 提交于
      commit aab15e8e upstream.
      
      After the simplification of the fast fsync patch done recently by commit
      b5e6c3e1 ("btrfs: always wait on ordered extents at fsync time") and
      commit e7175a69 ("btrfs: remove the wait ordered logic in the
      log_one_extent path"), we got a very short time window where we can get
      extents logged without writeback completing first or extents logged
      without logging the respective data checksums. Both issues can only happen
      when doing a non-full (fast) fsync.
      
      As soon as we enter btrfs_sync_file() we trigger writeback, then lock the
      inode and then wait for the writeback to complete before starting to log
      the inode. However before we acquire the inode's lock and after we started
      writeback, it's possible that more writes happened and dirtied more pages.
      If that happened and those pages get writeback triggered while we are
      logging the inode (for example, the VM subsystem triggering it due to
      memory pressure, or another concurrent fsync), we end up seeing the
      respective extent maps in the inode's list of modified extents and will
      log matching file extent items without waiting for the respective
      ordered extents to complete, meaning that either of the following will
      happen:
      
      1) We log an extent after its writeback finishes but before its checksums
         are added to the csum tree, leading to -EIO errors when attempting to
         read the extent after a log replay.
      
      2) We log an extent before its writeback finishes.
         Therefore after the log replay we will have a file extent item pointing
         to an unwritten extent (and without the respective data checksums as
         well).
      
      This could not happen before the fast fsync patch simplification, because
      for any extent we found in the list of modified extents, we would wait for
      its respective ordered extent to finish writeback or collect its checksums
      for logging if it did not complete yet.
      
      Fix this by triggering writeback again after acquiring the inode's lock
      and before waiting for ordered extents to complete.
      
      Fixes: e7175a69 ("btrfs: remove the wait ordered logic in the log_one_extent path")
      Fixes: b5e6c3e1 ("btrfs: always wait on ordered extents at fsync time")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      715608db
    • F
      Btrfs: ensure path name is null terminated at btrfs_control_ioctl · 78a2890f
      Filipe Manana 提交于
      commit f505754f upstream.
      
      We were using the path name received from user space without checking that
      it is null terminated. While btrfs-progs is well behaved and does proper
      validation and null termination, someone could call the ioctl and pass
      a non-null terminated patch, leading to buffer overrun problems in the
      kernel.  The ioctl is protected by CAP_SYS_ADMIN.
      
      So just set the last byte of the path to a null character, similar to what
      we do in other ioctls (add/remove/resize device, snapshot creation, etc).
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      78a2890f
    • N
      btrfs: Always try all copies when reading extent buffers · aaf249e3
      Nikolay Borisov 提交于
      commit f8397d69daef06d358430d3054662fb597e37c00 upstream.
      
      When a metadata read is served the endio routine btree_readpage_end_io_hook
      is called which eventually runs the tree-checker. If tree-checker fails
      to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
      leads to btree_read_extent_buffer_pages wrongly assuming that all
      available copies of this extent buffer are wrong and failing prematurely.
      Fix this modify btree_read_extent_buffer_pages to read all copies of
      the data.
      
      This failure was exhibitted in xfstests btrfs/124 which would
      spuriously fail its balance operations. The reason was that when balance
      was run following re-introduction of the missing raid1 disk
      __btrfs_map_block would map the read request to stripe 0, which
      corresponded to devid 2 (the disk which is being removed in the test):
      
          item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
      	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
      	io_align 65536 io_width 65536 sector_size 4096
      	num_stripes 2 sub_stripes 1
      		stripe 0 devid 2 offset 2156920832
      		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
      		stripe 1 devid 1 offset 3553624064
      		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
      
      This caused read requests for a checksum item that to be routed to the
      stale disk which triggered the aforementioned logic involving
      EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
      the balance operation.
      
      Fixes: a826d6dc ("Btrfs: check items for correctness as we search")
      CC: stable@vger.kernel.org # 4.4+
      Suggested-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aaf249e3
    • J
      udf: Allow mounting volumes with incorrect identification strings · 949ddf80
      Jan Kara 提交于
      commit b54e41f5 upstream.
      
      Commit c26f6c61 ("udf: Fix conversion of 'dstring' fields to UTF8")
      started to be more strict when checking whether converted strings are
      properly formatted. Sudip reports that there are DVDs where the volume
      identification string is actually too long - UDF reports:
      
      [  632.309320] UDF-fs: incorrect dstring lengths (32/32)
      
      during mount and fails the mount. This is mostly harmless failure as we
      don't need volume identification (and even less volume set
      identification) for anything. So just truncate the volume identification
      string if it is too long and replace it with 'Invalid' if we just cannot
      convert it for other reasons. This keeps slightly incorrect media still
      mountable.
      
      CC: stable@vger.kernel.org
      Fixes: c26f6c61 ("udf: Fix conversion of 'dstring' fields to UTF8")
      Reported-and-tested-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      949ddf80
    • A
      userfaultfd: shmem/hugetlbfs: only allow to register VM_MAYWRITE vmas · 34b7a7cc
      Andrea Arcangeli 提交于
      commit 29ec90660d68bbdd69507c1c8b4e33aa299278b1 upstream.
      
      After the VMA to register the uffd onto is found, check that it has
      VM_MAYWRITE set before allowing registration.  This way we inherit all
      common code checks before allowing to fill file holes in shmem and
      hugetlbfs with UFFDIO_COPY.
      
      The userfaultfd memory model is not applicable for readonly files unless
      it's a MAP_PRIVATE.
      
      Link: http://lkml.kernel.org/r/20181126173452.26955-4-aarcange@redhat.com
      Fixes: ff62a342 ("hugetlb: implement memfd sealing")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NHugh Dickins <hughd@google.com>
      Reported-by: NJann Horn <jannh@google.com>
      Fixes: 4c27fe4c ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
      Cc: <stable@vger.kernel.org>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      34b7a7cc
  6. 01 12月, 2018 2 次提交