1. 10 1月, 2019 11 次提交
  2. 29 12月, 2018 4 次提交
    • I
      proc/sysctl: don't return ENOMEM on lookup when a table is unregistering · 6bb41321
      Ivan Delalande 提交于
      commit ea5751cc upstream.
      
      proc_sys_lookup can fail with ENOMEM instead of ENOENT when the
      corresponding sysctl table is being unregistered. In our case we see
      this upon opening /proc/sys/net/*/conf files while network interfaces
      are being deleted, which confuses our configuration daemon.
      
      The problem was successfully reproduced and this fix tested on v4.9.122
      and v4.20-rc6.
      
      v2: return ERR_PTRs in all cases when proc_sys_make_inode fails instead
      of mixing them with NULL. Thanks Al Viro for the feedback.
      
      Fixes: ace0c791 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
      Cc: stable@vger.kernel.org
      Signed-off-by: NIvan Delalande <colona@arista.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6bb41321
    • R
      ubifs: Handle re-linking of inodes correctly while recovery · 07364588
      Richard Weinberger 提交于
      commit e58725d51fa8da9133f3f1c54170aa2e43056b91 upstream.
      
      UBIFS's recovery code strictly assumes that a deleted inode will never
      come back, therefore it removes all data which belongs to that inode
      as soon it faces an inode with link count 0 in the replay list.
      Before O_TMPFILE this assumption was perfectly fine. With O_TMPFILE
      it can lead to data loss upon a power-cut.
      
      Consider a journal with entries like:
      0: inode X (nlink = 0) /* O_TMPFILE was created */
      1: data for inode X /* Someone writes to the temp file */
      2: inode X (nlink = 0) /* inode was changed, xattr, chmod, … */
      3: inode X (nlink = 1) /* inode was re-linked via linkat() */
      
      Upon replay of entry #2 UBIFS will drop all data that belongs to inode X,
      this will lead to an empty file after mounting.
      
      As solution for this problem, scan the replay list for a re-link entry
      before dropping data.
      
      Fixes: 474b9370 ("ubifs: Implement O_TMPFILE")
      Cc: stable@vger.kernel.org
      Cc: Russell Senior <russell@personaltelco.net>
      Cc: Rafał Miłecki <zajec5@gmail.com>
      Reported-by: NRussell Senior <russell@personaltelco.net>
      Reported-by: NRafał Miłecki <zajec5@gmail.com>
      Tested-by: NRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07364588
    • C
      Revert "vfs: Allow userns root to call mknod on owned filesystems." · 9c5ccadb
      Christian Brauner 提交于
      commit 94f82008ce30e2624537d240d64ce718255e0b80 upstream.
      
      This reverts commit 55956b59.
      
      commit 55956b59 ("vfs: Allow userns root to call mknod on owned filesystems.")
      enabled mknod() in user namespaces for userns root if CAP_MKNOD is
      available. However, these device nodes are useless since any filesystem
      mounted from a non-initial user namespace will set the SB_I_NODEV flag on
      the filesystem. Now, when a device node s created in a non-initial user
      namespace a call to open() on said device node will fail due to:
      
      bool may_open_dev(const struct path *path)
      {
              return !(path->mnt->mnt_flags & MNT_NODEV) &&
                      !(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
      }
      
      The problem with this is that as of the aforementioned commit mknod()
      creates partially functional device nodes in non-initial user namespaces.
      In particular, it has the consequence that as of the aforementioned commit
      open() will be more privileged with respect to device nodes than mknod().
      Before it was the other way around. Specifically, if mknod() succeeded
      then it was transparent for any userspace application that a fatal error
      must have occured when open() failed.
      
      All of this breaks multiple userspace workloads and a widespread assumption
      about how to handle mknod(). Basically, all container runtimes and systemd
      live by the slogan "ask for forgiveness not permission" when running user
      namespace workloads. For mknod() the assumption is that if the syscall
      succeeds the device nodes are useable irrespective of whether it succeeds
      in a non-initial user namespace or not. This logic was chosen explicitly
      to allow for the glorious day when mknod() will actually be able to create
      fully functional device nodes in user namespaces.
      A specific problem people are already running into when running 4.18 rc
      kernels are failing systemd services. For any distro that is run in a
      container systemd services started with the PrivateDevices= property set
      will fail to start since the device nodes in question cannot be
      opened (cf. the arguments in [1]).
      
      Full disclosure, Seth made the very sound argument that it is already
      possible to end up with partially functional device nodes. Any filesystem
      mounted with MS_NODEV set will allow mknod() to succeed but will not allow
      open() to succeed. The difference to the case here is that the MS_NODEV
      case is transparent to userspace since it is an explicitly set mount option
      while the SB_I_NODEV case is an implicit property enforced by the kernel
      and hence opaque to userspace.
      
      [1]: https://github.com/systemd/systemd/pull/9483Signed-off-by: NChristian Brauner <christian@brauner.io>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Seth Forshee <seth.forshee@canonical.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c5ccadb
    • D
      iomap: Revert "fs/iomap.c: get/put the page in iomap_page_create/release()" · 38d072a4
      Dave Chinner 提交于
      [ Upstream commit a837eca2412051628c0529768c9bc4f3580b040e ]
      
      This reverts commit 61c6de667263184125d5ca75e894fcad632b0dd3.
      
      The reverted commit added page reference counting to iomap page
      structures that are used to track block size < page size state. This
      was supposed to align the code with page migration page accounting
      assumptions, but what it has done instead is break XFS filesystems.
      Every fstests run I've done on sub-page block size XFS filesystems
      has since picking up this commit 2 days ago has failed with bad page
      state errors such as:
      
      # ./run_check.sh "-m rmapbt=1,reflink=1 -i sparse=1 -b size=1k" "generic/038"
      ....
      SECTION       -- xfs
      FSTYP         -- xfs (debug)
      PLATFORM      -- Linux/x86_64 test1 4.20.0-rc6-dgc+
      MKFS_OPTIONS  -- -f -m rmapbt=1,reflink=1 -i sparse=1 -b size=1k /dev/sdc
      MOUNT_OPTIONS -- /dev/sdc /mnt/scratch
      
      generic/038 454s ...
       run fstests generic/038 at 2018-12-20 18:43:05
       XFS (sdc): Unmounting Filesystem
       XFS (sdc): Mounting V5 Filesystem
       XFS (sdc): Ending clean mount
       BUG: Bad page state in process kswapd0  pfn:3a7fa
       page:ffffea0000ccbeb0 count:0 mapcount:0 mapping:ffff88800d9b6360 index:0x1
       flags: 0xfffffc0000000()
       raw: 000fffffc0000000 dead000000000100 dead000000000200 ffff88800d9b6360
       raw: 0000000000000001 0000000000000000 00000000ffffffff
       page dumped because: non-NULL mapping
       CPU: 0 PID: 676 Comm: kswapd0 Not tainted 4.20.0-rc6-dgc+ #915
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
       Call Trace:
        dump_stack+0x67/0x90
        bad_page.cold.116+0x8a/0xbd
        free_pcppages_bulk+0x4bf/0x6a0
        free_unref_page_list+0x10f/0x1f0
        shrink_page_list+0x49d/0xf50
        shrink_inactive_list+0x19d/0x3b0
        shrink_node_memcg.constprop.77+0x398/0x690
        ? shrink_slab.constprop.81+0x278/0x3f0
        shrink_node+0x7a/0x2f0
        kswapd+0x34b/0x6d0
        ? node_reclaim+0x240/0x240
        kthread+0x11f/0x140
        ? __kthread_bind_mask+0x60/0x60
        ret_from_fork+0x24/0x30
       Disabling lock debugging due to kernel taint
      ....
      
      The failures are from anyway that frees pages and empties the
      per-cpu page magazines, so it's not a predictable failure or an easy
      to debug failure.
      
      generic/038 is a reliable reproducer of this problem - it has a 9 in
      10 failure rate on one of my test machines. Failure on other
      machines have been at random points in fstests runs but every run
      has ended up tripping this problem. Hence generic/038 was used to
      bisect the failure because it was the most reliable failure.
      
      It is too close to the 4.20 release (not to mention holidays) to
      try to diagnose, fix and test the underlying cause of the problem,
      so reverting the commit is the only option we have right now. The
      revert has been tested against a current tot 4.20-rc7+ kernel across
      multiple machines running sub-page block size XFs filesystems and
      none of the bad page state failures have been seen.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Cc: Piotr Jaroszynski <pjaroszynski@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Brian Foster <bfoster@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      38d072a4
  3. 21 12月, 2018 3 次提交
    • O
      Btrfs: fix missing delayed iputs on unmount · b4c7c826
      Omar Sandoval 提交于
      [ Upstream commit d6fd0ae2 ]
      
      There's a race between close_ctree() and cleaner_kthread().
      close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
      sees it set, but this is racy; the cleaner might have already checked
      the bit and could be cleaning stuff. In particular, if it deletes unused
      block groups, it will create delayed iputs for the free space cache
      inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
      longer running delayed iputs after a commit. Therefore, if the cleaner
      creates more delayed iputs after delayed iputs are run in
      btrfs_commit_super(), we will leak inodes on unmount and get a busy
      inode crash from the VFS.
      
      Fix it by parking the cleaner before we actually close anything. Then,
      any remaining delayed iputs will always be handled in
      btrfs_commit_super(). This also ensures that the commit in close_ctree()
      is really the last commit, so we can get rid of the commit in
      cleaner_kthread().
      
      The fstest/generic/475 followed by 476 can trigger a crash that
      manifests as a slab corruption caused by accessing the freed kthread
      structure by a wake up function. Sample trace:
      
      [ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
      [ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
      [ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G        W         4.19.0-rc8-default+ #323
      [ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
      [ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
      [ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
      [ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
      [ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
      [ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
      [ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
      [ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
      [ 5657.099689] FS:  00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
      [ 5657.101032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
      [ 5657.103159] Call Trace:
      [ 5657.103776]  shrink_inactive_list+0x194/0x410
      [ 5657.104671]  shrink_node_memcg.constprop.84+0x39a/0x6a0
      [ 5657.105750]  shrink_node+0x62/0x1c0
      [ 5657.106529]  try_to_free_pages+0x1a4/0x500
      [ 5657.107408]  __alloc_pages_slowpath+0x2c9/0xb20
      [ 5657.108418]  __alloc_pages_nodemask+0x268/0x2b0
      [ 5657.109348]  kmalloc_large_node+0x37/0x90
      [ 5657.110205]  __kmalloc_node+0x236/0x310
      [ 5657.111014]  kvmalloc_node+0x3e/0x70
      
      Fixes: 30928e9b ("btrfs: don't run delayed_iputs in commit")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add trace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b4c7c826
    • S
      cifs: In Kconfig CONFIG_CIFS_POSIX needs depends on legacy (insecure cifs) · b5a8028c
      Steve French 提交于
      [ Upstream commit 6e785302 ]
      
      Missing a dependency.  Shouldn't show cifs posix extensions
      in Kconfig if CONFIG_CIFS_ALLOW_INSECURE_DIALECTS (ie SMB1
      protocol) is disabled.
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Reviewed-by: NPavel Shilovsky <pshilov@microsoft.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b5a8028c
    • D
      nfs: don't dirty kernel pages read by direct-io · de956d40
      Dave Kleikamp 提交于
      [ Upstream commit ad3cba223ac02dc769c3bbe88efe277bbb457566 ]
      
      When we use direct_IO with an NFS backing store, we can trigger a
      WARNING in __set_page_dirty(), as below, since we're dirtying the page
      unnecessarily in nfs_direct_read_completion().
      
      To fix, replicate the logic in commit 53cbf3b1 ("fs: direct-io:
      don't dirtying pages for ITER_BVEC/ITER_KVEC direct read").
      
      Other filesystems that implement direct_IO handle this; most use
      blockdev_direct_IO(). ceph and cifs have similar logic.
      
      mount 127.0.0.1:/export /nfs
      dd if=/dev/zero of=/nfs/image bs=1M count=200
      losetup --direct-io=on -f /nfs/image
      mkfs.btrfs /dev/loop0
      mount -t btrfs /dev/loop0 /mnt/
      
      kernel: WARNING: CPU: 0 PID: 8067 at fs/buffer.c:580 __set_page_dirty+0xaf/0xd0
      kernel: Modules linked in: loop(E) nfsv3(E) rpcsec_gss_krb5(E) nfsv4(E) dns_resolver(E) nfs(E) fscache(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) fuse(E) tun(E) ip6t_rpfilter(E) ipt_REJECT(E) nf_
      kernel:  snd_seq(E) snd_seq_device(E) snd_pcm(E) video(E) snd_timer(E) snd(E) soundcore(E) ip_tables(E) xfs(E) libcrc32c(E) sd_mod(E) sr_mod(E) cdrom(E) ata_generic(E) pata_acpi(E) crc32c_intel(E) ahci(E) li
      kernel: CPU: 0 PID: 8067 Comm: kworker/0:2 Tainted: G            E     4.20.0-rc1.master.20181111.ol7.x86_64 #1
      kernel: Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      kernel: Workqueue: nfsiod rpc_async_release [sunrpc]
      kernel: RIP: 0010:__set_page_dirty+0xaf/0xd0
      kernel: Code: c3 48 8b 02 f6 c4 04 74 d4 48 89 df e8 ba 05 f7 ff 48 89 c6 eb cb 48 8b 43 08 a8 01 75 1f 48 89 d8 48 8b 00 a8 04 74 02 eb 87 <0f> 0b eb 83 48 83 e8 01 eb 9f 48 83 ea 01 0f 1f 00 eb 8b 48 83 e8
      kernel: RSP: 0000:ffffc1c8825b7d78 EFLAGS: 00013046
      kernel: RAX: 000fffffc0020089 RBX: fffff2b603308b80 RCX: 0000000000000001
      kernel: RDX: 0000000000000001 RSI: ffff9d11478115c8 RDI: ffff9d11478115d0
      kernel: RBP: ffffc1c8825b7da0 R08: 0000646f6973666e R09: 8080808080808080
      kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff9d11478115d0
      kernel: R13: ffff9d11478115c8 R14: 0000000000003246 R15: 0000000000000001
      kernel: FS:  0000000000000000(0000) GS:ffff9d115ba00000(0000) knlGS:0000000000000000
      kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      kernel: CR2: 00007f408686f640 CR3: 0000000104d8e004 CR4: 00000000000606f0
      kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      kernel: Call Trace:
      kernel:  __set_page_dirty_buffers+0xb6/0x110
      kernel:  set_page_dirty+0x52/0xb0
      kernel:  nfs_direct_read_completion+0xc4/0x120 [nfs]
      kernel:  nfs_pgio_release+0x10/0x20 [nfs]
      kernel:  rpc_free_task+0x30/0x70 [sunrpc]
      kernel:  rpc_async_release+0x12/0x20 [sunrpc]
      kernel:  process_one_work+0x174/0x390
      kernel:  worker_thread+0x4f/0x3e0
      kernel:  kthread+0x102/0x140
      kernel:  ? drain_workqueue+0x130/0x130
      kernel:  ? kthread_stop+0x110/0x110
      kernel:  ret_from_fork+0x35/0x40
      kernel: ---[ end trace 01341980905412c9 ]---
      Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com>
      
      [forward-ported to v4.20]
      Signed-off-by: NCalum Mackay <calum.mackay@oracle.com>
      Reviewed-by: NDave Kleikamp <dave.kleikamp@oracle.com>
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      de956d40
  4. 20 12月, 2018 6 次提交
  5. 17 12月, 2018 16 次提交
    • M
      dax: Check page->mapping isn't NULL · 384f1811
      Matthew Wilcox 提交于
      commit c93db7bb upstream.
      
      If we race with inode destroy, it's possible for page->mapping to be
      NULL before we even enter this routine, as well as after having slept
      waiting for the dax entry to become unlocked.
      
      Fixes: c2a7d2a1 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      384f1811
    • T
      flexfiles: enforce per-mirror stateid only for v4 DSes · 111758f7
      Tigran Mkrtchyan 提交于
      commit 320f35b7 upstream.
      
      Since commit bb21ce0a we always enforce per-mirror stateid.
      However, this makes sense only for v4+ servers.
      Signed-off-by: NTigran Mkrtchyan <tigran.mkrtchyan@desy.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      111758f7
    • P
      ocfs2: fix potential use after free · a31da26a
      Pan Bian 提交于
      [ Upstream commit 164f7e58 ]
      
      ocfs2_get_dentry() calls iput(inode) to drop the reference count of
      inode, and if the reference count hits 0, inode is freed.  However, in
      this function, it then reads inode->i_generation, which may result in a
      use after free bug.  Move the put operation later.
      
      Link: http://lkml.kernel.org/r/1543109237-110227-1-git-send-email-bianpan2016@163.com
      Fixes: 781f200c("ocfs2: Remove masklog ML_EXPORT.")
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a31da26a
    • P
      hfsplus: do not free node before using · ab31765e
      Pan Bian 提交于
      [ Upstream commit c7d7d620 ]
      
      hfs_bmap_free() frees node via hfs_bnode_put(node).  However it then
      reads node->this when dumping error message on an error path, which may
      result in a use-after-free bug.  This patch frees node only when it is
      never used.
      
      Link: http://lkml.kernel.org/r/1543053441-66942-1-git-send-email-bianpan2016@163.comSigned-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Viacheslav Dubeyko <slava@dubeyko.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ab31765e
    • P
      hfs: do not free node before using · f7cbec75
      Pan Bian 提交于
      [ Upstream commit ce96a407 ]
      
      hfs_bmap_free() frees the node via hfs_bnode_put(node).  However, it
      then reads node->this when dumping error message on an error path, which
      may result in a use-after-free bug.  This patch frees the node only when
      it is never again used.
      
      Link: http://lkml.kernel.org/r/1542963889-128825-1-git-send-email-bianpan2016@163.com
      Fixes: a1185ffa2fc ("HFS rewrite")
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
      Cc: Viacheslav Dubeyko <slava@dubeyko.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f7cbec75
    • L
      ocfs2: fix deadlock caused by ocfs2_defrag_extent() · 6aab48ae
      Larry Chen 提交于
      [ Upstream commit e21e5744 ]
      
      ocfs2_defrag_extent may fall into deadlock.
      
      ocfs2_ioctl_move_extents
          ocfs2_ioctl_move_extents
            ocfs2_move_extents
              ocfs2_defrag_extent
                ocfs2_lock_allocators_move_extents
      
                  ocfs2_reserve_clusters
                    inode_lock GLOBAL_BITMAP_SYSTEM_INODE
      
      	  __ocfs2_flush_truncate_log
                    inode_lock GLOBAL_BITMAP_SYSTEM_INODE
      
      As backtrace shows above, ocfs2_reserve_clusters() will call inode_lock
      against the global bitmap if local allocator has not sufficient cluters.
      Once global bitmap could meet the demand, ocfs2_reserve_cluster will
      return success with global bitmap locked.
      
      After ocfs2_reserve_cluster(), if truncate log is full,
      __ocfs2_flush_truncate_log() will definitely fall into deadlock because
      it needs to inode_lock global bitmap, which has already been locked.
      
      To fix this bug, we could remove from
      ocfs2_lock_allocators_move_extents() the code which intends to lock
      global allocator, and put the removed code after
      __ocfs2_flush_truncate_log().
      
      ocfs2_lock_allocators_move_extents() is referred by 2 places, one is
      here, the other does not need the data allocator context, which means
      this patch does not affect the caller so far.
      
      Link: http://lkml.kernel.org/r/20181101071422.14470-1-lchen@suse.comSigned-off-by: NLarry Chen <lchen@suse.com>
      Reviewed-by: NChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6aab48ae
    • C
      fscache, cachefiles: remove redundant variable 'cache' · 1f925643
      Colin Ian King 提交于
      [ Upstream commit 31ffa563 ]
      
      Variable 'cache' is being assigned but is never used hence it is
      redundant and can be removed.
      
      Cleans up clang warning:
      warning: variable 'cache' set but not used [-Wunused-but-set-variable]
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1f925643
    • N
      cachefiles: Explicitly cast enumerated type in put_object · d8bf97a0
      Nathan Chancellor 提交于
      [ Upstream commit b7e768b7e3522695ed36dcb48ecdcd344bd30a9b ]
      
      Clang warns when one enumerated type is implicitly converted to another.
      
      fs/cachefiles/namei.c:247:50: warning: implicit conversion from
      enumeration type 'enum cachefiles_obj_ref_trace' to different
      enumeration type 'enum fscache_obj_ref_trace' [-Wenum-conversion]
              cache->cache.ops->put_object(&xobject->fscache,
      cachefiles_obj_put_wait_retry);
      
      Silence this warning by explicitly casting to fscache_obj_ref_trace,
      which is also done in put_object.
      Reported-by: NNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d8bf97a0
    • N
      fscache: fix race between enablement and dropping of object · 02bd7b74
      NeilBrown 提交于
      [ Upstream commit c5a94f434c82529afda290df3235e4d85873c5b4 ]
      
      It was observed that a process blocked indefintely in
      __fscache_read_or_alloc_page(), waiting for FSCACHE_COOKIE_LOOKING_UP
      to be cleared via fscache_wait_for_deferred_lookup().
      
      At this time, ->backing_objects was empty, which would normaly prevent
      __fscache_read_or_alloc_page() from getting to the point of waiting.
      This implies that ->backing_objects was cleared *after*
      __fscache_read_or_alloc_page was was entered.
      
      When an object is "killed" and then "dropped",
      FSCACHE_COOKIE_LOOKING_UP is cleared in fscache_lookup_failure(), then
      KILL_OBJECT and DROP_OBJECT are "called" and only in DROP_OBJECT is
      ->backing_objects cleared.  This leaves a window where
      something else can set FSCACHE_COOKIE_LOOKING_UP and
      __fscache_read_or_alloc_page() can start waiting, before
      ->backing_objects is cleared
      
      There is some uncertainty in this analysis, but it seems to be fit the
      observations.  Adding the wake in this patch will be handled correctly
      by __fscache_read_or_alloc_page(), as it checks if ->backing_objects
      is empty again, after waiting.
      
      Customer which reported the hang, also report that the hang cannot be
      reproduced with this fix.
      
      The backtrace for the blocked process looked like:
      
      PID: 29360  TASK: ffff881ff2ac0f80  CPU: 3   COMMAND: "zsh"
       #0 [ffff881ff43efbf8] schedule at ffffffff815e56f1
       #1 [ffff881ff43efc58] bit_wait at ffffffff815e64ed
       #2 [ffff881ff43efc68] __wait_on_bit at ffffffff815e61b8
       #3 [ffff881ff43efca0] out_of_line_wait_on_bit at ffffffff815e625e
       #4 [ffff881ff43efd08] fscache_wait_for_deferred_lookup at ffffffffa04f2e8f [fscache]
       #5 [ffff881ff43efd18] __fscache_read_or_alloc_page at ffffffffa04f2ffe [fscache]
       #6 [ffff881ff43efd58] __nfs_readpage_from_fscache at ffffffffa0679668 [nfs]
       #7 [ffff881ff43efd78] nfs_readpage at ffffffffa067092b [nfs]
       #8 [ffff881ff43efda0] generic_file_read_iter at ffffffff81187a73
       #9 [ffff881ff43efe50] nfs_file_read at ffffffffa066544b [nfs]
      #10 [ffff881ff43efe70] __vfs_read at ffffffff811fc756
      #11 [ffff881ff43efee8] vfs_read at ffffffff811fccfa
      #12 [ffff881ff43eff18] sys_read at ffffffff811fda62
      #13 [ffff881ff43eff50] entry_SYSCALL_64_fastpath at ffffffff815e986e
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      02bd7b74
    • D
      afs: Fix validation/callback interaction · 52da87f0
      David Howells 提交于
      [ Upstream commit ae3b7361dc0ee9a425bf7d77ce211f533500b39b ]
      
      When afs_validate() is called to validate a vnode (inode), there are two
      unhandled cases in the fastpath at the top of the function:
      
       (1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break
           counters match and the data has expired, then there's an implicit case
           in which the vnode needs revalidating.
      
           This has no consequences since the default "valid = false" set at the
           top of the function happens to do the right thing.
      
       (2) If the vnode is not promised and it hasn't been deleted
           (AFS_VNODE_DELETED is not set) then there's a default case we're not
           handling in which the vnode is invalid.  If the vnode is invalid, we
           need to bring cb_s_break and cb_v_break up to date before we refetch
           the status.
      
           As a consequence, once the server loses track of the client
           (ie. sufficient time has passed since we last sent it an operation),
           it will send us a CB.InitCallBackState* operation when we next try to
           talk to it.  This calls afs_init_callback_state() which increments
           afs_server::cb_s_break, but this then doesn't propagate to the
           afs_vnode record.
      
           The result being that every afs_validate() call thereafter sends a
           status fetch operation to the server.
      
      Clarify and fix this by:
      
       (A) Setting valid in all the branches rather than initialising it at the
           top so that the compiler catches where we've missed.
      
       (B) Restructuring the logic in the 'promised' branch so that we set valid
           to false if the callback is due to expire (or has expired) and so that
           the final case is that the vnode is still valid.
      
       (C) Adding an else-statement that ups cb_s_break and cb_v_break if the
           promised and deleted cases don't match.
      
      Fixes: c435ee34 ("afs: Overhaul the callback handling")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      52da87f0
    • K
      pstore/ram: Correctly calculate usable PRZ bytes · ce469db0
      Kees Cook 提交于
      [ Upstream commit 89d328f637b9904b6d4c9af73c8a608b8dd4d6f8 ]
      
      The actual number of bytes stored in a PRZ is smaller than the
      bytes requested by platform data, since there is a header on each
      PRZ. Additionally, if ECC is enabled, there are trailing bytes used
      as well. Normally this mismatch doesn't matter since PRZs are circular
      buffers and the leading "overflow" bytes are just thrown away. However, in
      the case of a compressed record, this rather badly corrupts the results.
      
      This corruption was visible with "ramoops.mem_size=204800 ramoops.ecc=1".
      Any stored crashes would not be uncompressable (producing a pstorefs
      "dmesg-*.enc.z" file), and triggering errors at boot:
      
        [    2.790759] pstore: crypto_comp_decompress failed, ret = -22!
      
      Backporting this depends on commit 70ad35db ("pstore: Convert console
      write to use ->write_buf")
      Reported-by: NJoel Fernandes <joel@joelfernandes.org>
      Fixes: b0aad7a9 ("pstore: Add compression support to pstore")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ce469db0
    • K
      cachefiles: Fix page leak in cachefiles_read_backing_file while vmscan is active · eee2269f
      Kiran Kumar Modukuri 提交于
      [ Upstream commit 9a24ce5b66f9c8190d63b15f4473600db4935f1f ]
      
      [Description]
      
      In a heavily loaded system where the system pagecache is nearing memory
      limits and fscache is enabled, pages can be leaked by fscache while trying
      read pages from cachefiles backend.  This can happen because two
      applications can be reading same page from a single mount, two threads can
      be trying to read the backing page at same time.  This results in one of
      the threads finding that a page for the backing file or netfs file is
      already in the radix tree.  During the error handling cachefiles does not
      clean up the reference on backing page, leading to page leak.
      
      [Fix]
      The fix is straightforward, to decrement the reference when error is
      encountered.
      
        [dhowells: Note that I've removed the clearance and put of newpage as
         they aren't attested in the commit message and don't appear to actually
         achieve anything since a new page is only allocated is newpage!=NULL and
         any residual new page is cleared before returning.]
      
      [Testing]
      I have tested the fix using following method for 12+ hrs.
      
      1) mkdir -p /mnt/nfs ; mount -o vers=3,fsc <server_ip>:/export /mnt/nfs
      2) create 10000 files of 2.8MB in a NFS mount.
      3) start a thread to simulate heavy VM presssure
         (while true ; do echo 3 > /proc/sys/vm/drop_caches ; sleep 1 ; done)&
      4) start multiple parallel reader for data set at same time
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         ..
         ..
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
         find /mnt/nfs -type f | xargs -P 80 cat > /dev/null &
      5) finally check using cat /proc/fs/fscache/stats | grep -i pages ;
         free -h , cat /proc/meminfo and page-types -r -b lru
         to ensure all pages are freed.
      Reviewed-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NShantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: NKiran Kumar Modukuri <kiran.modukuri@gmail.com>
      [dja: forward ported to current upstream]
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      eee2269f
    • D
      cachefiles: Fix an assertion failure when trying to update a failed object · 5132f913
      David Howells 提交于
      [ Upstream commit e6bc06faf64a83384cc0abc537df954c9d3ff942 ]
      
      If cachefiles gets an error other then ENOENT when trying to look up an
      object in the cache (in this case, EACCES), the object state machine will
      eventually transition to the DROP_OBJECT state.
      
      This state invokes fscache_drop_object() which tries to sync the auxiliary
      data with the cache (this is done lazily since commit 402cb8dd) on an
      incomplete cache object struct.
      
      The problem comes when cachefiles_update_object_xattr() is called to
      rewrite the xattr holding the data.  There's an assertion there that the
      cache object points to a dentry as we're going to update its xattr.  The
      assertion trips, however, as dentry didn't get set.
      
      Fix the problem by skipping the update in cachefiles if the object doesn't
      refer to a dentry.  A better way to do it could be to skip the update from
      the DROP_OBJECT state handler in fscache, but that might deny the cache the
      opportunity to update intermediate state.
      
      If this error occurs, the kernel log includes lines that look like the
      following:
      
       CacheFiles: Lookup failed error -13
       CacheFiles:
       CacheFiles: Assertion failed
       ------------[ cut here ]------------
       kernel BUG at fs/cachefiles/xattr.c:138!
       ...
       Workqueue: fscache_object fscache_object_work_func [fscache]
       RIP: 0010:cachefiles_update_object_xattr.cold.4+0x18/0x1a [cachefiles]
       ...
       Call Trace:
        cachefiles_update_object+0xdd/0x1c0 [cachefiles]
        fscache_update_aux_data+0x23/0x30 [fscache]
        fscache_drop_object+0x18e/0x1c0 [fscache]
        fscache_object_work_func+0x74/0x2b0 [fscache]
        process_one_work+0x18d/0x340
        worker_thread+0x2e/0x390
        ? pwq_unbound_release_workfn+0xd0/0xd0
        kthread+0x112/0x130
        ? kthread_bind+0x30/0x30
        ret_from_fork+0x35/0x40
      
      Note that there are actually two issues here: (1) EACCES happened on a
      cache object and (2) an oops occurred.  I think that the second is a
      consequence of the first (it certainly looks like it ought to be).  This
      patch only deals with the second.
      
      Fixes: 402cb8dd ("fscache: Attach the index key and aux data to the cookie")
      Reported-by: NZhibin Li <zhibli@redhat.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5132f913
    • P
      exportfs: do not read dentry after free · ad374d10
      Pan Bian 提交于
      [ Upstream commit 2084ac6c ]
      
      The function dentry_connected calls dput(dentry) to drop the previously
      acquired reference to dentry. In this case, dentry can be released.
      After that, IS_ROOT(dentry) checks the condition
      (dentry == dentry->d_parent), which may result in a use-after-free bug.
      This patch directly compares dentry with its parent obtained before
      dropping the reference.
      
      Fixes: a056cc89("exportfs: stop retrying once we race with
      rename/remove")
      Signed-off-by: NPan Bian <bianpan2016@163.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ad374d10
    • R
      Btrfs: send, fix infinite loop due to directory rename dependencies · 91f6a9aa
      Robbie Ko 提交于
      [ Upstream commit a4390aee72713d9e73f1132bcdeb17d72fbbf974 ]
      
      When doing an incremental send, due to the need of delaying directory move
      (rename) operations we can end up in infinite loop at
      apply_children_dir_moves().
      
      An example scenario that triggers this problem is described below, where
      directory names correspond to the numbers of their respective inodes.
      
      Parent snapshot:
      
       .
       |--- 261/
             |--- 271/
                   |--- 266/
                         |--- 259/
                         |--- 260/
                         |     |--- 267
                         |
                         |--- 264/
                         |     |--- 258/
                         |           |--- 257/
                         |
                         |--- 265/
                         |--- 268/
                         |--- 269/
                         |     |--- 262/
                         |
                         |--- 270/
                         |--- 272/
                         |     |--- 263/
                         |     |--- 275/
                         |
                         |--- 274/
                               |--- 273/
      
      Send snapshot:
      
       .
       |-- 275/
            |-- 274/
                 |-- 273/
                      |-- 262/
                           |-- 269/
                                |-- 258/
                                     |-- 271/
                                          |-- 268/
                                               |-- 267/
                                                    |-- 270/
                                                         |-- 259/
                                                         |    |-- 265/
                                                         |
                                                         |-- 272/
                                                              |-- 257/
                                                                   |-- 260/
                                                                   |-- 264/
                                                                        |-- 263/
                                                                             |-- 261/
                                                                                  |-- 266/
      
      When processing inode 257 we delay its move (rename) operation because its
      new parent in the send snapshot, inode 272, was not yet processed. Then
      when processing inode 272, we delay the move operation for that inode
      because inode 274 is its ancestor in the send snapshot. Finally we delay
      the move operation for inode 274 when processing it because inode 275 is
      its new parent in the send snapshot and was not yet moved.
      
      When finishing processing inode 275, we start to do the move operations
      that were previously delayed (at apply_children_dir_moves()), resulting in
      the following iterations:
      
      1) We issue the move operation for inode 274;
      
      2) Because inode 262 depended on the move operation of inode 274 (it was
         delayed because 274 is its ancestor in the send snapshot), we issue the
         move operation for inode 262;
      
      3) We issue the move operation for inode 272, because it was delayed by
         inode 274 too (ancestor of 272 in the send snapshot);
      
      4) We issue the move operation for inode 269 (it was delayed by 262);
      
      5) We issue the move operation for inode 257 (it was delayed by 272);
      
      6) We issue the move operation for inode 260 (it was delayed by 272);
      
      7) We issue the move operation for inode 258 (it was delayed by 269);
      
      8) We issue the move operation for inode 264 (it was delayed by 257);
      
      9) We issue the move operation for inode 271 (it was delayed by 258);
      
      10) We issue the move operation for inode 263 (it was delayed by 264);
      
      11) We issue the move operation for inode 268 (it was delayed by 271);
      
      12) We verify if we can issue the move operation for inode 270 (it was
          delayed by 271). We detect a path loop in the current state, because
          inode 267 needs to be moved first before we can issue the move
          operation for inode 270. So we delay again the move operation for
          inode 270, this time we will attempt to do it after inode 267 is
          moved;
      
      13) We issue the move operation for inode 261 (it was delayed by 263);
      
      14) We verify if we can issue the move operation for inode 266 (it was
          delayed by 263). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12);
      
      15) We issue the move operation for inode 267 (it was delayed by 268);
      
      16) We verify if we can issue the move operation for inode 266 (it was
          delayed by 270). We detect a path loop in the current state, because
          inode 270 needs to be moved first before we can issue the move
          operation for inode 266. So we delay again the move operation for
          inode 266, this time we will attempt to do it after inode 270 is
          moved (its move operation was delayed in step 12). So here we added
          again the same delayed move operation that we added in step 14;
      
      17) We attempt again to see if we can issue the move operation for inode
          266, and as in step 16, we realize we can not due to a path loop in
          the current state due to a dependency on inode 270. Again we delay
          inode's 266 rename to happen after inode's 270 move operation, adding
          the same dependency to the empty stack that we did in steps 14 and 16.
          The next iteration will pick the same move dependency on the stack
          (the only entry) and realize again there is still a path loop and then
          again the same dependency to the stack, over and over, resulting in
          an infinite loop.
      
      So fix this by preventing adding the same move dependency entries to the
      stack by removing each pending move record from the red black tree of
      pending moves. This way the next call to get_pending_dir_moves() will
      not return anything for the current parent inode.
      
      A test case for fstests, with this reproducer, follows soon.
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [Wrote changelog with example and more clear explanation]
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      91f6a9aa
    • J
      aio: fix failure to put the file pointer · df66ef67
      Jens Axboe 提交于
      [ Upstream commit 53fffe29a9e664a999dd3787e4428da8c30533e0 ]
      
      If the ioprio capability check fails, we return without putting
      the file pointer.
      
      Fixes: d9a08a9e ("fs: Add aio iopriority support")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      df66ef67