1. 17 9月, 2019 3 次提交
  2. 16 9月, 2019 1 次提交
    • L
      Revert "ext4: make __ext4_get_inode_loc plug" · 72dbcf72
      Linus Torvalds 提交于
      This reverts commit b03755ad.
      
      This is sad, and done for all the wrong reasons.  Because that commit is
      good, and does exactly what it says: avoids a lot of small disk requests
      for the inode table read-ahead.
      
      However, it turns out that it causes an entirely unrelated problem: the
      getrandom() system call was introduced back in 2014 by commit
      c6e9d6f3 ("random: introduce getrandom(2) system call"), and people
      use it as a convenient source of good random numbers.
      
      But part of the current semantics for getrandom() is that it waits for
      the entropy pool to fill at least partially (unlike /dev/urandom).  And
      at least ArchLinux apparently has a systemd that uses getrandom() at
      boot time, and the improvements in IO patterns means that existing
      installations suddenly start hanging, waiting for entropy that will
      never happen.
      
      It seems to be an unlucky combination of not _quite_ enough entropy,
      together with a particular systemd version and configuration.  Lennart
      says that the systemd-random-seed process (which is what does this early
      access) is supposed to not block any other boot activity, but sadly that
      doesn't actually seem to be the case (possibly due bogus dependencies on
      cryptsetup for encrypted swapspace).
      
      The correct fix is to fix getrandom() to not block when it's not
      appropriate, but that fix is going to take a lot more discussion.  Do we
      just make it act like /dev/urandom by default, and add a new flag for
      "wait for entropy"? Do we add a boot-time option? Or do we just limit
      the amount of time it will wait for entropy?
      
      So in the meantime, we do the revert to give us time to discuss the
      eventual fix for the fundamental problem, at which point we can re-apply
      the ext4 inode table access optimization.
      Reported-by: NAhmed S. Darwish <darwish.07@gmail.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Alexander E. Patrakov <patrakov@gmail.com>
      Cc: Lennart Poettering <mzxreary@0pointer.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72dbcf72
  3. 12 9月, 2019 2 次提交
    • F
      Btrfs: fix unwritten extent buffers and hangs on future writeback attempts · 18dfa711
      Filipe Manana 提交于
      The lock_extent_buffer_io() returns 1 to the caller to tell it everything
      went fine and the callers needs to start writeback for the extent buffer
      (submit a bio, etc), 0 to tell the caller everything went fine but it does
      not need to start writeback for the extent buffer, and a negative value if
      some error happened.
      
      When it's about to return 1 it tries to lock all pages, and if a try lock
      on a page fails, and we didn't flush any existing bio in our "epd", it
      calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
      an error. The page might have been locked elsewhere, not with the goal
      of starting writeback of the extent buffer, and even by some code other
      than btrfs, like page migration for example, so it does not mean the
      writeback of the extent buffer was already started by some other task,
      so returning a 0 tells the caller (btree_write_cache_pages()) to not
      start writeback for the extent buffer. Note that epd might currently have
      either no bio, so flush_write_bio() returns 0 (success) or it might have
      a bio for another extent buffer with a lower index (logical address).
      
      Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
      extent buffer and writeback is never started for the extent buffer,
      future attempts to writeback the extent buffer will hang forever waiting
      on that bit to be cleared, since it can only be cleared after writeback
      completes. Such hang is reported with a trace like the following:
      
        [49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
        [49887.347059]       Not tainted 5.2.13-gentoo #2
        [49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [49887.347062] btrfs-transacti D    0  1752      2 0x80004000
        [49887.347064] Call Trace:
        [49887.347069]  ? __schedule+0x265/0x830
        [49887.347071]  ? bit_wait+0x50/0x50
        [49887.347072]  ? bit_wait+0x50/0x50
        [49887.347074]  schedule+0x24/0x90
        [49887.347075]  io_schedule+0x3c/0x60
        [49887.347077]  bit_wait_io+0x8/0x50
        [49887.347079]  __wait_on_bit+0x6c/0x80
        [49887.347081]  ? __lock_release.isra.29+0x155/0x2d0
        [49887.347083]  out_of_line_wait_on_bit+0x7b/0x80
        [49887.347084]  ? var_wake_function+0x20/0x20
        [49887.347087]  lock_extent_buffer_for_io+0x28c/0x390
        [49887.347089]  btree_write_cache_pages+0x18e/0x340
        [49887.347091]  do_writepages+0x29/0xb0
        [49887.347093]  ? kmem_cache_free+0x132/0x160
        [49887.347095]  ? convert_extent_bit+0x544/0x680
        [49887.347097]  filemap_fdatawrite_range+0x70/0x90
        [49887.347099]  btrfs_write_marked_extents+0x53/0x120
        [49887.347100]  btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
        [49887.347102]  btrfs_commit_transaction+0x6bb/0x990
        [49887.347103]  ? start_transaction+0x33e/0x500
        [49887.347105]  transaction_kthread+0x139/0x15c
      
      So fix this by not overwriting the return value (ret) with the result
      from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
      bit in case flush_write_bio() returns an error, otherwise it will hang
      any future attempts to writeback the extent buffer, and undo all work
      done before (set back EXTENT_BUFFER_DIRTY, etc).
      
      This is a regression introduced in the 5.2 kernel.
      
      Fixes: 2e3c2513 ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
      Fixes: f4340622 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
      Reported-by: NZdenek Sojka <zsojka@seznam.cz>
      Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#uReported-by: NStefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#tReported-by: NDrazen Kacar <drazen.kacar@oradian.com>
      Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18dfa711
    • F
      Btrfs: fix assertion failure during fsync and use of stale transaction · 410f954c
      Filipe Manana 提交于
      Sometimes when fsync'ing a file we need to log that other inodes exist and
      when we need to do that we acquire a reference on the inodes and then drop
      that reference using iput() after logging them.
      
      That generally is not a problem except if we end up doing the final iput()
      (dropping the last reference) on the inode and that inode has a link count
      of 0, which can happen in a very short time window if the logging path
      gets a reference on the inode while it's being unlinked.
      
      In that case we end up getting the eviction callback, btrfs_evict_inode(),
      invoked through the iput() call chain which needs to drop all of the
      inode's items from its subvolume btree, and in order to do that, it needs
      to join a transaction at the helper function evict_refill_and_join().
      However because the task previously started a transaction at the fsync
      handler, btrfs_sync_file(), it has current->journal_info already pointing
      to a transaction handle and therefore evict_refill_and_join() will get
      that transaction handle from btrfs_join_transaction(). From this point on,
      two different problems can happen:
      
      1) evict_refill_and_join() will often change the transaction handle's
         block reserve (->block_rsv) and set its ->bytes_reserved field to a
         value greater than 0. If evict_refill_and_join() never commits the
         transaction, the eviction handler ends up decreasing the reference
         count (->use_count) of the transaction handle through the call to
         btrfs_end_transaction(), and after that point we have a transaction
         handle with a NULL ->block_rsv (which is the value prior to the
         transaction join from evict_refill_and_join()) and a ->bytes_reserved
         value greater than 0. If after the eviction/iput completes the inode
         logging path hits an error or it decides that it must fallback to a
         transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a
         non-zero value from btrfs_log_dentry_safe(), and because of that
         non-zero value it tries to commit the transaction using a handle with
         a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes
         the transaction commit hit an assertion failure at
         btrfs_trans_release_metadata() because ->bytes_reserved is not zero but
         the ->block_rsv is NULL. The produced stack trace for that is like the
         following:
      
         [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816
         [192922.917553] ------------[ cut here ]------------
         [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532!
         [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
         [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G        W         5.1.4-btrfs-next-47 #1
         [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
         [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs]
         (...)
         [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286
         [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000
         [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838
         [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000
         [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980
         [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8
         [192922.923200] FS:  00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000
         [192922.923579] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0
         [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         [192922.925105] Call Trace:
         [192922.925505]  btrfs_trans_release_metadata+0x10c/0x170 [btrfs]
         [192922.925911]  btrfs_commit_transaction+0x3e/0xaf0 [btrfs]
         [192922.926324]  btrfs_sync_file+0x44c/0x490 [btrfs]
         [192922.926731]  do_fsync+0x38/0x60
         [192922.927138]  __x64_sys_fdatasync+0x13/0x20
         [192922.927543]  do_syscall_64+0x60/0x1c0
         [192922.927939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
         (...)
         [192922.934077] ---[ end trace f00808b12068168f ]---
      
      2) If evict_refill_and_join() decides to commit the transaction, it will
         be able to do it, since the nested transaction join only increments the
         transaction handle's ->use_count reference counter and it does not
         prevent the transaction from getting committed. This means that after
         eviction completes, the fsync logging path will be using a transaction
         handle that refers to an already committed transaction. What happens
         when using such a stale transaction can be unpredictable, we are at
         least having a use-after-free on the transaction handle itself, since
         the transaction commit will call kmem_cache_free() against the handle
         regardless of its ->use_count value, or we can end up silently losing
         all the updates to the log tree after that iput() in the logging path,
         or using a transaction handle that in the meanwhile was allocated to
         another task for a new transaction, etc, pretty much unpredictable
         what can happen.
      
      In order to fix both of them, instead of using iput() during logging, use
      btrfs_add_delayed_iput(), so that the logging path of fsync never drops
      the last reference on an inode, that step is offloaded to a safe context
      (usually the cleaner kthread).
      
      The assertion failure issue was sporadically triggered by the test case
      generic/475 from fstests, which loads the dm error target while fsstress
      is running, which lead to fsync failing while logging inodes with -EIO
      errors and then trying later to commit the transaction, triggering the
      assertion failure.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      410f954c
  4. 05 9月, 2019 1 次提交
    • A
      configfs: provide exclusion between IO and removals · b0841eef
      Al Viro 提交于
      Make sure that attribute methods are not called after the item
      has been removed from the tree.  To do so, we
      	* at the point of no return in removals, grab ->frag_sem
      exclusive and mark the fragment dead.
      	* call the methods of attributes with ->frag_sem taken
      shared and only after having verified that the fragment is still
      alive.
      
      	The main benefit is for method instances - they are
      guaranteed that the objects they are accessing *and* all ancestors
      are still there.  Another win is that we don't need to bother
      with extra refcount on config_item when opening a file -
      the item will be alive for as long as it stays in the tree, and
      we won't touch it/attributes/any associated data after it's
      been removed from the tree.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      b0841eef
  5. 03 9月, 2019 4 次提交
  6. 28 8月, 2019 4 次提交
  7. 27 8月, 2019 8 次提交
  8. 25 8月, 2019 1 次提交
  9. 23 8月, 2019 2 次提交
    • D
      xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT · 1fb254aa
      Darrick J. Wong 提交于
      Benjamin Moody reported to Debian that XFS partially wedges when a chgrp
      fails on account of being out of disk quota.  I ran his reproducer
      script:
      
      # adduser dummy
      # adduser dummy plugdev
      
      # dd if=/dev/zero bs=1M count=100 of=test.img
      # mkfs.xfs test.img
      # mount -t xfs -o gquota test.img /mnt
      # mkdir -p /mnt/dummy
      # chown -c dummy /mnt/dummy
      # xfs_quota -xc 'limit -g bsoft=100k bhard=100k plugdev' /mnt
      
      (and then as user dummy)
      
      $ dd if=/dev/urandom bs=1M count=50 of=/mnt/dummy/foo
      $ chgrp plugdev /mnt/dummy/foo
      
      and saw:
      
      ================================================
      WARNING: lock held when returning to user space!
      5.3.0-rc5 #rc5 Tainted: G        W
      ------------------------------------------------
      chgrp/47006 is leaving the kernel with locks still held!
      1 lock held by chgrp/47006:
       #0: 000000006664ea2d (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0xd2/0x290 [xfs]
      
      ...which is clearly caused by xfs_setattr_nonsize failing to unlock the
      ILOCK after the xfs_qm_vop_chown_reserve call fails.  Add the missing
      unlock.
      
      Reported-by: benjamin.moody@gmail.com
      Fixes: 253f4911 ("xfs: better xfs_trans_alloc interface")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NSalvatore Bonaccorso <carnil@debian.org>
      1fb254aa
    • J
      io_uring: add need_resched() check in inner poll loop · 08f5439f
      Jens Axboe 提交于
      The outer poll loop checks for whether we need to reschedule, and
      returns to userspace if we do. However, it's possible to get stuck
      in the inner loop as well, if the CPU we are running on needs to
      reschedule to finish the IO work.
      
      Add the need_resched() check in the inner loop as well. This fixes
      a potential hang if the kernel is configured with
      CONFIG_PREEMPT_VOLUNTARY=y.
      Reported-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Tested-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      08f5439f
  10. 22 8月, 2019 11 次提交
    • L
      ubifs: Limit the number of pages in shrink_liability · 0af83abb
      Liu Song 提交于
      If the number of dirty pages to be written back is large,
      then writeback_inodes_sb will block waiting for a long time,
      causing hung task detection alarm. Therefore, we should limit
      the maximum number of pages written back this time, which let
      the budget be completed faster. The remaining dirty pages
      tend to rely on the writeback mechanism to complete the
      synchronization.
      
      Fixes: b6e51316 ("writeback: separate starting of sync vs opportunistic writeback")
      Signed-off-by: NLiu Song <liu.song11@zte.com.cn>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      0af83abb
    • R
      ubifs: Correctly initialize c->min_log_bytes · 377e208f
      Richard Weinberger 提交于
      Currently on a freshly mounted UBIFS, c->min_log_bytes is 0.
      This can lead to a log overrun and make commits fail.
      
      Recent kernels will report the following assert:
      UBIFS assert failed: c->lhead_lnum != c->ltail_lnum, in fs/ubifs/log.c:412
      
      c->min_log_bytes can have two states, 0 and c->leb_size.
      It controls how much bytes of the log area are reserved for non-bud
      nodes such as commit nodes.
      
      After a commit it has to be set to c->leb_size such that we have always
      enough space for a commit. While a commit runs it can be 0 to make the
      remaining bytes of the log available to writers.
      
      Having it set to 0 right after mount is wrong since no space for commits
      is reserved.
      
      Fixes: 1e51764a ("UBIFS: add new flash file system")
      Reported-and-tested-by: NUwe Kleine-König <u.kleine-koenig@pengutronix.de>
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      377e208f
    • R
      ubifs: Fix double unlock around orphan_delete() · 4dd75b33
      Richard Weinberger 提交于
      We unlock after orphan_delete(), so no need to unlock
      in the function too.
      Reported-by: NHan Xu <han.xu@nxp.com>
      Fixes: 8009ce95 ("ubifs: Don't leak orphans on memory during commit")
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      4dd75b33
    • Y
      afs: use correct afs_call_type in yfs_fs_store_opaque_acl2 · 7533be85
      YueHaibing 提交于
      It seems that 'yfs_RXYFSStoreOpaqueACL2' should be use in
      yfs_fs_store_opaque_acl2().
      
      Fixes: f5e45463 ("afs: Implement YFS ACL setting")
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7533be85
    • M
      afs: Fix possible oops in afs_lookup trace event · c4c613ff
      Marc Dionne 提交于
      The afs_lookup trace event can cause the following:
      
      [  216.576777] BUG: kernel NULL pointer dereference, address: 000000000000023b
      [  216.576803] #PF: supervisor read access in kernel mode
      [  216.576813] #PF: error_code(0x0000) - not-present page
      ...
      [  216.576913] RIP: 0010:trace_event_raw_event_afs_lookup+0x9e/0x1c0 [kafs]
      
      If the inode from afs_do_lookup() is an error other than ENOENT, or if it
      is ENOENT and afs_try_auto_mntpt() returns an error, the trace event will
      try to dereference the error pointer as a valid pointer.
      
      Use IS_ERR_OR_NULL to only pass a valid pointer for the trace, or NULL.
      
      Ideally the trace would include the error value, but for now just avoid
      the oops.
      
      Fixes: 80548b03 ("afs: Add more tracepoints")
      Signed-off-by: NMarc Dionne <marc.dionne@auristor.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      c4c613ff
    • D
      afs: Fix leak in afs_lookup_cell_rcu() · a5fb8e6c
      David Howells 提交于
      Fix a leak on the cell refcount in afs_lookup_cell_rcu() due to
      non-clearance of the default error in the case a NULL cell name is passed
      and the workstation default cell is used.
      
      Also put a bit at the end to make sure we don't leak a cell ref if we're
      going to be returning an error.
      
      This leak results in an assertion like the following when the kafs module is
      unloaded:
      
      	AFS: Assertion failed
      	2 == 1 is false
      	0x2 == 0x1 is false
      	------------[ cut here ]------------
      	kernel BUG at fs/afs/cell.c:770!
      	...
      	RIP: 0010:afs_manage_cells+0x220/0x42f [kafs]
      	...
      	 process_one_work+0x4c2/0x82c
      	 ? pool_mayday_timeout+0x1e1/0x1e1
      	 ? do_raw_spin_lock+0x134/0x175
      	 worker_thread+0x336/0x4a6
      	 ? rescuer_thread+0x4af/0x4af
      	 kthread+0x1de/0x1ee
      	 ? kthread_park+0xd4/0xd4
      	 ret_from_fork+0x24/0x30
      
      Fixes: 989782dc ("afs: Overhaul cell database management")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a5fb8e6c
    • J
      ceph: don't try fill file_lock on unsuccessful GETFILELOCK reply · 28a28261
      Jeff Layton 提交于
      When ceph_mdsc_do_request returns an error, we can't assume that the
      filelock_reply pointer will be set. Only try to fetch fields out of
      the r_reply_info when it returns success.
      
      Cc: stable@vger.kernel.org
      Reported-by: NHector Martin <hector@marcansoft.com>
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      28a28261
    • E
      ceph: clear page dirty before invalidate page · c95f1c5f
      Erqi Chen 提交于
      clear_page_dirty_for_io(page) before mapping->a_ops->invalidatepage().
      invalidatepage() clears page's private flag, if dirty flag is not
      cleared, the page may cause BUG_ON failure in ceph_set_page_dirty().
      
      Cc: stable@vger.kernel.org
      Link: https://tracker.ceph.com/issues/40862Signed-off-by: NErqi Chen <chenerqi@gmail.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      c95f1c5f
    • L
      ceph: fix buffer free while holding i_ceph_lock in fill_inode() · af8a85a4
      Luis Henriques 提交于
      Calling ceph_buffer_put() in fill_inode() may result in freeing the
      i_xattrs.blob buffer while holding the i_ceph_lock.  This can be fixed by
      postponing the call until later, when the lock is released.
      
      The following backtrace was triggered by fstests generic/070.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
        in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: kworker/0:4
        6 locks held by kworker/0:4/3852:
         #0: 000000004270f6bb ((wq_completion)ceph-msgr){+.+.}, at: process_one_work+0x1b8/0x5f0
         #1: 00000000eb420803 ((work_completion)(&(&con->work)->work)){+.+.}, at: process_one_work+0x1b8/0x5f0
         #2: 00000000be1c53a4 (&s->s_mutex){+.+.}, at: dispatch+0x288/0x1476
         #3: 00000000559cb958 (&mdsc->snap_rwsem){++++}, at: dispatch+0x2eb/0x1476
         #4: 000000000d5ebbae (&req->r_fill_mutex){+.+.}, at: dispatch+0x2fc/0x1476
         #5: 00000000a83d0514 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: fill_inode.isra.0+0xf8/0xf70
        CPU: 0 PID: 3852 Comm: kworker/0:4 Not tainted 5.2.0+ #441
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
        Workqueue: ceph-msgr ceph_con_workfn
        Call Trace:
         dump_stack+0x67/0x90
         ___might_sleep.cold+0x9f/0xb1
         vfree+0x4b/0x60
         ceph_buffer_release+0x1b/0x60
         fill_inode.isra.0+0xa9b/0xf70
         ceph_fill_trace+0x13b/0xc70
         ? dispatch+0x2eb/0x1476
         dispatch+0x320/0x1476
         ? __mutex_unlock_slowpath+0x4d/0x2a0
         ceph_con_workfn+0xc97/0x2ec0
         ? process_one_work+0x1b8/0x5f0
         process_one_work+0x244/0x5f0
         worker_thread+0x4d/0x3e0
         kthread+0x105/0x140
         ? process_one_work+0x5f0/0x5f0
         ? kthread_park+0x90/0x90
         ret_from_fork+0x3a/0x50
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      af8a85a4
    • L
      ceph: fix buffer free while holding i_ceph_lock in __ceph_build_xattrs_blob() · 12fe3dda
      Luis Henriques 提交于
      Calling ceph_buffer_put() in __ceph_build_xattrs_blob() may result in
      freeing the i_xattrs.blob buffer while holding the i_ceph_lock.  This can
      be fixed by having this function returning the old blob buffer and have
      the callers of this function freeing it when the lock is released.
      
      The following backtrace was triggered by fstests generic/117.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
        in_atomic(): 1, irqs_disabled(): 0, pid: 649, name: fsstress
        4 locks held by fsstress/649:
         #0: 00000000a7478e7e (&type->s_umount_key#19){++++}, at: iterate_supers+0x77/0xf0
         #1: 00000000f8de1423 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: ceph_check_caps+0x7b/0xc60
         #2: 00000000562f2b27 (&s->s_mutex){+.+.}, at: ceph_check_caps+0x3bd/0xc60
         #3: 00000000f83ce16a (&mdsc->snap_rwsem){++++}, at: ceph_check_caps+0x3ed/0xc60
        CPU: 1 PID: 649 Comm: fsstress Not tainted 5.2.0+ #439
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x67/0x90
         ___might_sleep.cold+0x9f/0xb1
         vfree+0x4b/0x60
         ceph_buffer_release+0x1b/0x60
         __ceph_build_xattrs_blob+0x12b/0x170
         __send_cap+0x302/0x540
         ? __lock_acquire+0x23c/0x1e40
         ? __mark_caps_flushing+0x15c/0x280
         ? _raw_spin_unlock+0x24/0x30
         ceph_check_caps+0x5f0/0xc60
         ceph_flush_dirty_caps+0x7c/0x150
         ? __ia32_sys_fdatasync+0x20/0x20
         ceph_sync_fs+0x5a/0x130
         iterate_supers+0x8f/0xf0
         ksys_sync+0x4f/0xb0
         __ia32_sys_sync+0xa/0x10
         do_syscall_64+0x50/0x1c0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7fc6409ab617
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      12fe3dda
    • L
      ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr() · 86968ef2
      Luis Henriques 提交于
      Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the
      i_xattrs.prealloc_blob buffer while holding the i_ceph_lock.  This can be
      fixed by postponing the call until later, when the lock is released.
      
      The following backtrace was triggered by fstests generic/117.
      
        BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
        in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress
        3 locks held by fsstress/650:
         #0: 00000000870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50
         #1: 00000000ba0c4c74 (&type->i_mutex_dir_key#6){++++}, at: vfs_setxattr+0x55/0xa0
         #2: 000000008dfbb3f2 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: __ceph_setxattr+0x297/0x810
        CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ #437
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x67/0x90
         ___might_sleep.cold+0x9f/0xb1
         vfree+0x4b/0x60
         ceph_buffer_release+0x1b/0x60
         __ceph_setxattr+0x2b4/0x810
         __vfs_setxattr+0x66/0x80
         __vfs_setxattr_noperm+0x59/0xf0
         vfs_setxattr+0x81/0xa0
         setxattr+0x115/0x230
         ? filename_lookup+0xc9/0x140
         ? rcu_read_lock_sched_held+0x74/0x80
         ? rcu_sync_lockdep_assert+0x2e/0x60
         ? __sb_start_write+0x142/0x1a0
         ? mnt_want_write+0x20/0x50
         path_setxattr+0xba/0xd0
         __x64_sys_lsetxattr+0x24/0x30
         do_syscall_64+0x50/0x1c0
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7ff23514359a
      Signed-off-by: NLuis Henriques <lhenriques@suse.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
      86968ef2
  11. 21 8月, 2019 2 次提交
    • J
      io_uring: don't enter poll loop if we have CQEs pending · a3a0e43f
      Jens Axboe 提交于
      We need to check if we have CQEs pending before starting a poll loop,
      as those could be the events we will be spinning for (and hence we'll
      find none). This can happen if a CQE triggers an error, or if it is
      found by eg an IRQ before we get a chance to find it through polling.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3a0e43f
    • J
      io_uring: fix potential hang with polled IO · 500f9fba
      Jens Axboe 提交于
      If a request issue ends up being punted to async context to avoid
      blocking, we can get into a situation where the original application
      enters the poll loop for that very request before it has been issued.
      This should not be an issue, except that the polling will hold the
      io_uring uring_ctx mutex for the duration of the poll. When the async
      worker has actually issued the request, it needs to acquire this mutex
      to add the request to the poll issued list. Since the application
      polling is already holding this mutex, the workqueue sleeps on the
      mutex forever, and the application thus never gets a chance to poll for
      the very request it was interested in.
      
      Fix this by ensuring that the polling drops the uring_ctx occasionally
      if it's not making any progress.
      Reported-by: NJeffrey M. Birnbaum <jmbnyc@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      500f9fba
  12. 20 8月, 2019 1 次提交