1. 27 2月, 2020 1 次提交
    • J
      io_uring: drop file set ref put/get on switch · dd3db2a3
      Jens Axboe 提交于
      Dan reports that he triggered a warning on ring exit doing some testing:
      
      percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic
      WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Modules linked in:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83
      RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292
      RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000
      RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c
      RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045
      R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0
      R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009
      FS:  0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       rcu_core+0x1e4/0x4d0
       __do_softirq+0xdb/0x2f1
       irq_exit+0xa0/0xb0
       smp_apic_timer_interrupt+0x60/0x140
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:default_idle+0x23/0x170
      Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95
      
      Turns out that this is due to percpu_ref_switch_to_atomic() only
      grabbing a reference to the percpu refcount if it's not already in
      atomic mode. io_uring drops a ref and re-gets it when switching back to
      percpu mode. We attempt to protect against this with the FFD_F_ATOMIC
      bit, but that isn't reliable.
      
      We don't actually need to juggle these refcounts between atomic and
      percpu switch, we can just do them when we've switched to atomic mode.
      This removes the need for FFD_F_ATOMIC, which wasn't reliable.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dd3db2a3
  2. 26 2月, 2020 3 次提交
    • J
      io_uring: import_single_range() returns 0/-ERROR · 3a901598
      Jens Axboe 提交于
      Unlike the other core import helpers, import_single_range() returns 0 on
      success, not the length imported. This means that links that depend on
      the result of non-vec based IORING_OP_{READ,WRITE} that were added for
      5.5 get errored when they should not be.
      
      Fixes: 3a6820f2 ("io_uring: add non-vectored read/write commands")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3a901598
    • J
      io_uring: pick up link work on submit reference drop · 2a44f467
      Jens Axboe 提交于
      If work completes inline, then we should pick up a dependent link item
      in __io_queue_sqe() as well. If we don't do so, we're forced to go async
      with that item, which is suboptimal.
      
      This also fixes an issue with io_put_req_find_next(), which always looks
      up the next work item. That should only be done if we're dropping the
      last reference to the request, to prevent multiple lookups of the same
      work item.
      
      Outside of being a fix, this also enables a good cleanup series for 5.7,
      where we never have to pass 'nxt' around or into the work handlers.
      Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2a44f467
    • J
      io-wq: ensure work->task_pid is cleared on init · 2d141dd2
      Jens Axboe 提交于
      We use ->task_pid for exit cancellation, but we need to ensure it's
      cleared to zero for io_req_work_grab_env() to do the right thing. Take
      a suggestion from Bart and clear the whole thing, just setting the
      function passed in. This makes it more future proof as well.
      
      Fixes: 36282881 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d141dd2
  3. 25 2月, 2020 2 次提交
    • J
      io-wq: remove spin-for-work optimization · 3030fd4c
      Jens Axboe 提交于
      Andres reports that buffered IO seems to suck up more cycles than we
      would like, and he narrowed it down to the fact that the io-wq workers
      will briefly spin for more work on completion of a work item. This was
      a win on the networking side, but apparently some other cases take a
      hit because of it. Remove the optimization to avoid burning more CPU
      than we have to for disk IO.
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3030fd4c
    • X
      io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL · bdcd3eab
      Xiaoguang Wang 提交于
      After making ext4 support iopoll method:
        let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
      we found fio can easily hang in fio_ioring_getevents() with below fio
      job:
          rm -f testfile; sync;
          sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
      -rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
      -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
      with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
      
      There are two issues that results in this hang, one reason is that
      when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
      does not use io_uring_enter to get completed events, it relies on
      kernel io_sq_thread to poll for completed events.
      
      Another reason is that there is a race: when io_submit_sqes() in
      io_sq_thread() submits a batch of sqes, variable 'inflight' will
      record the number of submitted reqs, then io_sq_thread will poll for
      reqs which have been added to poll_list. But note, if some previous
      reqs have been punted to io worker, these reqs will won't be in
      poll_list timely. io_sq_thread() will only poll for a part of previous
      submitted reqs, and then find poll_list is empty, reset variable
      'inflight' to be zero. If app just waits these deferred reqs and does
      not wake up io_sq_thread again, then hang happens.
      
      For app that entirely relies on io_sq_thread to poll completed requests,
      let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
      element to poll_list, and when io_sq_thread prepares to sleep, check
      whether poll_list is empty again, if not empty, continue to poll.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bdcd3eab
  4. 24 2月, 2020 2 次提交
  5. 22 2月, 2020 7 次提交
    • X
      io_uring: fix __io_iopoll_check deadlock in io_sq_thread · c7849be9
      Xiaoguang Wang 提交于
      Since commit a3a0e43f ("io_uring: don't enter poll loop if we have
      CQEs pending"), if we already events pending, we won't enter poll loop.
      In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
      been terminated and don't reap pending events which are already in cq
      ring, and there are some reqs in poll_list, io_sq_thread will enter
      __io_iopoll_check(), and find pending events, then return, this loop
      will never have a chance to exit.
      
      I have seen this issue in fio stress tests, to fix this issue, let
      io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
      and remove __io_iopoll_check().
      
      Fixes: a3a0e43f ("io_uring: don't enter poll loop if we have CQEs pending")
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7849be9
    • J
      ext4: fix mount failure with quota configured as module · 9db176bc
      Jan Kara 提交于
      When CONFIG_QFMT_V2 is configured as a module, the test in
      ext4_feature_set_ok() fails and so mount of filesystems with quota or
      project features fails. Fix the test to use IS_ENABLED macro which
      works properly even for modules.
      
      Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz
      Fixes: d65d87a0 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      9db176bc
    • W
      jbd2: fix ocfs2 corrupt when clearing block group bits · 8eedabfd
      wangyan 提交于
      I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
      The running environment:
      	kernel version: 4.19
      	A cluster with two nodes, 5 luns mounted on two nodes, and do some
      	file operations like dd/fallocate/truncate/rm on every lun with storage
      	network disconnection.
      
      The fallocate operation on dm-23-45 caused an null pointer dereference.
      
      The information of NULL pointer dereference as follows:
      	[577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45.
      	[577992.878290] Aborting journal on device dm-23-45.
      	...
      	[577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46.
      	[577992.890908] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30
      	[577992.890918] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30
      	[577992.890922] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30
      	[577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30
      	[577992.890928] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30
      	[577992.890933] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890939] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
      	[577992.890950] Mem abort info:
      	[577992.890951]   ESR = 0x96000004
      	[577992.890952]   Exception class = DABT (current EL), IL = 32 bits
      	[577992.890952]   SET = 0, FnV = 0
      	[577992.890953]   EA = 0, S1PTW = 0
      	[577992.890954] Data abort info:
      	[577992.890955]   ISV = 0, ISS = 0x00000004
      	[577992.890956]   CM = 0, WnR = 0
      	[577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9
      	[577992.890960] [0000000000000020] pgd=0000000000000000
      	[577992.890964] Internal error: Oops: 96000004 [#1] SMP
      	[577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd)
      	[577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G        W  OE     4.19.36 #1
      	[577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
      	[577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO)
      	[577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
      	[577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2]
      	[577992.891084] sp : ffff0000c8e2b810
      	[577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000
      	[577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70
      	[577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2
      	[577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30
      	[577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000
      	[577992.891097] x19: ffff000001681638 x18: ffffffffffffffff
      	[577992.891098] x17: 0000000000000000 x16: ffff000080a03df0
      	[577992.891100] x15: ffff0000811d9708 x14: 203d207375746174
      	[577992.891101] x13: 73203a524f525245 x12: 20373439343a6565
      	[577992.891103] x11: 0000000000000038 x10: 0101010101010101
      	[577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f
      	[577992.891109] x7 : 0000000000000000 x6 : 0000000000000080
      	[577992.891110] x5 : 0000000000000000 x4 : 0000000000000002
      	[577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00
      	[577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000
      	[577992.891116] Call trace:
      	[577992.891139]  _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
      	[577992.891162]  _ocfs2_free_clusters+0x100/0x290 [ocfs2]
      	[577992.891185]  ocfs2_free_clusters+0x50/0x68 [ocfs2]
      	[577992.891206]  ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2]
      	[577992.891227]  ocfs2_add_inode_data+0x94/0xc8 [ocfs2]
      	[577992.891248]  ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2]
      	[577992.891269]  ocfs2_allocate_extents+0x14c/0x338 [ocfs2]
      	[577992.891290]  __ocfs2_change_file_space+0x3f8/0x610 [ocfs2]
      	[577992.891309]  ocfs2_fallocate+0xe4/0x128 [ocfs2]
      	[577992.891316]  vfs_fallocate+0x11c/0x250
      	[577992.891317]  ksys_fallocate+0x54/0x88
      	[577992.891319]  __arm64_sys_fallocate+0x28/0x38
      	[577992.891323]  el0_svc_common+0x78/0x130
      	[577992.891325]  el0_svc_handler+0x38/0x78
      	[577992.891327]  el0_svc+0x8/0xc
      
      My analysis process as follows:
      ocfs2_fallocate
        __ocfs2_change_file_space
          ocfs2_allocate_extents
            ocfs2_extend_allocation
              ocfs2_add_inode_data
                ocfs2_add_clusters_in_btree
                  ocfs2_insert_extent
                    ocfs2_do_insert_extent
                      ocfs2_rotate_tree_right
                        ocfs2_extend_rotate_transaction
                          ocfs2_extend_trans
                            jbd2_journal_restart
                              jbd2__journal_restart
                                /* handle->h_transaction is NULL,
                                 * is_handle_aborted(handle) is true
                                 */
                                handle->h_transaction = NULL;
                                start_this_handle
                                  return -EROFS;
                  ocfs2_free_clusters
                    _ocfs2_free_clusters
                      _ocfs2_free_suballoc_bits
                        ocfs2_block_group_clear_bits
                          ocfs2_journal_access_gd
                            __ocfs2_journal_access
                              jbd2_journal_get_undo_access
                                /* I think jbd2_write_access_granted() will
                                 * return true, because do_get_write_access()
                                 * will return -EROFS.
                                 */
                                if (jbd2_write_access_granted(...)) return 0;
                                do_get_write_access
                                  /* handle->h_transaction is NULL, it will
                                   * return -EROFS here, so do_get_write_access()
                                   * was not called.
                                   */
                                  if (is_handle_aborted(handle)) return -EROFS;
                          /* bh2jh(group_bh) is NULL, caused NULL
                             pointer dereference */
                          undo_bg = (struct ocfs2_group_desc *)
                                      bh2jh(group_bh)->b_committed_data;
      
      If handle->h_transaction == NULL, then jbd2_write_access_granted()
      does not really guarantee that journal_head will stay around,
      not even speaking of its b_committed_data. The bh2jh(group_bh)
      can be removed after ocfs2_journal_access_gd() and before call
      "bh2jh(group_bh)->b_committed_data". So, we should move
      is_handle_aborted() check from do_get_write_access() into
      jbd2_journal_get_undo_access() and jbd2_journal_get_write_access()
      before the call to jbd2_write_access_granted().
      
      Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.comSigned-off-by: NYan Wang <wangyan122@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      8eedabfd
    • E
      ext4: fix race between writepages and enabling EXT4_EXTENTS_FL · cb85f4d2
      Eric Biggers 提交于
      If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
      on it, the following warning in ext4_add_complete_io() can be hit:
      
      WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
      
      Here's a minimal reproducer (not 100% reliable) (root isn't required):
      
              while true; do
                      sync
              done &
              while true; do
                      rm -f file
                      touch file
                      chattr -e file
                      echo X >> file
                      chattr +e file
              done
      
      The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
      (which only returns true on extent-based files) is checked once to set
      the number of reserved journal credits, and also again later to select
      the flags for ext4_map_blocks() and copy the reserved journal handle to
      ext4_io_end::handle.  But if EXT4_EXTENTS_FL is being concurrently set,
      the first check can see dioread_nolock disabled while the later one can
      see it enabled, causing the reserved handle to unexpectedly be NULL.
      
      Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
      related to doing so as well, fix this by synchronizing changing
      EXT4_EXTENTS_FL with ext4_writepages() via the existing
      s_writepages_rwsem (previously called s_journal_flag_rwsem).
      
      This was originally reported by syzbot without a reproducer at
      https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
      but now that dioread_nolock is the default I also started seeing this
      when running syzkaller locally.
      
      Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
      Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
      Fixes: 6b523df4 ("ext4: use transaction reservation for extent conversion in ext4_end_io")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      cb85f4d2
    • E
      ext4: rename s_journal_flag_rwsem to s_writepages_rwsem · bbd55937
      Eric Biggers 提交于
      In preparation for making s_journal_flag_rwsem synchronize
      ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
      flags (rather than just JOURNAL_DATA as it does currently), rename it to
      s_writepages_rwsem.
      
      Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      bbd55937
    • S
      ext4: fix potential race between s_flex_groups online resizing and access · 7c990728
      Suraj Jitindar Singh 提交于
      During an online resize an array of s_flex_groups structures gets replaced
      so it can get enlarged. If there is a concurrent access to the array and
      this memory has been reused then this can lead to an invalid memory access.
      
      The s_flex_group array has been converted into an array of pointers rather
      than an array of structures. This is to ensure that the information
      contained in the structures cannot get out of sync during a resize due to
      an accessor updating the value in the old structure after it has been
      copied but before the array pointer is updated. Since the structures them-
      selves are no longer copied but only the pointers to them this case is
      mitigated.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
      Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.eduSigned-off-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      7c990728
    • S
      io_uring: prevent sq_thread from spinning when it should stop · 7143b5ac
      Stefano Garzarella 提交于
      This patch drops 'cur_mm' before calling cond_resched(), to prevent
      the sq_thread from spinning even when the user process is finished.
      
      Before this patch, if the user process ended without closing the
      io_uring fd, the sq_thread continues to spin until the
      'sq_thread_idle' timeout ends.
      
      In the worst case where the 'sq_thread_idle' parameter is bigger than
      INT_MAX, the sq_thread will spin forever.
      
      Fixes: 6c271ce2 ("io_uring: add submission polling")
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7143b5ac
  6. 21 2月, 2020 3 次提交
    • F
      Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof · a5ae50de
      Filipe Manana 提交于
      While logging the prealloc extents of an inode during a fast fsync we call
      btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
      holding a read lock on a leaf of the inode's root (not the log root, the
      fs/subvol root), and then that function locks the file range in the inode's
      iotree. This can lead to a deadlock when:
      
      * the fsync is ranged
      
      * the file has prealloc extents beyond eof
      
      * writeback for a range different from the fsync range starts
        during the fsync
      
      * the size of the file is not sector size aligned
      
      Because when finishing an ordered extent we lock first a file range and
      then try to COW the fs/subvol tree to insert an extent item.
      
      The following diagram shows how the deadlock can happen.
      
                 CPU 1                                        CPU 2
      
        btrfs_sync_file()
          --> for range [0, 1MiB)
      
          --> inode has a size of
              1MiB and has 1 prealloc
              extent beyond the
              i_size, starting at offset
              4MiB
      
          flushes all delalloc for the
          range [0MiB, 1MiB) and waits
          for the respective ordered
          extents to complete
      
                                                    --> before task at CPU 1 locks the
                                                        inode, a write into file range
                                                        [1MiB, 2MiB + 1KiB) is made
      
                                                    --> i_size is updated to 2MiB + 1KiB
      
                                                    --> writeback is started for that
                                                        range, [1MiB, 2MiB + 4KiB)
                                                        --> end offset rounded up to
                                                            be sector size aligned
      
          btrfs_log_dentry_safe()
            btrfs_log_inode_parent()
              btrfs_log_inode()
      
                btrfs_log_changed_extents()
                  btrfs_log_prealloc_extents()
                    --> does a search on the
                        inode's root
                    --> holds a read lock on
                        leaf X
      
                                                    btrfs_finish_ordered_io()
                                                      --> locks range [1MiB, 2MiB + 4KiB)
                                                          --> end offset rounded up
                                                              to be sector size aligned
      
                                                      --> tries to cow leaf X, through
                                                          insert_reserved_file_extent()
                                                          --> already locked by the
                                                              task at CPU 1
      
                    btrfs_truncate_inode_items()
      
                      --> gets an i_size of
                          2MiB + 1KiB, which is
                          not sector size
                          aligned
      
                      --> tries to lock file
                          range [2MiB, (u64)-1)
                          --> the start range
                              is rounded down
                              from 2MiB + 1K
                              to 2MiB to be sector
                              size aligned
      
                          --> but the subrange
                              [2MiB, 2MiB + 4KiB) is
                              already locked by
                              task at CPU 2 which
                              is waiting to get a
                              write lock on leaf X
                              for which we are
                              holding a read lock
      
                                      *** deadlock ***
      
      This results in a stack trace like the following, triggered by test case
      generic/561 from fstests:
      
        [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
        [ 2779.979536]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2779.990136] kworker/u8:6    D    0   247      2 0x80004000
        [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        [ 2779.990466] Call Trace:
        [ 2779.990491]  ? __schedule+0x384/0xa30
        [ 2779.990521]  schedule+0x33/0xe0
        [ 2779.990616]  btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
        [ 2779.990632]  ? remove_wait_queue+0x60/0x60
        [ 2779.990730]  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
        [ 2779.990782]  btrfs_search_slot+0x510/0x1000 [btrfs]
        [ 2779.990869]  btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
        [ 2779.990944]  __btrfs_drop_extents+0x161/0x1060 [btrfs]
        [ 2779.990987]  ? mark_held_locks+0x6d/0xc0
        [ 2779.990994]  ? __slab_alloc.isra.49+0x99/0x100
        [ 2779.991060]  ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
        [ 2779.991145]  insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
        [ 2779.991222]  ? start_transaction+0xdd/0x5c0 [btrfs]
        [ 2779.991291]  btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
        [ 2779.991405]  btrfs_work_helper+0xaa/0x720 [btrfs]
        [ 2779.991432]  process_one_work+0x26d/0x6a0
        [ 2779.991460]  worker_thread+0x4f/0x3e0
        [ 2779.991481]  ? process_one_work+0x6a0/0x6a0
        [ 2779.991489]  kthread+0x103/0x140
        [ 2779.991499]  ? kthread_create_worker_on_cpu+0x70/0x70
        [ 2779.991515]  ret_from_fork+0x3a/0x50
        (...)
        [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
        [ 2780.027480]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2780.030035] fsstress        D    0 17375  17373 0x00004000
        [ 2780.030038] Call Trace:
        [ 2780.030044]  ? __schedule+0x384/0xa30
        [ 2780.030052]  schedule+0x33/0xe0
        [ 2780.030075]  lock_extent_bits+0x20c/0x320 [btrfs]
        [ 2780.030094]  ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
        [ 2780.030098]  ? rcu_read_lock_sched_held+0x59/0xa0
        [ 2780.030102]  ? remove_wait_queue+0x60/0x60
        [ 2780.030122]  btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
        [ 2780.030151]  ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
        [ 2780.030165]  ? btrfs_search_slot+0x379/0x1000 [btrfs]
        [ 2780.030195]  btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
        [ 2780.030202]  ? do_raw_spin_unlock+0x49/0xc0
        [ 2780.030215]  ? btrfs_get_num_csums+0x10/0x10 [btrfs]
        [ 2780.030239]  btrfs_log_inode+0xf83/0x1124 [btrfs]
        [ 2780.030251]  ? __mutex_unlock_slowpath+0x45/0x2a0
        [ 2780.030275]  btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
        [ 2780.030282]  ? dget_parent+0xa1/0x370
        [ 2780.030309]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [ 2780.030329]  btrfs_sync_file+0x3f3/0x490 [btrfs]
        [ 2780.030339]  do_fsync+0x38/0x60
        [ 2780.030343]  __x64_sys_fdatasync+0x13/0x20
        [ 2780.030345]  do_syscall_64+0x5c/0x280
        [ 2780.030348]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
        [ 2780.030361] Code: Bad RIP value.
        [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
        [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
        [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
        [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
        [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
      
      So fix this by making btrfs_truncate_inode_items() not lock the range in
      the inode's iotree when the target root is a log root, since it's not
      needed to lock the range for log roots as the protection from the inode's
      lock and log_mutex are all that's needed.
      
      Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5ae50de
    • S
      ext4: fix potential race between s_group_info online resizing and access · df3da4ea
      Suraj Jitindar Singh 提交于
      During an online resize an array of pointers to s_group_info gets replaced
      so it can get enlarged. If there is a concurrent access to the array in
      ext4_get_group_info() and this memory has been reused then this can lead to
      an invalid memory access.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
      Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.eduSigned-off-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NBalbir Singh <sblbir@amazon.com>
      Cc: stable@kernel.org
      df3da4ea
    • T
      ext4: fix potential race between online resizing and write operations · 1d0c3924
      Theodore Ts'o 提交于
      During an online resize an array of pointers to buffer heads gets
      replaced so it can get enlarged.  If there is a racing block
      allocation or deallocation which uses the old array, and the old array
      has gotten reused this can lead to a GPF or some other random kernel
      memory getting modified.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
      Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.eduReported-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      1d0c3924
  7. 20 2月, 2020 2 次提交
    • S
      ext4: add cond_resched() to __ext4_find_entry() · 9424ef56
      Shijie Luo 提交于
      We tested a soft lockup problem in linux 4.19 which could also
      be found in linux 5.x.
      
      When dir inode takes up a large number of blocks, and if the
      directory is growing when we are searching, it's possible the
      restart branch could be called many times, and the do while loop
      could hold cpu a long time.
      
      Here is the call trace in linux 4.19.
      
      [  473.756186] Call trace:
      [  473.756196]  dump_backtrace+0x0/0x198
      [  473.756199]  show_stack+0x24/0x30
      [  473.756205]  dump_stack+0xa4/0xcc
      [  473.756210]  watchdog_timer_fn+0x300/0x3e8
      [  473.756215]  __hrtimer_run_queues+0x114/0x358
      [  473.756217]  hrtimer_interrupt+0x104/0x2d8
      [  473.756222]  arch_timer_handler_virt+0x38/0x58
      [  473.756226]  handle_percpu_devid_irq+0x90/0x248
      [  473.756231]  generic_handle_irq+0x34/0x50
      [  473.756234]  __handle_domain_irq+0x68/0xc0
      [  473.756236]  gic_handle_irq+0x6c/0x150
      [  473.756238]  el1_irq+0xb8/0x140
      [  473.756286]  ext4_es_lookup_extent+0xdc/0x258 [ext4]
      [  473.756310]  ext4_map_blocks+0x64/0x5c0 [ext4]
      [  473.756333]  ext4_getblk+0x6c/0x1d0 [ext4]
      [  473.756356]  ext4_bread_batch+0x7c/0x1f8 [ext4]
      [  473.756379]  ext4_find_entry+0x124/0x3f8 [ext4]
      [  473.756402]  ext4_lookup+0x8c/0x258 [ext4]
      [  473.756407]  __lookup_hash+0x8c/0xe8
      [  473.756411]  filename_create+0xa0/0x170
      [  473.756413]  do_mkdirat+0x6c/0x140
      [  473.756415]  __arm64_sys_mkdirat+0x28/0x38
      [  473.756419]  el0_svc_common+0x78/0x130
      [  473.756421]  el0_svc_handler+0x38/0x78
      [  473.756423]  el0_svc+0x8/0xc
      [  485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149]
      
      Add cond_resched() to avoid soft lockup and to provide a better
      system responding.
      
      Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.comSigned-off-by: NShijie Luo <luoshijie1@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      9424ef56
    • Q
      ext4: fix a data race in EXT4_I(inode)->i_disksize · 35df4299
      Qian Cai 提交于
      EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by
      KCSAN,
      
       BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4]
      
       write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127:
        ext4_write_end+0x4e3/0x750 [ext4]
        ext4_update_i_disksize at fs/ext4/ext4.h:3032
        (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046
        (inlined by) ext4_write_end at fs/ext4/inode.c:1287
        generic_perform_write+0x208/0x2a0
        ext4_buffered_write_iter+0x11f/0x210 [ext4]
        ext4_file_write_iter+0xce/0x9e0 [ext4]
        new_sync_write+0x29c/0x3b0
        __vfs_write+0x92/0xa0
        vfs_write+0x103/0x260
        ksys_write+0x9d/0x130
        __x64_sys_write+0x4c/0x60
        do_syscall_64+0x91/0xb47
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37:
        ext4_writepages+0x10ac/0x1d00 [ext4]
        mpage_map_and_submit_extent at fs/ext4/inode.c:2468
        (inlined by) ext4_writepages at fs/ext4/inode.c:2772
        do_writepages+0x5e/0x130
        __writeback_single_inode+0xeb/0xb20
        writeback_sb_inodes+0x429/0x900
        __writeback_inodes_wb+0xc4/0x150
        wb_writeback+0x4bd/0x870
        wb_workfn+0x6b4/0x960
        process_one_work+0x54c/0xbe0
        worker_thread+0x80/0x650
        kthread+0x1e0/0x200
        ret_from_fork+0x27/0x50
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G        W  O L 5.5.0-next-20200204+ #5
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
       Workqueue: writeback wb_workfn (flush-7:0)
      
      Since only the read is operating as lockless (outside of the
      "i_data_sem"), load tearing could introduce a logic bug. Fix it by
      adding READ_ONCE() for the read and WRITE_ONCE() for the write.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pwSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      35df4299
  8. 19 2月, 2020 9 次提交
    • P
      io_uring: fix use-after-free by io_cleanup_req() · 929a3af9
      Pavel Begunkov 提交于
      io_cleanup_req() should be called before req->io is freed, and so
      shouldn't be after __io_free_req() -> __io_req_aux_free(). Also,
      it will be ignored for in io_free_req_many(), which use
      __io_req_aux_free().
      
      Place cleanup_req() into __io_req_aux_free().
      
      Fixes: 99bc4c38 ("io_uring: fix iovec leaks")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      929a3af9
    • F
      Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents · e75fd33b
      Filipe Manana 提交于
      In btrfs_wait_ordered_range() once we find an ordered extent that has
      finished with an error we exit the loop and don't wait for any other
      ordered extents that might be still in progress.
      
      All the users of btrfs_wait_ordered_range() expect that there are no more
      ordered extents in progress after that function returns. So past fixes
      such like the ones from the two following commits:
      
        ff612ba7 ("btrfs: fix panic during relocation after ENOSPC before
                         writeback happens")
      
        28aeeac1 ("Btrfs: fix panic when starting bg cache writeout after
                         IO error")
      
      don't work when there are multiple ordered extents in the range.
      
      Fix that by making btrfs_wait_ordered_range() wait for all ordered extents
      even after it finds one that had an error.
      
      Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e75fd33b
    • J
      btrfs: fix bytes_may_use underflow in prealloc error condtition · b778cf96
      Josef Bacik 提交于
      I hit the following warning while running my error injection stress
      testing:
      
        WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
        RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
        Call Trace:
        btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
        __btrfs_prealloc_file_range+0x378/0x470 [btrfs]
        elfcorehdr_read+0x40/0x40
        ? elfcorehdr_read+0x40/0x40
        ? btrfs_commit_transaction+0xca/0xa50 [btrfs]
        ? dput+0xb4/0x2a0
        ? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
        ? btrfs_sync_file+0x30e/0x420 [btrfs]
        ? do_fsync+0x38/0x70
        ? __x64_sys_fdatasync+0x13/0x20
        ? do_syscall_64+0x5b/0x1b0
        ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens if we fail to insert our reserved file extent.  At this
      point we've already converted our reservation from ->bytes_may_use to
      ->bytes_reserved.  However once we break we will attempt to free
      everything from [cur_offset, end] from ->bytes_may_use, but our extent
      reservation will overlap part of this.
      
      Fix this problem by adding ins.offset (our extent allocation size) to
      cur_offset so we remove the actual remaining part from ->bytes_may_use.
      
      I validated this fix using my inject-error.py script
      
      python inject-error.py -o should_fail_bio -t cache_save_setup -t \
      	__btrfs_prealloc_file_range \
      	-t insert_reserved_file_extent.constprop.0 \
      	-r "-5" ./run-fsstress.sh
      
      where run-fsstress.sh simply mounts and runs fsstress on a disk.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b778cf96
    • J
      btrfs: handle logged extent failure properly · bd727173
      Josef Bacik 提交于
      If we're allocating a logged extent we attempt to insert an extent
      record for the file extent directly.  We increase
      space_info->bytes_reserved, because the extent entry addition will call
      btrfs_update_block_group(), which will convert the ->bytes_reserved to
      ->bytes_used.  However if we fail at any point while inserting the
      extent entry we will bail and leave space on ->bytes_reserved, which
      will trigger a WARN_ON() on umount.  Fix this by pinning the space if we
      fail to insert, which is what happens in every other failure case that
      involves adding the extent entry.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bd727173
    • J
      btrfs: do not check delayed items are empty for single transaction cleanup · 1e903151
      Josef Bacik 提交于
      btrfs_assert_delayed_root_empty() will check if the delayed root is
      completely empty, but this is a filesystem-wide check.  On cleanup we
      may have allowed other transactions to begin, for whatever reason, and
      thus the delayed root is not empty.
      
      So remove this check from cleanup_one_transation().  This however can
      stay in btrfs_cleanup_transaction(), because it checks only after all of
      the transactions have been properly cleaned up, and thus is valid.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1e903151
    • J
      btrfs: reset fs_root to NULL on error in open_ctree · 315bf8ef
      Josef Bacik 提交于
      While running my error injection script I hit a panic when we tried to
      clean up the fs_root when freeing the fs_root.  This is because
      fs_info->fs_root == PTR_ERR(-EIO), which isn't great.  Fix this by
      setting fs_info->fs_root = NULL; if we fail to read the root.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      315bf8ef
    • J
      btrfs: destroy qgroup extent records on transaction abort · 81f7eb00
      Jeff Mahoney 提交于
      We clean up the delayed references when we abort a transaction but we
      leave the pending qgroup extent records behind, leaking memory.
      
      This patch destroys the extent records when we destroy the delayed refs
      and makes sure ensure they're gone before releasing the transaction.
      
      Fixes: 3368d001 ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      [ Rebased to latest upstream, remove to_qgroup() helper, use
        rbtree_postorder_for_each_entry_safe() wrapper ]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      81f7eb00
    • L
      pipe: make sure to wake up everybody when the last reader/writer closes · 6551d5c5
      Linus Torvalds 提交于
      Andrei Vagin reported that commit 0ddad21d ("pipe: use exclusive
      waits when reading or writing") broke one of the CRIU tests.  He even
      has a trivial reproducer:
      
          #include <unistd.h>
          #include <sys/types.h>
          #include <sys/wait.h>
      
          int main()
          {
                  int p[2];
                  pid_t p1, p2;
                  int status;
      
                  if (pipe(p) == -1)
                          return 1;
      
                  p1 = fork();
                  if (p1 == 0) {
                          close(p[1]);
                          read(p[0], &status, sizeof(status));
                          return 0;
                  }
                  p2 = fork();
                  if (p2 == 0) {
                          close(p[1]);
                          read(p[0], &status, sizeof(status));
                          return 0;
                  }
                  sleep(1);
                  close(p[1]);
                  wait(&status);
                  wait(&status);
      
                  return 0;
          }
      
      and the problem - once he points it out - is obvious.  We use these nice
      exclusive waits, but when the last writer goes away, it then needs to
      wake up _every_ reader (and conversely, the last reader disappearing
      needs to wake every writer, of course).
      
      In fact, when going through this, we had several small oddities around
      how to wake things.  We did in fact wake every reader when we changed
      the size of the pipe buffers.  But that's entirely pointless, since that
      just acts as a possible source of new space - no new data to read.
      
      And when we change the size of the buffer, we don't need to wake all
      writers even when we add space - that case acts just as if somebody made
      space by reading, and any writer that finds itself not filling it up
      entirely will wake the next one.
      
      On the other hand, on the exit path, we tried to limit the wakeups with
      the proper poll keys etc, which is entirely pointless, because at that
      point we obviously need to wake up everybody.  So don't do that: just
      wake up everybody - but only do that if the counts changed to zero.
      
      So fix those non-IO wakeups to be more proper: space change doesn't add
      any new data, but it might make room for writers, so it wakes up a
      writer.  And the actual changes to reader/writer counts should wake up
      everybody, since everybody is affected (ie readers will all see EOF if
      the writers have gone away, and writers will all get EPIPE if all
      readers have gone away).
      
      Fixes: 0ddad21d ("pipe: use exclusive waits when reading or writing")
      Reported-and-tested-by: NAndrei Vagin <avagin@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6551d5c5
    • D
      io_uring: remove unnecessary NULL checks · 297a31e3
      Dan Carpenter 提交于
      The "kmsg" pointer can't be NULL and we have already dereferenced it so
      a check here would be useless.
      Reviewed-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      297a31e3
  9. 17 2月, 2020 2 次提交
    • J
      btrfs: don't set path->leave_spinning for truncate · 52e29e33
      Josef Bacik 提交于
      The only time we actually leave the path spinning is if we're truncating
      a small amount and don't actually free an extent, which is not a common
      occurrence.  We have to set the path blocking in order to add the
      delayed ref anyway, so the first extent we find we set the path to
      blocking and stay blocking for the duration of the operation.  With the
      upcoming file extent map stuff there will be another case that we have
      to have the path blocking, so just swap to blocking always.
      
      Note: this patch also fixes a warning after 28553fa9 ("Btrfs: fix
      race between shrinking truncate and fiemap") got merged that inserts
      extent locks around truncation so the path must not leave spinning locks
      after btrfs_search_slot.
      
        [70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
        [70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
        [70.794863] 5 locks held by rsync/1141:
        [70.794876]  #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
        [70.795030]  #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
        [70.795051]  #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
        [70.795124]  #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
        [70.795203]  #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
        [70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
        [70.795362] Call Trace:
        [70.795374]  dump_stack+0x71/0xa0
        [70.795445]  ___might_sleep.part.96.cold.106+0xa6/0xb6
        [70.795459]  kmem_cache_alloc+0x1d3/0x290
        [70.795471]  alloc_extent_state+0x22/0x1c0
        [70.795544]  __clear_extent_bit+0x3ba/0x580
        [70.795557]  ? _raw_spin_unlock_irq+0x24/0x30
        [70.795569]  btrfs_truncate_inode_items+0x339/0xe50
        [70.795647]  btrfs_evict_inode+0x269/0x540
        [70.795659]  ? dput.part.38+0x29/0x460
        [70.795671]  evict+0xcd/0x190
        [70.795682]  __dentry_kill+0xd6/0x180
        [70.795754]  dput.part.38+0x2ad/0x460
        [70.795765]  do_renameat2+0x3cb/0x540
        [70.795777]  __x64_sys_rename+0x1c/0x20
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52e29e33
    • P
      io_uring: add missing io_req_cancelled() · 7fbeb95d
      Pavel Begunkov 提交于
      fallocate_finish() is missing cancellation check. Add it.
      It's safe to do that, as only flags setup and sqe fields copy are done
      before it gets into __io_fallocate().
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7fbeb95d
  10. 15 2月, 2020 5 次提交
  11. 14 2月, 2020 4 次提交