1. 03 3月, 2020 10 次提交
  2. 02 3月, 2020 1 次提交
  3. 01 3月, 2020 2 次提交
    • D
      ext4: potential crash on allocation error in ext4_alloc_flex_bg_array() · 37b0b6b8
      Dan Carpenter 提交于
      If sbi->s_flex_groups_allocated is zero and the first allocation fails
      then this code will crash.  The problem is that "i--" will set "i" to
      -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
      is type promoted to unsigned and becomes UINT_MAX.  Since UINT_MAX
      is more than zero, the condition is true so we call kvfree(new_groups[-1]).
      The loop will carry on freeing invalid memory until it crashes.
      
      Fixes: 7c990728 ("ext4: fix potential race between s_flex_groups online resizing and access")
      Reviewed-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: stable@kernel.org
      Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountainSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      37b0b6b8
    • Q
      jbd2: fix data races at struct journal_head · 6c5d9112
      Qian Cai 提交于
      journal_head::b_transaction and journal_head::b_next_transaction could
      be accessed concurrently as noticed by KCSAN,
      
       LTP: starting fsync04
       /dev/zero: Can't open blockdev
       EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem
       EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
       ==================================================================
       BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2]
      
       write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70:
        __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2]
        __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569
        jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2]
        (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034
        kjournald2+0x13b/0x450 [jbd2]
        kthread+0x1cd/0x1f0
        ret_from_fork+0x27/0x50
      
       read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68:
        jbd2_write_access_granted+0x1b2/0x250 [jbd2]
        jbd2_write_access_granted at fs/jbd2/transaction.c:1155
        jbd2_journal_get_write_access+0x2c/0x60 [jbd2]
        __ext4_journal_get_write_access+0x50/0x90 [ext4]
        ext4_mb_mark_diskspace_used+0x158/0x620 [ext4]
        ext4_mb_new_blocks+0x54f/0xca0 [ext4]
        ext4_ind_map_blocks+0xc79/0x1b40 [ext4]
        ext4_map_blocks+0x3b4/0x950 [ext4]
        _ext4_get_block+0xfc/0x270 [ext4]
        ext4_get_block+0x3b/0x50 [ext4]
        __block_write_begin_int+0x22e/0xae0
        __block_write_begin+0x39/0x50
        ext4_write_begin+0x388/0xb50 [ext4]
        generic_perform_write+0x15d/0x290
        ext4_buffered_write_iter+0x11f/0x210 [ext4]
        ext4_file_write_iter+0xce/0x9e0 [ext4]
        new_sync_write+0x29c/0x3b0
        __vfs_write+0x92/0xa0
        vfs_write+0x103/0x260
        ksys_write+0x9d/0x130
        __x64_sys_write+0x4c/0x60
        do_syscall_64+0x91/0xb05
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       5 locks held by fsync04/25724:
        #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260
        #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4]
        #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
        #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4]
        #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2]
       irq event stamp: 1407125
       hardirqs last  enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790
       hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790
       softirqs last  enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c
       softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
      
      The plain reads are outside of jh->b_state_lock critical section which result
      in data races. Fix them by adding pairs of READ|WRITE_ONCE().
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pwSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      6c5d9112
  4. 28 2月, 2020 1 次提交
  5. 27 2月, 2020 2 次提交
    • T
      io_uring: define and set show_fdinfo only if procfs is enabled · bebdb65e
      Tobias Klauser 提交于
      Follow the pattern used with other *_show_fdinfo functions and only
      define and use io_uring_show_fdinfo and its helper functions if
      CONFIG_PROC_FS is set.
      
      Fixes: 87ce955b ("io_uring: add ->show_fdinfo() for the io_uring file descriptor")
      Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bebdb65e
    • J
      io_uring: drop file set ref put/get on switch · dd3db2a3
      Jens Axboe 提交于
      Dan reports that he triggered a warning on ring exit doing some testing:
      
      percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic
      WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Modules linked in:
      CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0
      Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83
      RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292
      RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000
      RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c
      RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045
      R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0
      R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009
      FS:  0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <IRQ>
       rcu_core+0x1e4/0x4d0
       __do_softirq+0xdb/0x2f1
       irq_exit+0xa0/0xb0
       smp_apic_timer_interrupt+0x60/0x140
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:default_idle+0x23/0x170
      Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95
      
      Turns out that this is due to percpu_ref_switch_to_atomic() only
      grabbing a reference to the percpu refcount if it's not already in
      atomic mode. io_uring drops a ref and re-gets it when switching back to
      percpu mode. We attempt to protect against this with the FFD_F_ATOMIC
      bit, but that isn't reliable.
      
      We don't actually need to juggle these refcounts between atomic and
      percpu switch, we can just do them when we've switched to atomic mode.
      This removes the need for FFD_F_ATOMIC, which wasn't reliable.
      
      Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Reported-by: NDan Melnic <dmm@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      dd3db2a3
  6. 26 2月, 2020 5 次提交
  7. 25 2月, 2020 2 次提交
    • J
      io-wq: remove spin-for-work optimization · 3030fd4c
      Jens Axboe 提交于
      Andres reports that buffered IO seems to suck up more cycles than we
      would like, and he narrowed it down to the fact that the io-wq workers
      will briefly spin for more work on completion of a work item. This was
      a win on the networking side, but apparently some other cases take a
      hit because of it. Remove the optimization to avoid burning more CPU
      than we have to for disk IO.
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3030fd4c
    • X
      io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL · bdcd3eab
      Xiaoguang Wang 提交于
      After making ext4 support iopoll method:
        let ext4_file_operations's iopoll method be iomap_dio_iopoll(),
      we found fio can easily hang in fio_ioring_getevents() with below fio
      job:
          rm -f testfile; sync;
          sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread
      -rw=write -ioengine=io_uring  -hipri=1 -sqthread_poll=1 -direct=1
      -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting
      with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled.
      
      There are two issues that results in this hang, one reason is that
      when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio
      does not use io_uring_enter to get completed events, it relies on
      kernel io_sq_thread to poll for completed events.
      
      Another reason is that there is a race: when io_submit_sqes() in
      io_sq_thread() submits a batch of sqes, variable 'inflight' will
      record the number of submitted reqs, then io_sq_thread will poll for
      reqs which have been added to poll_list. But note, if some previous
      reqs have been punted to io worker, these reqs will won't be in
      poll_list timely. io_sq_thread() will only poll for a part of previous
      submitted reqs, and then find poll_list is empty, reset variable
      'inflight' to be zero. If app just waits these deferred reqs and does
      not wake up io_sq_thread again, then hang happens.
      
      For app that entirely relies on io_sq_thread to poll completed requests,
      let io_iopoll_req_issued() wake up io_sq_thread properly when adding new
      element to poll_list, and when io_sq_thread prepares to sleep, check
      whether poll_list is empty again, if not empty, continue to poll.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bdcd3eab
  8. 24 2月, 2020 2 次提交
  9. 22 2月, 2020 7 次提交
    • X
      io_uring: fix __io_iopoll_check deadlock in io_sq_thread · c7849be9
      Xiaoguang Wang 提交于
      Since commit a3a0e43f ("io_uring: don't enter poll loop if we have
      CQEs pending"), if we already events pending, we won't enter poll loop.
      In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
      been terminated and don't reap pending events which are already in cq
      ring, and there are some reqs in poll_list, io_sq_thread will enter
      __io_iopoll_check(), and find pending events, then return, this loop
      will never have a chance to exit.
      
      I have seen this issue in fio stress tests, to fix this issue, let
      io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
      and remove __io_iopoll_check().
      
      Fixes: a3a0e43f ("io_uring: don't enter poll loop if we have CQEs pending")
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7849be9
    • J
      ext4: fix mount failure with quota configured as module · 9db176bc
      Jan Kara 提交于
      When CONFIG_QFMT_V2 is configured as a module, the test in
      ext4_feature_set_ok() fails and so mount of filesystems with quota or
      project features fails. Fix the test to use IS_ENABLED macro which
      works properly even for modules.
      
      Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz
      Fixes: d65d87a0 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      9db176bc
    • W
      jbd2: fix ocfs2 corrupt when clearing block group bits · 8eedabfd
      wangyan 提交于
      I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
      The running environment:
      	kernel version: 4.19
      	A cluster with two nodes, 5 luns mounted on two nodes, and do some
      	file operations like dd/fallocate/truncate/rm on every lun with storage
      	network disconnection.
      
      The fallocate operation on dm-23-45 caused an null pointer dereference.
      
      The information of NULL pointer dereference as follows:
      	[577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45.
      	[577992.878290] Aborting journal on device dm-23-45.
      	...
      	[577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46.
      	[577992.890908] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30
      	[577992.890918] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30
      	[577992.890922] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30
      	[577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30
      	[577992.890928] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30
      	[577992.890933] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890939] __journal_remove_journal_head: freeing b_committed_data
      	[577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
      	[577992.890950] Mem abort info:
      	[577992.890951]   ESR = 0x96000004
      	[577992.890952]   Exception class = DABT (current EL), IL = 32 bits
      	[577992.890952]   SET = 0, FnV = 0
      	[577992.890953]   EA = 0, S1PTW = 0
      	[577992.890954] Data abort info:
      	[577992.890955]   ISV = 0, ISS = 0x00000004
      	[577992.890956]   CM = 0, WnR = 0
      	[577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9
      	[577992.890960] [0000000000000020] pgd=0000000000000000
      	[577992.890964] Internal error: Oops: 96000004 [#1] SMP
      	[577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd)
      	[577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G        W  OE     4.19.36 #1
      	[577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019
      	[577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO)
      	[577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
      	[577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2]
      	[577992.891084] sp : ffff0000c8e2b810
      	[577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000
      	[577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70
      	[577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2
      	[577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30
      	[577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000
      	[577992.891097] x19: ffff000001681638 x18: ffffffffffffffff
      	[577992.891098] x17: 0000000000000000 x16: ffff000080a03df0
      	[577992.891100] x15: ffff0000811d9708 x14: 203d207375746174
      	[577992.891101] x13: 73203a524f525245 x12: 20373439343a6565
      	[577992.891103] x11: 0000000000000038 x10: 0101010101010101
      	[577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f
      	[577992.891109] x7 : 0000000000000000 x6 : 0000000000000080
      	[577992.891110] x5 : 0000000000000000 x4 : 0000000000000002
      	[577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00
      	[577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000
      	[577992.891116] Call trace:
      	[577992.891139]  _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2]
      	[577992.891162]  _ocfs2_free_clusters+0x100/0x290 [ocfs2]
      	[577992.891185]  ocfs2_free_clusters+0x50/0x68 [ocfs2]
      	[577992.891206]  ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2]
      	[577992.891227]  ocfs2_add_inode_data+0x94/0xc8 [ocfs2]
      	[577992.891248]  ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2]
      	[577992.891269]  ocfs2_allocate_extents+0x14c/0x338 [ocfs2]
      	[577992.891290]  __ocfs2_change_file_space+0x3f8/0x610 [ocfs2]
      	[577992.891309]  ocfs2_fallocate+0xe4/0x128 [ocfs2]
      	[577992.891316]  vfs_fallocate+0x11c/0x250
      	[577992.891317]  ksys_fallocate+0x54/0x88
      	[577992.891319]  __arm64_sys_fallocate+0x28/0x38
      	[577992.891323]  el0_svc_common+0x78/0x130
      	[577992.891325]  el0_svc_handler+0x38/0x78
      	[577992.891327]  el0_svc+0x8/0xc
      
      My analysis process as follows:
      ocfs2_fallocate
        __ocfs2_change_file_space
          ocfs2_allocate_extents
            ocfs2_extend_allocation
              ocfs2_add_inode_data
                ocfs2_add_clusters_in_btree
                  ocfs2_insert_extent
                    ocfs2_do_insert_extent
                      ocfs2_rotate_tree_right
                        ocfs2_extend_rotate_transaction
                          ocfs2_extend_trans
                            jbd2_journal_restart
                              jbd2__journal_restart
                                /* handle->h_transaction is NULL,
                                 * is_handle_aborted(handle) is true
                                 */
                                handle->h_transaction = NULL;
                                start_this_handle
                                  return -EROFS;
                  ocfs2_free_clusters
                    _ocfs2_free_clusters
                      _ocfs2_free_suballoc_bits
                        ocfs2_block_group_clear_bits
                          ocfs2_journal_access_gd
                            __ocfs2_journal_access
                              jbd2_journal_get_undo_access
                                /* I think jbd2_write_access_granted() will
                                 * return true, because do_get_write_access()
                                 * will return -EROFS.
                                 */
                                if (jbd2_write_access_granted(...)) return 0;
                                do_get_write_access
                                  /* handle->h_transaction is NULL, it will
                                   * return -EROFS here, so do_get_write_access()
                                   * was not called.
                                   */
                                  if (is_handle_aborted(handle)) return -EROFS;
                          /* bh2jh(group_bh) is NULL, caused NULL
                             pointer dereference */
                          undo_bg = (struct ocfs2_group_desc *)
                                      bh2jh(group_bh)->b_committed_data;
      
      If handle->h_transaction == NULL, then jbd2_write_access_granted()
      does not really guarantee that journal_head will stay around,
      not even speaking of its b_committed_data. The bh2jh(group_bh)
      can be removed after ocfs2_journal_access_gd() and before call
      "bh2jh(group_bh)->b_committed_data". So, we should move
      is_handle_aborted() check from do_get_write_access() into
      jbd2_journal_get_undo_access() and jbd2_journal_get_write_access()
      before the call to jbd2_write_access_granted().
      
      Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.comSigned-off-by: NYan Wang <wangyan122@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      8eedabfd
    • E
      ext4: fix race between writepages and enabling EXT4_EXTENTS_FL · cb85f4d2
      Eric Biggers 提交于
      If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
      on it, the following warning in ext4_add_complete_io() can be hit:
      
      WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
      
      Here's a minimal reproducer (not 100% reliable) (root isn't required):
      
              while true; do
                      sync
              done &
              while true; do
                      rm -f file
                      touch file
                      chattr -e file
                      echo X >> file
                      chattr +e file
              done
      
      The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
      (which only returns true on extent-based files) is checked once to set
      the number of reserved journal credits, and also again later to select
      the flags for ext4_map_blocks() and copy the reserved journal handle to
      ext4_io_end::handle.  But if EXT4_EXTENTS_FL is being concurrently set,
      the first check can see dioread_nolock disabled while the later one can
      see it enabled, causing the reserved handle to unexpectedly be NULL.
      
      Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
      related to doing so as well, fix this by synchronizing changing
      EXT4_EXTENTS_FL with ext4_writepages() via the existing
      s_writepages_rwsem (previously called s_journal_flag_rwsem).
      
      This was originally reported by syzbot without a reproducer at
      https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
      but now that dioread_nolock is the default I also started seeing this
      when running syzkaller locally.
      
      Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
      Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
      Fixes: 6b523df4 ("ext4: use transaction reservation for extent conversion in ext4_end_io")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      cb85f4d2
    • E
      ext4: rename s_journal_flag_rwsem to s_writepages_rwsem · bbd55937
      Eric Biggers 提交于
      In preparation for making s_journal_flag_rwsem synchronize
      ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
      flags (rather than just JOURNAL_DATA as it does currently), rename it to
      s_writepages_rwsem.
      
      Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      bbd55937
    • S
      ext4: fix potential race between s_flex_groups online resizing and access · 7c990728
      Suraj Jitindar Singh 提交于
      During an online resize an array of s_flex_groups structures gets replaced
      so it can get enlarged. If there is a concurrent access to the array and
      this memory has been reused then this can lead to an invalid memory access.
      
      The s_flex_group array has been converted into an array of pointers rather
      than an array of structures. This is to ensure that the information
      contained in the structures cannot get out of sync during a resize due to
      an accessor updating the value in the old structure after it has been
      copied but before the array pointer is updated. Since the structures them-
      selves are no longer copied but only the pointers to them this case is
      mitigated.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
      Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.eduSigned-off-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      7c990728
    • S
      io_uring: prevent sq_thread from spinning when it should stop · 7143b5ac
      Stefano Garzarella 提交于
      This patch drops 'cur_mm' before calling cond_resched(), to prevent
      the sq_thread from spinning even when the user process is finished.
      
      Before this patch, if the user process ended without closing the
      io_uring fd, the sq_thread continues to spin until the
      'sq_thread_idle' timeout ends.
      
      In the worst case where the 'sq_thread_idle' parameter is bigger than
      INT_MAX, the sq_thread will spin forever.
      
      Fixes: 6c271ce2 ("io_uring: add submission polling")
      Signed-off-by: NStefano Garzarella <sgarzare@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7143b5ac
  10. 21 2月, 2020 3 次提交
    • F
      Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof · a5ae50de
      Filipe Manana 提交于
      While logging the prealloc extents of an inode during a fast fsync we call
      btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
      holding a read lock on a leaf of the inode's root (not the log root, the
      fs/subvol root), and then that function locks the file range in the inode's
      iotree. This can lead to a deadlock when:
      
      * the fsync is ranged
      
      * the file has prealloc extents beyond eof
      
      * writeback for a range different from the fsync range starts
        during the fsync
      
      * the size of the file is not sector size aligned
      
      Because when finishing an ordered extent we lock first a file range and
      then try to COW the fs/subvol tree to insert an extent item.
      
      The following diagram shows how the deadlock can happen.
      
                 CPU 1                                        CPU 2
      
        btrfs_sync_file()
          --> for range [0, 1MiB)
      
          --> inode has a size of
              1MiB and has 1 prealloc
              extent beyond the
              i_size, starting at offset
              4MiB
      
          flushes all delalloc for the
          range [0MiB, 1MiB) and waits
          for the respective ordered
          extents to complete
      
                                                    --> before task at CPU 1 locks the
                                                        inode, a write into file range
                                                        [1MiB, 2MiB + 1KiB) is made
      
                                                    --> i_size is updated to 2MiB + 1KiB
      
                                                    --> writeback is started for that
                                                        range, [1MiB, 2MiB + 4KiB)
                                                        --> end offset rounded up to
                                                            be sector size aligned
      
          btrfs_log_dentry_safe()
            btrfs_log_inode_parent()
              btrfs_log_inode()
      
                btrfs_log_changed_extents()
                  btrfs_log_prealloc_extents()
                    --> does a search on the
                        inode's root
                    --> holds a read lock on
                        leaf X
      
                                                    btrfs_finish_ordered_io()
                                                      --> locks range [1MiB, 2MiB + 4KiB)
                                                          --> end offset rounded up
                                                              to be sector size aligned
      
                                                      --> tries to cow leaf X, through
                                                          insert_reserved_file_extent()
                                                          --> already locked by the
                                                              task at CPU 1
      
                    btrfs_truncate_inode_items()
      
                      --> gets an i_size of
                          2MiB + 1KiB, which is
                          not sector size
                          aligned
      
                      --> tries to lock file
                          range [2MiB, (u64)-1)
                          --> the start range
                              is rounded down
                              from 2MiB + 1K
                              to 2MiB to be sector
                              size aligned
      
                          --> but the subrange
                              [2MiB, 2MiB + 4KiB) is
                              already locked by
                              task at CPU 2 which
                              is waiting to get a
                              write lock on leaf X
                              for which we are
                              holding a read lock
      
                                      *** deadlock ***
      
      This results in a stack trace like the following, triggered by test case
      generic/561 from fstests:
      
        [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
        [ 2779.979536]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2779.990136] kworker/u8:6    D    0   247      2 0x80004000
        [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
        [ 2779.990466] Call Trace:
        [ 2779.990491]  ? __schedule+0x384/0xa30
        [ 2779.990521]  schedule+0x33/0xe0
        [ 2779.990616]  btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
        [ 2779.990632]  ? remove_wait_queue+0x60/0x60
        [ 2779.990730]  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
        [ 2779.990782]  btrfs_search_slot+0x510/0x1000 [btrfs]
        [ 2779.990869]  btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
        [ 2779.990944]  __btrfs_drop_extents+0x161/0x1060 [btrfs]
        [ 2779.990987]  ? mark_held_locks+0x6d/0xc0
        [ 2779.990994]  ? __slab_alloc.isra.49+0x99/0x100
        [ 2779.991060]  ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
        [ 2779.991145]  insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
        [ 2779.991222]  ? start_transaction+0xdd/0x5c0 [btrfs]
        [ 2779.991291]  btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
        [ 2779.991405]  btrfs_work_helper+0xaa/0x720 [btrfs]
        [ 2779.991432]  process_one_work+0x26d/0x6a0
        [ 2779.991460]  worker_thread+0x4f/0x3e0
        [ 2779.991481]  ? process_one_work+0x6a0/0x6a0
        [ 2779.991489]  kthread+0x103/0x140
        [ 2779.991499]  ? kthread_create_worker_on_cpu+0x70/0x70
        [ 2779.991515]  ret_from_fork+0x3a/0x50
        (...)
        [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
        [ 2780.027480]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
        [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [ 2780.030035] fsstress        D    0 17375  17373 0x00004000
        [ 2780.030038] Call Trace:
        [ 2780.030044]  ? __schedule+0x384/0xa30
        [ 2780.030052]  schedule+0x33/0xe0
        [ 2780.030075]  lock_extent_bits+0x20c/0x320 [btrfs]
        [ 2780.030094]  ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
        [ 2780.030098]  ? rcu_read_lock_sched_held+0x59/0xa0
        [ 2780.030102]  ? remove_wait_queue+0x60/0x60
        [ 2780.030122]  btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
        [ 2780.030151]  ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
        [ 2780.030165]  ? btrfs_search_slot+0x379/0x1000 [btrfs]
        [ 2780.030195]  btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
        [ 2780.030202]  ? do_raw_spin_unlock+0x49/0xc0
        [ 2780.030215]  ? btrfs_get_num_csums+0x10/0x10 [btrfs]
        [ 2780.030239]  btrfs_log_inode+0xf83/0x1124 [btrfs]
        [ 2780.030251]  ? __mutex_unlock_slowpath+0x45/0x2a0
        [ 2780.030275]  btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
        [ 2780.030282]  ? dget_parent+0xa1/0x370
        [ 2780.030309]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [ 2780.030329]  btrfs_sync_file+0x3f3/0x490 [btrfs]
        [ 2780.030339]  do_fsync+0x38/0x60
        [ 2780.030343]  __x64_sys_fdatasync+0x13/0x20
        [ 2780.030345]  do_syscall_64+0x5c/0x280
        [ 2780.030348]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
        [ 2780.030361] Code: Bad RIP value.
        [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
        [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
        [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
        [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
        [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
        [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
      
      So fix this by making btrfs_truncate_inode_items() not lock the range in
      the inode's iotree when the target root is a log root, since it's not
      needed to lock the range for log roots as the protection from the inode's
      lock and log_mutex are all that's needed.
      
      Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a5ae50de
    • S
      ext4: fix potential race between s_group_info online resizing and access · df3da4ea
      Suraj Jitindar Singh 提交于
      During an online resize an array of pointers to s_group_info gets replaced
      so it can get enlarged. If there is a concurrent access to the array in
      ext4_get_group_info() and this memory has been reused then this can lead to
      an invalid memory access.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
      Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.eduSigned-off-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NBalbir Singh <sblbir@amazon.com>
      Cc: stable@kernel.org
      df3da4ea
    • T
      ext4: fix potential race between online resizing and write operations · 1d0c3924
      Theodore Ts'o 提交于
      During an online resize an array of pointers to buffer heads gets
      replaced so it can get enlarged.  If there is a racing block
      allocation or deallocation which uses the old array, and the old array
      has gotten reused this can lead to a GPF or some other random kernel
      memory getting modified.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
      Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.eduReported-by: NSuraj Jitindar Singh <surajjs@amazon.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      1d0c3924
  11. 20 2月, 2020 2 次提交
    • S
      ext4: add cond_resched() to __ext4_find_entry() · 9424ef56
      Shijie Luo 提交于
      We tested a soft lockup problem in linux 4.19 which could also
      be found in linux 5.x.
      
      When dir inode takes up a large number of blocks, and if the
      directory is growing when we are searching, it's possible the
      restart branch could be called many times, and the do while loop
      could hold cpu a long time.
      
      Here is the call trace in linux 4.19.
      
      [  473.756186] Call trace:
      [  473.756196]  dump_backtrace+0x0/0x198
      [  473.756199]  show_stack+0x24/0x30
      [  473.756205]  dump_stack+0xa4/0xcc
      [  473.756210]  watchdog_timer_fn+0x300/0x3e8
      [  473.756215]  __hrtimer_run_queues+0x114/0x358
      [  473.756217]  hrtimer_interrupt+0x104/0x2d8
      [  473.756222]  arch_timer_handler_virt+0x38/0x58
      [  473.756226]  handle_percpu_devid_irq+0x90/0x248
      [  473.756231]  generic_handle_irq+0x34/0x50
      [  473.756234]  __handle_domain_irq+0x68/0xc0
      [  473.756236]  gic_handle_irq+0x6c/0x150
      [  473.756238]  el1_irq+0xb8/0x140
      [  473.756286]  ext4_es_lookup_extent+0xdc/0x258 [ext4]
      [  473.756310]  ext4_map_blocks+0x64/0x5c0 [ext4]
      [  473.756333]  ext4_getblk+0x6c/0x1d0 [ext4]
      [  473.756356]  ext4_bread_batch+0x7c/0x1f8 [ext4]
      [  473.756379]  ext4_find_entry+0x124/0x3f8 [ext4]
      [  473.756402]  ext4_lookup+0x8c/0x258 [ext4]
      [  473.756407]  __lookup_hash+0x8c/0xe8
      [  473.756411]  filename_create+0xa0/0x170
      [  473.756413]  do_mkdirat+0x6c/0x140
      [  473.756415]  __arm64_sys_mkdirat+0x28/0x38
      [  473.756419]  el0_svc_common+0x78/0x130
      [  473.756421]  el0_svc_handler+0x38/0x78
      [  473.756423]  el0_svc+0x8/0xc
      [  485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149]
      
      Add cond_resched() to avoid soft lockup and to provide a better
      system responding.
      
      Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.comSigned-off-by: NShijie Luo <luoshijie1@huawei.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@kernel.org
      9424ef56
    • Q
      ext4: fix a data race in EXT4_I(inode)->i_disksize · 35df4299
      Qian Cai 提交于
      EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by
      KCSAN,
      
       BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4]
      
       write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127:
        ext4_write_end+0x4e3/0x750 [ext4]
        ext4_update_i_disksize at fs/ext4/ext4.h:3032
        (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046
        (inlined by) ext4_write_end at fs/ext4/inode.c:1287
        generic_perform_write+0x208/0x2a0
        ext4_buffered_write_iter+0x11f/0x210 [ext4]
        ext4_file_write_iter+0xce/0x9e0 [ext4]
        new_sync_write+0x29c/0x3b0
        __vfs_write+0x92/0xa0
        vfs_write+0x103/0x260
        ksys_write+0x9d/0x130
        __x64_sys_write+0x4c/0x60
        do_syscall_64+0x91/0xb47
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
       read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37:
        ext4_writepages+0x10ac/0x1d00 [ext4]
        mpage_map_and_submit_extent at fs/ext4/inode.c:2468
        (inlined by) ext4_writepages at fs/ext4/inode.c:2772
        do_writepages+0x5e/0x130
        __writeback_single_inode+0xeb/0xb20
        writeback_sb_inodes+0x429/0x900
        __writeback_inodes_wb+0xc4/0x150
        wb_writeback+0x4bd/0x870
        wb_workfn+0x6b4/0x960
        process_one_work+0x54c/0xbe0
        worker_thread+0x80/0x650
        kthread+0x1e0/0x200
        ret_from_fork+0x27/0x50
      
       Reported by Kernel Concurrency Sanitizer on:
       CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G        W  O L 5.5.0-next-20200204+ #5
       Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
       Workqueue: writeback wb_workfn (flush-7:0)
      
      Since only the read is operating as lockless (outside of the
      "i_data_sem"), load tearing could introduce a logic bug. Fix it by
      adding READ_ONCE() for the read and WRITE_ONCE() for the write.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pwSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      35df4299
  12. 19 2月, 2020 3 次提交
    • P
      io_uring: fix use-after-free by io_cleanup_req() · 929a3af9
      Pavel Begunkov 提交于
      io_cleanup_req() should be called before req->io is freed, and so
      shouldn't be after __io_free_req() -> __io_req_aux_free(). Also,
      it will be ignored for in io_free_req_many(), which use
      __io_req_aux_free().
      
      Place cleanup_req() into __io_req_aux_free().
      
      Fixes: 99bc4c38 ("io_uring: fix iovec leaks")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      929a3af9
    • F
      Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents · e75fd33b
      Filipe Manana 提交于
      In btrfs_wait_ordered_range() once we find an ordered extent that has
      finished with an error we exit the loop and don't wait for any other
      ordered extents that might be still in progress.
      
      All the users of btrfs_wait_ordered_range() expect that there are no more
      ordered extents in progress after that function returns. So past fixes
      such like the ones from the two following commits:
      
        ff612ba7 ("btrfs: fix panic during relocation after ENOSPC before
                         writeback happens")
      
        28aeeac1 ("Btrfs: fix panic when starting bg cache writeout after
                         IO error")
      
      don't work when there are multiple ordered extents in the range.
      
      Fix that by making btrfs_wait_ordered_range() wait for all ordered extents
      even after it finds one that had an error.
      
      Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e75fd33b
    • J
      btrfs: fix bytes_may_use underflow in prealloc error condtition · b778cf96
      Josef Bacik 提交于
      I hit the following warning while running my error injection stress
      testing:
      
        WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
        RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
        Call Trace:
        btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
        __btrfs_prealloc_file_range+0x378/0x470 [btrfs]
        elfcorehdr_read+0x40/0x40
        ? elfcorehdr_read+0x40/0x40
        ? btrfs_commit_transaction+0xca/0xa50 [btrfs]
        ? dput+0xb4/0x2a0
        ? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
        ? btrfs_sync_file+0x30e/0x420 [btrfs]
        ? do_fsync+0x38/0x70
        ? __x64_sys_fdatasync+0x13/0x20
        ? do_syscall_64+0x5b/0x1b0
        ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This happens if we fail to insert our reserved file extent.  At this
      point we've already converted our reservation from ->bytes_may_use to
      ->bytes_reserved.  However once we break we will attempt to free
      everything from [cur_offset, end] from ->bytes_may_use, but our extent
      reservation will overlap part of this.
      
      Fix this problem by adding ins.offset (our extent allocation size) to
      cur_offset so we remove the actual remaining part from ->bytes_may_use.
      
      I validated this fix using my inject-error.py script
      
      python inject-error.py -o should_fail_bio -t cache_save_setup -t \
      	__btrfs_prealloc_file_range \
      	-t insert_reserved_file_extent.constprop.0 \
      	-r "-5" ./run-fsstress.sh
      
      where run-fsstress.sh simply mounts and runs fsstress on a disk.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b778cf96