- 03 3月, 2020 10 次提交
-
-
由 Pavel Begunkov 提交于
Deduplicate call to io_cqring_fill_event(), plain and easy Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Add support for splice(2). - output file is specified as sqe->fd, so it's handled by generic code - hash_reg_file handled by generic code as well - len is 32bit, but should be fine - the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which is a splice flag (i.e. sqe->splice_flags). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Preparation without functional changes. Adds io_get_file(), that allows to grab files not only into req->file. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Make do_splice(), so other kernel parts can reuse it Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
req->in_async is not really needed, it only prevents propagation of @nxt for fast not-blocked submissions. Remove it. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_prep_async_worker() called io_wq_assign_next() do many useless checks: io_req_work_grab_env() was already called during prep, and @do_hashed is not ever used. Add io_prep_next_work() -- simplified version, that can be called io-wq. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Many operations define custom work.func before getting into an io-wq. There are several points against: - it calls io_wq_assign_next() from outside io-wq, that may be confusing - sync context would go unnecessary through io_req_cancelled() - prototypes are quite different, so work!=old_work looks strange - makes async/sync responsibilities fuzzy - adds extra overhead Don't call generic path and io-wq handlers from each other, but use helpers instead Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Don't drop an early reference, hang on to it and let the caller drop it. This makes it behave more like "regular" requests. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If the -EAGAIN happens because of a static condition, then a poll or later retry won't fix it. We must call it again from blocking condition. Play it safe and ensure that any -EAGAIN condition from read or write must retry from async context. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_wq_flush() is buggy, during cancelation of a flush, the associated work may be passed to the caller's (i.e. io_uring) @match callback. That callback is expecting it to be embedded in struct io_kiocb. Cancelation of internal work probably doesn't make a lot of sense to begin with. As the flush helper is no longer used, just delete it and the associated work flag. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 3月, 2020 1 次提交
-
-
由 Pavel Begunkov 提交于
To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may return next work, which will be ignored and lost. Cancel the whole link. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 3月, 2020 2 次提交
-
-
由 Dan Carpenter 提交于
If sbi->s_flex_groups_allocated is zero and the first allocation fails then this code will crash. The problem is that "i--" will set "i" to -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1 is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX is more than zero, the condition is true so we call kvfree(new_groups[-1]). The loop will carry on freeing invalid memory until it crashes. Fixes: 7c990728 ("ext4: fix potential race between s_flex_groups online resizing and access") Reviewed-by: NSuraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountainSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Qian Cai 提交于
journal_head::b_transaction and journal_head::b_next_transaction could be accessed concurrently as noticed by KCSAN, LTP: starting fsync04 /dev/zero: Can't open blockdev EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null) ================================================================== BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2] write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70: __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2] __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569 jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2] (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034 kjournald2+0x13b/0x450 [jbd2] kthread+0x1cd/0x1f0 ret_from_fork+0x27/0x50 read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68: jbd2_write_access_granted+0x1b2/0x250 [jbd2] jbd2_write_access_granted at fs/jbd2/transaction.c:1155 jbd2_journal_get_write_access+0x2c/0x60 [jbd2] __ext4_journal_get_write_access+0x50/0x90 [ext4] ext4_mb_mark_diskspace_used+0x158/0x620 [ext4] ext4_mb_new_blocks+0x54f/0xca0 [ext4] ext4_ind_map_blocks+0xc79/0x1b40 [ext4] ext4_map_blocks+0x3b4/0x950 [ext4] _ext4_get_block+0xfc/0x270 [ext4] ext4_get_block+0x3b/0x50 [ext4] __block_write_begin_int+0x22e/0xae0 __block_write_begin+0x39/0x50 ext4_write_begin+0x388/0xb50 [ext4] generic_perform_write+0x15d/0x290 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe 5 locks held by fsync04/25724: #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260 #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4] #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2] #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4] #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2] irq event stamp: 1407125 hardirqs last enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790 hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790 softirqs last enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0 Reported by Kernel Concurrency Sanitizer on: CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 The plain reads are outside of jh->b_state_lock critical section which result in data races. Fix them by adding pairs of READ|WRITE_ONCE(). Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NQian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pwSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
- 28 2月, 2020 1 次提交
-
-
由 Jens Axboe 提交于
We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise the iovec import for these commands will not do the right thing and fail the command with -EINVAL. Found by running the test suite compiled as 32-bit. Cc: stable@vger.kernel.org Fixes: aa1fa28f ("io_uring: add support for recvmsg()") Fixes: 0fa03c62 ("io_uring: add support for sendmsg()") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 2月, 2020 2 次提交
-
-
由 Tobias Klauser 提交于
Follow the pattern used with other *_show_fdinfo functions and only define and use io_uring_show_fdinfo and its helper functions if CONFIG_PROC_FS is set. Fixes: 87ce955b ("io_uring: add ->show_fdinfo() for the io_uring file descriptor") Signed-off-by: NTobias Klauser <tklauser@distanz.ch> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Dan reports that he triggered a warning on ring exit doing some testing: percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0 Modules linked in: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0 Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83 RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292 RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000 RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045 R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0 R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009 FS: 0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> rcu_core+0x1e4/0x4d0 __do_softirq+0xdb/0x2f1 irq_exit+0xa0/0xb0 smp_apic_timer_interrupt+0x60/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x23/0x170 Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95 Turns out that this is due to percpu_ref_switch_to_atomic() only grabbing a reference to the percpu refcount if it's not already in atomic mode. io_uring drops a ref and re-gets it when switching back to percpu mode. We attempt to protect against this with the FFD_F_ATOMIC bit, but that isn't reliable. We don't actually need to juggle these refcounts between atomic and percpu switch, we can just do them when we've switched to atomic mode. This removes the need for FFD_F_ATOMIC, which wasn't reliable. Fixes: 05f3fb3c ("io_uring: avoid ring quiesce for fixed file set unregister and update") Reported-by: NDan Melnic <dmm@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 2月, 2020 5 次提交
-
-
由 Jens Axboe 提交于
Unlike the other core import helpers, import_single_range() returns 0 on success, not the length imported. This means that links that depend on the result of non-vec based IORING_OP_{READ,WRITE} that were added for 5.5 get errored when they should not be. Fixes: 3a6820f2 ("io_uring: add non-vectored read/write commands") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If work completes inline, then we should pick up a dependent link item in __io_queue_sqe() as well. If we don't do so, we're forced to go async with that item, which is suboptimal. This also fixes an issue with io_put_req_find_next(), which always looks up the next work item. That should only be done if we're dropping the last reference to the request, to prevent multiple lookups of the same work item. Outside of being a fix, this also enables a good cleanup series for 5.7, where we never have to pass 'nxt' around or into the work handlers. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Johannes Thumshirn 提交于
Zonefs makes use of iomap internally, so it should also select iomap in Kconfig. Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
-
由 Christoph Hellwig 提交于
IOCB_NOWAIT can't just be ignored as it breaks applications expecting it not to block. Just refuse the operation as applications must handle that (e.g. by falling back to a thread pool). Fixes: 8dcc1a9d ("fs: New zonefs file system") Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
-
由 Jens Axboe 提交于
We use ->task_pid for exit cancellation, but we need to ensure it's cleared to zero for io_req_work_grab_env() to do the right thing. Take a suggestion from Bart and clear the whole thing, just setting the function passed in. This makes it more future proof as well. Fixes: 36282881 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 2月, 2020 2 次提交
-
-
由 Jens Axboe 提交于
Andres reports that buffered IO seems to suck up more cycles than we would like, and he narrowed it down to the fact that the io-wq workers will briefly spin for more work on completion of a work item. This was a win on the networking side, but apparently some other cases take a hit because of it. Remove the optimization to avoid burning more CPU than we have to for disk IO. Reported-by: NAndres Freund <andres@anarazel.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Xiaoguang Wang 提交于
After making ext4 support iopoll method: let ext4_file_operations's iopoll method be iomap_dio_iopoll(), we found fio can easily hang in fio_ioring_getevents() with below fio job: rm -f testfile; sync; sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread -rw=write -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled. There are two issues that results in this hang, one reason is that when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio does not use io_uring_enter to get completed events, it relies on kernel io_sq_thread to poll for completed events. Another reason is that there is a race: when io_submit_sqes() in io_sq_thread() submits a batch of sqes, variable 'inflight' will record the number of submitted reqs, then io_sq_thread will poll for reqs which have been added to poll_list. But note, if some previous reqs have been punted to io worker, these reqs will won't be in poll_list timely. io_sq_thread() will only poll for a part of previous submitted reqs, and then find poll_list is empty, reset variable 'inflight' to be zero. If app just waits these deferred reqs and does not wake up io_sq_thread again, then hang happens. For app that entirely relies on io_sq_thread to poll completed requests, let io_iopoll_req_issued() wake up io_sq_thread properly when adding new element to poll_list, and when io_sq_thread prepares to sleep, check whether poll_list is empty again, if not empty, continue to poll. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 2月, 2020 2 次提交
-
-
由 Jens Axboe 提交于
We somehow never free the idr, even though we init it for every ctx. Free it when the rest of the ring data is freed. Fixes: 071698e1 ("io_uring: allow registering credentials") Reviewed-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we have a chain of requests and they don't all use the same credentials, then the head of the chain will be issued with the credentails of the tail of the chain. Ensure __io_queue_sqe() overrides the credentials, if they are different. Once we do that, we can clean up the creds handling as well, by only having io_submit_sqe() do the lookup of a personality. It doesn't need to assign it, since __io_queue_sqe() now always does the right thing. Fixes: 75c6a039 ("io_uring: support using a registered personality for commands") Reported-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 22 2月, 2020 7 次提交
-
-
由 Xiaoguang Wang 提交于
Since commit a3a0e43f ("io_uring: don't enter poll loop if we have CQEs pending"), if we already events pending, we won't enter poll loop. In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has been terminated and don't reap pending events which are already in cq ring, and there are some reqs in poll_list, io_sq_thread will enter __io_iopoll_check(), and find pending events, then return, this loop will never have a chance to exit. I have seen this issue in fio stress tests, to fix this issue, let io_sq_thread call io_iopoll_getevents() with argument 'min' being zero, and remove __io_iopoll_check(). Fixes: a3a0e43f ("io_uring: don't enter poll loop if we have CQEs pending") Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jan Kara 提交于
When CONFIG_QFMT_V2 is configured as a module, the test in ext4_feature_set_ok() fails and so mount of filesystems with quota or project features fails. Fix the test to use IS_ENABLED macro which works properly even for modules. Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz Fixes: d65d87a0 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel") Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
-
由 wangyan 提交于
I found a NULL pointer dereference in ocfs2_block_group_clear_bits(). The running environment: kernel version: 4.19 A cluster with two nodes, 5 luns mounted on two nodes, and do some file operations like dd/fallocate/truncate/rm on every lun with storage network disconnection. The fallocate operation on dm-23-45 caused an null pointer dereference. The information of NULL pointer dereference as follows: [577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45. [577992.878290] Aborting journal on device dm-23-45. ... [577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46. [577992.890908] __journal_remove_journal_head: freeing b_committed_data [577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30 [577992.890918] __journal_remove_journal_head: freeing b_committed_data [577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30 [577992.890922] __journal_remove_journal_head: freeing b_committed_data [577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30 [577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30 [577992.890928] __journal_remove_journal_head: freeing b_committed_data [577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30 [577992.890933] __journal_remove_journal_head: freeing b_committed_data [577992.890939] __journal_remove_journal_head: freeing b_committed_data [577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 [577992.890950] Mem abort info: [577992.890951] ESR = 0x96000004 [577992.890952] Exception class = DABT (current EL), IL = 32 bits [577992.890952] SET = 0, FnV = 0 [577992.890953] EA = 0, S1PTW = 0 [577992.890954] Data abort info: [577992.890955] ISV = 0, ISS = 0x00000004 [577992.890956] CM = 0, WnR = 0 [577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9 [577992.890960] [0000000000000020] pgd=0000000000000000 [577992.890964] Internal error: Oops: 96000004 [#1] SMP [577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd) [577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G W OE 4.19.36 #1 [577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019 [577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO) [577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2] [577992.891084] sp : ffff0000c8e2b810 [577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000 [577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70 [577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2 [577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30 [577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000 [577992.891097] x19: ffff000001681638 x18: ffffffffffffffff [577992.891098] x17: 0000000000000000 x16: ffff000080a03df0 [577992.891100] x15: ffff0000811d9708 x14: 203d207375746174 [577992.891101] x13: 73203a524f525245 x12: 20373439343a6565 [577992.891103] x11: 0000000000000038 x10: 0101010101010101 [577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f [577992.891109] x7 : 0000000000000000 x6 : 0000000000000080 [577992.891110] x5 : 0000000000000000 x4 : 0000000000000002 [577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00 [577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000 [577992.891116] Call trace: [577992.891139] _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891162] _ocfs2_free_clusters+0x100/0x290 [ocfs2] [577992.891185] ocfs2_free_clusters+0x50/0x68 [ocfs2] [577992.891206] ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2] [577992.891227] ocfs2_add_inode_data+0x94/0xc8 [ocfs2] [577992.891248] ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2] [577992.891269] ocfs2_allocate_extents+0x14c/0x338 [ocfs2] [577992.891290] __ocfs2_change_file_space+0x3f8/0x610 [ocfs2] [577992.891309] ocfs2_fallocate+0xe4/0x128 [ocfs2] [577992.891316] vfs_fallocate+0x11c/0x250 [577992.891317] ksys_fallocate+0x54/0x88 [577992.891319] __arm64_sys_fallocate+0x28/0x38 [577992.891323] el0_svc_common+0x78/0x130 [577992.891325] el0_svc_handler+0x38/0x78 [577992.891327] el0_svc+0x8/0xc My analysis process as follows: ocfs2_fallocate __ocfs2_change_file_space ocfs2_allocate_extents ocfs2_extend_allocation ocfs2_add_inode_data ocfs2_add_clusters_in_btree ocfs2_insert_extent ocfs2_do_insert_extent ocfs2_rotate_tree_right ocfs2_extend_rotate_transaction ocfs2_extend_trans jbd2_journal_restart jbd2__journal_restart /* handle->h_transaction is NULL, * is_handle_aborted(handle) is true */ handle->h_transaction = NULL; start_this_handle return -EROFS; ocfs2_free_clusters _ocfs2_free_clusters _ocfs2_free_suballoc_bits ocfs2_block_group_clear_bits ocfs2_journal_access_gd __ocfs2_journal_access jbd2_journal_get_undo_access /* I think jbd2_write_access_granted() will * return true, because do_get_write_access() * will return -EROFS. */ if (jbd2_write_access_granted(...)) return 0; do_get_write_access /* handle->h_transaction is NULL, it will * return -EROFS here, so do_get_write_access() * was not called. */ if (is_handle_aborted(handle)) return -EROFS; /* bh2jh(group_bh) is NULL, caused NULL pointer dereference */ undo_bg = (struct ocfs2_group_desc *) bh2jh(group_bh)->b_committed_data; If handle->h_transaction == NULL, then jbd2_write_access_granted() does not really guarantee that journal_head will stay around, not even speaking of its b_committed_data. The bh2jh(group_bh) can be removed after ocfs2_journal_access_gd() and before call "bh2jh(group_bh)->b_committed_data". So, we should move is_handle_aborted() check from do_get_write_access() into jbd2_journal_get_undo_access() and jbd2_journal_get_write_access() before the call to jbd2_write_access_granted(). Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.comSigned-off-by: NYan Wang <wangyan122@huawei.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJun Piao <piaojun@huawei.com> Reviewed-by: NJan Kara <jack@suse.cz> Cc: stable@kernel.org
-
由 Eric Biggers 提交于
If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running on it, the following warning in ext4_add_complete_io() can be hit: WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120 Here's a minimal reproducer (not 100% reliable) (root isn't required): while true; do sync done & while true; do rm -f file touch file chattr -e file echo X >> file chattr +e file done The problem is that in ext4_writepages(), ext4_should_dioread_nolock() (which only returns true on extent-based files) is checked once to set the number of reserved journal credits, and also again later to select the flags for ext4_map_blocks() and copy the reserved journal handle to ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set, the first check can see dioread_nolock disabled while the later one can see it enabled, causing the reserved handle to unexpectedly be NULL. Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races related to doing so as well, fix this by synchronizing changing EXT4_EXTENTS_FL with ext4_writepages() via the existing s_writepages_rwsem (previously called s_journal_flag_rwsem). This was originally reported by syzbot without a reproducer at https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf, but now that dioread_nolock is the default I also started seeing this when running syzkaller locally. Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com Fixes: 6b523df4 ("ext4: use transaction reservation for extent conversion in ext4_end_io") Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz> Cc: stable@kernel.org
-
由 Eric Biggers 提交于
In preparation for making s_journal_flag_rwsem synchronize ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA flags (rather than just JOURNAL_DATA as it does currently), rename it to s_writepages_rwsem. Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.orgSigned-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz> Cc: stable@kernel.org
-
由 Suraj Jitindar Singh 提交于
During an online resize an array of s_flex_groups structures gets replaced so it can get enlarged. If there is a concurrent access to the array and this memory has been reused then this can lead to an invalid memory access. The s_flex_group array has been converted into an array of pointers rather than an array of structures. This is to ensure that the information contained in the structures cannot get out of sync during a resize due to an accessor updating the value in the old structure after it has been copied but before the array pointer is updated. Since the structures them- selves are no longer copied but only the pointers to them this case is mitigated. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.eduSigned-off-by: NSuraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
-
由 Stefano Garzarella 提交于
This patch drops 'cur_mm' before calling cond_resched(), to prevent the sq_thread from spinning even when the user process is finished. Before this patch, if the user process ended without closing the io_uring fd, the sq_thread continues to spin until the 'sq_thread_idle' timeout ends. In the worst case where the 'sq_thread_idle' parameter is bigger than INT_MAX, the sq_thread will spin forever. Fixes: 6c271ce2 ("io_uring: add submission polling") Signed-off-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 2月, 2020 3 次提交
-
-
由 Filipe Manana 提交于
While logging the prealloc extents of an inode during a fast fsync we call btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while holding a read lock on a leaf of the inode's root (not the log root, the fs/subvol root), and then that function locks the file range in the inode's iotree. This can lead to a deadlock when: * the fsync is ranged * the file has prealloc extents beyond eof * writeback for a range different from the fsync range starts during the fsync * the size of the file is not sector size aligned Because when finishing an ordered extent we lock first a file range and then try to COW the fs/subvol tree to insert an extent item. The following diagram shows how the deadlock can happen. CPU 1 CPU 2 btrfs_sync_file() --> for range [0, 1MiB) --> inode has a size of 1MiB and has 1 prealloc extent beyond the i_size, starting at offset 4MiB flushes all delalloc for the range [0MiB, 1MiB) and waits for the respective ordered extents to complete --> before task at CPU 1 locks the inode, a write into file range [1MiB, 2MiB + 1KiB) is made --> i_size is updated to 2MiB + 1KiB --> writeback is started for that range, [1MiB, 2MiB + 4KiB) --> end offset rounded up to be sector size aligned btrfs_log_dentry_safe() btrfs_log_inode_parent() btrfs_log_inode() btrfs_log_changed_extents() btrfs_log_prealloc_extents() --> does a search on the inode's root --> holds a read lock on leaf X btrfs_finish_ordered_io() --> locks range [1MiB, 2MiB + 4KiB) --> end offset rounded up to be sector size aligned --> tries to cow leaf X, through insert_reserved_file_extent() --> already locked by the task at CPU 1 btrfs_truncate_inode_items() --> gets an i_size of 2MiB + 1KiB, which is not sector size aligned --> tries to lock file range [2MiB, (u64)-1) --> the start range is rounded down from 2MiB + 1K to 2MiB to be sector size aligned --> but the subrange [2MiB, 2MiB + 4KiB) is already locked by task at CPU 2 which is waiting to get a write lock on leaf X for which we are holding a read lock *** deadlock *** This results in a stack trace like the following, triggered by test case generic/561 from fstests: [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds. [ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1 [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000 [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] [ 2779.990466] Call Trace: [ 2779.990491] ? __schedule+0x384/0xa30 [ 2779.990521] schedule+0x33/0xe0 [ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs] [ 2779.990632] ? remove_wait_queue+0x60/0x60 [ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs] [ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs] [ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs] [ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs] [ 2779.990987] ? mark_held_locks+0x6d/0xc0 [ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100 [ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs] [ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs] [ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs] [ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs] [ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs] [ 2779.991432] process_one_work+0x26d/0x6a0 [ 2779.991460] worker_thread+0x4f/0x3e0 [ 2779.991481] ? process_one_work+0x6a0/0x6a0 [ 2779.991489] kthread+0x103/0x140 [ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70 [ 2779.991515] ret_from_fork+0x3a/0x50 (...) [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds. [ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1 [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2780.030035] fsstress D 0 17375 17373 0x00004000 [ 2780.030038] Call Trace: [ 2780.030044] ? __schedule+0x384/0xa30 [ 2780.030052] schedule+0x33/0xe0 [ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs] [ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs] [ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0 [ 2780.030102] ? remove_wait_queue+0x60/0x60 [ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs] [ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs] [ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs] [ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs] [ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0 [ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs] [ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs] [ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0 [ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs] [ 2780.030282] ? dget_parent+0xa1/0x370 [ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs] [ 2780.030339] do_fsync+0x38/0x60 [ 2780.030343] __x64_sys_fdatasync+0x13/0x20 [ 2780.030345] do_syscall_64+0x5c/0x280 [ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0 [ 2780.030361] Code: Bad RIP value. [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0 [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003 [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4 [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90 So fix this by making btrfs_truncate_inode_items() not lock the range in the inode's iotree when the target root is a log root, since it's not needed to lock the range for log roots as the protection from the inode's lock and log_mutex are all that's needed. Fixes: 28553fa9 ("Btrfs: fix race between shrinking truncate and fiemap") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Suraj Jitindar Singh 提交于
During an online resize an array of pointers to s_group_info gets replaced so it can get enlarged. If there is a concurrent access to the array in ext4_get_group_info() and this memory has been reused then this can lead to an invalid memory access. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.eduSigned-off-by: NSuraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NBalbir Singh <sblbir@amazon.com> Cc: stable@kernel.org
-
由 Theodore Ts'o 提交于
During an online resize an array of pointers to buffer heads gets replaced so it can get enlarged. If there is a racing block allocation or deallocation which uses the old array, and the old array has gotten reused this can lead to a GPF or some other random kernel memory getting modified. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.eduReported-by: NSuraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
-
- 20 2月, 2020 2 次提交
-
-
由 Shijie Luo 提交于
We tested a soft lockup problem in linux 4.19 which could also be found in linux 5.x. When dir inode takes up a large number of blocks, and if the directory is growing when we are searching, it's possible the restart branch could be called many times, and the do while loop could hold cpu a long time. Here is the call trace in linux 4.19. [ 473.756186] Call trace: [ 473.756196] dump_backtrace+0x0/0x198 [ 473.756199] show_stack+0x24/0x30 [ 473.756205] dump_stack+0xa4/0xcc [ 473.756210] watchdog_timer_fn+0x300/0x3e8 [ 473.756215] __hrtimer_run_queues+0x114/0x358 [ 473.756217] hrtimer_interrupt+0x104/0x2d8 [ 473.756222] arch_timer_handler_virt+0x38/0x58 [ 473.756226] handle_percpu_devid_irq+0x90/0x248 [ 473.756231] generic_handle_irq+0x34/0x50 [ 473.756234] __handle_domain_irq+0x68/0xc0 [ 473.756236] gic_handle_irq+0x6c/0x150 [ 473.756238] el1_irq+0xb8/0x140 [ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4] [ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4] [ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4] [ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4] [ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4] [ 473.756402] ext4_lookup+0x8c/0x258 [ext4] [ 473.756407] __lookup_hash+0x8c/0xe8 [ 473.756411] filename_create+0xa0/0x170 [ 473.756413] do_mkdirat+0x6c/0x140 [ 473.756415] __arm64_sys_mkdirat+0x28/0x38 [ 473.756419] el0_svc_common+0x78/0x130 [ 473.756421] el0_svc_handler+0x38/0x78 [ 473.756423] el0_svc+0x8/0xc [ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149] Add cond_resched() to avoid soft lockup and to provide a better system responding. Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.comSigned-off-by: NShijie Luo <luoshijie1@huawei.com> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Reviewed-by: NJan Kara <jack@suse.cz> Cc: stable@kernel.org
-
由 Qian Cai 提交于
EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4] write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127: ext4_write_end+0x4e3/0x750 [ext4] ext4_update_i_disksize at fs/ext4/ext4.h:3032 (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046 (inlined by) ext4_write_end at fs/ext4/inode.c:1287 generic_perform_write+0x208/0x2a0 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37: ext4_writepages+0x10ac/0x1d00 [ext4] mpage_map_and_submit_extent at fs/ext4/inode.c:2468 (inlined by) ext4_writepages at fs/ext4/inode.c:2772 do_writepages+0x5e/0x130 __writeback_single_inode+0xeb/0xb20 writeback_sb_inodes+0x429/0x900 __writeback_inodes_wb+0xc4/0x150 wb_writeback+0x4bd/0x870 wb_workfn+0x6b4/0x960 process_one_work+0x54c/0xbe0 worker_thread+0x80/0x650 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 Reported by Kernel Concurrency Sanitizer on: CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: writeback wb_workfn (flush-7:0) Since only the read is operating as lockless (outside of the "i_data_sem"), load tearing could introduce a logic bug. Fix it by adding READ_ONCE() for the read and WRITE_ONCE() for the write. Signed-off-by: NQian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pwSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org
-
- 19 2月, 2020 3 次提交
-
-
由 Pavel Begunkov 提交于
io_cleanup_req() should be called before req->io is freed, and so shouldn't be after __io_free_req() -> __io_req_aux_free(). Also, it will be ignored for in io_free_req_many(), which use __io_req_aux_free(). Place cleanup_req() into __io_req_aux_free(). Fixes: 99bc4c38 ("io_uring: fix iovec leaks") Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Filipe Manana 提交于
In btrfs_wait_ordered_range() once we find an ordered extent that has finished with an error we exit the loop and don't wait for any other ordered extents that might be still in progress. All the users of btrfs_wait_ordered_range() expect that there are no more ordered extents in progress after that function returns. So past fixes such like the ones from the two following commits: ff612ba7 ("btrfs: fix panic during relocation after ENOSPC before writeback happens") 28aeeac1 ("Btrfs: fix panic when starting bg cache writeout after IO error") don't work when there are multiple ordered extents in the range. Fix that by making btrfs_wait_ordered_range() wait for all ordered extents even after it finds one that had an error. Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554 CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NQu Wenruo <wqu@suse.com> Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
I hit the following warning while running my error injection stress testing: WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs] RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs] Call Trace: btrfs_free_reserved_data_space+0x4f/0x70 [btrfs] __btrfs_prealloc_file_range+0x378/0x470 [btrfs] elfcorehdr_read+0x40/0x40 ? elfcorehdr_read+0x40/0x40 ? btrfs_commit_transaction+0xca/0xa50 [btrfs] ? dput+0xb4/0x2a0 ? btrfs_log_dentry_safe+0x55/0x70 [btrfs] ? btrfs_sync_file+0x30e/0x420 [btrfs] ? do_fsync+0x38/0x70 ? __x64_sys_fdatasync+0x13/0x20 ? do_syscall_64+0x5b/0x1b0 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens if we fail to insert our reserved file extent. At this point we've already converted our reservation from ->bytes_may_use to ->bytes_reserved. However once we break we will attempt to free everything from [cur_offset, end] from ->bytes_may_use, but our extent reservation will overlap part of this. Fix this problem by adding ins.offset (our extent allocation size) to cur_offset so we remove the actual remaining part from ->bytes_may_use. I validated this fix using my inject-error.py script python inject-error.py -o should_fail_bio -t cache_save_setup -t \ __btrfs_prealloc_file_range \ -t insert_reserved_file_extent.constprop.0 \ -r "-5" ./run-fsstress.sh where run-fsstress.sh simply mounts and runs fsstress on a disk. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: NQu Wenruo <wqu@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-