- 26 11月, 2019 32 次提交
-
-
由 Jens Axboe 提交于
We currently pass in 4 arguments outside of the bounded size. In preparation for adding one more argument, let's bundle them up in a struct to make it more readable. No functional changes in this patch. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Read/write requests to devices without implemented read/write_iter using fixed buffers can cause general protection fault, which totally hangs a machine. io_import_fixed() initialises iov_iter with bvec, but loop_rw_iter() accesses it as iovec, dereferencing random address. kmap() page by page in this case Cc: stable@vger.kernel.org Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This allows an application to call connect() in an async fashion. Like other opcodes, we first try a non-blocking connect, then punt to async context if we have to. Note that we can still return -EINPROGRESS, and in that case the caller should use IORING_OP_POLL_ADD to do an async wait for completion of the connect request (just like for regular connect(2), except we can do it async here too). Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We return -EBUSY on submit when we have a CQ ring overflow backlog, but that can be a bit problematic if the application is using pure userspace poll of the CQ ring. For that case, if the ring briefly overflowed and we have pending entries in the backlog, the submit flushes the backlog successfully but still returns -EBUSY. If we're able to fully flush the CQ ring backlog, let the submission proceed. Reported-by: NDan Melnic <dmm@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's side. And propagate it. - kiocb_done() is only called from io_read() and io_write(), which are only called from io_issue_sqe(), so it's @nxt != NULL - io_put_req_find_next() is called either with explicitly non-null local nxt, or from one of the functions in io_issue_sqe() switch (or their callees). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
"if (nxt)" is always true, as it was checked in the while's condition. io_wq_current_is_worker() is unnecessary, as non-async callers don't pass nxt, so io_queue_async_work() will be called for them anyway. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Make io_req_find_next() and io_req_link_next() to accept only non-null nxt, and handle it in callers. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
There is only one one-liner user of io_free_req_find_next(). Inline it. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
The number of SQEs to submit is specified by a user, so io_get_sqring() in most of the cases succeeds. Hint compilers about that. Checking ASM genereted by gcc 9.2.0 for x64, there is one branch misprediction. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
__io_submit_sqe() is issuing requests, so call it as such. Moreover, it ends by calling io_iopoll_req_issued(). Rename it and make terminology clearer. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We don't have shadow requests anymore, so get rid of the shadow argument. Add the user_data argument, as that's often useful to easily match up requests, instead of having to look at request pointers. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
There's an issue with the shadow drain logic in that we drop the completion lock after deciding to defer a request, then re-grab it later and assume that the state is still the same. In the mean time, someone else completing a request could have found and issued it. This can cause a stall in the queue, by having a shadow request inserted that nobody is going to drain. Additionally, if we fail allocating the shadow request, we simply ignore the drain. Instead of using a shadow request, defer the next request/link instead. This also has the following advantages: - removes semi-duplicated code - doesn't allocate memory for shadows - works better if only the head marked for drain - doesn't need complex synchronisation On the flip side, it removes the shadow->seq == last_drain_in_in_link->seq optimization. That shouldn't be a common case, and can always be added back, if needed. Fixes: 4fe2c963 ("io_uring: add support for link with drain") Cc: Jackie Liu <liuyun01@kylinos.cn> Reported-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
When we find new work to process within the work handler, we queue the linked timeout before we have issued the new work. This can be problematic for very short timeouts, as we have a window where the new work isn't visible. Allow the work handler to store a callback function for this in the work item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that is set, then io-wq will call the callback when it has setup the new work item. Reported-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We currently try and start the next link when we put the request, and only if we were going to free it. This means that the optimization to continue executing requests from the same context often fails, as we're not putting the final reference. Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the next request more efficiently. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We currently rely on the ring destroy on cleaning things up in case of failure, but io_allocate_scq_urings() can leave things half initialized if only parts of it fails. Be nice and return with either everything setup in success, or return an error with things nicely cleaned up. Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Always mark requests with allocated sqe and deallocate it in __io_free_req(). It's easier to follow and doesn't add edge cases. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We currently clear the linked timeout field if we cancel such a timeout, but we should only attempt to cancel if it's the first one we see. Others should simply be freed like other requests, as they haven't been started yet. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT 1. submission stage: submission references for REQ and LINK_TIMEOUT are dropped. So, references respectively (1,1,2) 2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all linked timeouts will call cancel_timeout() and drop 1 reference. So, references after: (0,0,1). That's a leak. Make it treat only the first linked timeout as such, and pass others through __io_double_put_req(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Pass any IORING_OP_LINK_TIMEOUT request further, where it will eventually fail in io_issue_sqe(). Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
If io_req_defer() failed, it needs to cancel a dependant link. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dan Carpenter 提交于
These lines are indented an extra space character. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We currently have a race where if setup is really slow, we can be calling io_wq_destroy() before we're done setting up. This will cause the caller to get stuck waiting for the manager to set things up, but the manager already exited. Fix this by doing a sync setup of the manager. This also fixes the case where if we failed creating workers, we'd also get stuck. In practice this race window was really small, as we already wait for the manager to start. Hence someone would have to call io_wq_destroy() after the task has started, but before it started the first loop. The reported test case forked tons of these, which is why it became an issue. Reported-by: syzbot+0f1cc17f85154f400465@syzkaller.appspotmail.com Fixes: 771b53d0 ("io-wq: small threadpool implementation for io_uring") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We currently don't explicitly break links if a request is cancelled, but we should. Add explicitly link breakage for all types of request cancellations that we support. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Currently a poll request fills a completion entry of 0, even if it got cancelled. This is odd, and it makes it harder to support with chains. Ensure that it returns -ECANCELED in the completions events if it got cancelled, and furthermore ensure that the linked timeout that triggered it completes with -ETIME if we did indeed trigger the completions through a timeout. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
With the conversion to io-wq, we no longer use that flag. Kill it. Fixes: 561fb04a ("io_uring: replace workqueue usage with io-wq") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
We have an issue with timeout links that are deeper in the submit chain, because we only handle it upfront, not from later submissions. Move the prep + issue of the timeout link to the async work prep handler, and do it normally for non-async queue. If we validate and prepare the timeout links upfront when we first see them, there's nothing stopping us from supporting any sort of nesting. Fixes: 2665abfd ("io_uring: add support for linked SQE timeouts") Reported-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
There are a few reasons for this: - As a prep to improving the linked timeout logic - io_timeout is the biggest member in the io_kiocb opcode union This also enables a few cleanups, like unifying the timer setup between IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple arguments to the link/prep helpers. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we don't use the normal completion path, we may skip killing links that should be errored and freed. Add __io_double_put_req() for use within the completion path itself, other calls should just use io_double_put_req(). Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
__io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err, but the caller doesn't care since the errors are handled inline. Clean these up and just make them void. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
If we have a linked request, this enables us to pass it back directly without having to go through async context. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Linus Torvalds 提交于
fdget_pos() is used by file operations that will read and update f_pos: things like "read()", "write()" and "lseek()" (but not, for example, "pread()/pwrite" that get their file positions elsewhere). However, it had two separate escape clauses for this, because not everybody wants or needs serialization of the file position. The first and most obvious case is the "file descriptor doesn't have a position at all", ie a stream-like file. Except we didn't actually use FMODE_STREAM, but instead used FMODE_ATOMIC_POS. The reason for that was that FMODE_STREAM didn't exist back in the days, but also that we didn't want to mark all the special cases, so we only marked the ones that _required_ position atomicity according to POSIX - regular files and directories. The case one was intentionally lazy, but now that we _do_ have FMODE_STREAM we could and should just use it. With the change to use FMODE_STREAM, there are no remaining uses for FMODE_ATOMIC_POS, and all the code to set it is deleted. Any cases where we don't want the serialization because the driver (or subsystem) doesn't use the file position should just be updated to do "stream_open()". We've done that for all the obvious and common situations, we may need a few more. Quoting Kirill Smelkov in the original FMODE_STREAM thread (see link below for full email): "And I appreciate if people could help at least somehow with "getting rid of mixed case entirely" (i.e. always lock f_pos_lock on !FMODE_STREAM), because this transition starts to diverge from my particular use-case too far. To me it makes sense to do that transition as follows: - convert nonseekable_open -> stream_open via stream_open.cocci; - audit other nonseekable_open calls and convert left users that truly don't depend on position to stream_open; - extend stream_open.cocci to analyze alloc_file_pseudo as well (this will cover pipes and sockets), or maybe convert pipes and sockets to FMODE_STREAM manually; - extend stream_open.cocci to analyze file_operations that use no_llseek or noop_llseek, but do not use nonseekable_open or alloc_file_pseudo. This might find files that have stream semantic but are opened differently; - extend stream_open.cocci to analyze file_operations whose .read/.write do not use ppos at all (independently of how file was opened); - ... - after that remove FMODE_ATOMIC_POS and always take f_pos_lock if !FMODE_STREAM; - gather bug reports for deadlocked read/write and convert missed cases to FMODE_STREAM, probably extending stream_open.cocci along the road to catch similar cases i.e. always take f_pos_lock unless a file is explicitly marked as being stream, and try to find and cover all files that are streams" We have not done the "extend stream_open.cocci to analyze alloc_file_pseudo" as well, but the previous commit did manually handle the case of pipes and sockets. The other case where we can avoid locking f_pos is the "this file descriptor only has a single user and it is us, and thus there is no need to lock it". The second test was correct, although a bit subtle and worth just re-iterating here. There are two kinds of other sources of references to the same file descriptor: file descriptors that have been explicitly shared across fork() or with dup(), and file tables having elevated reference counts due to threading (or explicit file sharing with clone()). The first case would have incremented the file count explicitly, and in the second case the previous __fdget() would have incremented it for us and set the FDPUT_FPUT flag. But in both cases the file count would be greater than one, so the "file_count(file) > 1" test catches both situations. Also note that if file_count is 1, that also means that no other thread can have access to the file table, so there also cannot be races with concurrent calls to dup()/fork()/clone() that would increment the file count any other way. Link: https://lore.kernel.org/linux-fsdevel/20190413184404.GA13490@deco.navytux.spb.ru Cc: Kirill Smelkov <kirr@nexedi.com> Cc: Eic Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Marco Elver <elver@google.com> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Paul McKenney <paulmck@kernel.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
In commit 3975b097 ("convert stream-like files -> stream_open, even if they use noop_llseek") Kirill used a coccinelle script to change "nonseekable_open()" to "stream_open()", which changed the trivial cases of stream-like file descriptors to the new model with FMODE_STREAM. However, the two big cases - sockets and pipes - don't actually have that trivial pattern at all, and were thus never converted to FMODE_STREAM even though it makes lots of sense to do so. That's particularly true when looking forward to the next change: getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to decide whether f_pos updates are needed or not. And if they are, we'll always do them atomically. This came up because KCSAN (correctly) noted that the non-locked f_pos updates are data races: they are clearly benign for the case where we don't care, but it would be good to just not have that issue exist at all. Note that the reason we used FMODE_ATOMIC_POS originally is that only doing it for the minimal required case is "safer" in that it's possible that the f_pos locking can cause unnecessary serialization across the whole write() call. And in the worst case, that kind of serialization can cause deadlock issues: think writers that need readers to empty the state using the same file descriptor. [ Note that the locking is per-file descriptor - because it protects "f_pos", which is obviously per-file descriptor - so it only affects cases where you literally use the same file descriptor to both read and write. So a regular pipe that has separate reading and writing file descriptors doesn't really have this situation even though it's the obvious case of "reader empties what a bit writer concurrently fills" But we want to make pipes as being stream-line anyway, because we don't want the unnecessary overhead of locking, and because a named pipe can be (ab-)used by reading and writing to the same file descriptor. ] There are likely a lot of other cases that might want FMODE_STREAM, and looking for ".llseek = no_llseek" users and other cases that don't have an lseek file operation at all and making them use "stream_open()" might be a good idea. But pipes and sockets are likely to be the two main cases. Cc: Kirill Smelkov <kirr@nexedi.com> Cc: Eic Dumazet <edumazet@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Marco Elver <elver@google.com> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Paul McKenney <paulmck@kernel.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 11月, 2019 1 次提交
-
-
由 Maxime Bizon 提交于
When both CONFIG_CRAMFS_MTD and CONFIG_CRAMFS_BLOCKDEV are enabled, if we fail to mount on MTD, we don't try on block device. Note: this relies upon cramfs_mtd_fill_super() leaving no side effects on fc state in case of failure; in general, failing get_tree_...() does *not* mean "fine to try again"; e.g. parsed options might've been consumed by fill_super callback and freed on failure. Fixes: 74f78fc5 ("vfs: Convert cramfs to use the new mount API") Signed-off-by: NMaxime Bizon <mbizon@freebox.fr> Signed-off-by: NNicolas Pitre <nico@fluxnic.net> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 23 11月, 2019 3 次提交
-
-
由 Marc Dionne 提交于
By default s_maxbytes is set to MAX_NON_LFS, which limits the usable file size to 2GB, enforced by the vfs. Commit b9b1f8d5 ("AFS: write support fixes") added support for the 64-bit fetch and store server operations, but did not change this value. As a result, attempts to write past the 2G mark result in EFBIG errors: $ dd if=/dev/zero of=foo bs=1M count=1 seek=2048 dd: error writing 'foo': File too large Set s_maxbytes to MAX_LFS_FILESIZE. Fixes: b9b1f8d5 ("AFS: write support fixes") Signed-off-by: NMarc Dionne <marc.dionne@auristor.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Marc Dionne 提交于
Servers sending callback breaks to the YFS_CM_SERVICE service may send up to YFSCBMAX (1024) fids in a single RPC. Anything over AFSCBMAX (50) will cause the assert in afs_break_callbacks to trigger. Remove the assert, as the count has already been checked against the appropriate max values in afs_deliver_cb_callback and afs_deliver_yfs_cb_callback. Fixes: 35dbfba3 ("afs: Implement the YFS cache manager service") Signed-off-by: NMarc Dionne <marc.dionne@auristor.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joseph Qi 提交于
This reverts commit 56e94ea1. Commit 56e94ea1 ("fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()") introduces a regression that fail to create directory with mount option user_xattr and acl. Actually the reported NULL pointer dereference case can be correctly handled by loc->xl_ops->xlo_add_entry(), so revert it. Link: http://lkml.kernel.org/r/1573624916-83825-1-git-send-email-joseph.qi@linux.alibaba.com Fixes: 56e94ea1 ("fs: ocfs2: fix possible null-pointer dereferences in ocfs2_xa_prepare_entry()") Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reported-by: NThomas Voegtle <tv@lio96.de> Acked-by: NChangwei Ge <gechangwei@live.cn> Cc: Jia-Ju Bai <baijiaju1990@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 11月, 2019 1 次提交
-
-
由 David Howells 提交于
In afs_wait_for_call_to_complete(), rather than immediately aborting an operation if a signal occurs, the code attempts to wait for it to complete, using a schedule timeout of 2*RTT (or min 2 jiffies) and a check that we're still receiving relevant packets from the server before we consider aborting the call. We may even ping the server to check on the status of the call. However, there's a missing timeout reset in the event that we do actually get a packet to process, such that if we then get a couple of short stalls, we then time out when progress is actually being made. Fix this by resetting the timeout any time we get something to process. If it's the failure of the call then the call state will get changed and we'll exit the loop shortly thereafter. A symptom of this is data fetches and stores failing with EINTR when they really shouldn't. Fixes: bc5e3a54 ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals") Signed-off-by: NDavid Howells <dhowells@redhat.com> Reviewed-by: NMarc Dionne <marc.dionne@auristor.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 11月, 2019 3 次提交
-
-
由 David Sterba 提交于
After previous patches removing bdev being passed around to set it to bio, it has become unused in submit_extent_page. So it now has "only" 13 parameters. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
We can now remove the bdev from extent_map. Previous patches made sure that bio_set_dev is correctly in all places and that we don't need to grab it from latest_bdev or pass it around inside the extent map. Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 David Sterba 提交于
bio_set_dev sets a bdev to a bio and is not only setting a pointer bug also changing some state bits if there was a different bdev set before. This is one thing that's not needed. Another thing is that setting a bdev at bio allocation time is too early and actually does not work with plain redundancy profiles, where each time we submit a bio to a device, the bdev is set correctly. In many places the bio bdev is set to latest_bdev that seems to serve as a stub pointer "just to put something to bio". But we don't have to do that. Where do we know which bdev to set: * for regular IO: submit_stripe_bio that's called by btrfs_map_bio * repair IO: repair_io_failure, read or write from specific device * super block write (using buffer_heads but uses raw bdev) and barriers * scrub: this does not use all regular IO paths as it needs to reach all copies, verify and fixup eventually, and for that all bdev management is independent * raid56: rbio_add_io_page, for the RMW write * integrity-checker: does it's own low-level block tracking Signed-off-by: NDavid Sterba <dsterba@suse.com>
-