- 12 4月, 2020 5 次提交
-
-
由 Pavel Begunkov 提交于
Requests initialisation is scattered across several functions, namely io_init_req(), io_submit_sqes(), io_submit_sqe(). Put it in io_init_req() for better data locality and code clarity. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
It's a good idea to not read sqe->flags twice, as it's prone to security bugs. Instead of passing it around, embeed them in req->flags. It's already so except for IOSQE_IO_LINK. 1. rename former REQ_F_LINK -> REQ_F_LINK_HEAD 2. introduce and copy REQ_F_LINK, which mimics IO_IOSQE_LINK And leave req_set_fail_links() using new REQ_F_LINK, because it's more sensible. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Having only one place for cleaning up a request after a link assembly/ submission failure will play handy in the future. At least it allows to remove duplicated cleanup sequence. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
As a preparation for extracting request init bits, remove self-coded mm tracking from io_submit_sqes(), but rely on current->mm. It's more convenient, than passing this piece of state in other functions. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
If io_submit_sqes() can't grab an mm, it fails and exits right away. There is no need to track the fact of the failure. Remove @mm_fault. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 4月, 2020 5 次提交
-
-
由 Vasily Averin 提交于
Patch series "seq_file .next functions should increase position index". In Aug 2018 NeilBrown noticed commit 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface") "Some ->next functions do not increment *pos when they return NULL... Note that such ->next functions are buggy and should be fixed. A simple demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size larger than the size of /proc/swaps. This will always show the whole last line of /proc/swaps" Described problem is still actual. If you make lseek into middle of last output line following read will output end of last line and whole last line once again. $ dd if=/proc/swaps bs=1 # usual output Filename Type Size Used Priority /dev/dm-0 partition 4194812 97536 -2 104+0 records in 104+0 records out 104 bytes copied $ dd if=/proc/swaps bs=40 skip=1 # last line was generated twice dd: /proc/swaps: cannot skip to specified offset v/dm-0 partition 4194812 97536 -2 /dev/dm-0 partition 4194812 97536 -2 3+1 records in 3+1 records out 131 bytes copied There are lot of other affected files, I've found 30+ including /proc/net/ip_tables_matches and /proc/sysvipc/* I've sent patches into maillists of affected subsystems already, this patch-set fixes the problem in files related to pstore, tracing, gcov, sysvipc and other subsystems processed via linux-kernel@ mailing list directly https://bugzilla.kernel.org/show_bug.cgi?id=206283 This patch (of 4): Add debug code to seq_read() to detect missed or out-of-tree incorrect .next seq_file functions. [akpm@linux-foundation.org: s/pr_info/pr_info_ratelimited/, per Qian Cai] https://bugzilla.kernel.org/show_bug.cgi?id=206283Signed-off-by: NVasily Averin <vvs@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: NeilBrown <neilb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Oberparleiter <oberpar@linux.ibm.com> Cc: Waiman Long <longman@redhat.com> Link: http://lkml.kernel.org/r/244674e5-760c-86bd-d08a-047042881748@virtuozzo.com Link: http://lkml.kernel.org/r/7c24087c-e280-e580-5b0c-0cdaeb14cd18@virtuozzo.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pali Rohár 提交于
For security reasons I stopped using gmail account and kernel address is now up-to-date alias to my personal address. People periodically send me emails to address which they found in source code of drivers, so this change reflects state where people can contact me. [ Added .mailmap entry as per Joe Perches - Linus ] Signed-off-by: NPali Rohár <pali@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200307104237.8199-1-pali@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric Biggers 提交于
After request_module(), nothing is stopping the module from being unloaded until someone takes a reference to it via try_get_module(). The WARN_ONCE() in get_fs_type() is thus user-reachable, via userspace running 'rmmod' concurrently. Since WARN_ONCE() is for kernel bugs only, not for user-reachable situations, downgrade this warning to pr_warn_once(). Keep it printed once only, since the intent of this warning is to detect a bug in modprobe at boot time. Printing the warning more than once wouldn't really provide any useful extra information. Fixes: 41124db8 ("fs: warn in case userspace lied about modprobe return") Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJessica Yu <jeyu@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Jessica Yu <jeyu@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: NeilBrown <neilb@suse.com> Cc: <stable@vger.kernel.org> [4.13+] Link: http://lkml.kernel.org/r/20200312202552.241885-3-ebiggers@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Changwei Ge 提交于
Linux fallocate(2) with FALLOC_FL_PUNCH_HOLE mode set, its offset can exceed the inode size. Ocfs2 now doesn't allow that offset beyond inode size. This restriction is not necessary and violates fallocate(2) semantics. If fallocate(2) offset is beyond inode size, just return success and do nothing further. Otherwise, ocfs2 will crash the kernel. kernel BUG at fs/ocfs2//alloc.c:7264! ocfs2_truncate_inline+0x20f/0x360 [ocfs2] ocfs2_remove_inode_range+0x23c/0xcb0 [ocfs2] __ocfs2_change_file_space+0x4a5/0x650 [ocfs2] ocfs2_fallocate+0x83/0xa0 [ocfs2] vfs_fallocate+0x148/0x230 SyS_fallocate+0x48/0x80 do_syscall_64+0x79/0x170 Signed-off-by: NChangwei Ge <chge@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200407082754.17565-1-chge@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Simon Gander 提交于
When removing files containing extended attributes, the hfsplus driver may remove the wrong entries from the attributes b-tree, causing major filesystem damage and in some cases even kernel crashes. To remove a file, all its extended attributes have to be removed as well. The driver does this by looking up all keys in the attributes b-tree with the cnid of the file. Each of these entries then gets deleted using the key used for searching, which doesn't contain the attribute's name when it should. Since the key doesn't contain the name, the deletion routine will not find the correct entry and instead remove the one in front of it. If parent nodes have to be modified, these become corrupt as well. This causes invalid links and unsorted entries that not even macOS's fsck_hfs is able to fix. To fix this, modify the search key before an entry is deleted from the attributes b-tree by copying the found entry's key into the search key, therefore ensuring that the correct entry gets removed from the tree. Signed-off-by: NSimon Gander <simon@tuxera.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAnton Altaparmakov <anton@tuxera.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200327155541.1521-1-simon@tuxera.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 4月, 2020 2 次提交
-
-
由 Jens Axboe 提交于
We can't reliably wait in io_ring_ctx_wait_and_kill(), since the task_works list isn't ordered (in fact it's LIFO ordered). We could either fix this with a separate task_works list for io_uring work, or just punt the wait-and-free to async context. This ensures that task_work that comes in while we're shutting down is processed correctly. If we don't go async, we could have work past the fput() work for the ring that depends on work that won't be executed until after we're done with the wait-and-free. But as this operation is blocking, it'll never get a chance to run. This was reproduced with hundreds of thousands of sockets running memcached, haven't been able to reproduce this synthetically. Reported-by: NDan Melnic <dmm@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Eric W. Biederman 提交于
syzbot wrote: > ======================================================== > WARNING: possible irq lock inversion dependency detected > 5.6.0-syzkaller #0 Not tainted > -------------------------------------------------------- > swapper/1/0 just changed the state of lock: > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840 > but this lock took another, SOFTIRQ-unsafe lock in the past: > (&pid->wait_pidfd){+.+.}-{2:2} > > > and interrupts could create inverse lock ordering between them. > > > other info that might help us debug this: > Possible interrupt unsafe locking scenario: > > CPU0 CPU1 > ---- ---- > lock(&pid->wait_pidfd); > local_irq_disable(); > lock(tasklist_lock); > lock(&pid->wait_pidfd); > <Interrupt> > lock(tasklist_lock); > > *** DEADLOCK *** > > 4 locks held by swapper/1/0: The problem is that because wait_pidfd.lock is taken under the tasklist lock. It must always be taken with irqs disabled as tasklist_lock can be taken from interrupt context and if wait_pidfd.lock was already taken this would create a lock order inversion. Oleg suggested just disabling irqs where I have added extra calls to wait_pidfd.lock. That should be safe and I think the code will eventually do that. It was rightly pointed out by Christian that sharing the wait_pidfd.lock was a premature optimization. It is also true that my pre-merge window testing was insufficient. So remove the premature optimization and give struct pid a dedicated lock of it's own for struct pid things. I have verified that lockdep sees all 3 paths where we take the new pid->lock and lockdep does not complain. It is my current day dream that one day pid->lock can be used to guard the task lists as well and then the tasklist_lock won't need to be held to deliver signals. That will require taking pid->lock with irqs disabled. Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/ Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com Fixes: 7bc3e6e5 ("proc: Use a list of inodes to flush from proc") Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 09 4月, 2020 1 次提交
-
-
由 Pavel Begunkov 提交于
If completion queue overflow occurs, __io_cqring_fill_event() will update req->cflags, which is in a union with req->work and happens to be aliased to req->work.fs. Following io_free_req() -> io_req_work_drop_env() may get a bunch of different problems (miscount fs->users, segfault, etc) on cleaning @fs. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 4月, 2020 26 次提交
-
-
由 Pavel Begunkov 提交于
Don't re-read userspace-shared sqe->flags, it can be exploited. sqe->flags are copied into req->flags in io_submit_sqe(), check them there instead. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
io_get_req() do two different things: io_kiocb allocation and initialisation. Move init part out of it and rename into io_alloc_req(). It's simpler this way and also have better data locality. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
As io_get_sqe() split into 2 stage get/consume, get an sqe before allocating io_kiocb, so no free_req*() for a failure case is needed, and inline back __io_req_do_free(), which has only 1 user. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Make io_get_sqring() care only about sqes themselves, not initialising the io_kiocb. Also, split it into get + consume, that will be helpful in the future. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Xiaoguang Wang 提交于
In io_read_prep() or io_write_prep(), io_req_map_rw() takes struct io_async_rw's fast_iov as argument to call io_import_iovec(), and if io_import_iovec() uses struct io_async_rw's fast_iov as valid iovec array, later indeed io_req_map_rw() does not need to do the memcpy operation, because they are same pointers. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
OPENAT2 correctly sets O_LARGEFILE if it has to, but that escaped the OPENAT opcode. Dmitry reports that his test case that compares openat() and IORING_OP_OPENAT sees failures on large files: *** sync openat openat succeeded sync write at offset 0 write succeeded sync write at offset 4294967296 write succeeded *** sync openat openat succeeded io_uring write at offset 0 write succeeded io_uring write at offset 4294967296 write succeeded *** io_uring openat openat succeeded sync write at offset 0 write succeeded sync write at offset 4294967296 write failed: File too large *** io_uring openat openat succeeded io_uring write at offset 0 write succeeded io_uring write at offset 4294967296 write failed: File too large Ensure we set O_LARGEFILE, if force_o_largefile() is true. Cc: stable@vger.kernel.org # v5.6 Fixes: 15b71abe ("io_uring: add support for IORING_OP_OPENAT") Reported-by: NDmitry Kadashev <dkadashev@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Mike Marshall 提交于
Christoph Hellwig noticed that we were doing some unnecessary work in orangefs_flush: orangefs_flush just writes out data on every close(2) call. There is no need to change anything about the dirty state, especially as orangefs doesn't treat I_DIRTY_TIMES special in any way. The code seems to come from partially open coding vfs_fsync. He sent in a patch with the above commit message and also a patch that was a reversion of another Orangefs patch I had sent upstream a while ago. I had to fix his reversion patch so that it would compile which caused his "don't mess with I_DIRTY_TIMES" patch to fail to apply. So here I have just remade his patch and applied it after the fixed reversion patch. Signed-off-by: NMike Marshall <hubcap@omnibond.com>
-
由 Mike Marshall 提交于
Christoph Hellwig sent in a reversion of "orangefs: remember count when reading." because: ->read_iter calls can race with each other and one or more ->flush calls. Remove the the scheme to store the read count in the file private data as is is completely racy and can cause use after free or double free conditions Christoph's reversion caused Orangefs not to work or to compile. I added a patch that fixed that, but intel's kbuild test robot pointed out that sending Christoph's patch followed by my patch upstream, it would break bisection because of the failure to compile. So I have combined the reversion plus my patch... here's the commit message that was in my patch: Logically, optimal Orangefs "pages" are 4 megabytes. Reading large Orangefs files 4096 bytes at a time is like trying to kick a dead whale down the beach. Before Christoph's "Revert orangefs: remember count when reading." I tried to give users a knob whereby they could, for example, use "count" in read(2) or bs with dd(1) to get whatever they considered an appropriate amount of bytes at a time from Orangefs and fill as many page cache pages as they could at once. Without the racy code that Christoph reverted Orangefs won't even compile, much less work. So this replaces the logic that used the private file data that Christoph reverted with a static number of bytes to read from Orangefs. I ran tests like the following to determine what a reasonable static number of bytes might be: dd if=/pvfsmnt/asdf of=/dev/null count=128 bs=4194304 dd if=/pvfsmnt/asdf of=/dev/null count=256 bs=2097152 dd if=/pvfsmnt/asdf of=/dev/null count=512 bs=1048576 . . . dd if=/pvfsmnt/asdf of=/dev/null count=4194304 bs=128 Reads seem faster using the static number, so my "knob code" wasn't just racy, it wasn't even a good idea... Signed-off-by: NMike Marshall <hubcap@omnibond.com> Reported-by: Nkbuild test robot <lkp@intel.com>
-
由 Colin Ian King 提交于
There are several places where code is indented incorrectly. Fix these. Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200325135018.113431-1-colin.king@canonical.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
Static executables don't need to free NULL pointer. It doesn't matter really because static executable is not common scenario but do it anyway out of pedantry. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200219185330.GA4933@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
PT_INTERP ELF header can be spared if executable is static. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200219185012.GB4871@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
"loc" variable became just a wrapper for PT_INTERP ELF header after main ELF header was moved to "bprm->buf". Delete it. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200219184847.GA4871@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jason Baron 提交于
Davidlohr Bueso pointed out that when CONFIG_DEBUG_LOCK_ALLOC is set ep_poll_safewake() can take several non-raw spinlocks after disabling interrupts. Since a spinlock can block in the -rt kernel, we can't take a spinlock after disabling interrupts. So let's re-work how we determine the nesting level such that it plays nicely with the -rt kernel. Let's introduce a 'nests' field in struct eventpoll that records the current nesting level during ep_poll_callback(). Then, if we nest again we can find the previous struct eventpoll that we were called from and increase our count by 1. The 'nests' field is protected by ep->poll_wait.lock. I've also moved the visited field to reduce the size of struct eventpoll from 184 bytes to 176 bytes on x86_64 for !CONFIG_DEBUG_LOCK_ALLOC, which is typical for a production config. Reported-by: NDavidlohr Bueso <dbueso@suse.de> Signed-off-by: NJason Baron <jbaron@akamai.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavidlohr Bueso <dbueso@suse.de> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Eric Wong <normalperson@yhbt.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/1582739816-13167-1-git-send-email-jbaron@akamai.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
It's clearer to just put this inline. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200317193201.9924-5-adobriyan@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
The process maps file was the only user of version (introduced back in 2005). Now that it uses ppos instead, we can remove it. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200317193201.9924-4-adobriyan@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
The ppos is a private cursor, just like m->version. Use the canonical cursor, not a special one. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200317193201.9924-3-adobriyan@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
Instead of setting m->version in the show method, set it in m_next(), where it should be. Also remove the fallback code for failing to find a vma, or version being zero. Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200317193201.9924-2-adobriyan@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matthew Wilcox (Oracle) 提交于
Instead of calling vma_stop() from m_start() and m_next(), do its work in m_stop(). Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200317193201.9924-1-adobriyan@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
top(1) reads all /proc/*/statm files but kernel threads will always have zeros. Print those zeroes directly without going through seq_put_decimal_ull(). Speed up reading /proc/2/statm (which is kthreadd) is like 3%. My system has more kernel threads than normal processes after booting KDE. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200307154435.GA2788@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
Now that "struct proc_ops" exist we can start putting there stuff which could not fly with VFS "struct file_operations"... Most of fs/proc/inode.c file is dedicated to make open/read/.../close reliable in the event of disappearing /proc entries which usually happens if module is getting removed. Files like /proc/cpuinfo which never disappear simply do not need such protection. Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such "permanent" files. Enable "permanent" flag for /proc/cpuinfo /proc/kmsg /proc/modules /proc/slabinfo /proc/stat /proc/sysvipc/* /proc/swaps More will come once I figure out foolproof way to prevent out module authors from marking their stuff "permanent" for performance reasons when it is not. This should help with scalability: benchmark is "read /proc/cpuinfo R times by N threads scattered over the system". N R t, s (before) t, s (after) ----------------------------------------------------- 64 4096 1.582458 1.530502 -3.2% 256 4096 6.371926 6.125168 -3.9% 1024 4096 25.64888 24.47528 -4.6% Benchmark source: #include <chrono> #include <iostream> #include <thread> #include <vector> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN); int N; const char *filename; int R; int xxx = 0; int glue(int n) { cpu_set_t m; CPU_ZERO(&m); CPU_SET(n, &m); return sched_setaffinity(0, sizeof(cpu_set_t), &m); } void f(int n) { glue(n % NR_CPUS); while (*(volatile int *)&xxx == 0) { } for (int i = 0; i < R; i++) { int fd = open(filename, O_RDONLY); char buf[4096]; ssize_t rv = read(fd, buf, sizeof(buf)); asm volatile ("" :: "g" (rv)); close(fd); } } int main(int argc, char *argv[]) { if (argc < 4) { std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R "; return 1; } N = atoi(argv[1]); filename = argv[2]; R = atoi(argv[3]); for (int i = 0; i < NR_CPUS; i++) { if (glue(i) == 0) break; } std::vector<std::thread> T; T.reserve(N); for (int i = 0; i < N; i++) { T.emplace_back(f, i); } auto t0 = std::chrono::system_clock::now(); { *(volatile int *)&xxx = 1; for (auto& t: T) { t.join(); } } auto t1 = std::chrono::system_clock::now(); std::chrono::duration<double> dt = t1 - t0; std::cout << dt.count() << ' '; return 0; } P.S.: Explicit randomization marker is added because adding non-function pointer will silently disable structure layout randomization. [akpm@linux-foundation.org: coding style fixes] Reported-by: Nkbuild test robot <lkp@intel.com> Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jules Irenge 提交于
Fix sparse locking imbalance warning: warning: context imbalance in close_pdeo() - unexpected unlock Signed-off-by: NJules Irenge <jbi.octave@gmail.com> Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200227201538.GA30462@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
Only declare _UFFDIO_WRITEPROTECT if the user specified UFFDIO_REGISTER_MODE_WP and if all the checks passed. Then when the user registers regions with shmem/hugetlbfs we won't expose the new ioctl to them. Even with complete anonymous memory range, we'll only expose the new WP ioctl bit if the register mode has MODE_WP. Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-18-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Xu 提交于
It does not make sense to try to wake up any waiting thread when we're write-protecting a memory region. Only wake up when resolving a write protected page fault. Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-16-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Introduce the new uffd-wp APIs for userspace. Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking using the new UFFDIO_REGISTER_MODE_WP flag. Note that this flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the userspace program can not only resolve missing page faults, and at the same time tracking page data changes along the way. Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level write protection tracking. Note that we will need to register the memory region with UFFDIO_REGISTER_MODE_WP before that. [peterx@redhat.com: write up the commit message] [peterx@redhat.com: remove useless block, write commit message, check against VM_MAYWRITE rather than VM_WRITE when register] Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJerome Glisse <jglisse@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
This allows UFFDIO_COPY to map pages write-protected. [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and commit messages] Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NJerome Glisse <jglisse@redhat.com> Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Anshuman Khandual 提交于
This replaces all remaining open encodings with is_vm_hugetlb_page(). Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Will Deacon <will@kernel.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guo Ren <guoren@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Paul Burton <paulburton@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 4月, 2020 1 次提交
-
-
由 Xiaoguang Wang 提交于
syzbot reports below warning: INFO: trying to register non-static key. the code is fine but needs lockdep annotation. turning off the locking correctness validator. CPU: 1 PID: 7099 Comm: syz-executor897 Not tainted 5.6.0-next-20200406-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x188/0x20d lib/dump_stack.c:118 assign_lock_key kernel/locking/lockdep.c:913 [inline] register_lock_class+0x1664/0x1760 kernel/locking/lockdep.c:1225 __lock_acquire+0x104/0x4e00 kernel/locking/lockdep.c:4223 lock_acquire+0x1f2/0x8f0 kernel/locking/lockdep.c:4923 __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline] _raw_spin_lock_irqsave+0x8c/0xbf kernel/locking/spinlock.c:159 io_sqe_files_register fs/io_uring.c:6599 [inline] __io_uring_register+0x1fe8/0x2f00 fs/io_uring.c:8001 __do_sys_io_uring_register fs/io_uring.c:8081 [inline] __se_sys_io_uring_register fs/io_uring.c:8063 [inline] __x64_sys_io_uring_register+0x192/0x560 fs/io_uring.c:8063 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x440289 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffff1bbf558 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440289 RDX: 0000000020000280 RSI: 0000000000000002 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000401b10 R13: 0000000000401ba0 R14: 0000000000000000 R15: 0000000000000000 Initialize struct fixed_file_data's lock to fix this issue. Reported-by: syzbot+e6eeca4a035da76b3065@syzkaller.appspotmail.com Fixes: 05589553 ("io_uring: refactor file register/unregister/update handling") Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-