- 19 5月, 2018 3 次提交
-
-
由 Arnd Bergmann 提交于
In a move to make ktime_get_*() the preferred driver interface into the timekeeping code, sanitizes ktime_get_real_ts64() to be a proper exported symbol rather than an alias for getnstimeofday64(). The internal __getnstimeofday64() is no longer used, so remove that and merge it into ktime_get_real_ts64(). Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: y2038@lists.linaro.org Cc: John Stultz <john.stultz@linaro.org> Link: https://lkml.kernel.org/r/20180427134016.2525989-3-arnd@arndb.de
-
由 Arnd Bergmann 提交于
At this point, we have converted most of the kernel to use timespec64 consistently in place of timespec, so it seems it's time to make timespec64 the native structure and define timespec in terms of that one on 64-bit architectures. Starting with gcc-5, the compiler can completely optimize away the timespec_to_timespec64 and timespec64_to_timespec functions on 64-bit architectures. With older compilers, we introduce a couple of extra copies of local variables, but those are easily avoided by using the timespec64 based interfaces consistently, as we do in most of the important code paths already. The main upside of removing the hack is that printing the tv_sec field of a timespec64 structure can now use the %lld format string on all architectures without a cast to time64_t. Without this patch, the field is a 'long' type and would have to be printed using %ld on 64-bit architectures. Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: y2038@lists.linaro.org Cc: John Stultz <john.stultz@linaro.org> Link: https://lkml.kernel.org/r/20180427134016.2525989-2-arnd@arndb.de
-
由 Souptick Joarder 提交于
Many places in drivers/ file systems, error was handled in a common way like below: ret = (ret == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS; vmf_error() will replace this and return vm_fault_t type err. A lot of drivers and filesystems currently have a rather complex mapping of errno-to-VM_FAULT code. We have been able to eliminate a lot of it by just returning VM_FAULT codes directly from functions which are called exclusively from the fault handling path. Some functions can be called both from the fault handler and other context which are expecting an errno, so they have to continue to return an errno. Some users still need to choose different behaviour for different errnos, but vmf_error() captures the essential error translation that's common to all users, and those that need to handle additional errors can handle them first. Link: http://lkml.kernel.org/r/20180510174826.GA14268@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 5月, 2018 1 次提交
-
-
由 Willy Tarreau 提交于
proc_pid_cmdline_read() and environ_read() directly access the target process' VM to retrieve the command line and environment. If this process remaps these areas onto a file via mmap(), the requesting process may experience various issues such as extra delays if the underlying device is slow to respond. Let's simply refuse to access file-backed areas in these functions. For this we add a new FOLL_ANON gup flag that is passed to all calls to access_remote_vm(). The code already takes care of such failures (including unmapped areas). Accesses via /proc/pid/mem were not changed though. This was assigned CVE-2018-1120. Note for stable backports: the patch may apply to kernels prior to 4.11 but silently miss one location; it must be checked that no call to access_remote_vm() keeps zero as the last argument. Reported-by: NQualys Security Advisory <qsa@qualys.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NWilly Tarreau <w@1wt.eu> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 5月, 2018 1 次提交
-
-
由 Geert Uytterhoeven 提交于
The __DIVIDE() macro checks whether it is called with a 32-bit or 64-bit dividend, to select the appropriate divide-and-round-up routine. As the check uses the ternary operator, the result will always be promoted to a type that can hold both results, i.e. unsigned long long. When using this result in a division on a 32-bit system, this may lead to link errors like: ERROR: "__udivdi3" [drivers/mtd/nand/raw/nand.ko] undefined! Fix this by casting the result of the division to the type of the dividend. Fixes: 8878b126 ("mtd: nand: add ->exec_op() implementation") Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NBoris Brezillon <boris.brezillon@bootlin.com>
-
- 14 5月, 2018 1 次提交
-
-
由 Ben Hutchings 提交于
Commit 9e343e87 ("mtd: cfi: convert inline functions to macros") changed map_word_andequal() into a macro, but also changed the right hand side of the comparison from val3 to val2. Change it back to use val3 on the right hand side. Thankfully this did not cause a regression because all callers currently pass the same argument for val2 and val3. Fixes: 9e343e87 ("mtd: cfi: convert inline functions to macros") Signed-off-by: NBen Hutchings <ben@decadent.org.uk> Signed-off-by: NBoris Brezillon <boris.brezillon@bootlin.com>
-
- 12 5月, 2018 2 次提交
-
-
Since commit c1adf200 ("Introduce rb_replace_node_rcu()") rbtree_augmented.h uses RCU related data structures but does not include the header file. It works as long as it gets somehow included before that and fails otherwise. Link: http://lkml.kernel.org/r/20180504103159.19938-1-bigeasy@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com Fixes: 21292580 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Signed-off-by: NDavid Rientjes <rientjes@google.com> Suggested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 5月, 2018 1 次提交
-
-
由 Wanpeng Li 提交于
Our virtual machines make use of device assignment by configuring 12 NVMe disks for high I/O performance. Each NVMe device has 129 MSI-X Table entries: Capabilities: [50] MSI-X: Enable+ Count=129 Masked-Vector table: BAR=0 offset=00002000 The windows virtual machines fail to boot since they will map the number of MSI-table entries that the NVMe hardware reported to the bus to msi routing table, this will exceed the 1024. This patch extends MAX_IRQ_ROUTES to 4096 for all archs, in the future this might be extended again if needed. Reviewed-by: NCornelia Huck <cohuck@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim KrÄmář <rkrcmar@redhat.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Signed-off-by: NTonny Lu <tonnylu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 10 5月, 2018 1 次提交
-
-
由 Ilya Dryomov 提交于
... and store num_bvecs for client code's convenience. Signed-off-by: NIlya Dryomov <idryomov@gmail.com> Reviewed-by: NJeff Layton <jlayton@redhat.com> Reviewed-by: N"Yan, Zheng" <zyan@redhat.com>
-
- 05 5月, 2018 1 次提交
-
-
由 Bhadram Varka 提交于
It adds support for BCM89610 (Single-Port 10/100/1000BASE-T) transceiver which is used in P3310 Tegra186 platform. Signed-off-by: NBhadram Varka <vbhadram@nvidia.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 5月, 2018 1 次提交
-
-
由 Peter Zijlstra 提交于
Gaurav reported a perceived problem with TASK_PARKED, which turned out to be a broken wait-loop pattern in __kthread_parkme(), but the reported issue can (and does) in fact happen for states that do not do condition based sleeps. When the 'current->state = TASK_RUNNING' store of a previous (concurrent) try_to_wake_up() collides with the setting of a 'special' sleep state, we can loose the sleep state. Normal condition based wait-loops are immune to this problem, but for sleep states that are not condition based are subject to this problem. There already is a fix for TASK_DEAD. Abstract that and also apply it to TASK_STOPPED and TASK_TRACED, both of which are also without condition based wait-loop. Reported-by: NGaurav Kohli <gkohli@codeaurora.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 5月, 2018 2 次提交
-
-
由 Tetsuo Handa 提交于
syzbot is reporting hung tasks at wait_on_bit(WB_shutting_down) in wb_shutdown() [1]. This seems to be because commit 5318ce7d ("bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") forgot to call wake_up_bit(WB_shutting_down) after clear_bit(WB_shutting_down). Introduce a helper function clear_and_wake_up_bit() and use it, in order to avoid similar errors in future. [1] https://syzkaller.appspot.com/bug?id=b297474817af98d5796bc544e1bb806fc3da0e5eSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Nsyzbot <syzbot+c0cf869505e03bdf1a24@syzkaller.appspotmail.com> Fixes: 5318ce7d ("bdi: Shutdown writeback on all cgwbs in cgwb_bdi_destroy()") Cc: Tejun Heo <tj@kernel.org> Reviewed-by: NJan Kara <jack@suse.cz> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Peter Zijlstra 提交于
Even with the wait-loop fixed, there is a further issue with kthread_parkme(). Upon hotplug, when we do takedown_cpu(), smpboot_park_threads() can return before all those threads are in fact blocked, due to the placement of the complete() in __kthread_parkme(). When that happens, sched_cpu_dying() -> migrate_tasks() can end up migrating such a still runnable task onto another CPU. Normally the task will have hit schedule() and gone to sleep by the time we do kthread_unpark(), which will then do __kthread_bind() to re-bind the task to the correct CPU. However, when we loose the initial TASK_PARKED store to the concurrent wakeup issue described previously, do the complete(), get migrated, it is possible to either: - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate the park and set TASK_RUNNING, or - __kthread_bind()'s wait_task_inactive() to observe the competing TASK_RUNNING store. Either way the WARN() in __kthread_bind() will trigger and fail to correctly set the CPU affinity. Fix this by only issuing the complete() when the kthread has scheduled out. This does away with all the icky 'still running' nonsense. The alternative is to promote TASK_PARKED to a special state, this guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING and we'll end up doing the right thing, but this preserves the whole icky business of potentially migating the still runnable thing. Reported-by: NGaurav Kohli <gkohli@codeaurora.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 29 4月, 2018 1 次提交
-
-
由 Amir Goldstein 提交于
The comment claims that this helper will try not to loose bits, but for 64bit long it looses the high bits before hashing 64bit long into 32bit int. Use the helper hash_long() to do the right thing for 64bit long. For 32bit long, there is no change. All the callers of end_name_hash() either assign the result to qstr->hash, which is u32 or return the result as an int value (e.g. full_name_hash()). Change the helper return type to int to conform to its users. [ It took me a while to apply this, because my initial reaction to it was - incorrectly - that it could make for slower code. After having looked more at it, I take back all my complaints about the patch, Amir was right and I was mis-reading things or just being stupid. I also don't worry too much about the possible performance impact of this on 64-bit, since most architectures that actually care about performance end up not using this very much (the dcache code is the most performance-critical, but the word-at-a-time case uses its own hashing anyway). So this ends up being mostly used for filesystems that do their own degraded hashing (usually because they want a case-insensitive comparison function). A _tiny_ worry remains, in that not everybody uses DCACHE_WORD_ACCESS, and then this potentially makes things more expensive on 64-bit architectures with slow or lacking multipliers even for the normal case. That said, realistically the only such architecture I can think of is PA-RISC. Nobody really cares about performance on that, it's more of a "look ma, I've got warts^W an odd machine" platform. So the patch is fine, and all my initial worries were just misplaced from not looking at this properly. - Linus ] Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 4月, 2018 2 次提交
-
-
由 Joel Pepper 提交于
The composite framework allows us to create gadgets composed from many different functions, which need to fit into a single configuration descriptor. Some functions (like uvc) can produce configuration descriptors upwards of 2500 bytes on their own. This patch increases the limit from 1024 bytes to 4096. Signed-off-by: NJoel Pepper <joel.pepper@rwth-aachen.de> Signed-off-by: NFelipe Balbi <felipe.balbi@linux.intel.com>
-
由 Israel Rukshin 提交于
Adding the vector offset when calling to mlx5_vector2eqn() is wrong. This is because mlx5_vector2eqn() checks if EQ index is equal to vector number and the fact that the internal completion vectors that mlx5 allocates don't get an EQ index. The second problem here is that using effective_affinity_mask gives the same CPU for different vectors. This leads to unmapped queues when calling it from blk_mq_rdma_map_queues(). This doesn't happen when using affinity_hint mask. Fixes: 2572cf57 ("mlx5: fix mlx5_get_vector_affinity to start from completion vector 0") Fixes: 05e0cc84 ("net/mlx5: Fix get vector affinity helper function") Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
-
- 26 4月, 2018 4 次提交
-
-
由 Omar Sandoval 提交于
When the blk-mq inflight implementation was added, /proc/diskstats was converted to use it, but /sys/block/$dev/inflight was not. Fix it by adding another helper to count in-flight requests by data direction. Fixes: f299b7c7 ("blk-mq: provide internal in-flight variant") Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Thomas Gleixner 提交于
Revert commits 92af4dcb ("tracing: Unify the "boot" and "mono" tracing clocks") 127bfa5f ("hrtimer: Unify MONOTONIC and BOOTTIME clock behavior") 7250a404 ("posix-timers: Unify MONOTONIC and BOOTTIME clock behavior") d6c7270e ("timekeeping: Remove boot time specific code") f2d6fdbf ("Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior") d6ed449a ("timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock") 72199320 ("timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock") As stated in the pull request for the unification of CLOCK_MONOTONIC and CLOCK_BOOTTIME, it was clear that we might have to revert the change. As reported by several folks systemd and other applications rely on the documented behaviour of CLOCK_MONOTONIC on Linux and break with the above changes. After resume daemons time out and other timeout related issues are observed. Rafael compiled this list: * systemd kills daemons on resume, after >WatchdogSec seconds of suspending (Genki Sky). [Verified that that's because systemd uses CLOCK_MONOTONIC and expects it to not include the suspend time.] * systemd-journald misbehaves after resume: systemd-journald[7266]: File /var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal corrupted or uncleanly shut down, renaming and replacing. (Mike Galbraith). * NetworkManager reports "networking disabled" and networking is broken after resume 50% of the time (Pavel). [May be because of systemd.] * MATE desktop dims the display and starts the screensaver right after system resume (Pavel). * Full system hang during resume (me). [May be due to systemd or NM or both.] That happens on debian and open suse systems. It's sad, that these problems were neither catched in -next nor by those folks who expressed interest in this change. Reported-by: NRafael J. Wysocki <rjw@rjwysocki.net> Reported-by: Genki Sky <sky@genki.is>, Reported-by: NPavel Machek <pavel@ucw.cz> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org>
-
由 Arnaud Pouliquen 提交于
Fix rproc_add_subdev parameter name and inverse the crashed logic. Fixes: 880f5b38 ("remoteproc: Pass type of shutdown to subdev remove") Reviewed-by: NAlex Elder <elder@linaro.org> Signed-off-by: NArnaud Pouliquen <arnaud.pouliquen@st.com> Signed-off-by: NBjorn Andersson <bjorn.andersson@linaro.org>
-
由 Michael S. Tsirkin 提交于
For cleanup it's helpful to be able to simply scan all vqs and discard all data. Add an iterator to do that. Cc: stable@vger.kernel.org Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
- 25 4月, 2018 2 次提交
-
-
由 Linus Walleij 提交于
As it came up in discussion on the mailing list that the semantic meaning of 'blk_mq_ctx' and 'blk_mq_hw_ctx' isn't completely obvious to everyone, let's add some minimal kerneldoc for a starter. Signed-off-by: NLinus Walleij <linus.walleij@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Florian Fainelli 提交于
While adding support for ethtool::get_fecparam and set_fecparam, kernel doc for these functions was missed, add those. Fixes: 1a5f3da2 ("net: ethtool: add support for forward error correction modes") Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Acked-by: NRoopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 4月, 2018 3 次提交
-
-
由 Joakim Tjernlund 提交于
Currently it is possible to read and/or write to suspend EB's. Writing /dev/mtdX or /dev/mtdblockX from several processes may break the flash state machine. Signed-off-by: NJoakim Tjernlund <joakim.tjernlund@infinera.com> Cc: <stable@vger.kernel.org> Reviewed-by: NRichard Weinberger <richard@nod.at> Signed-off-by: NBoris Brezillon <boris.brezillon@bootlin.com>
-
由 John Fastabend 提交于
Relying on map_release hook to decrement the reference counts when a map is removed only works if the map is not being pinned. In the pinned case the ref is decremented immediately and the BPF programs released. After this BPF programs may not be in-use which is not what the user would expect. This patch moves the release logic into bpf_map_put_uref() and brings sockmap in-line with how a similar case is handled in prog array maps. Fixes: 3d9e9526 ("bpf: sockmap, fix leaking maps with attached but not detached progs") Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Roman Gushchin 提交于
Running bpf programs requires disabled preemption, however at least some* of the BPF_PROG_RUN_ARRAY users do not follow this rule. To fix this bug, and also to make it not happen in the future, let's add explicit preemption disabling/re-enabling to the __BPF_PROG_RUN_ARRAY code. * for example: [ 17.624472] RIP: 0010:__cgroup_bpf_run_filter_sk+0x1c4/0x1d0 ... [ 17.640890] inet6_create+0x3eb/0x520 [ 17.641405] __sock_create+0x242/0x340 [ 17.641939] __sys_socket+0x57/0xe0 [ 17.642370] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 17.642944] SyS_socket+0xa/0x10 [ 17.643357] do_syscall_64+0x79/0x220 [ 17.643879] entry_SYSCALL_64_after_hwframe+0x42/0xb7 Signed-off-by: NRoman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 23 4月, 2018 3 次提交
-
-
由 Hans de Goede 提交于
Move the declarations of functions from vboxguest_utils.c which are only meant for vboxguest internal use from include/linux/vbox_utils.h to drivers/virt/vboxguest/vboxguest_core.h. Cc: stable@vger.kernel.org Signed-off-by: NHans de Goede <hdegoede@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Tetsuo Handa 提交于
syzbot is reporting kernel panic [1] triggered by memory allocation failure at tty_ldisc_get() from tty_ldisc_init(). But since both tty_ldisc_get() and caller of tty_ldisc_init() can cleanly handle errors, tty_ldisc_init() does not need to call panic() when tty_ldisc_get() failed. [1] https://syzkaller.appspot.com/bug?id=883431818e036ae6a9981156a64b821110f39187Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jslaby@suse.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Daniel Kurtz 提交于
Commit 99492c39 ("earlycon: Fix __earlycon_table stride") tried to fix __earlycon_table stride by forcing the earlycon_id struct alignment to 32 and asking the linker to 32-byte align the __earlycon_table symbol. This fix was based on commit 07fca0e5 ("tracing: Properly align linker defined symbols") which tried a similar fix for the tracing subsystem. However, this fix doesn't quite work because there is no guarantee that gcc will place structures packed into an array format. In fact, gcc 4.9 chooses to 64-byte align these structs by inserting additional padding between the entries because it has no clue that they are supposed to be in an array. If we are unlucky, the linker will assign symbol "__earlycon_table" to a 32-byte aligned address which does not correspond to the 64-byte aligned contents of section "__earlycon_table". To address this same problem, the fix to the tracing system was subsequently re-implemented using a more robust table of pointers approach by commits: 3d56e331 ("tracing: Replace syscall_meta_data struct array with pointer array") 65498646 ("tracepoints: Fix section alignment using pointer array") e4a9ea5e ("tracing: Replace trace_event struct array with pointer array") Let's use this same "array of pointers to structs" approach for EARLYCON_TABLE. Fixes: 99492c39 ("earlycon: Fix __earlycon_table stride") Signed-off-by: NDaniel Kurtz <djkurtz@chromium.org> Suggested-by: NAaron Durbin <adurbin@chromium.org> Reviewed-by: NRob Herring <robh@kernel.org> Tested-by: NGuenter Roeck <groeck@chromium.org> Reviewed-by: NGuenter Roeck <groeck@chromium.org> Cc: stable <stable@vger.kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 21 4月, 2018 3 次提交
-
-
由 Andrey Konovalov 提交于
KASAN uses the __no_sanitize_address macro to disable instrumentation of particular functions. Right now it's defined only for GCC build, which causes false positives when clang is used. This patch adds a definition for clang. Note, that clang's revision 329612 or higher is required. [andreyknvl@google.com: remove redundant #ifdef CONFIG_KASAN check] Link: http://lkml.kernel.org/r/c79aa31a2a2790f6131ed607c58b0dd45dd62a6c.1523967959.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/4ad725cc903f8534f8c8a60f0daade5e3d674f8d.1523554166.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com> Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Paul Lawrence <paullawrence@google.com> Cc: Sandipan Das <sandipan@linux.vnet.ibm.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Greg Thelen 提交于
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if the page's memcg is undergoing move accounting, which occurs when a process leaves its memcg for a new one that has memory.move_charge_at_immigrate set. unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the given inode is switching writeback domains. Switches occur when enough writes are issued from a new domain. This existing pattern is thus suspicious: lock_page_memcg(page); unlocked_inode_to_wb_begin(inode, &locked); ... unlocked_inode_to_wb_end(inode, locked); unlock_page_memcg(page); If both inode switch and process memcg migration are both in-flight then unlocked_inode_to_wb_end() will unconditionally enable interrupts while still holding the lock_page_memcg() irq spinlock. This suggests the possibility of deadlock if an interrupt occurs before unlock_page_memcg(). truncate __cancel_dirty_page lock_page_memcg unlocked_inode_to_wb_begin unlocked_inode_to_wb_end <interrupts mistakenly enabled> <interrupt> end_page_writeback test_clear_page_writeback lock_page_memcg <deadlock> unlock_page_memcg Due to configuration limitations this deadlock is not currently possible because we don't mix cgroup writeback (a cgroupv2 feature) and memory.move_charge_at_immigrate (a cgroupv1 feature). If the kernel is hacked to always claim inode switching and memcg moving_account, then this script triggers lockup in less than a minute: cd /mnt/cgroup/memory mkdir a b echo 1 > a/memory.move_charge_at_immigrate echo 1 > b/memory.move_charge_at_immigrate ( echo $BASHPID > a/cgroup.procs while true; do dd if=/dev/zero of=/mnt/big bs=1M count=256 done ) & while true; do sync done & sleep 1h & SLEEP=$! while true; do echo $SLEEP > a/cgroup.procs echo $SLEEP > b/cgroup.procs done The deadlock does not seem possible, so it's debatable if there's any reason to modify the kernel. I suggest we should to prevent future surprises. And Wang Long said "this deadlock occurs three times in our environment", so there's more reason to apply this, even to stable. Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch see "[PATCH for-4.4] writeback: safer lock nesting" https://lkml.org/lkml/2018/4/11/146 Wang Long said "this deadlock occurs three times in our environment" [gthelen@google.com: v4] Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com [akpm@linux-foundation.org: comment tweaks, struct initialization simplification] Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com Fixes: 682aa8e1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Signed-off-by: NGreg Thelen <gthelen@google.com> Reported-by: NWang Long <wanglong19@meituan.com> Acked-by: NWang Long <wanglong19@meituan.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> [v4.2+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kees Cook 提交于
One of the classes of kernel stack content leaks[1] is exposing the contents of prior heap or stack contents when a new process stack is allocated. Normally, those stacks are not zeroed, and the old contents remain in place. In the face of stack content exposure flaws, those contents can leak to userspace. Fixing this will make the kernel no longer vulnerable to these flaws, as the stack will be wiped each time a stack is assigned to a new process. There's not a meaningful change in runtime performance; it almost looks like it provides a benefit. Performing back-to-back kernel builds before: Run times: 157.86 157.09 158.90 160.94 160.80 Mean: 159.12 Std Dev: 1.54 and after: Run times: 159.31 157.34 156.71 158.15 160.81 Mean: 158.46 Std Dev: 1.46 Instead of making this a build or runtime config, Andy Lutomirski recommended this just be enabled by default. [1] A noisy search for many kinds of stack content leaks can be seen here: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=linux+kernel+stack+leak I did some more with perf and cycle counts on running 100,000 execs of /bin/true. before: Cycles: 218858861551 218853036130 214727610969 227656844122 224980542841 Mean: 221015379122.60 Std Dev: 4662486552.47 after: Cycles: 213868945060 213119275204 211820169456 224426673259 225489986348 Mean: 217745009865.40 Std Dev: 5935559279.99 It continues to look like it's faster, though the deviation is rather wide, but I'm not sure what I could do that would be less noisy. I'm open to ideas! Link: http://lkml.kernel.org/r/20180221021659.GA37073@beastSigned-off-by: NKees Cook <keescook@chromium.org> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 4月, 2018 2 次提交
-
-
由 Arnd Bergmann 提交于
This is a preparatation for changing over __kernel_timespec to 64-bit times, which involves assigning new system call numbers for mq_timedsend(), mq_timedreceive() and semtimedop() for compatibility with future y2038 proof user space. The existing ABIs will remain available through compat code. Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
由 Robert Kolchmeyer 提交于
fsnotify() acquires a reference to a fsnotify_mark_connector through the SRCU-protected pointer to_tell->i_fsnotify_marks. However, it appears that no precautions are taken in fsnotify_put_mark() to ensure that fsnotify() drops its reference to this fsnotify_mark_connector before assigning a value to its 'destroy_next' field. This can result in fsnotify_put_mark() assigning a value to a connector's 'destroy_next' field right before fsnotify() tries to traverse the linked list referenced by the connector's 'list' field. Since these two fields are members of the same union, this behavior results in a kernel panic. This issue is resolved by moving the connector's 'destroy_next' field into the object pointer union. This should work since the object pointer access is protected by both a spinlock and the value of the 'flags' field, and the 'flags' field is cleared while holding the spinlock in fsnotify_put_mark() before 'destroy_next' is updated. It shouldn't be possible for another thread to accidentally read from the object pointer after the 'destroy_next' field is updated. The offending behavior here is extremely unlikely; since fsnotify_put_mark() removes references to a connector (specifically, it ensures that the connector is unreachable from the inode it was formerly attached to) before updating its 'destroy_next' field, a sizeable chunk of code in fsnotify_put_mark() has to execute in the short window between when fsnotify() acquires the connector reference and saves the value of its 'list' field. On the HEAD kernel, I've only been able to reproduce this by inserting a udelay(1) in fsnotify(). However, I've been able to reproduce this issue without inserting a udelay(1) anywhere on older unmodified release kernels, so I believe it's worth fixing at HEAD. References: https://bugzilla.kernel.org/show_bug.cgi?id=199437 Fixes: 08991e83 CC: stable@vger.kernel.org Signed-off-by: NRobert Kolchmeyer <rkolchmeyer@google.com> Signed-off-by: NJan Kara <jack@suse.cz>
-
- 19 4月, 2018 6 次提交
-
-
由 Mathieu Poirier 提交于
Move CoreSight headers to the SPDX identifier. Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: linux-arm-kernel@lists.infradead.org Link: http://lkml.kernel.org/r/1524089118-27595-1-git-send-email-mathieu.poirier@linaro.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
由 Deepa Dinamani 提交于
Change over clock_nanosleep syscalls to use y2038 safe __kernel_timespec times. This will enable changing over of these syscalls to use new y2038 safe syscalls when the architectures define the CONFIG_64BIT_TIME. Note that nanosleep syscall is deprecated and does not have a plan for making it y2038 safe. But, the syscall should work as before on 64 bit machines and on 32 bit machines, the syscall works correctly until y2038 as before using the existing compat syscall version. There is no new syscall for supporting 64 bit time_t on 32 bit architectures. Cc: linux-api@vger.kernel.org Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
由 Deepa Dinamani 提交于
Change over clock_settime, clock_gettime and clock_getres syscalls to use __kernel_timespec times. This will enable changing over of these syscalls to use new y2038 safe syscalls when the architectures define the CONFIG_64BIT_TIME. Cc: linux-api@vger.kernel.org Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
由 Deepa Dinamani 提交于
get/put_timespec64() interfaces will eventually be used for conversions between the new y2038 safe struct __kernel_timespec and struct timespec64. The new y2038 safe syscalls have a common entry for native and compat interfaces. On compat interfaces, the high order bits of nanoseconds should be zeroed out. This is because the application code or the libc do not guarantee zeroing of these. If used without zeroing, kernel might be at risk of using timespec values incorrectly. Note that clearing of bits is dependent on CONFIG_64BIT_TIME for now. This is until COMPAT_USE_64BIT_TIME has been handled correctly. x86 will be the first architecture that will use the CONFIG_64BIT_TIME. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
由 Deepa Dinamani 提交于
The new struct __kernel_timespec is similar to current internal kernel struct timespec64 on 64 bit architecture. The compat structure however is similar to below on little endian systems (padding and tv_nsec are switched for big endian systems): typedef s32 compat_long_t; typedef s64 compat_kernel_time64_t; struct compat_kernel_timespec { compat_kernel_time64_t tv_sec; compat_long_t tv_nsec; compat_long_t padding; }; This allows for both the native and compat representations to be the same and syscalls using this type as part of their ABI can have a single entry point to both. Note that the compat define is not included anywhere in the kernel explicitly to avoid confusion. These types will be used by the new syscalls that will be introduced in the consequent patches. Most of the new syscalls are just an update to the existing native ones with this new type. Hence, put this new type under an ifdef so that the architectures can define CONFIG_64BIT_TIME when they are ready to handle this switch. Cc: linux-arch@vger.kernel.org Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
由 Deepa Dinamani 提交于
These functions are used in the repurposed compat syscalls to provide backward compatibility for using 32 bit time_t on 32 bit systems. Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-