- 27 12月, 2019 40 次提交
-
-
由 Qian Cai 提交于
mainline inclusion from mainline-5.0-rc1 commit fed84c78527009d4f799a3ed9a566502fa026d82 category: bugfix bugzilla: 7440 CVE: NA ------------------------------------------------- Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and Huawei TaiShan 2280 aarch64 servers). After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early log buffer went from something like 280 to 260000 which caused kmemleak disabled and crash dump memory reservation failed. The multitude of kmemleak_alloc() calls is from nested loops while KASAN is setting up full memory mappings, so let early kmemleak allocations skip those memblock_alloc_internal() calls came from kasan_init() given that those early KASAN memory mappings should not reference to other memory. Hence, no kmemleak false positives. kasan_init kasan_map_populate [1] kasan_pgd_populate [2] kasan_pud_populate [3] kasan_pmd_populate [4] kasan_pte_populate [5] kasan_alloc_zeroed_page memblock_alloc_try_nid memblock_alloc_internal kmemleak_alloc [1] for_each_memblock(memory, reg) [2] while (pgdp++, addr = next, addr != end) [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp))) [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp))) [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep))) Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: NQian Cai <cai@gmx.us> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NZengruan Ye <yezengruan@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Frank Rowand 提交于
commit 144552c786925314c1e7cb8f91a71dae1aca8798 upstream. Add checks: - attempted kfree due to refcount reaching zero before overlay is removed - properties linked to an overlay node when the node is removed - node refcount > one during node removal in a changeset destroy, if the node was created by the changeset After applying this patch, several validation warnings will be reported from the devicetree unittest during boot due to pre-existing devicetree bugs. The warnings will be similar to: OF: ERROR: of_node_release(), unexpected properties in /testcase-data/overlay-node/test-bus/test-unittest11 OF: ERROR: memory leak, expected refcount 1 instead of 2, of_node_get()/of_node_put() unbalanced - destroy cset entry: attach overlay node /testcase-data-2/substation@100/ hvac-medium-2 Tested-by: NAlan Tull <atull@kernel.org> Signed-off-by: NFrank Rowand <frank.rowand@sony.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Tetsuo Handa 提交于
commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream. Arkadiusz reported that enabling memcg's group oom killing causes strange memcg statistics where there is no task in a memcg despite the number of tasks in that memcg is not 0. It turned out that there is a bug in wake_oom_reaper() which allows enqueuing same task twice which makes impossible to decrease the number of tasks in that memcg due to a refcount leak. This bug existed since the OOM reaper became invokable from task_will_free_mem(current) path in out_of_memory() in Linux 4.7, T1@P1 |T2@P1 |T3@P1 |OOM reaper ----------+----------+----------+------------ # Processing an OOM victim in a different memcg domain. try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) try_charge() mem_cgroup_out_of_memory() mutex_lock(&oom_lock) out_of_memory() oom_kill_process(P1) do_send_sig_info(SIGKILL, @P1) mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T2@P1) wake_oom_reaper(T2@P1) # T2@P1 is enqueued. mutex_unlock(&oom_lock) out_of_memory() mark_oom_victim(T1@P1) wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL. mutex_unlock(&oom_lock) # Completed processing an OOM victim in a different memcg domain. spin_lock(&oom_reaper_lock) # T1P1 is dequeued. spin_unlock(&oom_reaper_lock) but memcg's group oom killing made it easier to trigger this bug by calling wake_oom_reaper() on the same task from one out_of_memory() request. Fix this bug using an approach used by commit 855b0183 ("oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a side effect of this patch, this patch also avoids enqueuing multiple threads sharing memory via task_will_free_mem(current) path. Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head") Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: NArkadiusz Miskiewicz <arekm@maven.pl> Tested-by: NArkadiusz Miskiewicz <arekm@maven.pl> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NRoman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Aleksa Sarai <asarai@suse.de> Cc: Jay Kamat <jgkamat@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Daniel Borkmann 提交于
[ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ] While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin, I ran into the issue that while l3 mode is working fine, l3s mode does not have any connectivity to kube-apiserver and hence all pods end up in Error state as well. The ipvlan master device sits on top of a bond device and hostns traffic to kube-apiserver (also running in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573 where the latter is the address of the bond0. While in l3 mode, a curl to https://10.152.183.1:443 or to https://139.178.29.207:37573 works fine from hostns, neither of them do in case of l3s. In the latter only a curl to https://127.0.0.1:37573 appeared to work where for local addresses of bond0 I saw kernel suddenly starting to emit ARP requests to query HW address of bond0 which remained unanswered and neighbor entries in INCOMPLETE state. These ARP requests only happen while in l3s. Debugging this further, I found the issue is that l3s mode is piggy- backing on l3 master device, and in this case local routes are using l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant") and 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback"). I found that reverting them back into using the net->loopback_dev fixed ipvlan l3s connectivity and got everything working for the CNI. Now judging from 4fbae7d8 ("ipvlan: Introduce l3s mode") and the l3mdev paper in [0] the only sole reason why ipvlan l3s is relying on l3 master device is to get the l3mdev_ip_rcv() receive hook for setting the dst entry of the input route without adding its own ipvlan specific hacks into the receive path, however, any l3 domain semantics beyond just that are breaking l3s operation. Note that ipvlan also has the ability to dynamically switch its internal operation from l3 to l3s for all ports via ipvlan_set_port_mode() at runtime. In any case, l3 vs l3s soley distinguishes itself by 'de-confusing' netfilter through switching skb->dev to ipvlan slave device late in NF_INET_LOCAL_IN before handing the skb to L4. Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which, if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook without any additional l3mdev semantics on top. This should also have minimal impact since dev->priv_flags is already hot in cache. With this set, l3s mode is working fine and I also get things like masquerading pod traffic on the ipvlan master properly working. [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback") Fixes: 4fbae7d8 ("ipvlan: Introduce l3s mode") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Cc: Mahesh Bandewar <maheshb@google.com> Cc: David Ahern <dsa@cumulusnetworks.com> Cc: Florian Westphal <fw@strlen.de> Cc: Martynas Pumputis <m@lambda.lt> Acked-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Borkmann 提交于
[ commit d3bd7413e0ca40b60cf60d4003246d067cafdeda upstream ] While 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") took care of rejecting alu op on pointer when e.g. pointer came from two different map values with different map properties such as value size, Jann reported that a case was not covered yet when a given alu op is used in both "ptr_reg += reg" and "numeric_reg += reg" from different branches where we would incorrectly try to sanitize based on the pointer's limit. Catch this corner case and reject the program instead. Fixes: 979d63d50c0c ("bpf: prevent out of bounds speculation on pointer arithmetic") Reported-by: NJann Horn <jannh@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Daniel Borkmann 提交于
[ commit 979d63d50c0c0f7bc537bf821e056cc9fe5abd38 upstream ] Jann reported that the original commit back in b2157399 ("bpf: prevent out-of-bounds speculation") was not sufficient to stop CPU from speculating out of bounds memory access: While b2157399 only focussed on masking array map access for unprivileged users for tail calls and data access such that the user provided index gets sanitized from BPF program and syscall side, there is still a more generic form affected from BPF programs that applies to most maps that hold user data in relation to dynamic map access when dealing with unknown scalars or "slow" known scalars as access offset, for example: - Load a map value pointer into R6 - Load an index into R7 - Do a slow computation (e.g. with a memory dependency) that loads a limit into R8 (e.g. load the limit from a map for high latency, then mask it to make the verifier happy) - Exit if R7 >= R8 (mispredicted branch) - Load R0 = R6[R7] - Load R0 = R6[R0] For unknown scalars there are two options in the BPF verifier where we could derive knowledge from in order to guarantee safe access to the memory: i) While </>/<=/>= variants won't allow to derive any lower or upper bounds from the unknown scalar where it would be safe to add it to the map value pointer, it is possible through ==/!= test however. ii) another option is to transform the unknown scalar into a known scalar, for example, through ALU ops combination such as R &= <imm> followed by R |= <imm> or any similar combination where the original information from the unknown scalar would be destroyed entirely leaving R with a constant. The initial slow load still precedes the latter ALU ops on that register, so the CPU executes speculatively from that point. Once we have the known scalar, any compare operation would work then. A third option only involving registers with known scalars could be crafted as described in [0] where a CPU port (e.g. Slow Int unit) would be filled with many dependent computations such that the subsequent condition depending on its outcome has to wait for evaluation on its execution port and thereby executing speculatively if the speculated code can be scheduled on a different execution port, or any other form of mistraining as described in [1], for example. Given this is not limited to only unknown scalars, not only map but also stack access is affected since both is accessible for unprivileged users and could potentially be used for out of bounds access under speculation. In order to prevent any of these cases, the verifier is now sanitizing pointer arithmetic on the offset such that any out of bounds speculation would be masked in a way where the pointer arithmetic result in the destination register will stay unchanged, meaning offset masked into zero similar as in array_index_nospec() case. With regards to implementation, there are three options that were considered: i) new insn for sanitation, ii) push/pop insn and sanitation as inlined BPF, iii) reuse of ax register and sanitation as inlined BPF. Option i) has the downside that we end up using from reserved bits in the opcode space, but also that we would require each JIT to emit masking as native arch opcodes meaning mitigation would have slow adoption till everyone implements it eventually which is counter-productive. Option ii) and iii) have both in common that a temporary register is needed in order to implement the sanitation as inlined BPF since we are not allowed to modify the source register. While a push / pop insn in ii) would be useful to have in any case, it requires once again that every JIT needs to implement it first. While possible, amount of changes needed would also be unsuitable for a -stable patch. Therefore, the path which has fewer changes, less BPF instructions for the mitigation and does not require anything to be changed in the JITs is option iii) which this work is pursuing. The ax register is already mapped to a register in all JITs (modulo arm32 where it's mapped to stack as various other BPF registers there) and used in constant blinding for JITs-only so far. It can be reused for verifier rewrites under certain constraints. The interpreter's tmp "register" has therefore been remapped into extending the register set with hidden ax register and reusing that for a number of instructions that needed the prior temporary variable internally (e.g. div, mod). This allows for zero increase in stack space usage in the interpreter, and enables (restricted) generic use in rewrites otherwise as long as such a patchlet does not make use of these instructions. The sanitation mask is dynamic and relative to the offset the map value or stack pointer currently holds. There are various cases that need to be taken under consideration for the masking, e.g. such operation could look as follows: ptr += val or val += ptr or ptr -= val. Thus, the value to be sanitized could reside either in source or in destination register, and the limit is different depending on whether the ALU op is addition or subtraction and depending on the current known and bounded offset. The limit is derived as follows: limit := max_value_size - (smin_value + off). For subtraction: limit := umax_value + off. This holds because we do not allow any pointer arithmetic that would temporarily go out of bounds or would have an unknown value with mixed signed bounds where it is unclear at verification time whether the actual runtime value would be either negative or positive. For example, we have a derived map pointer value with constant offset and bounded one, so limit based on smin_value works because the verifier requires that statically analyzed arithmetic on the pointer must be in bounds, and thus it checks if resulting smin_value + off and umax_value + off is still within map value bounds at time of arithmetic in addition to time of access. Similarly, for the case of stack access we derive the limit as follows: MAX_BPF_STACK + off for subtraction and -off for the case of addition where off := ptr_reg->off + ptr_reg->var_off.value. Subtraction is a special case for the masking which can be in form of ptr += -val, ptr -= -val, or ptr -= val. In the first two cases where we know that the value is negative, we need to temporarily negate the value in order to do the sanitation on a positive value where we later swap the ALU op, and restore original source register if the value was in source. The sanitation of pointer arithmetic alone is still not fully sufficient as is, since a scenario like the following could happen ... PTR += 0x1000 (e.g. K-based imm) PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON PTR += 0x1000 PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON [...] ... which under speculation could end up as ... PTR += 0x1000 PTR -= 0 [ truncated by mitigation ] PTR += 0x1000 PTR -= 0 [ truncated by mitigation ] [...] ... and therefore still access out of bounds. To prevent such case, the verifier is also analyzing safety for potential out of bounds access under speculative execution. Meaning, it is also simulating pointer access under truncation. We therefore "branch off" and push the current verification state after the ALU operation with known 0 to the verification stack for later analysis. Given the current path analysis succeeded it is likely that the one under speculation can be pruned. In any case, it is also subject to existing complexity limits and therefore anything beyond this point will be rejected. In terms of pruning, it needs to be ensured that the verification state from speculative execution simulation must never prune a non-speculative execution path, therefore, we mark verifier state accordingly at the time of push_stack(). If verifier detects out of bounds access under speculative execution from one of the possible paths that includes a truncation, it will reject such program. Given we mask every reg-based pointer arithmetic for unprivileged programs, we've been looking into how it could affect real-world programs in terms of size increase. As the majority of programs are targeted for privileged-only use case, we've unconditionally enabled masking (with its alu restrictions on top of it) for privileged programs for the sake of testing in order to check i) whether they get rejected in its current form, and ii) by how much the number of instructions and size will increase. We've tested this by using Katran, Cilium and test_l4lb from the kernel selftests. For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb we've used test_l4lb.o as well as test_l4lb_noinline.o. We found that none of the programs got rejected by the verifier with this change, and that impact is rather minimal to none. balancer_kern.o had 13,904 bytes (1,738 insns) xlated and 7,797 bytes JITed before and after the change. Most complex program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated and 18,538 bytes JITed before and after and none of the other tail call programs in bpf_lxc.o had any changes either. For the older bpf_lxc_opt_-DUNKNOWN.o object we found a small increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed after the change. Other programs from that object file had similar small increase. Both test_l4lb.o had no change and remained at 6,544 bytes (817 insns) xlated and 3,401 bytes JITed and for test_l4lb_noinline.o constant at 5,080 bytes (634 insns) xlated and 3,313 bytes JITed. This can be explained in that LLVM typically optimizes stack based pointer arithmetic by using K-based operations and that use of dynamic map access is not overly frequent. However, in future we may decide to optimize the algorithm further under known guarantees from branch and value speculation. Latter seems also unclear in terms of prediction heuristics that today's CPUs apply as well as whether there could be collisions in e.g. the predictor's Value History/Pattern Table for triggering out of bounds access, thus masking is performed unconditionally at this point but could be subject to relaxation later on. We were generally also brainstorming various other approaches for mitigation, but the blocker was always lack of available registers at runtime and/or overhead for runtime tracking of limits belonging to a specific pointer. Thus, we found this to be minimally intrusive under given constraints. With that in place, a simple example with sanitized access on unprivileged load at post-verification time looks as follows: # bpftool prog dump xlated id 282 [...] 28: (79) r1 = *(u64 *)(r7 +0) 29: (79) r2 = *(u64 *)(r7 +8) 30: (57) r1 &= 15 31: (79) r3 = *(u64 *)(r0 +4608) 32: (57) r3 &= 1 33: (47) r3 |= 1 34: (2d) if r2 > r3 goto pc+19 35: (b4) (u32) r11 = (u32) 20479 | 36: (1f) r11 -= r2 | Dynamic sanitation for pointer 37: (4f) r11 |= r2 | arithmetic with registers 38: (87) r11 = -r11 | containing bounded or known 39: (c7) r11 s>>= 63 | scalars in order to prevent 40: (5f) r11 &= r2 | out of bounds speculation. 41: (0f) r4 += r11 | 42: (71) r4 = *(u8 *)(r4 +0) 43: (6f) r4 <<= r1 [...] For the case where the scalar sits in the destination register as opposed to the source register, the following code is emitted for the above example: [...] 16: (b4) (u32) r11 = (u32) 20479 17: (1f) r11 -= r2 18: (4f) r11 |= r2 19: (87) r11 = -r11 20: (c7) r11 s>>= 63 21: (5f) r2 &= r11 22: (0f) r2 += r0 23: (61) r0 = *(u32 *)(r2 +0) [...] JIT blinding example with non-conflicting use of r10: [...] d5: je 0x0000000000000106 _ d7: mov 0x0(%rax),%edi | da: mov $0xf153246,%r10d | Index load from map value and e0: xor $0xf153259,%r10 | (const blinded) mask with 0x1f. e7: and %r10,%rdi |_ ea: mov $0x2f,%r10d | f0: sub %rdi,%r10 | Sanitized addition. Both use r10 f3: or %rdi,%r10 | but do not interfere with each f6: neg %r10 | other. (Neither do these instructions f9: sar $0x3f,%r10 | interfere with the use of ax as temp fd: and %r10,%rdi | in interpreter.) 100: add %rax,%rdi |_ 103: mov 0x0(%rdi),%eax [...] Tested that it fixes Jann's reproducer, and also checked that test_verifier and test_progs suite with interpreter, JIT and JIT with hardening enabled on x86-64 and arm64 runs successfully. [0] Speculose: Analyzing the Security Implications of Speculative Execution in CPUs, Giorgi Maisuradze and Christian Rossow, https://arxiv.org/pdf/1801.04084.pdf [1] A Systematic Evaluation of Transient Execution Attacks and Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz, Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens, Dmitry Evtyushkin, Daniel Gruss, https://arxiv.org/pdf/1811.05441.pdf Fixes: b2157399 ("bpf: prevent out-of-bounds speculation") Reported-by: NJann Horn <jannh@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Borkmann 提交于
[ commit 9b73bfdd08e73231d6a90ae6db4b46b3fbf56c30 upstream ] Right now we are using BPF ax register in JIT for constant blinding as well as in interpreter as temporary variable. Verifier will not be able to use it simply because its use will get overridden from the former in bpf_jit_blind_insn(). However, it can be made to work in that blinding will be skipped if there is prior use in either source or destination register on the instruction. Taking constraints of ax into account, the verifier is then open to use it in rewrites under some constraints. Note, ax register already has mappings in every eBPF JIT. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Borkmann 提交于
[ commit 144cd91c4c2bced6eb8a7e25e590f6618a11e854 upstream ] This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run() into the hidden ax register. The latter is currently only used in JITs for constant blinding as a temporary scratch register, meaning the BPF interpreter will never see the use of ax. Therefore it is safe to use it for the cases where tmp has been used earlier. This is needed to later on allow restricted hidden use of ax in both interpreter and JITs. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Borkmann 提交于
[ commit c08435ec7f2bc8f4109401f696fd55159b4b40cb upstream ] Move prev_insn_idx and insn_idx from the do_check() function into the verifier environment, so they can be read inside the various helper functions for handling the instructions. It's easier to put this into the environment rather than changing all call-sites only to pass it along. insn_idx is useful in particular since this later on allows to hold state in env->insn_aux_data[env->insn_idx]. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Dexuan Cui 提交于
commit ba50bf1ce9a51fc97db58b96d01306aa70bc3979 upstream. fc96df16a1ce is good and can already fix the "return stack garbage" issue, but let's also improve hv_ringbuffer_get_debuginfo(), which would silently return stack garbage, if people forget to check channel->state or ring_info->ring_buffer, when using the function in the future. Having an error check in the function would eliminate the potential risk. Add a Fixes tag to indicate the patch depdendency. Fixes: fc96df16a1ce ("Drivers: hv: vmbus: Return -EINVAL for the sys files for unopened channels") Cc: stable@vger.kernel.org Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com> Signed-off-by: NDexuan Cui <decui@microsoft.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Ross Lagerwall 提交于
[ Upstream commit 6c57f0458022298e4da1729c67bd33ce41c14e7a ] In certain cases, pskb_trim_rcsum() may change skb pointers. Reinitialize header pointers afterwards to avoid potential use-after-frees. Add a note in the documentation of pskb_trim_rcsum(). Found by KASAN. Signed-off-by: NRoss Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xie XiuQi 提交于
hulk inclusion category: feature bugzilla: 5510 CVE: NA Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yang Yingliang 提交于
hulk inclusion category: feature bugzilla: 5510 CVE: NA Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xie XiuQi 提交于
hulk inclusion category: feature bugzilla: 5510 CVE: NA Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiongfeng Wang 提交于
euler inclusion category: feature Bugzilla: 5515 CVE: N/A ---------------------------------------- NMI Watchdog need to enable the event for each core individually. But the existing public api 'sdei_event_enable' enable events for all cores when the event type is private. Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiongfeng Wang 提交于
euler inclusion category: feature Bugzilla: 5515 CVE: N/A ---------------------------------------- This patch add a interrupt binding api function which returns the binded event number. Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiongfeng Wang 提交于
euler inclusion category: feature Bugzilla: 5515 CVE: N/A ---------------------------------------- In current code, the hardlockup detect code is contained by CONFIG_HARDLOCKUP_DETECTOR_PERF. This patch makes this code public so that other arch hardlockup detector can use it. Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Aaron Lu 提交于
[ Upstream commit 66f71da9dd38af17dc17209cdde7987d4679a699 ] Since a2468cc9 ("swap: choose swap device according to numa node"), avail_lists field of swap_info_struct is changed to an array with MAX_NUMNODES elements. This made swap_info_struct size increased to 40KiB and needs an order-4 page to hold it. This is not optimal in that: 1 Most systems have way less than MAX_NUMNODES(1024) nodes so it is a waste of memory; 2 It could cause swapon failure if the swap device is swapped on after system has been running for a while, due to no order-4 page is available as pointed out by Vasily Averin. Solve the above two issues by using nr_node_ids(which is the actual possible node number the running system has) for avail_lists instead of MAX_NUMNODES. nr_node_ids is unknown at compile time so can't be directly used when declaring this array. What I did here is to declare avail_lists as zero element array and allocate space for it when allocating space for swap_info_struct. The reason why keep using array but not pointer is plist_for_each_entry needs the field to be part of the struct, so pointer will not work. This patch is on top of Vasily Averin's fix commit. I think the use of kvzalloc for swap_info_struct is still needed in case nr_node_ids is really big on some systems. Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Huang Ying <ying.huang@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Andrey Ignatov 提交于
[ Upstream commit 46f53a65d2de3e1591636c22b626b09d8684fd71 ] Currently BPF verifier allows narrow loads for a context field only with offset zero. E.g. if there is a __u32 field then only the following loads are permitted: * off=0, size=1 (narrow); * off=0, size=2 (narrow); * off=0, size=4 (full). On the other hand LLVM can generate a load with offset different than zero that make sense from program logic point of view, but verifier doesn't accept it. E.g. tools/testing/selftests/bpf/sendmsg4_prog.c has code: #define DST_IP4 0xC0A801FEU /* 192.168.1.254 */ ... if ((ctx->user_ip4 >> 24) == (bpf_htonl(DST_IP4) >> 24) && where ctx is struct bpf_sock_addr. Some versions of LLVM can produce the following byte code for it: 8: 71 12 07 00 00 00 00 00 r2 = *(u8 *)(r1 + 7) 9: 67 02 00 00 18 00 00 00 r2 <<= 24 10: 18 03 00 00 00 00 00 fe 00 00 00 00 00 00 00 00 r3 = 4261412864 ll 12: 5d 32 07 00 00 00 00 00 if r2 != r3 goto +7 <LBB0_6> where `*(u8 *)(r1 + 7)` means narrow load for ctx->user_ip4 with size=1 and offset=3 (7 - sizeof(ctx->user_family) = 3). This load is currently rejected by verifier. Verifier code that rejects such loads is in bpf_ctx_narrow_access_ok() what means any is_valid_access implementation, that uses the function, works this way, e.g. bpf_skb_is_valid_access() for __sk_buff or sock_addr_is_valid_access() for bpf_sock_addr. The patch makes such loads supported. Offset can be in [0; size_default) but has to be multiple of load size. E.g. for __u32 field the following loads are supported now: * off=0, size=1 (narrow); * off=1, size=1 (narrow); * off=2, size=1 (narrow); * off=3, size=1 (narrow); * off=0, size=2 (narrow); * off=2, size=2 (narrow); * off=0, size=4 (full). Reported-by: NYonghong Song <yhs@fb.com> Signed-off-by: NAndrey Ignatov <rdna@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Anders Roxell 提交于
[ Upstream commit 347a28b586802d09604a149c1a1f6de5dccbe6fa ] This happened while running in qemu-system-aarch64, the AMBA PL011 UART driver when enabling CONFIG_DEBUG_TEST_DRIVER_REMOVE. arch_initcall(pl011_init) came before subsys_initcall(default_bdi_init), devtmpfs' handle_remove() crashes because the reference count is a NULL pointer only because wb->bdi hasn't been initialized yet. Rework so that wb_put have an extra check if wb->bdi before decrement wb->refcnt and also add a WARN_ON_ONCE to get a warning if it happens again in other drivers. Fixes: 52ebea74 ("writeback: make backing_dev_info host cgroup-specific bdi_writebacks") Co-developed-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NAnders Roxell <anders.roxell@linaro.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Badhri Jagan Sridharan 提交于
[ Upstream commit 23b5f73266e59a598c1e5dd435d87651b5a7626b ] During HARD_RESET the data link is disconnected. For self powered device, the spec is advising against doing that. >From USB_PD_R3_0 7.1.5 Response to Hard Resets Device operation during and after a Hard Reset is defined as follows: Self-powered devices Should Not disconnect from USB during a Hard Reset (see Section 9.1.2). Bus powered devices will disconnect from USB during a Hard Reset due to the loss of their power source. Tackle this by letting TCPM know whether the device is self or bus powered. This overcomes unnecessary port disconnections from hard reset. Also, speeds up the enumeration time when connected to Type-A ports. Signed-off-by: NBadhri Jagan Sridharan <badhri@google.com> Reviewed-by: NHeikki Krogerus <heikki.krogerus@linux.intel.com> --------- Version history: V3: Rebase on top of usb-next V2: Based on feedback from heikki.krogerus@linux.intel.com - self_powered added to the struct tcpm_port which is populated from a. "connector" node of the device tree in tcpm_fw_get_caps() b. "self_powered" node of the tcpc_config in tcpm_copy_caps Based on feedbase from linux@roeck-us.net - Code was refactored - SRC_HARD_RESET_VBUS_OFF sets the link state to false based on self_powered flag V1 located here: https://lkml.org/lkml/2018/9/13/94Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yufen Yu 提交于
commit 94a2c3a32b62e868dc1e3d854326745a7f1b8c7a upstream. We recently got a stack by syzkaller like this: BUG: sleeping function called from invalid context at mm/slab.h:361 in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid INFO: lockdep is turned off. CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194 0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00 0000000000000000 0000000000000001 0000000000000004 0000000000000000 Call Trace: <IRQ> [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline] <IRQ> [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51 [<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675 [<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637 [<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline] [<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline] [<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline] [<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709 [<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline] [<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline] [<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227 [<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374 [<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline] [<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675 [<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline] [<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline] [<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692 [<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237 [<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232 [<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline] [<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline] [<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline] [<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline] [<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957 [<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273 [<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline] [<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391 [<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline] [<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926 [<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746 <EOI> [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180 [<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626 [<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043 [<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline] [<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050 [<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a In softirq context, we call rcu callback function delete_partition_rcu_cb(), which may allocate memory by kzalloc with GFP_KERNEL flag. If the allocation cannot be satisfied, it may sleep. However, That is not allowed in softirq contex. Although we found this problem on linux 4.4, the latest kernel version seems to have this problem as well. And it is very similar to the previous one: https://lkml.org/lkml/2018/7/9/391 Fix it by using RCU workqueue, which allows sleep. Reviewed-by: NPaul E. McKenney <paulmck@linux.ibm.com> Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Rafał Miłecki 提交于
commit 321c46b91550adc03054125fa7a1639390608e1a upstream. So far we never had any device registered for the SoC. This resulted in some small issues that we kept ignoring like: 1) Not working GPIOLIB_IRQCHIP (gpiochip_irqchip_add_key() failing) 2) Lack of proper tree in the /sys/devices/ 3) mips_dma_alloc_coherent() silently handling empty coherent_dma_mask Kernel 4.19 came with a lot of DMA changes and caused a regression on bcm47xx. Starting with the commit f8c55dc6 ("MIPS: use generic dma noncoherent ops for simple noncoherent platforms") DMA coherent allocations just fail. Example: [ 1.114914] bgmac_bcma bcma0:2: Allocation of TX ring 0x200 failed [ 1.121215] bgmac_bcma bcma0:2: Unable to alloc memory for DMA [ 1.127626] bgmac_bcma: probe of bcma0:2 failed with error -12 [ 1.133838] bgmac_bcma: Broadcom 47xx GBit MAC driver loaded The bgmac driver also triggers a WARNING: [ 0.959486] ------------[ cut here ]------------ [ 0.964387] WARNING: CPU: 0 PID: 1 at ./include/linux/dma-mapping.h:516 bgmac_enet_probe+0x1b4/0x5c4 [ 0.973751] Modules linked in: [ 0.976913] CPU: 0 PID: 1 Comm: swapper Not tainted 4.19.9 #0 [ 0.982750] Stack : 804a0000 804597c4 00000000 00000000 80458fd8 8381bc2c 838282d4 80481a47 [ 0.991367] 8042e3ec 00000001 804d38f0 00000204 83980000 00000065 8381bbe0 6f55b24f [ 0.999975] 00000000 00000000 80520000 00002018 00000000 00000075 00000007 00000000 [ 1.008583] 00000000 80480000 000ee811 00000000 00000000 00000000 80432c00 80248db8 [ 1.017196] 00000009 00000204 83980000 803ad7b0 00000000 801feeec 00000000 804d0000 [ 1.025804] ... [ 1.028325] Call Trace: [ 1.030875] [<8000aef8>] show_stack+0x58/0x100 [ 1.035513] [<8001f8b4>] __warn+0xe4/0x118 [ 1.039708] [<8001f9a4>] warn_slowpath_null+0x48/0x64 [ 1.044935] [<80248db8>] bgmac_enet_probe+0x1b4/0x5c4 [ 1.050101] [<802498e0>] bgmac_probe+0x558/0x590 [ 1.054906] [<80252fd0>] bcma_device_probe+0x38/0x70 [ 1.060017] [<8020e1e8>] really_probe+0x170/0x2e8 [ 1.064891] [<8020e714>] __driver_attach+0xa4/0xec [ 1.069784] [<8020c1e0>] bus_for_each_dev+0x58/0xb0 [ 1.074833] [<8020d590>] bus_add_driver+0xf8/0x218 [ 1.079731] [<8020ef24>] driver_register+0xcc/0x11c [ 1.084804] [<804b54cc>] bgmac_init+0x1c/0x44 [ 1.089258] [<8000121c>] do_one_initcall+0x7c/0x1a0 [ 1.094343] [<804a1d34>] kernel_init_freeable+0x150/0x218 [ 1.099886] [<803a082c>] kernel_init+0x10/0x104 [ 1.104583] [<80005878>] ret_from_kernel_thread+0x14/0x1c [ 1.110107] ---[ end trace f441c0d873d1fb5b ]--- This patch setups a "struct device" (and passes it to the bcma) which allows fixing all the mentioned problems. It'll also require a tiny bcma patch which will follow through the wireless tree & its maintainer. Fixes: f8c55dc6 ("MIPS: use generic dma noncoherent ops for simple noncoherent platforms") Signed-off-by: NRafał Miłecki <rafal@milecki.pl> Signed-off-by: NPaul Burton <paul.burton@mips.com> Acked-by: NHauke Mehrtens <hauke@hauke-m.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: linux-wireless@vger.kernel.org Cc: Ralf Baechle <ralf@linux-mips.org> Cc: James Hogan <jhogan@kernel.org> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-5.0 commit 80424b02d42bb22f8ff8839cb93a84ade53b39c0 category: bugfix bugzilla: 5501 CVE: NA ------------------------------------------------- The current implementation of efi_mem_reserve_persistent() is rather naive, in the sense that for each invocation, it creates a separate linked list entry to describe the reservation. Since the linked list entries themselves need to persist across subsequent kexec reboots, every reservation created this way results in two memblock_reserve() calls at the next boot. On arm64 systems with 100s of CPUs, this may result in a excessive number of memblock reservations, and needless fragmentation. So instead, make use of the newly updated struct linux_efi_memreserve layout to put multiple reservations into a single linked list entry. This should get rid of the numerous tiny memblock reservations, and effectively cut the total number of reservations in half on arm64 systems with many CPUs. [ mingo: build warning fix. ] Tested-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arend van Spriel <arend.vanspriel@broadcom.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Eric Snowberg <eric.snowberg@oracle.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jon Hunter <jonathanh@nvidia.com> Cc: Julien Thierry <julien.thierry@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: YiFei Zhu <zhuyifei1999@gmail.com> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20181129171230.18699-11-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-5.0 commit 5f0b0ecf043a5319e729c11a53bc8294df12dab3 category: bugfix bugzilla: 5501 CVE: NA ------------------------------------------------- In preparation of updating efi_mem_reserve_persistent() to cause less fragmentation when dealing with many persistent reservations, update the struct definition and the code that handles it currently so it can describe an arbitrary number of reservations using a single linked list entry. The actual optimization will be implemented in a subsequent patch. Tested-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arend van Spriel <arend.vanspriel@broadcom.com> Cc: Bhupesh Sharma <bhsharma@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Eric Snowberg <eric.snowberg@oracle.com> Cc: Hans de Goede <hdegoede@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Jon Hunter <jonathanh@nvidia.com> Cc: Julien Thierry <julien.thierry@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: YiFei Zhu <zhuyifei1999@gmail.com> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20181129171230.18699-10-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-4.20 commit eff896288872d687d9662000ec9ae11b6d61766f category: bugfix bugzilla: 5501 CVE: NA ------------------------------------------------- The new memory EFI reservation feature we introduced to allow memory reservations to persist across kexec may trigger an unbounded number of calls to memblock_reserve(). The memblock subsystem can deal with this fine, but not before memblock resizing is enabled, which we can only do after paging_init(), when the memory we reallocate the array into is actually mapped. So break out the memreserve table processing into a separate routine and call it after paging_init() on arm64. On ARM, because of limited reviewing bandwidth of the maintainer, we cannot currently fix this, so instead, disable the EFI persistent memreserve entirely on ARM so we can fix it later. Tested-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-efi@vger.kernel.org Link: http://lkml.kernel.org/r/20181114175544.12860-5-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
-
由 Cheng Jian 提交于
euler inclusion category: feature Bugzilla: 5507 CVE: N/A ---------------------------------------- In the previous version we forced the association between livepatch wo_ftrace and stop_machine. This is unwise and obviously confusing. commit d83a7cb3 ("livepatch: change to a per-task consistency model") introduce a PER-TASK consistency model. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task consistency and syscall barrier switching combined with kpatch's stack trace switching. There are also a number of fallback options which make it quite flexible. So we split livepatch consistency for without ftrace to two model: [1] PER-TASK consistency model. per-task consistency and syscall barrier switching combined with kpatch's stack trace switching. [2] STOP-MACHINE consistency model. stop-machine consistency and kpatch's stack trace switching. Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Reviewed-by: NLi Bin <huawei.libin@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Cheng Jian 提交于
euler inclusion category: feature Bugzilla: 5507 CVE: N/A ---------------------------------------- front-tools kpatch-build support load and unload hooks, but the kernel does not fit this feature. just implement it. Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Reviewed-by: NLi Bin <huawei.libin@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Cheng Jian 提交于
euler inclusion category: feature Bugzilla: 5507 CVE: N/A ---------------------------------------- Some functions in the kernel are always on the stack of some thread in the system. Attempts to patch these function will currently always fail the activeness safety check. However, through human inspection, it can be determined that, for a particular function, consistency is maintained even if the old and new versions of the function run concurrently. commit 2e93c5e1e3dc ("support forced patching") in kpatch-build introduces a KPATCH_FORCE_UNSAFE() macro to define patched functions that such be exempted from the activeness safety check. now kernel implement this feature. Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Reviewed-by: NLi Bin <huawei.libin@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Cheng Jian 提交于
euler inclusion category: feature Bugzilla: 5507 CVE: N/A ---------------------------------------- support for livepatch without ftrace mode new config for WO_FTRACE CONFIG_LIVEPATCH_WO_FTRACE=y CONFIG_LIVEPATCH_STACK=y Implements livepatch without ftrace by direct jump, we directly modify the first few instructions(usually one, but four for long jumps under ARM64) of the old function as jump instructions by stop_machine, so it will jump to the first address of the new function when livepatch enable KERNEL/MODULE call/bl A---------------old_A------------ | jump new_A----+--------| | | | | | | ----------------- | | | | livepatch_module------------- | | | | |new_A <--------------------+--------------------| | | | | |---------------------------| | .plt | | ......PLTS for livepatch | ----------------------------- something we need to consider under different architectures: 1. jump instruction 2. partial relocation in new function requires for livepatch. 3. long jumps may be required if the jump address exceeds the offset. both for livepatch relocation and livepatch enable. Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Reviewed-by: NLi Bin <huawei.libin@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Vasily Averin 提交于
commit d4b09acf924b84bae77cad090a9d108e70b43643 upstream. if node have NFSv41+ mounts inside several net namespaces it can lead to use-after-free in svc_process_common() svc_process_common() /* Setup reply header */ rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE svc_process_common() can use incorrect rqstp->rq_xprt, its caller function bc_svc_process() takes it from serv->sv_bc_xprt. The problem is that serv is global structure but sv_bc_xprt is assigned per-netnamespace. According to Trond, the whole "let's set up rqstp->rq_xprt for the back channel" is nothing but a giant hack in order to work around the fact that svc_process_common() uses it to find the xpt_ops, and perform a couple of (meaningless for the back channel) tests of xpt_flags. All we really need in svc_process_common() is to be able to run rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr() Bruce J Fields points that this xpo_prep_reply_hdr() call is an awfully roundabout way just to do "svc_putnl(resv, 0);" in the tcp case. This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(), now it calls svc_process_common() with rqstp->rq_xprt = NULL. To adjust reply header svc_process_common() just check rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case. To handle rqstp->rq_xprt = NULL case in functions called from svc_process_common() patch intruduces net namespace pointer svc_rqst->rq_bc_net and adjust SVC_NET() definition. Some other function was also adopted to properly handle described case. Signed-off-by: NVasily Averin <vvs@virtuozzo.com> Cc: stable@vger.kernel.org Fixes: 23c20ecd ("NFS: callback up - users counting cleanup") Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> v2: added lost extern svc_tcp_prep_reply_hdr() Signed-off-by: NVasily Averin <vvs@virtuozzo.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 WANG Chao 提交于
commit e4f358916d528d479c3c12bd2fd03f2d5a576380 upstream. Commit 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support") replaced the RETPOLINE define with CONFIG_RETPOLINE checks. Remove the remaining pieces. [ bp: Massage commit message. ] Fixes: 4cd24de3a098 ("x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support") Signed-off-by: NWANG Chao <chao.wang@ucloud.cn> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NZhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Jessica Yu <jeyu@kernel.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: linux-kbuild@vger.kernel.org Cc: srinivas.eeda@oracle.com Cc: stable <stable@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20181210163725.95977-1-chao.wang@ucloud.cnSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Dan Williams 提交于
commit 58ef15b765af0d2cbe6799ec564f1dc485010ab8 upstream. devm semantics arrange for resources to be torn down when device-driver-probe fails or when device-driver-release completes. Similar to devm_memremap_pages() there is no need to support an explicit remove operation when the users properly adhere to devm semantics. Note that devm_kzalloc() automatically handles allocating node-local memory. Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJérôme Glisse <jglisse@redhat.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Dan Williams 提交于
commit a95c90f1e2c253b280385ecf3d4ebfe476926b28 upstream. The last step before devm_memremap_pages() returns success is to allocate a release action, devm_memremap_pages_release(), to tear the entire setup down. However, the result from devm_add_action() is not checked. Checking the error from devm_add_action() is not enough. The api currently relies on the fact that the percpu_ref it is using is killed by the time the devm_memremap_pages_release() is run. Rather than continue this awkward situation, offload the responsibility of killing the percpu_ref to devm_memremap_pages_release() directly. This allows devm_memremap_pages() to do the right thing relative to init failures and shutdown. Without this change we could fail to register the teardown of devm_memremap_pages(). The likelihood of hitting this failure is tiny as small memory allocations almost always succeed. However, the impact of the failure is large given any future reconfiguration, or disable/enable, of an nvdimm namespace will fail forever as subsequent calls to devm_memremap_pages() will fail to setup the pgmap_radix since there will be stale entries for the physical address range. An argument could be made to require that the ->kill() operation be set in the @pgmap arg rather than passed in separately. However, it helps code readability, tracking the lifetime of a given instance, to be able to grep the kill routine directly at the devm_memremap_pages() call site. Link: http://lkml.kernel.org/r/154275558526.76910.7535251937849268605.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com> Fixes: e8d51348 ("memremap: change devm_memremap_pages interface...") Reviewed-by: N"Jérôme Glisse" <jglisse@redhat.com> Reported-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Taehee Yoo 提交于
[ Upstream commit 4c05ec47384ab3627b62814e8f886e90cc38ce15 ] basechain->stats is rcu protected data which is updated from nft_chain_stats_replace(). This function is executed from the commit phase which holds the pernet nf_tables commit mutex - not the global nfnetlink subsystem mutex. Test commands to reproduce the problem are: %iptables-nft -I INPUT %iptables-nft -Z %iptables-nft -Z This patch uses RCU calls to handle basechain->stats updates to fix a splat that looks like: [89279.358755] ============================= [89279.363656] WARNING: suspicious RCU usage [89279.368458] 4.20.0-rc2+ #44 Tainted: G W L [89279.374661] ----------------------------- [89279.379542] net/netfilter/nf_tables_api.c:1404 suspicious rcu_dereference_protected() usage! [...] [89279.406556] 1 lock held by iptables-nft/5225: [89279.411728] #0: 00000000bf45a000 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x1f/0x70 [nf_tables] [89279.424022] stack backtrace: [89279.429236] CPU: 0 PID: 5225 Comm: iptables-nft Tainted: G W L 4.20.0-rc2+ #44 [89279.430135] Call Trace: [89279.430135] dump_stack+0xc9/0x16b [89279.430135] ? show_regs_print_info+0x5/0x5 [89279.430135] ? lockdep_rcu_suspicious+0x117/0x160 [89279.430135] nft_chain_commit_update+0x4ea/0x640 [nf_tables] [89279.430135] ? sched_clock_local+0xd4/0x140 [89279.430135] ? check_flags.part.35+0x440/0x440 [89279.430135] ? __rhashtable_remove_fast.constprop.67+0xec0/0xec0 [nf_tables] [89279.430135] ? sched_clock_cpu+0x126/0x170 [89279.430135] ? find_held_lock+0x39/0x1c0 [89279.430135] ? hlock_class+0x140/0x140 [89279.430135] ? is_bpf_text_address+0x5/0xf0 [89279.430135] ? check_flags.part.35+0x440/0x440 [89279.430135] ? __lock_is_held+0xb4/0x140 [89279.430135] nf_tables_commit+0x2555/0x39c0 [nf_tables] Fixes: f102d66b ("netfilter: nf_tables: use dedicated mutex to guard transactions") Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Miquel Raynal 提交于
commit 81b1e6e6a8590a19257e37a1633bec098d499c57 upstream. Since the addition of platform MSI support, there were two helpers supposed to allocate/free IRQs for a device: platform_msi_domain_alloc_irqs() platform_msi_domain_free_irqs() In these helpers, IRQ descriptors are allocated in the "alloc" routine while they are freed in the "free" one. Later, two other helpers have been added to handle IRQ domains on top of MSI domains: platform_msi_domain_alloc() platform_msi_domain_free() Seen from the outside, the logic is pretty close with the former helpers and people used it with the same logic as before: a platform_msi_domain_alloc() call should be balanced with a platform_msi_domain_free() call. While this is probably what was intended to do, the platform_msi_domain_free() does not remove/free the IRQ descriptor(s) created/inserted in platform_msi_domain_alloc(). One effect of such situation is that removing a module that requested an IRQ will let one orphaned IRQ descriptor (with an allocated MSI entry) in the device descriptors list. Next time the module will be inserted back, one will observe that the allocation will happen twice in the MSI domain, one time for the remaining descriptor, one time for the new one. It also has the side effect to quickly overshoot the maximum number of allocated MSI and then prevent any module requesting an interrupt in the same domain to be inserted anymore. This situation has been met with loops of insertion/removal of the mvpp2.ko module (requesting 15 MSIs each time). Fixes: 552c494a ("platform-msi: Allow creation of a MSI-based stacked irq domain") Cc: stable@vger.kernel.org Signed-off-by: NMiquel Raynal <miquel.raynal@bootlin.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Cong Wang 提交于
[ Upstream commit aff6db454599d62191aabc208930e891748e4322 ] __ptr_ring_swap_queue() tries to move pointers from the old ring to the new one, but it forgets to check if ->producer is beyond the new size at the end of the operation. This leads to an out-of-bound access in __ptr_ring_produce() as reported by syzbot. Reported-by: syzbot+8993c0fa96d57c399735@syzkaller.appspotmail.com Fixes: 5d49de53 ("ptr_ring: resize support") Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yang Yingliang 提交于
euler inclusion category: bugfix bugzilla: 5359 CVE: NA Drop this patch, but use the patch in LTS kernel. ------------------- This reverts commit 698ac96606173d99c1b2f8fded0990c30047ee3c. Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Yang Yingliang 提交于
This reverts commit 3cbd5bf6892f0c8c6a12d5848af1e1b46ec50eef. Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Yang Yingliang 提交于
This reverts commit 3f5824d84929e058b82a0dcd28a6a8b13339cc00. Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-