- 07 4月, 2020 1 次提交
-
-
由 Dexuan Cui 提交于
Fixed the punctuation and some typos. Improved some sentences with minor changes. No change of semantics or code. Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NDexuan Cui <decui@microsoft.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 3月, 2020 9 次提交
-
-
由 Andrii Nakryiko 提交于
Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from under given bpf_link. Currently this is only supported for bpf_cgroup_link, but will be extended to other kinds of bpf_links in follow-up patches. For bpf_cgroup_link, implemented functionality matches existing semantics for direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either unconditionally set new bpf_prog regardless of which bpf_prog is currently active under given bpf_link, or, optionally, can specify expected active bpf_prog. If active bpf_prog doesn't match expected one, no changes are performed, old bpf_link stays intact and attached, operation returns a failure. cgroup_bpf_replace() operation is resolving race between auto-detachment and bpf_prog update in the same fashion as it's done for bpf_link detachment, except in this case update has no way of succeeding because of target cgroup marked as dying. So in this case error is returned. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
-
由 Andrii Nakryiko 提交于
Implement new sub-command to attach cgroup BPF programs and return FD-based bpf_link back on success. bpf_link, once attached to cgroup, cannot be replaced, except by owner having its FD. Cgroup bpf_link supports only BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI attachments can be freely intermixed. To prevent bpf_cgroup_link from keeping cgroup alive past the point when no BPF program can be executed, implement auto-detachment of link. When cgroup_bpf_release() is called, all attached bpf_links are forced to release cgroup refcounts, but they leave bpf_link otherwise active and allocated, as well as still owning underlying bpf_prog. This is because user-space might still have FDs open and active, so bpf_link as a user-referenced object can't be freed yet. Once last active FD is closed, bpf_link will be freed and underlying bpf_prog refcount will be dropped. But cgroup refcount won't be touched, because cgroup is released already. The inherent race between bpf_cgroup_link release (from closing last FD) and cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So the only additional check required is when bpf_cgroup_link attempts to detach itself from cgroup. At that time we need to check whether there is still cgroup associated with that link. And if not, exit with success, because bpf_cgroup_link was already successfully detached. Signed-off-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NRoman Gushchin <guro@fb.com> Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
-
由 Scott Mayhew 提交于
When using NFSv4.2, the security label for the root inode should be set via a call to nfs_setsecurity() during the mount process, otherwise the inode will appear as unlabeled for up to acdirmin seconds. Currently the label for the root inode is allocated, retrieved, and freed entirely witin nfs4_proc_get_root(). Add a field for the label to the nfs_fattr struct, and allocate & free the label in nfs_get_root(), where we also add a call to nfs_setsecurity(). Note that for the call to nfs_setsecurity() to succeed, it's necessary to also move the logic calling security_sb_{set,clone}_security() from nfs_get_tree_common() down into nfs_get_root()... otherwise the SBLABEL_MNT flag will not be set in the super_block's security flags and nfs_setsecurity() will silently fail. Reported-by: NRichard Haines <richard_c_haines@btinternet.com> Signed-off-by: NScott Mayhew <smayhew@redhat.com> Acked-by: NStephen Smalley <sds@tycho.nsa.gov> Tested-by: NStephen Smalley <sds@tycho.nsa.gov> [PM: fixed 80-char line width problems] Signed-off-by: NPaul Moore <paul@paul-moore.com>
-
由 John Fastabend 提交于
It is not possible for the current verifier to track ALU32 and JMP ops correctly. This can result in the verifier aborting with errors even though the program should be verifiable. BPF codes that hit this can work around it by changin int variables to 64-bit types, marking variables volatile, etc. But this is all very ugly so it would be better to avoid these tricks. But, the main reason to address this now is do_refine_retval_range() was assuming return values could not be negative. Once we fixed this code that was previously working will no longer work. See do_refine_retval_range() patch for details. And we don't want to suddenly cause programs that used to work to fail. The simplest example code snippet that illustrates the problem is likely this, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] The expected 64-bit and 32-bit bounds after each line are shown on the right. The current issue is without the w* bounds we are forced to use the worst case bound of [0, U32_MAX]. To resolve this type of case, jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit 32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then from branch_taken logic creating new bounds we can track 32-bit bounds explicitly. The next case we observed is ALU ops after the jmp32, 53: w8 = w0 // r8 <- [0, S32_MAX], // w8 <- [-S32_MIN, X] 54: w8 <s 0 // r8 <- [0, U32_MAX] // w8 <- [0, X] 55: w8 += 1 // r8 <- [0, U32_MAX+1] // w8 <- [0, X+1] In order to keep the bounds accurate at this point we also need to track ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU ops, mov, add, sub, etc. Finally there is a question of how and when to merge bounds. The cases enumerate here, 1. MOV ALU32 - zext 32-bit -> 64-bit 2. MOV ALU64 - copy 64-bit -> 32-bit 3. op ALU32 - zext 32-bit -> 64-bit 4. op ALU64 - n/a 5. jmp ALU32 - 64-bit: var32_off | upper_32_bits(var64_off) 6. jmp ALU64 - 32-bit: (>> (<< var64_off)) Details for each case, For "MOV ALU32" BPF arch zero extends so we simply copy the bounds from 32-bit into 64-bit ensuring we truncate var_off and 64-bit bounds correctly. See zext_32_to_64. For "MOV ALU64" copy all bounds including 32-bit into new register. If the src register had 32-bit bounds the dst register will as well. For "op ALU32" zero extend 32-bit into 64-bit the same as move, see zext_32_to_64. For "op ALU64" calculate both 32-bit and 64-bit bounds no merging is done here. Except we have a special case. When RSH or ARSH is done we can't simply ignore shifting bits from 64-bit reg into the 32-bit subreg. So currently just push bounds from 64-bit into 32-bit. This will be correct in the sense that they will represent a valid state of the register. However we could lose some accuracy if an ARSH is following a jmp32 operation. We can handle this special case in a follow up series. For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We special case if 64-bit bounds has zero'd upper 32bits at which point we can simply copy 32-bit bounds into 64-bit register. This catches a common compiler trick where upper 32-bits are zeroed and then 32-bit ops are used followed by a 64-bit compare or 64-bit op on a pointer. See __reg_combine_64_into_32(). For "jmp ALU64" cast the bounds of the 64bit to their 32-bit counterpart. For example s32_min_value = (s32)reg->smin_value. For tnum use only the lower 32bits via, (>>(<<var_off)). See __reg_combine_64_into_32(). Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower
-
由 Russell King 提交于
Add a separate set of PCS operations, which MAC drivers can use to couple phylink with their associated MAC PCS layer. The PCS operations include: - pcs_get_state() - reads the link up/down, resolved speed, duplex and pause from the PCS. - pcs_config() - configures the PCS for the specified mode, PHY interface type, and setting the advertisement. - pcs_an_restart() - restarts 802.3 in-band negotiation with the link partner - pcs_link_up() - informs the PCS that link has come up, and the parameters of the link. Link parameters are used to program the PCS for fixed speed and non-inband modes. Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Russell King 提交于
Rename the bland 'ops' member of struct phylink to be a more descriptive 'mac_ops' - this is necessary as we're about to introduce another set of operations. Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Russell King 提交于
Change phylink_mii_c22_pcs_set_advertisement() to take only the PHY interface and advertisement mask, rather than the full phylink state. Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuval Basson 提交于
The qed_chain data structure was modified in commit 1a4a6975 ("qed: Chain support for external PBL") to support receiving an external pbl (due to iWARP FW requirements). The pages pointed to by the pbl are allocated in qed_chain_alloc and their virtual address are stored in an virtual addresses array to enable accessing and freeing the data. The physical addresses however weren't stored and were accessed directly from the external-pbl during free. Destroy-qp flow, leads to freeing the external pbl before the chain is freed, when the chain is freed it tries accessing the already freed external pbl, leading to a use-after-free. Therefore we need to store the physical addresses in additional to the virtual addresses in a new data structure. Fixes: 1a4a6975 ("qed: Chain support for external PBL") Signed-off-by: NMichal Kalderon <mkalderon@marvell.com> Signed-off-by: NYuval Bason <ybason@marvell.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Richard Cochran 提交于
The PTP Hardware Clock (PHC) subsystem offers an API for configuring programmable pins. User space sets or gets the settings using ioctls, and drivers verify dialed settings via a callback. Drivers may also query pin settings by calling the ptp_find_pin() method. Although the core subsystem protects concurrent access to the pin settings, the implementation places illogical restrictions on how drivers may call ptp_find_pin(). When enabling an auxiliary function via the .enable(on=1) callback, drivers may invoke the pin finding method, but when disabling with .enable(on=0) drivers are not permitted to do so. With the exception of the mv88e6xxx, all of the PHC drivers do respect this restriction, but still the locking pattern is both confusing and unnecessary. This patch changes the locking implementation to allow PHC drivers to freely call ptp_find_pin() from their .enable() and .verify() callbacks. V2 ChangeLog: - fixed spelling in the kernel doc - add Vladimir's tested by tag Signed-off-by: NRichard Cochran <richardcochran@gmail.com> Reported-by: NYangbo Lu <yangbo.lu@nxp.com> Tested-by: NVladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 3月, 2020 9 次提交
-
-
由 Alexander Aring 提交于
This patch adds rpl source routing receive handling. Everything works only if sysconf "rpl_seg_enabled" and source routing is enabled. Mostly the same behaviour as IPv6 segmentation routing. To handle compression and uncompression a rpl.c file is created which contains the necessary functionality. The receive handling will also care about IPv6 encapsulated so far it's specified as possible nexthdr in RFC 6554. Signed-off-by: NAlexander Aring <alex.aring@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Peter Krystad 提交于
Process the MP_JOIN option in a SYN packet with the same flow as MP_CAPABLE but when the third ACK is received add the subflow to the MPTCP socket subflow list instead of adding it to the TCP socket accept queue. The subflow is added at the end of the subflow list so it will not interfere with the existing subflows operation and no data is expected to be transmitted on it. Co-developed-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NFlorian Westphal <fw@strlen.de> Co-developed-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Peter Krystad 提交于
Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6, and RM_ADDR suboptions. Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Cambda Zhu 提交于
The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the offset of the GSO in skb cb. This patch fixes the typo. Fixes: 9207f9d4 ("net: preserve IP control block during GSO segmentation") Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 KP Singh 提交于
JITed BPF programs are dynamically attached to the LSM hooks using BPF trampolines. The trampoline prologue generates code to handle conversion of the signature of the hook to the appropriate BPF context. The allocated trampoline programs are attached to the nop functions initialized as LSM hooks. BPF_PROG_TYPE_LSM programs must have a GPL compatible license and and need CAP_SYS_ADMIN (required for loading eBPF programs). Upon attachment: * A BPF fexit trampoline is used for LSM hooks with a void return type. * A BPF fmod_ret trampoline is used for LSM hooks which return an int. The attached programs can override the return value of the bpf LSM hook to indicate a MAC Policy decision. Signed-off-by: NKP Singh <kpsingh@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NBrendan Jackman <jackmanb@google.com> Reviewed-by: NFlorent Revest <revest@google.com> Acked-by: NAndrii Nakryiko <andriin@fb.com> Acked-by: NJames Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-5-kpsingh@chromium.org
-
由 KP Singh 提交于
When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_<hook_name>, are generated for each LSM hook. These functions are initialized as LSM hooks in a subsequent patch. Signed-off-by: NKP Singh <kpsingh@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NBrendan Jackman <jackmanb@google.com> Reviewed-by: NFlorent Revest <revest@google.com> Reviewed-by: NKees Cook <keescook@chromium.org> Acked-by: NYonghong Song <yhs@fb.com> Acked-by: NJames Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-4-kpsingh@chromium.org
-
由 KP Singh 提交于
The information about the different types of LSM hooks is scattered in two locations i.e. union security_list_options and struct security_hook_heads. Rather than duplicating this information even further for BPF_PROG_TYPE_LSM, define all the hooks with the LSM_HOOK macro in lsm_hook_defs.h which is then used to generate all the data structures required by the LSM framework. The LSM hooks are defined as: LSM_HOOK(<return_type>, <default_value>, <hook_name>, args...) with <default_value> acccessible in security.c as: LSM_RET_DEFAULT(<hook_name>) Signed-off-by: NKP Singh <kpsingh@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NBrendan Jackman <jackmanb@google.com> Reviewed-by: NFlorent Revest <revest@google.com> Reviewed-by: NKees Cook <keescook@chromium.org> Reviewed-by: NCasey Schaufler <casey@schaufler-ca.com> Acked-by: NJames Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-3-kpsingh@chromium.org
-
由 KP Singh 提交于
Introduce types and configs for bpf programs that can be attached to LSM hooks. The programs can be enabled by the config option CONFIG_BPF_LSM. Signed-off-by: NKP Singh <kpsingh@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NBrendan Jackman <jackmanb@google.com> Reviewed-by: NFlorent Revest <revest@google.com> Reviewed-by: NThomas Garnier <thgarnie@google.com> Acked-by: NYonghong Song <yhs@fb.com> Acked-by: NAndrii Nakryiko <andriin@fb.com> Acked-by: NJames Morris <jamorris@linux.microsoft.com> Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org
-
由 Roman Gushchin 提交于
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the space for task stacks can be allocated using __vmalloc_node_range(), alloc_pages_node() and kmem_cache_alloc_node(). In the first and the second cases page->mem_cgroup pointer is set, but in the third it's not: memcg membership of a slab page should be determined using the memcg_from_slab_page() function, which looks at page->slab_cache->memcg_params.memcg . In this case, using mod_memcg_page_state() (as in account_kernel_stack()) is incorrect: page->mem_cgroup pointer is NULL even for pages charged to a non-root memory cgroup. It can lead to kernel_stack per-memcg counters permanently showing 0 on some architectures (depending on the configuration). In order to fix it, let's introduce a mod_memcg_obj_state() helper, which takes a pointer to a kernel object as a first argument, uses mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls mod_memcg_state(). It allows to handle all possible configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without spilling any memcg/kmem specifics into fork.c . Note: This is a special version of the patch created for stable backports. It contains code from the following two patches: - mm: memcg/slab: introduce mem_cgroup_from_obj() - mm: fork: fix kernel_stack memcg stats for various stack implementations [guro@fb.com: introduce mem_cgroup_from_obj()] Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 3月, 2020 1 次提交
-
-
由 Toke Høiland-Jørgensen 提交于
While it is currently possible for userspace to specify that an existing XDP program should not be replaced when attaching to an interface, there is no mechanism to safely replace a specific XDP program with another. This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be set along with IFLA_XDP_FD. If set, the kernel will check that the program currently loaded on the interface matches the expected one, and fail the operation if it does not. This corresponds to a 'cmpxchg' memory operation. Setting the new attribute with a negative value means that no program is expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST flag. A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly request checking of the EXPECTED_FD attribute. This is needed for userspace to discover whether the kernel supports the new attribute. Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Reviewed-by: NJakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/158515700640.92963.3551295145441017022.stgit@toke.dk
-
- 28 3月, 2020 10 次提交
-
-
由 Thomas Gleixner 提交于
Bit spinlocks are problematic if PREEMPT_RT is enabled, because they disable preemption, which is undesired for latency reasons and breaks when regular spinlocks are taken within the bit_spinlock locked region because regular spinlocks are converted to 'sleeping spinlocks' on RT. PREEMPT_RT replaced the bit spinlocks with regular spinlocks to avoid this problem. The replacement was done conditionaly at compile time, but Christoph requested to do an unconditional conversion. Jan suggested to move the spinlock into a existing padding hole which avoids a size increase of struct buffer_head on production kernels. As a benefit the lock gains lockdep coverage. [ bigeasy: Remove the wrapper and use always spinlock_t and move it into the padding hole ] Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NJan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20191118132824.rclhrbujqh4b4g4d@linutronix.de
-
由 Thomas Gleixner 提交于
A recent change to freeze_secondary_cpus() which added an early abort if a wakeup is pending missed the fact that the function is also invoked for shutdown, reboot and kexec via disable_nonboot_cpus(). In case of disable_nonboot_cpus() the wakeup event needs to be ignored as the purpose is to terminate the currently running kernel. Add a 'suspend' argument which is only set when the freeze is in context of a suspend operation. If not set then an eventually pending wakeup event is ignored. Fixes: a66d955e ("cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending") Reported-by: NBoqun Feng <boqun.feng@gmail.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Pavankumar Kondeti <pkondeti@codeaurora.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/874kuaxdiz.fsf@nanos.tec.linutronix.de
-
由 Daniel Borkmann 提交于
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(), recvmsg() and bind-related hooks in order to retrieve the cgroup v2 context which can then be used as part of the key for BPF map lookups, for example. Given these hooks operate in process context 'current' is always valid and pointing to the app that is performing mentioned syscalls if it's subject to a v2 cgroup. Also with same motivation of commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper") enable retrieval of ancestor from current so the cgroup id can be used for policy lookups which can then forbid connect() / bind(), for example. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
-
由 Daniel Borkmann 提交于
In Cilium we're mainly using BPF cgroup hooks today in order to implement kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic between Cilium managed nodes. While this works in its current shape and avoids packet-level NAT for inter Cilium managed node traffic, there is one major limitation we're facing today, that is, lack of netns awareness. In Kubernetes, the concept of Pods (which hold one or multiple containers) has been built around network namespaces, so while we can use the global scope of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing NodePort ports on loopback addresses), we also have the need to differentiate between initial network namespaces and non-initial one. For example, ExternalIP services mandate that non-local service IPs are not to be translated from the host (initial) network namespace as one example. Right now, we have an ugly work-around in place where non-local service IPs for ExternalIP services are not xlated from connect() and friends BPF hooks but instead via less efficient packet-level NAT on the veth tc ingress hook for Pod traffic. On top of determining whether we're in initial or non-initial network namespace we also have a need for a socket-cookie like mechanism for network namespaces scope. Socket cookies have the nice property that they can be combined as part of the key structure e.g. for BPF LRU maps without having to worry that the cookie could be recycled. We are planning to use this for our sessionAffinity implementation for services. Therefore, add a new bpf_get_netns_cookie() helper which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would provide the cookie for the initial network namespace while passing the context instead of NULL would provide the cookie from the application's network namespace. We're using a hole, so no size increase; the assignment happens only once. Therefore this allows for a comparison on initial namespace as well as regular cookie usage as we have today with socket cookies. We could later on enable this helper for other program types as well as we would see need. (*) Both externalTrafficPolicy={Local|Cluster} types [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
-
The BCM7XX PHY family requires special configuration to pass jumbo frames. Do that during initial PHY setup. Signed-off-by: NMurali Krishna Policharla <murali.policharla@broadcom.com> Reviewed-by: NScott Branden <scott.branden@broadcom.com> Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com> Acked-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vasundhara Volam 提交于
This patch adds a new macro for serial number keyword. Acked-by: NBjorn Helgaas <bhelgaas@google.com> Signed-off-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: NMichael Chan <michael.chan@broadcom.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Chaitanya Kulkarni 提交于
Add a helper to stringify the zone conditions. We use this helper in the next patch to track zone conditions in tracepoints. Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
The bio_map_* helpers are just the low-level helpers for the blk_rq_map_* APIs. Move them together for better logical grouping, as no there isn't much overlap with other code in bio.c. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Current make_request based drivers use either blk_alloc_queue_node or blk_alloc_queue to allocate a queue, and then set up the make_request_fn function pointer and a few parameters using the blk_queue_make_request helper. Simplify this by passing the make_request pointer to blk_alloc_queue, and while at it merge the _node variant into the main helper by always passing a node_id, and remove the superfluous gfp_mask parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
This allows a driver to pass a queuedata member before ->init_hctx is called. null_blk currently open codes this logic, but I'd rather have it in the core to ease future maintainance. Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 3月, 2020 10 次提交
-
-
由 Christoph Hellwig 提交于
There really isn't any good reason to stash a method directly into struct gendisk. Move it together with the other block device operations. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Benjamin Thiel 提交于
... in order to fix a -Wmissing-ptototypes warning: arch/x86/platform/efi/quirks.c:245:13: warning: no previous prototype for ‘efi_arch_mem_reserve’ [-Wmissing-prototypes] \ void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size) Signed-off-by: NBenjamin Thiel <b.thiel@posteo.de> Signed-off-by: NBorislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200326135041.3264-1-b.thiel@posteo.de
-
由 Antoine Tenart 提交于
This patch adds a reference to MACsec ops to the net_device structure, allowing net device drivers to implement offloading operations for MACsec. Signed-off-by: NAntoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: NMark Starovoytov <mstarovoitov@marvell.com> Signed-off-by: NIgor Russkikh <irusskikh@marvell.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Antoine Tenart 提交于
This patch introduce a new netdev feature, which will be used by drivers to state they can perform MACsec transformations in hardware. The patchset was gathered by Mark, macsec functinality itself was implemented by Dmitry, Mark and Pavel Belous. Signed-off-by: NAntoine Tenart <antoine.tenart@bootlin.com> Signed-off-by: NMark Starovoytov <mstarovoitov@marvell.com> Signed-off-by: NIgor Russkikh <irusskikh@marvell.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Al Viro 提交于
Straightforward, except for save_altstack_ex() stuck in those. Replace that thing with an analogue that would use unsafe_put_user() instead of put_user_ex() (called compat_save_altstack()) and be done with that. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
* move ata_eh_analyze_ncq_error() and ata_eh_read_log_10h() to libata-sata.c * add static inline for ata_eh_analyze_ncq_error() for CONFIG_SATA_HOST=n case (link->sactive is non-zero only if NCQ commands are actually queued so empty function body is sufficient) Code size savings on m68k arch using (modified) atari_defconfig: text data bss dec hex filename before: 16164 18 0 16182 3f36 drivers/ata/libata-eh.o after: 15446 18 0 15464 3c68 drivers/ata/libata-eh.o Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
Start separating SATA specific code from libata-eh.c: * move sata_async_notification() to libata-sata.c: Code size savings on m68k arch using (modified) atari_defconfig: text data bss dec hex filename before: 16243 18 0 16261 3f85 drivers/ata/libata-eh.o after: 16164 18 0 16182 3f36 drivers/ata/libata-eh.o Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
* un-inline: - ata_scsi_dump_cdb() - __ata_scsi_queuecmd() * un-static: - ata_scsi_sdev_config() - ata_scsi_dev_config() - ata_scsi_dump_cdb() - __ata_scsi_queuecmd() * move ata_sas_*() to libata-sata.c: * add static inlines for CONFIG_SATA_HOST=n case for ata_sas_{allocate,free}_tag() Code size savings on m68k arch using (modified) atari_defconfig: text data bss dec hex filename before: 19137 23 576 19736 4d18 drivers/ata/libata-scsi.o after: 18330 23 576 18929 49f1 drivers/ata/libata-scsi.o Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
Start separating SATA specific code from libata-scsi.c: * un-static ata_scsi_find_dev() * move following code to libata-sata.c: - SATA only sysfs device attributes handling - __ata_change_queue_depth() - ata_scsi_change_queue_depth() * cover with CONFIG_SATA_HOST ifdef SATA only sysfs device attributes handling code and ATA_SHT_NCQ() macro in <linux/libata.h> Code size savings on m68k arch using (modified) atari_defconfig: text data bss dec hex filename before: 20702 105 576 21383 5387 drivers/ata/libata-scsi.o after: 19137 23 576 19736 4d18 drivers/ata/libata-scsi.o Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
* move sata_deb_timing_*() to libata-sata.c * add static inline for sata_ehc_deb_timing() for CONFIG_SATA_HOST=n case Code size savings on m68k arch using (modified) atari_defconfig: text data bss dec hex filename before: 32158 572 40 32770 8002 drivers/ata/libata-core.o after: 32015 572 40 32627 7f73 drivers/ata/libata-core.o Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-