提交 · db1f00fb8ff793889e83f2e37e0c7bbb6fc9934e · openeuler / Kernel

07 4月, 2020 1 次提交

skbuff.h: Improve the checksum related comments · db1f00fb

由 Dexuan Cui 提交于 4月 05, 2020

Fixed the punctuation and some typos.
Improved some sentences with minor changes.

No change of semantics or code.
Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NDexuan Cui <decui@microsoft.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

db1f00fb

31 3月, 2020 9 次提交

bpf: Implement bpf_prog replacement for an active bpf_cgroup_link · 0c991ebc

由 Andrii Nakryiko 提交于 3月 29, 2020

Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
under given bpf_link. Currently this is only supported for bpf_cgroup_link,
but will be extended to other kinds of bpf_links in follow-up patches.

For bpf_cgroup_link, implemented functionality matches existing semantics for
direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
unconditionally set new bpf_prog regardless of which bpf_prog is currently
active under given bpf_link, or, optionally, can specify expected active
bpf_prog. If active bpf_prog doesn't match expected one, no changes are
performed, old bpf_link stays intact and attached, operation returns
a failure.

cgroup_bpf_replace() operation is resolving race between auto-detachment and
bpf_prog update in the same fashion as it's done for bpf_link detachment,
except in this case update has no way of succeeding because of target cgroup
marked as dying. So in this case error is returned.
Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com

0c991ebc

bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57

由 Andrii Nakryiko 提交于 3月 29, 2020

Implement new sub-command to attach cgroup BPF programs and return FD-based
bpf_link back on success. bpf_link, once attached to cgroup, cannot be
replaced, except by owner having its FD. Cgroup bpf_link supports only
BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
attachments can be freely intermixed.

To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
BPF program can be executed, implement auto-detachment of link. When
cgroup_bpf_release() is called, all attached bpf_links are forced to release
cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
well as still owning underlying bpf_prog. This is because user-space might
still have FDs open and active, so bpf_link as a user-referenced object can't
be freed yet. Once last active FD is closed, bpf_link will be freed and
underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
touched, because cgroup is released already.

The inherent race between bpf_cgroup_link release (from closing last FD) and
cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
the only additional check required is when bpf_cgroup_link attempts to detach
itself from cgroup. At that time we need to check whether there is still
cgroup associated with that link. And if not, exit with success, because
bpf_cgroup_link was already successfully detached.
Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NRoman Gushchin <guro@fb.com>
Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com

af6eea57

NFS: Ensure security label is set for root inode · 779df6a5

由 Scott Mayhew 提交于 3月 03, 2020

When using NFSv4.2, the security label for the root inode should be set
via a call to nfs_setsecurity() during the mount process, otherwise the
inode will appear as unlabeled for up to acdirmin seconds.  Currently
the label for the root inode is allocated, retrieved, and freed entirely
witin nfs4_proc_get_root().

Add a field for the label to the nfs_fattr struct, and allocate & free
the label in nfs_get_root(), where we also add a call to
nfs_setsecurity().  Note that for the call to nfs_setsecurity() to
succeed, it's necessary to also move the logic calling
security_sb_{set,clone}_security() from nfs_get_tree_common() down into
nfs_get_root()... otherwise the SBLABEL_MNT flag will not be set in the
super_block's security flags and nfs_setsecurity() will silently fail.
Reported-by: NRichard Haines <richard_c_haines@btinternet.com>
Signed-off-by: NScott Mayhew <smayhew@redhat.com>
Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
Tested-by: NStephen Smalley <sds@tycho.nsa.gov>
[PM: fixed 80-char line width problems]
Signed-off-by: NPaul Moore <paul@paul-moore.com>

779df6a5

bpf: Verifier, do explicit ALU32 bounds tracking · 3f50f132

由 John Fastabend 提交于 3月 30, 2020

It is not possible for the current verifier to track ALU32 and JMP ops
correctly. This can result in the verifier aborting with errors even though
the program should be verifiable. BPF codes that hit this can work around
it by changin int variables to 64-bit types, marking variables volatile,
etc. But this is all very ugly so it would be better to avoid these tricks.

But, the main reason to address this now is do_refine_retval_range() was
assuming return values could not be negative. Once we fixed this code that
was previously working will no longer work. See do_refine_retval_range()
patch for details. And we don't want to suddenly cause programs that used
to work to fail.

The simplest example code snippet that illustrates the problem is likely
this,

 53: w8 = w0                    // r8 <- [0, S32_MAX],
                                // w8 <- [-S32_MIN, X]
 54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                // w8 <- [0, X]

The expected 64-bit and 32-bit bounds after each line are shown on the
right. The current issue is without the w* bounds we are forced to use
the worst case bound of [0, U32_MAX]. To resolve this type of case,
jmp32 creating divergent 32-bit bounds from 64-bit bounds, we add explicit
32-bit register bounds s32_{min|max}_value and u32_{min|max}_value. Then
from branch_taken logic creating new bounds we can track 32-bit bounds
explicitly.

The next case we observed is ALU ops after the jmp32,

 53: w8 = w0                    // r8 <- [0, S32_MAX],
                                // w8 <- [-S32_MIN, X]
 54: w8 <s 0                    // r8 <- [0, U32_MAX]
                                // w8 <- [0, X]
 55: w8 += 1                    // r8 <- [0, U32_MAX+1]
                                // w8 <- [0, X+1]

In order to keep the bounds accurate at this point we also need to track
ALU32 ops. To do this we add explicit ALU32 logic for each of the ALU
ops, mov, add, sub, etc.

Finally there is a question of how and when to merge bounds. The cases
enumerate here,

1. MOV ALU32   - zext 32-bit -> 64-bit
2. MOV ALU64   - copy 64-bit -> 32-bit
3. op  ALU32   - zext 32-bit -> 64-bit
4. op  ALU64   - n/a
5. jmp ALU32   - 64-bit: var32_off | upper_32_bits(var64_off)
6. jmp ALU64   - 32-bit: (>> (<< var64_off))

Details for each case,

For "MOV ALU32" BPF arch zero extends so we simply copy the bounds
from 32-bit into 64-bit ensuring we truncate var_off and 64-bit
bounds correctly. See zext_32_to_64.

For "MOV ALU64" copy all bounds including 32-bit into new register. If
the src register had 32-bit bounds the dst register will as well.

For "op ALU32" zero extend 32-bit into 64-bit the same as move,
see zext_32_to_64.

For "op ALU64" calculate both 32-bit and 64-bit bounds no merging
is done here. Except we have a special case. When RSH or ARSH is
done we can't simply ignore shifting bits from 64-bit reg into the
32-bit subreg. So currently just push bounds from 64-bit into 32-bit.
This will be correct in the sense that they will represent a valid
state of the register. However we could lose some accuracy if an
ARSH is following a jmp32 operation. We can handle this special
case in a follow up series.

For "jmp ALU32" mark 64-bit reg unknown and recalculate 64-bit bounds
from tnum by setting var_off to ((<<(>>var_off)) | var32_off). We
special case if 64-bit bounds has zero'd upper 32bits at which point
we can simply copy 32-bit bounds into 64-bit register. This catches
a common compiler trick where upper 32-bits are zeroed and then
32-bit ops are used followed by a 64-bit compare or 64-bit op on
a pointer. See __reg_combine_64_into_32().

For "jmp ALU64" cast the bounds of the 64bit to their 32-bit
counterpart. For example s32_min_value = (s32)reg->smin_value. For
tnum use only the lower 32bits via, (>>(<<var_off)). See
__reg_combine_64_into_32().
Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158560419880.10843.11448220440809118343.stgit@john-Precision-5820-Tower

3f50f132

net: phylink: add separate pcs operations structure · 4c0d6d3a

由 Russell King 提交于 3月 30, 2020

Add a separate set of PCS operations, which MAC drivers can use to
couple phylink with their associated MAC PCS layer.  The PCS
operations include:

- pcs_get_state() - reads the link up/down, resolved speed, duplex
   and pause from the PCS.
- pcs_config() - configures the PCS for the specified mode, PHY
   interface type, and setting the advertisement.
- pcs_an_restart() - restarts 802.3 in-band negotiation with the
   link partner
- pcs_link_up() - informs the PCS that link has come up, and the
   parameters of the link. Link parameters are used to program the
   PCS for fixed speed and non-inband modes.
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4c0d6d3a

net: phylink: rename 'ops' to 'mac_ops' · e7765d63

由 Russell King 提交于 3月 30, 2020

Rename the bland 'ops' member of struct phylink to be a more
descriptive 'mac_ops' - this is necessary as we're about to introduce
another set of operations.
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7765d63

net: phylink: change phylink_mii_c22_pcs_set_advertisement() prototype · 0bd27406

由 Russell King 提交于 3月 30, 2020

Change phylink_mii_c22_pcs_set_advertisement() to take only the PHY
interface and advertisement mask, rather than the full phylink state.
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0bd27406

qed: Fix use after free in qed_chain_free · 8063f761

由 Yuval Basson 提交于 3月 29, 2020

The qed_chain data structure was modified in
commit 1a4a6975 ("qed: Chain support for external PBL") to support
receiving an external pbl (due to iWARP FW requirements).
The pages pointed to by the pbl are allocated in qed_chain_alloc
and their virtual address are stored in an virtual addresses array to
enable accessing and freeing the data. The physical addresses however
weren't stored and were accessed directly from the external-pbl
during free.

Destroy-qp flow, leads to freeing the external pbl before the chain is
freed, when the chain is freed it tries accessing the already freed
external pbl, leading to a use-after-free. Therefore we need to store
the physical addresses in additional to the virtual addresses in a
new data structure.

Fixes: 1a4a6975 ("qed: Chain support for external PBL")
Signed-off-by: NMichal Kalderon <mkalderon@marvell.com>
Signed-off-by: NYuval Bason <ybason@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8063f761

ptp: Avoid deadlocks in the programmable pin code. · 62582a7e

由 Richard Cochran 提交于 3月 29, 2020

The PTP Hardware Clock (PHC) subsystem offers an API for configuring
programmable pins.  User space sets or gets the settings using ioctls,
and drivers verify dialed settings via a callback.  Drivers may also
query pin settings by calling the ptp_find_pin() method.

Although the core subsystem protects concurrent access to the pin
settings, the implementation places illogical restrictions on how
drivers may call ptp_find_pin().  When enabling an auxiliary function
via the .enable(on=1) callback, drivers may invoke the pin finding
method, but when disabling with .enable(on=0) drivers are not
permitted to do so.  With the exception of the mv88e6xxx, all of the
PHC drivers do respect this restriction, but still the locking pattern
is both confusing and unnecessary.

This patch changes the locking implementation to allow PHC drivers to
freely call ptp_find_pin() from their .enable() and .verify()
callbacks.

V2 ChangeLog:
- fixed spelling in the kernel doc
- add Vladimir's tested by tag
Signed-off-by: NRichard Cochran <richardcochran@gmail.com>
Reported-by: NYangbo Lu <yangbo.lu@nxp.com>
Tested-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

62582a7e

30 3月, 2020 9 次提交

net: ipv6: add support for rpl sr exthdr · 8610c7c6

由 Alexander Aring 提交于 3月 27, 2020

This patch adds rpl source routing receive handling. Everything works
only if sysconf "rpl_seg_enabled" and source routing is enabled. Mostly
the same behaviour as IPv6 segmentation routing. To handle compression
and uncompression a rpl.c file is created which contains the necessary
functionality. The receive handling will also care about IPv6
encapsulated so far it's specified as possible nexthdr in RFC 6554.
Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8610c7c6

mptcp: Add handling of incoming MP_JOIN requests · f296234c

由 Peter Krystad 提交于 3月 27, 2020

Process the MP_JOIN option in a SYN packet with the same flow
as MP_CAPABLE but when the third ACK is received add the
subflow to the MPTCP socket subflow list instead of adding it to
the TCP socket accept queue.

The subflow is added at the end of the subflow list so it will not
interfere with the existing subflows operation and no data is
expected to be transmitted on it.
Co-developed-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Co-developed-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f296234c

mptcp: Add ADD_ADDR handling · 3df523ab

由 Peter Krystad 提交于 3月 27, 2020

Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6,
and RM_ADDR suboptions.
Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3df523ab

net: Fix typo of SKB_SGO_CB_OFFSET · a08e7fd9

由 Cambda Zhu 提交于 3月 26, 2020

The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the
offset of the GSO in skb cb. This patch fixes the typo.

Fixes: 9207f9d4 ("net: preserve IP control block during GSO segmentation")
Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a08e7fd9

bpf: lsm: Implement attach, detach and execution · 9e4e01df

由 KP Singh 提交于 3月 29, 2020

JITed BPF programs are dynamically attached to the LSM hooks
using BPF trampolines. The trampoline prologue generates code to handle
conversion of the signature of the hook to the appropriate BPF context.

The allocated trampoline programs are attached to the nop functions
initialized as LSM hooks.

BPF_PROG_TYPE_LSM programs must have a GPL compatible license and
and need CAP_SYS_ADMIN (required for loading eBPF programs).

Upon attachment:

* A BPF fexit trampoline is used for LSM hooks with a void return type.
* A BPF fmod_ret trampoline is used for LSM hooks which return an
  int. The attached programs can override the return value of the
  bpf LSM hook to indicate a MAC Policy decision.
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NBrendan Jackman <jackmanb@google.com>
Reviewed-by: NFlorent Revest <revest@google.com>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Acked-by: NJames Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-5-kpsingh@chromium.org

9e4e01df

bpf: lsm: Provide attachment points for BPF LSM programs · 9d3fdea7

由 KP Singh 提交于 3月 29, 2020

When CONFIG_BPF_LSM is enabled, nop functions, bpf_lsm_<hook_name>, are
generated for each LSM hook. These functions are initialized as LSM
hooks in a subsequent patch.
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NBrendan Jackman <jackmanb@google.com>
Reviewed-by: NFlorent Revest <revest@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Acked-by: NYonghong Song <yhs@fb.com>
Acked-by: NJames Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-4-kpsingh@chromium.org

9d3fdea7

security: Refactor declaration of LSM hooks · 98e828a0

由 KP Singh 提交于 3月 29, 2020

The information about the different types of LSM hooks is scattered
in two locations i.e. union security_list_options and
struct security_hook_heads. Rather than duplicating this information
even further for BPF_PROG_TYPE_LSM, define all the hooks with the
LSM_HOOK macro in lsm_hook_defs.h which is then used to generate all
the data structures required by the LSM framework.

The LSM hooks are defined as:

  LSM_HOOK(<return_type>, <default_value>, <hook_name>, args...)

with <default_value> acccessible in security.c as:

  LSM_RET_DEFAULT(<hook_name>)
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NBrendan Jackman <jackmanb@google.com>
Reviewed-by: NFlorent Revest <revest@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NCasey Schaufler <casey@schaufler-ca.com>
Acked-by: NJames Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-3-kpsingh@chromium.org

98e828a0

bpf: Introduce BPF_PROG_TYPE_LSM · fc611f47

由 KP Singh 提交于 3月 29, 2020

Introduce types and configs for bpf programs that can be attached to
LSM hooks. The programs can be enabled by the config option
CONFIG_BPF_LSM.
Signed-off-by: NKP Singh <kpsingh@google.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NBrendan Jackman <jackmanb@google.com>
Reviewed-by: NFlorent Revest <revest@google.com>
Reviewed-by: NThomas Garnier <thgarnie@google.com>
Acked-by: NYonghong Song <yhs@fb.com>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Acked-by: NJames Morris <jamorris@linux.microsoft.com>
Link: https://lore.kernel.org/bpf/20200329004356.27286-2-kpsingh@chromium.org

fc611f47

mm: fork: fix kernel_stack memcg stats for various stack implementations · 8380ce47

由 Roman Gushchin 提交于 3月 28, 2020

Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
space for task stacks can be allocated using __vmalloc_node_range(),
alloc_pages_node() and kmem_cache_alloc_node().

In the first and the second cases page->mem_cgroup pointer is set, but
in the third it's not: memcg membership of a slab page should be
determined using the memcg_from_slab_page() function, which looks at
page->slab_cache->memcg_params.memcg .  In this case, using
mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
page->mem_cgroup pointer is NULL even for pages charged to a non-root
memory cgroup.

It can lead to kernel_stack per-memcg counters permanently showing 0 on
some architectures (depending on the configuration).

In order to fix it, let's introduce a mod_memcg_obj_state() helper,
which takes a pointer to a kernel object as a first argument, uses
mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
mod_memcg_state().  It allows to handle all possible configurations
(CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
spilling any memcg/kmem specifics into fork.c .

Note: This is a special version of the patch created for stable
backports.  It contains code from the following two patches:
  - mm: memcg/slab: introduce mem_cgroup_from_obj()
  - mm: fork: fix kernel_stack memcg stats for various stack implementations

[guro@fb.com: introduce mem_cgroup_from_obj()]
  Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
Signed-off-by: NRoman Gushchin <guro@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Bharata B Rao <bharata@linux.ibm.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8380ce47

29 3月, 2020 1 次提交

xdp: Support specifying expected existing program when attaching XDP · 92234c8f

由 Toke Høiland-Jørgensen 提交于 3月 25, 2020

While it is currently possible for userspace to specify that an existing
XDP program should not be replaced when attaching to an interface, there is
no mechanism to safely replace a specific XDP program with another.

This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be
set along with IFLA_XDP_FD. If set, the kernel will check that the program
currently loaded on the interface matches the expected one, and fail the
operation if it does not. This corresponds to a 'cmpxchg' memory operation.
Setting the new attribute with a negative value means that no program is
expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST
flag.

A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly
request checking of the EXPECTED_FD attribute. This is needed for userspace
to discover whether the kernel supports the new attribute.
Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NJakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/158515700640.92963.3551295145441017022.stgit@toke.dk

92234c8f

28 3月, 2020 10 次提交

fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t · f1e67e35

由 Thomas Gleixner 提交于 11月 18, 2019

Bit spinlocks are problematic if PREEMPT_RT is enabled, because they
disable preemption, which is undesired for latency reasons and breaks when
regular spinlocks are taken within the bit_spinlock locked region because
regular spinlocks are converted to 'sleeping spinlocks' on RT.

PREEMPT_RT replaced the bit spinlocks with regular spinlocks to avoid this
problem. The replacement was done conditionaly at compile time, but
Christoph requested to do an unconditional conversion.

Jan suggested to move the spinlock into a existing padding hole which
avoids a size increase of struct buffer_head on production kernels.

As a benefit the lock gains lockdep coverage.

[ bigeasy: Remove the wrapper and use always spinlock_t and move it into
           the padding hole ]
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Link: https://lkml.kernel.org/r/20191118132824.rclhrbujqh4b4g4d@linutronix.de

f1e67e35

cpu/hotplug: Ignore pm_wakeup_pending() for disable_nonboot_cpus() · e98eac6f

由 Thomas Gleixner 提交于 3月 27, 2020

A recent change to freeze_secondary_cpus() which added an early abort if a
wakeup is pending missed the fact that the function is also invoked for
shutdown, reboot and kexec via disable_nonboot_cpus().

In case of disable_nonboot_cpus() the wakeup event needs to be ignored as
the purpose is to terminate the currently running kernel.

Add a 'suspend' argument which is only set when the freeze is in context of
a suspend operation. If not set then an eventually pending wakeup event is
ignored.

Fixes: a66d955e ("cpu/hotplug: Abort disabling secondary CPUs if wakeup is pending")
Reported-by: NBoqun Feng <boqun.feng@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/874kuaxdiz.fsf@nanos.tec.linutronix.de

e98eac6f

bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id · 0f09abd1

由 Daniel Borkmann 提交于 3月 27, 2020

Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
recvmsg() and bind-related hooks in order to retrieve the cgroup v2
context which can then be used as part of the key for BPF map lookups,
for example. Given these hooks operate in process context 'current' is
always valid and pointing to the app that is performing mentioned
syscalls if it's subject to a v2 cgroup. Also with same motivation of
commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
enable retrieval of ancestor from current so the cgroup id can be used
for policy lookups which can then forbid connect() / bind(), for example.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net

0f09abd1

bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c

由 Daniel Borkmann 提交于 3月 27, 2020

In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.

In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.

On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.

  (*) Both externalTrafficPolicy={Local|Cluster} types
  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net

f318903c

net: phy: bcm7xx: add jumbo frame configuration to PHY · ab41ca34

由 Murali Krishna Policharla 提交于 3月 27, 2020

The BCM7XX PHY family requires special configuration to pass jumbo
frames. Do that during initial PHY setup.
Signed-off-by: NMurali Krishna Policharla <murali.policharla@broadcom.com>
Reviewed-by: NScott Branden <scott.branden@broadcom.com>
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ab41ca34

PCI: Add new PCI_VPD_RO_KEYWORD_SERIALNO macro · 16efafa3

由 Vasundhara Volam 提交于 3月 27, 2020

This patch adds a new macro for serial number keyword.
Acked-by: NBjorn Helgaas <bhelgaas@google.com>
Signed-off-by: NVasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

16efafa3

block: add a zone condition debug helper · 02694e86

由 Chaitanya Kulkarni 提交于 3月 25, 2020

Add a helper to stringify the zone conditions. We use this helper in the
next patch to track zone conditions in tracepoints.
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

02694e86

block: move bio_map_* to blk-map.c · 130879f1

由 Christoph Hellwig 提交于 3月 27, 2020

The bio_map_* helpers are just the low-level helpers for the
blk_rq_map_* APIs.  Move them together for better logical grouping,
as no there isn't much overlap with other code in bio.c.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

130879f1

block: simplify queue allocation · 3d745ea5

由 Christoph Hellwig 提交于 3月 27, 2020

Current make_request based drivers use either blk_alloc_queue_node or
blk_alloc_queue to allocate a queue, and then set up the make_request_fn
function pointer and a few parameters using the blk_queue_make_request
helper. Simplify this by passing the make_request pointer to
blk_alloc_queue, and while at it merge the _node variant into the main
helper by always passing a node_id, and remove the superfluous gfp_mask
parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

3d745ea5

block: add a blk_mq_init_queue_data helper · 2f227bb9

由 Christoph Hellwig 提交于 3月 27, 2020

This allows a driver to pass a queuedata member before ->init_hctx is
called.  null_blk currently open codes this logic, but I'd rather have
it in the core to ease future maintainance.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2f227bb9

27 3月, 2020 10 次提交

block: move the ->devnode callback to struct block_device_operations · 348e114b

由 Christoph Hellwig 提交于 3月 27, 2020

There really isn't any good reason to stash a method directly into
struct gendisk.  Move it together with the other block device
operations.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

348e114b

x86/efi: Add a prototype for efi_arch_mem_reserve() · 860f89e6

由 Benjamin Thiel 提交于 3月 26, 2020

... in order to fix a -Wmissing-ptototypes warning:

  arch/x86/platform/efi/quirks.c:245:13: warning:
  no previous prototype for ‘efi_arch_mem_reserve’ [-Wmissing-prototypes] \
	  void __init efi_arch_mem_reserve(phys_addr_t addr, u64 size)
Signed-off-by: NBenjamin Thiel <b.thiel@posteo.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200326135041.3264-1-b.thiel@posteo.de

860f89e6

net: add a reference to MACsec ops in net_device · 30e9bb84

由 Antoine Tenart 提交于 3月 25, 2020

This patch adds a reference to MACsec ops to the net_device structure,
allowing net device drivers to implement offloading operations for
MACsec.
Signed-off-by: NAntoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: NMark Starovoytov <mstarovoitov@marvell.com>
Signed-off-by: NIgor Russkikh <irusskikh@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30e9bb84

net: introduce the MACSEC netdev feature · 5908220b

由 Antoine Tenart 提交于 3月 25, 2020

This patch introduce a new netdev feature, which will be used by drivers
to state they can perform MACsec transformations in hardware.

The patchset was gathered by Mark, macsec functinality itself
was implemented by Dmitry, Mark and Pavel Belous.
Signed-off-by: NAntoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: NMark Starovoytov <mstarovoitov@marvell.com>
Signed-off-by: NIgor Russkikh <irusskikh@marvell.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5908220b

x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit) · 119cd59f

由 Al Viro 提交于 2月 15, 2020

Straightforward, except for save_altstack_ex() stuck in those.
Replace that thing with an analogue that would use unsafe_put_user()
instead of put_user_ex() (called compat_save_altstack()) and be done
with that.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

119cd59f

ata: move ata_eh_analyze_ncq_error() & co. to libata-sata.c · a0ccd251

由 Bartlomiej Zolnierkiewicz 提交于 3月 26, 2020

* move ata_eh_analyze_ncq_error() and ata_eh_read_log_10h() to
  libata-sata.c

* add static inline for ata_eh_analyze_ncq_error() for
  CONFIG_SATA_HOST=n case (link->sactive is non-zero only if
  NCQ commands are actually queued so empty function body is
  sufficient)

Code size savings on m68k arch using (modified) atari_defconfig:

   text    data     bss     dec     hex filename
before:
  16164      18       0   16182    3f36 drivers/ata/libata-eh.o
after:
  15446      18       0   15464    3c68 drivers/ata/libata-eh.o
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a0ccd251

ata: start separating SATA specific code from libata-eh.c · a695de27

由 Bartlomiej Zolnierkiewicz 提交于 3月 26, 2020

Start separating SATA specific code from libata-eh.c:

* move sata_async_notification() to libata-sata.c:

Code size savings on m68k arch using (modified) atari_defconfig:

   text    data     bss     dec     hex filename
before:
  16243      18       0   16261    3f85 drivers/ata/libata-eh.o
after:
  16164      18       0   16182    3f36 drivers/ata/libata-eh.o
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a695de27

ata: move ata_sas_*() to libata-sata.c · 15964ff7

由 Bartlomiej Zolnierkiewicz 提交于 3月 26, 2020

* un-inline:
  - ata_scsi_dump_cdb()
  - __ata_scsi_queuecmd()

* un-static:
  - ata_scsi_sdev_config()
  - ata_scsi_dev_config()
  - ata_scsi_dump_cdb()
  - __ata_scsi_queuecmd()

* move ata_sas_*() to libata-sata.c:

* add static inlines for CONFIG_SATA_HOST=n case for
  ata_sas_{allocate,free}_tag()

Code size savings on m68k arch using (modified) atari_defconfig:

   text    data     bss     dec     hex filename
before:
  19137      23     576   19736    4d18 drivers/ata/libata-scsi.o
after:
  18330      23     576   18929    49f1 drivers/ata/libata-scsi.o
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

15964ff7

ata: start separating SATA specific code from libata-scsi.c · ec811a94

由 Bartlomiej Zolnierkiewicz 提交于 3月 26, 2020

Start separating SATA specific code from libata-scsi.c:

* un-static ata_scsi_find_dev()

* move following code to libata-sata.c:
  - SATA only sysfs device attributes handling
  - __ata_change_queue_depth()
  - ata_scsi_change_queue_depth()

* cover with CONFIG_SATA_HOST ifdef SATA only sysfs device
  attributes handling code and ATA_SHT_NCQ() macro in
  <linux/libata.h>

Code size savings on m68k arch using (modified) atari_defconfig:

   text    data     bss     dec     hex filename
before:
  20702     105     576   21383    5387 drivers/ata/libata-scsi.o
after:
  19137      23     576   19736    4d18 drivers/ata/libata-scsi.o
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ec811a94

ata: move sata_deb_timing_*() to libata-sata.c · 2b384ede

由 Bartlomiej Zolnierkiewicz 提交于 3月 26, 2020

* move sata_deb_timing_*() to libata-sata.c

* add static inline for sata_ehc_deb_timing() for
  CONFIG_SATA_HOST=n case

Code size savings on m68k arch using (modified) atari_defconfig:

   text    data     bss     dec     hex filename
before:
  32158     572      40   32770    8002 drivers/ata/libata-core.o
after:
  32015     572      40   32627    7f73 drivers/ata/libata-core.o
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2b384ede

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功