提交 · afdb09c720b62b8090584c11151d856df330e57d · openanolis / cloud-kernel

20 10月, 2017 3 次提交

security: bpf: Add LSM hooks for bpf object related syscall · afdb09c7

由 Chenbo Feng 提交于 10月 18, 2017

Introduce several LSM hooks for the syscalls that will allow the
userspace to access to eBPF object such as eBPF programs and eBPF maps.
The security check is aimed to enforce a per object security protection
for eBPF object so only processes with the right priviliges can
read/write to a specific map or use a specific eBPF program. Besides
that, a general security hook is added before the multiplexer of bpf
syscall to check the cmd and the attribute used for the command. The
actual security module can decide which command need to be checked and
how the cmd should be checked.
Signed-off-by: NChenbo Feng <fengc@google.com>
Acked-by: NJames Morris <james.l.morris@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

afdb09c7

bpf: Add file mode configuration into bpf maps · 6e71b04a

由 Chenbo Feng 提交于 10月 18, 2017

Introduce the map read/write flags to the eBPF syscalls that returns the
map fd. The flags is used to set up the file mode when construct a new
file descriptor for bpf maps. To not break the backward capability, the
f_flags is set to O_RDWR if the flag passed by syscall is 0. Otherwise
it should be O_RDONLY or O_WRONLY. When the userspace want to modify or
read the map content, it will check the file mode to see if it is
allowed to make the change.
Signed-off-by: NChenbo Feng <fengc@google.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6e71b04a

net: Add extack to validator_info structs used for address notifier · de95e047

由 David Ahern 提交于 10月 18, 2017

Add extack to in_validator_info and in6_validator_info. Update the one
user of each, ipvlan, to return an error message for failures.

Only manual configuration of an address is plumbed in the IPv6 code path.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

de95e047

19 10月, 2017 1 次提交

dql: make dql_init return void · 7a0947e7

由 Stephen Hemminger 提交于 10月 17, 2017

dql_init always returned 0, and the only place that uses it
in network core code didn't care about the return value anyway.
Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
Acked-by: NHiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a0947e7

18 10月, 2017 7 次提交

bpf: move knowledge about post-translation offsets out of verifier · 4f9218aa

由 Jakub Kicinski 提交于 10月 16, 2017

Use the fact that verifier ops are now separate from program
ops to define a separate set of callbacks for verification of
already translated programs.

Since we expect the analyzer ops to be defined only for
a small subset of all program types initialize their array
by hand (don't use linux/bpf_types.h).
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4f9218aa

bpf: remove the verifier ops from program structure · 00176a34

由 Jakub Kicinski 提交于 10月 16, 2017

Since the verifier ops don't have to be associated with
the program for its entire lifetime we can move it to
verifier's struct bpf_verifier_env.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

00176a34

bpf: split verifier and program ops · 7de16e3a

由 Jakub Kicinski 提交于 10月 16, 2017

struct bpf_verifier_ops contains both verifier ops and operations
used later during program's lifetime (test_run).  Split the runtime
ops into a different structure.

BPF_PROG_TYPE() will now append ## _prog_ops or ## _verifier_ops
to the names.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7de16e3a

bpf: cpumap xdp_buff to skb conversion and allocation · 1c601d82

由 Jesper Dangaard Brouer 提交于 10月 16, 2017

This patch makes cpumap functional, by adding SKB allocation and
invoking the network stack on the dequeuing CPU.

For constructing the SKB on the remote CPU, the xdp_buff in converted
into a struct xdp_pkt, and it mapped into the top headroom of the
packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
just a cpumap internal data structure, with info carried between
enqueue to dequeue.

If a driver doesn't have enough headroom it is simply dropped, with
return code -EOVERFLOW.  This will be picked up the xdp tracepoint
infrastructure, to allow users to catch this.

V2: take into account xdp->data_meta

V4:
 - Drop busypoll tricks, keeping it more simple.
 - Skip RPS and Generic-XDP-recursive-reinjection, suggested by Alexei

V5: correct RCU read protection around __netif_receive_skb_core.

V6: Setting TASK_RUNNING vs TASK_INTERRUPTIBLE based on talk with Rik van Riel
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1c601d82

bpf: XDP_REDIRECT enable use of cpumap · 9c270af3

由 Jesper Dangaard Brouer 提交于 10月 16, 2017

This patch connects cpumap to the xdp_do_redirect_map infrastructure.

Still no SKB allocation are done yet.  The XDP frames are transferred
to the other CPU, but they are simply refcnt decremented on the remote
CPU.  This served as a good benchmark for measuring the overhead of
remote refcnt decrement.  If driver page recycle cache is not
efficient then this, exposes a bottleneck in the page allocator.

A shout-out to MST's ptr_ring, which is the secret behind is being so
efficient to transfer memory pointers between CPUs, without constantly
bouncing cache-lines between CPUs.

V3: Handle !CONFIG_BPF_SYSCALL pointed out by kbuild test robot.

V4: Make Generic-XDP aware of cpumap type, but don't allow redirect yet,
 as implementation require a separate upstream discussion.

V5:
 - Fix a maybe-uninitialized pointed out by kbuild test robot.
 - Restrict bpf-prog side access to cpumap, open when use-cases appear
 - Implement cpu_map_enqueue() as a more simple void pointer enqueue

V6:
 - Allow cpumap type for usage in helper bpf_redirect_map,
   general bpf-prog side restriction moved to earlier patch.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9c270af3

bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP · 6710e112

由 Jesper Dangaard Brouer 提交于 10月 16, 2017

The 'cpumap' is primarily used as a backend map for XDP BPF helper
call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.

This patch implement the main part of the map.  It is not connected to
the XDP redirect system yet, and no SKB allocation are done yet.

The main concern in this patch is to ensure the datapath can run
without any locking.  This adds complexity to the setup and tear-down
procedure, which assumptions are extra carefully documented in the
code comments.

V2:
 - make sure array isn't larger than NR_CPUS
 - make sure CPUs added is a valid possible CPU

V3: fix nitpicks from Jakub Kicinski <kubakici@wp.pl>

V5:
 - Restrict map allocation to root / CAP_SYS_ADMIN
 - WARN_ON_ONCE if queue is not empty on tear-down
 - Return -EPERM on memlock limit instead of -ENOMEM
 - Error code in __cpu_map_entry_alloc() also handle ptr_ring_cleanup()
 - Moved cpu_map_enqueue() to next patch

V6: all notice by Daniel Borkmann
 - Fix err return code in cpu_map_alloc() introduced in V5
 - Move cpu_possible() check after max_entries boundary check
 - Forbid usage initially in check_map_func_compatibility()

V7:
 - Fix alloc error path spotted by Daniel Borkmann
 - Did stress test adding+removing CPUs from the map concurrently
 - Fixed refcnt issue on cpu_map_entry, kthread started too soon
 - Make sure packets are flushed during tear-down, involved use of
   rcu_barrier() and kthread_run only exit after queue is empty
 - Fix alloc error path in __cpu_map_entry_alloc() for ptr_ring

V8:
 - Nitpicking comments and gramma by Edward Cree
 - Fix missing semi-colon introduced in V7 due to rebasing
 - Move struct bpf_cpu_map_entry members cpu+map_id to tracepoint patch
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6710e112

ethtool: add ethtool_intersect_link_masks · 5a6cd6de

由 Alan Brady 提交于 10月 05, 2017

This function provides a way to intersect two link masks together to
find the common ground between them. For example in i40e, the driver
first generates link masks for what is supported by the PHY type. The
driver then gets the link masks for what the NVM supports. The
resulting intersection between them yields what can truly be supported.
Signed-off-by: NAlan Brady <alan.brady@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

5a6cd6de

15 10月, 2017 1 次提交

net/mlx5: PTP code migration to driver core section · 7c39afb3

由 Feras Daoud 提交于 8月 15, 2017

PTP code is moved to core section of mlx5 driver in order to share
it between ethernet and infiniband. This movement involves the following
changes:
- Change mlx5e_ prefix to be mlx5_
- Add clock structs to Core
- Add clock object to mlx5_core_dev
- Call Init/Uninit clock from core init/cleanup
- Rename mlx5e_tstamp to be mlx5_clock
Signed-off-by: NFeras Daoud <ferasda@mellanox.com>
Signed-off-by: NEitan Rabin <rabin@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>

7c39afb3

14 10月, 2017 1 次提交

i40e/i40evf: don't trust VF to reset itself · 17a9422d

由 Alan Brady 提交于 10月 11, 2017

When using 'ethtool -L' on a VF to change number of requested queues
from PF, we shouldn't trust the VF to reset itself after making the
request.  Doing it that way opens the door for a potentially malicious
VF to do nasty things to the PF which should never be the case.

This makes it such that after VF makes a successful request, PF will
then reset the VF to institute required changes.  Only if the request
fails will PF send a message back to VF letting it know the request was
unsuccessful.

Testing-hints:
There should be no real functional changes.  This is simply hardening
against a potentially malicious VF.
Signed-off-by: NAlan Brady <alan.brady@intel.com>
Tested-by: NAndrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>

17a9422d

13 10月, 2017 1 次提交

net: phy: broadcom: support new device flag for setting master mode · 2355a654

由 Rafał Miłecki 提交于 10月 12, 2017

Some of Broadcom's PHYs run by default in slave mode with Automatic
Slave/Master configuration disabled. It stops them from working properly
with some devices.

So far it has been verified for BCM54210E and BCM50212E which don't
work well with Intel's I217-LM and I218-LM:
http://ark.intel.com/products/60019/Intel-Ethernet-Connection-I217-LM
http://ark.intel.com/products/71307/Intel-Ethernet-Connection-I218-LM
I was told there is massive ping loss.

This commit adds support for a new flag which can be set by an ethernet
driver to fixup PHY setup.
Signed-off-by: NRafał Miłecki <rafal@milecki.pl>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2355a654

11 10月, 2017 3 次提交

bpf: write back the verifier log buffer as it gets filled · a2a7d570

由 Jakub Kicinski 提交于 10月 09, 2017

Verifier log buffer can be quite large (up to 16MB currently).
As Eric Dumazet points out if we allow multiple verification
requests to proceed simultaneously, malicious user may use the
verifier as a way of allocating large amounts of unswappable
memory to OOM the host.

Switch to a strategy of allocating a smaller buffer (1024B)
and writing it out into the user buffer after every print.

While at it remove the old BUG_ON().

This is in preparation of the global verifier lock removal.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NSimon Horman <simon.horman@netronome.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a2a7d570

bpf: move global verifier log into verifier environment · 61bd5218

由 Jakub Kicinski 提交于 10月 09, 2017

The biggest piece of global state protected by the verifier lock
is the verifier_log.  Move that log to struct bpf_verifier_env.
struct bpf_verifier_env has to be passed now to all invocations
of verbose().
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NSimon Horman <simon.horman@netronome.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

61bd5218

bpf: encapsulate verifier log state into a structure · e7bf8249

由 Jakub Kicinski 提交于 10月 09, 2017

Put the loose log_* variables into a structure.  This will make
it simpler to remove the global verifier state in following patches.
Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: NSimon Horman <simon.horman@netronome.com>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7bf8249

10 10月, 2017 5 次提交

once: switch to new jump label API · cf4c950b

由 Eric Biggers 提交于 10月 09, 2017

Switch the DO_ONCE() macro from the deprecated jump label API to the new
one.  The new one is more readable, and for DO_ONCE() it also makes the
generated code more icache-friendly: now the one-time initialization
code is placed out-of-line at the jump target, rather than at the inline
fallthrough case.
Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf4c950b

qed: Add LL2 slowpath handling · 6f34a284

由 Michal Kalderon 提交于 10月 09, 2017

For iWARP unaligned MPA flow, a slowpath event of flushing an
MPA connection that entered an unaligned state is required.
The flush ramrod is received on the ll2 queue, and a pre-registered
callback function is called to handle the flush event.
Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: NAriel Elior <Ariel.Elior@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6f34a284

qed: Add ll2 option for dropping a tx packet · 77caa792

由 Michal Kalderon 提交于 10月 09, 2017

The option of sending a packet on the ll2 and dropping it exists in
hardware and was not used until now, thus not exposed.
The iWARP unaligned MPA flow requires this functionality for
flushing the tx queue.
Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: NAriel Elior <Ariel.Elior@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

77caa792

qed: Add ll2 ability of opening a secondary queue · ed468ebe

由 Michal Kalderon 提交于 10月 09, 2017

When more than one ll2 queue is opened ( that is not an OOO queue )
ll2 code does not have enough information to determine whether
the queue is the main one or not, so a new field is added to the
acquire input data to expose the control of determining whether
the queue is the main queue or a secondary queue.
Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: NAriel Elior <Ariel.Elior@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ed468ebe

net: bridge: Export bridge multicast router state · 0912bda4

由 Yotam Gigi 提交于 10月 09, 2017

Add an access function that, given a bridge netdevice, returns whether the
bridge device is currently an mrouter or not. The function uses the already
existing br_multicast_is_router function to check that.

This function is needed in order to allow ports that join an already
existing bridge to know the current mrouter state of the bridge device.
Together with the bridge device mrouter ports switchdev notifications, it
is possible to have full offloading of the semantics of the bridge device
mcast router state.

Due to the fact that the bridge multicast router status can change in
packet RX path, take the multicast_router bridge spinlock to protect the
read.
Signed-off-by: NYotam Gigi <yotamg@mellanox.com>
Reviewed-by: NNogah Frankel <nogahf@mellanox.com>
Reviewed-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0912bda4

09 10月, 2017 2 次提交

netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' · 98589a09

由 Shmulik Ladkani 提交于 10月 09, 2017

Commit 2c16d603 ("netfilter: xt_bpf: support ebpf") introduced
support for attaching an eBPF object by an fd, with the
'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
IPT_SO_SET_REPLACE call.

However this breaks subsequent iptables calls:

 # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
 # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
 iptables: Invalid argument. Run `dmesg' for more information.

That's because iptables works by loading existing rules using
IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
the replacement set.

However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
(from the initial "iptables -m bpf" invocation) - so when 2nd invocation
occurs, userspace passes a bogus fd number, which leads to
'bpf_mt_check_v1' to fail.

One suggested solution [1] was to hack iptables userspace, to perform a
"entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
process-local fd per every 'xt_bpf_info_v1' entry seen.

However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.

This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
'.fd' and instead perform an in-kernel lookup for the bpf object given
the provided '.path'.

It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
expected to provide the path of the pinned object.

Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.

References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
            [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2Reported-by: NRafael Buchbinder <rafi@rbk.ms>
Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
Acked-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

98589a09

bridge: add new BR_NEIGH_SUPPRESS port flag to suppress arp and nd flood · 821f1b21

由 Roopa Prabhu 提交于 10月 06, 2017

This patch adds a new bridge port flag BR_NEIGH_SUPPRESS to
suppress arp and nd flood on bridge ports. It implements
rfc7432, section 10.
https://tools.ietf.org/html/rfc7432#section-10
for ethernet VPN deployments. It is similar to the existing
BR_PROXYARP* flags but has a few semantic differences to conform
to EVPN standard. Unlike the existing flags, this new flag suppresses
flood of all neigh discovery packets (arp and nd) to tunnel ports.
Supports both vlan filtering and non-vlan filtering bridges.

In case of EVPN, it is mainly used to avoid flooding
of arp and nd packets to tunnel ports like vxlan.

This patch adds netlink and sysfs support to set this bridge port
flag.
Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

821f1b21

08 10月, 2017 3 次提交

bpf: Use char in prog and map name · 067cae47

由 Martin KaFai Lau 提交于 10月 05, 2017

Instead of u8, use char for prog and map name.  It can avoid the
userspace tool getting compiler's signess warning.  The
bpf_prog_aux, bpf_map, bpf_attr, bpf_prog_info and
bpf_map_info are changed.
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

067cae47

net: phonet: mark header_ops as const · 64237470

由 Lin Zhang 提交于 10月 06, 2017

Signed-off-by: NLin Zhang <xiaolou4617@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

64237470

bpf: perf event change needed for subsequent bpf helpers · 97562633

由 Yonghong Song 提交于 10月 05, 2017

This patch does not impact existing functionalities.
It contains the changes in perf event area needed for
subsequent bpf_perf_event_read_value and
bpf_perf_prog_read_value helpers.
Signed-off-by: NYonghong Song <yhs@fb.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

97562633

07 10月, 2017 1 次提交

net: add rb_to_skb() and other rb tree helpers · 18a4c0ea

由 Eric Dumazet 提交于 10月 05, 2017

Geeralize private netem_rb_to_skb()

TCP rtx queue will soon be converted to rb-tree,
so we will need skb_rbtree_walk() helpers.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

18a4c0ea

06 10月, 2017 1 次提交

tcp: new list for sent but unacked skbs for RACK recovery · e2080072

由 Eric Dumazet 提交于 10月 04, 2017

This patch adds a new queue (list) that tracks the sent but not yet
acked or SACKed skbs for a TCP connection. The list is chronologically
ordered by skb->skb_mstamp (the head is the oldest sent skb).

This list will be used to optimize TCP Rack recovery, which checks
an skb's timestamp to judge if it has been lost and needs to be
retransmitted. Since TCP write queue is ordered by sequence instead
of sent time, RACK has to scan over the write queue to catch all
eligible packets to detect lost retransmission, and iterates through
SACKed skbs repeatedly.

Special cares for rare events:
1. TCP repair fakes skb transmission so the send queue needs adjusted
2. SACK reneging would require re-inserting SACKed skbs into the
   send queue. For now I believe it's not worth the complexity to
   make RACK work perfectly on SACK reneging, so we do nothing here.
3. Fast Open: currently for non-TFO, send-queue correctly queues
   the pure SYN packet. For TFO which queues a pure SYN and
   then a data packet, send-queue only queues the data packet but
   not the pure SYN due to the structure of TFO code. This is okay
   because the SYN receiver would never respond with a SACK on a
   missing SYN (i.e. SYN is never fast-retransmitted by SACK/RACK).

In order to not grow sk_buff, we use an union for the new list and
_skb_refdst/destructor fields. This is a bit complicated because
we need to make sure _skb_refdst and destructor are properly zeroed
before skb is cloned/copied at transmit, and before being freed.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e2080072

05 10月, 2017 6 次提交

net: Add extack to upper device linking · 42ab19ee

由 David Ahern 提交于 10月 04, 2017

Add extack arg to netdev_upper_dev_link and netdev_master_upper_dev_link
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

42ab19ee

net: Add extack to ndo_add_slave · 33eaf2a6

由 David Ahern 提交于 10月 04, 2017

Pass extack to do_set_master and down to ndo_add_slave
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

33eaf2a6

net: Add extack to netdev_notifier_info · 51d0c047

由 David Ahern 提交于 10月 04, 2017

Add netlink_ext_ack to netdev_notifier_info to allow notifier
handlers to return errors to userspace.

Clean up the initialization in dev.c such that extack is easily
added in subsequent patches where relevant. Specifically, remove
the init call in call_netdevice_notifiers_info and have callers
initalize on stack when info is declared.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

51d0c047

dev: advertise the new nsid when the netns iface changes · 6621dd29

由 Nicolas Dichtel 提交于 10月 03, 2017

x-netns interfaces are bound to two netns: the link netns and the upper
netns. Usually, this kind of interfaces is created in the link netns and
then moved to the upper netns. At the end, the interface is visible only
in the upper netns. The link nsid is advertised via netlink in the upper
netns, thus the user always knows where is the link part.

There is no such mechanism in the link netns. When the interface is moved
to another netns, the user cannot "follow" it.
This patch adds a new netlink attribute which helps to follow an interface
which moves to another netns. When the interface is unregistered, the new
nsid is advertised. If the interface is a x-netns interface (ie
rtnl_link_ops->get_link_net is defined), the nsid is allocated if needed.

CC: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

6621dd29

bpf: introduce BPF_PROG_QUERY command · 468e2f64

由 Alexei Starovoitov 提交于 10月 02, 2017

introduce BPF_PROG_QUERY command to retrieve a set of either
attached programs to given cgroup or a set of effective programs
that will execute for events within a cgroup
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

468e2f64

bpf: multi program support for cgroup+bpf · 324bda9e

由 Alexei Starovoitov 提交于 10月 02, 2017

introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
bpf programs to a cgroup.

The difference between three possible flags for BPF_PROG_ATTACH command:
- NONE(default): No further bpf programs allowed in the subtree.
- BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
  the program in this cgroup yields to sub-cgroup program.
- BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
  that cgroup program gets run in addition to the program in this cgroup.

NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
change their behavior. It only clarifies the semantics in relation
to new flag.

Only one program is allowed to be attached to a cgroup with
NONE or BPF_F_ALLOW_OVERRIDE flag.
Multiple programs are allowed to be attached to a cgroup with
BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
(those that were attached first, run first)
The programs of sub-cgroup are executed first, then programs of
this cgroup and then programs of parent cgroup.
All eligible programs are executed regardless of return code from
earlier programs.

To allow efficient execution of multiple programs attached to a cgroup
and to avoid penalizing cgroups without any programs attached
introduce 'struct bpf_prog_array' which is RCU protected array
of pointers to bpf programs.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NMartin KaFai Lau <kafai@fb.com>
for cgroup bits
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

324bda9e

04 10月, 2017 5 次提交

powerpc/watchdog: Make use of watchdog_nmi_probe() · 34ddaa3e

由 Thomas Gleixner 提交于 10月 03, 2017

The rework of the core hotplug code triggers the WARN_ON in start_wd_cpu()
on powerpc because it is called multiple times for the boot CPU.

The first call is via:

  start_wd_on_cpu+0x80/0x2f0
  watchdog_nmi_reconfigure+0x124/0x170
  softlockup_reconfigure_threads+0x110/0x130
  lockup_detector_init+0xbc/0xe0
  kernel_init_freeable+0x18c/0x37c
  kernel_init+0x2c/0x160
  ret_from_kernel_thread+0x5c/0xbc

And then again via the CPU hotplug registration:

  start_wd_on_cpu+0x80/0x2f0
  cpuhp_invoke_callback+0x194/0x620
  cpuhp_thread_fun+0x7c/0x1b0
  smpboot_thread_fn+0x290/0x2a0
  kthread+0x168/0x1b0
  ret_from_kernel_thread+0x5c/0xbc

This can be avoided by setting up the cpu hotplug state with nocalls and
move the initialization to the watchdog_nmi_probe() function. That
initializes the hotplug callbacks without invoking the callback and the
following core initialization function then configures the watchdog for the
online CPUs (in this case CPU0) via softlockup_reconfigure_threads().
Reported-and-tested-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org

34ddaa3e

watchdog/core, powerpc: Replace watchdog_nmi_reconfigure() · 6b9dc480

由 Thomas Gleixner 提交于 10月 02, 2017

The recent cleanup of the watchdog code split watchdog_nmi_reconfigure()
into two stages. One to stop the NMI and one to restart it after
reconfiguration. That was done by adding a boolean 'run' argument to the
code, which is functionally correct but not necessarily a piece of art.

Replace it by two explicit functions: watchdog_nmi_stop() and
watchdog_nmi_start().

Fixes: 6592ad2f ("watchdog/core, powerpc: Make watchdog_nmi_reconfigure() two stage")
Requested-by: NLinus 'Nursing his pet-peeve' Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: NThomas 'Mopping up garbage' Gleixner <tglx@linutronix.de>
Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1710021957480.2114@nanos

6b9dc480

mmc: Delete bounce buffer handling · de3ee99b

由 Linus Walleij 提交于 9月 20, 2017

In may, Steven sent a patch deleting the bounce buffer handling
and the CONFIG_MMC_BLOCK_BOUNCE option.

I chose the less invasive path of making it a runtime config
option, and we merged that successfully for kernel v4.12.

The code is however just standing in the way and taking up
space for seemingly no gain on any systems in wide use today.

Pierre says the code was there to improve speed on TI SDHCI
controllers on certain HP laptops and possibly some Ricoh
controllers as well. Early SDHCI controllers lacked the
scatter-gather feature, which made software bounce buffers
a significant speed boost.

We are clearly talking about the list of SDHCI PCI-based
MMC/SD card readers found in the pci_ids[] list in
drivers/mmc/host/sdhci-pci-core.c.

The TI SDHCI derivative is not supported by the upstream
kernel. This leaves the Ricoh.

What we can however notice is that the x86 defconfigs in the
kernel did not enable CONFIG_MMC_BLOCK_BOUNCE option, which
means that any such laptop would have to have a custom
configured kernel to actually take advantage of this
bounce buffer speed-up. It simply seems like there was
a speed optimization for the Ricoh controllers that noone
was using. (I have not checked the distro defconfigs but
I am pretty sure the situation is the same there.)

Bounce buffers increased performance on the OMAP HSMMC
at one point, and was part of the original submission in
commit a45c6cb8 ("[ARM] 5369/1: omap mmc: Add new
   omap hsmmc controller for 2430 and 34xx, v3")

This optimization was removed in
commit 0ccd76d4 ("omap_hsmmc: Implement scatter-gather
   emulation")
which found that scatter-gather emulation provided even
better performance.

The same was introduced for SDHCI in
commit 2134a922 ("sdhci: scatter-gather (ADMA) support")

I am pretty positively convinced that software
scatter-gather emulation will do for any host controller what
the bounce buffers were doing. Essentially, the bounce buffer
was a reimplementation of software scatter-gather-emulation in
the MMC subsystem, and it should be done away with.

Cc: Pierre Ossman <pierre@ossman.eu>
Cc: Juha Yrjola <juha.yrjola@solidboot.com>
Cc: Steven J. Hill <Steven.Hill@cavium.com>
Cc: Shawn Lin <shawn.lin@rock-chips.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Suggested-by: NSteven J. Hill <Steven.Hill@cavium.com>
Suggested-by: NShawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>

de3ee99b

include/linux/fs.h: fix comment about struct address_space · 32e57c29

由 Mike Rapoport 提交于 10月 03, 2017

Before commit 9c5d760b ("mm: split gfp_mask and mapping flags into
separate fields") the private_* fields of struct adrress_space were
grouped together and using "ditto" in comments describing the last
fields was correct.

With introduction of gpf_mask between private_lock and private_list
"ditto" references the wrong description.

Fix it by using the elaborate description.

Link: http://lkml.kernel.org/r/1507009987-8746-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

32e57c29

mm/memory_hotplug: change pfn_to_section_nr/section_nr_to_pfn macro to inline function · 1dd2bfc8

由 YASUAKI ISHIMATSU 提交于 10月 03, 2017

pfn_to_section_nr() and section_nr_to_pfn() are defined as macro.
pfn_to_section_nr() has no issue even if it is defined as macro.  But
section_nr_to_pfn() has overflow issue if sec is defined as int.

section_nr_to_pfn() just shifts sec by PFN_SECTION_SHIFT.  If sec is
defined as unsigned long, section_nr_to_pfn() returns pfn as 64 bit value.
But if sec is defined as int, section_nr_to_pfn() returns pfn as 32 bit
value.

__remove_section() calculates start_pfn using section_nr_to_pfn() and
scn_nr defined as int.  So if hot-removed memory address is over 16TB,
overflow issue occurs and section_nr_to_pfn() does not calculate correct
pfn.

To make callers use proper arg, the patch changes the macros to inline
functions.

Fixes: 815121d2 ("memory_hotplug: clear zone when removing the memory")
Link: http://lkml.kernel.org/r/e643a387-e573-6bbf-d418-c60c8ee3d15e@gmail.comSigned-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1dd2bfc8

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功