提交 · bb98f2c5ac5db92d34908dbac81a8de7c47c8353 · openanolis / cloud-kernel

08 6月, 2018 3 次提交

mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct · 88aa7cc6

由 Yang Shi 提交于 6月 07, 2018

mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too.  It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
       Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
 ps              D    0 14018      1 0x00000004
 Call Trace:
   schedule+0x36/0x80
   rwsem_down_read_failed+0xf0/0x150
   call_rwsem_down_read_failed+0x18/0x30
   down_read+0x20/0x40
   proc_pid_cmdline_read+0xd9/0x4e0
   __vfs_read+0x37/0x150
   vfs_read+0x96/0x130
   SyS_read+0x55/0xc0
   entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent
access to arg_start|end, env_start|end and others, as well as replace
write map_sem to read to protect the race condition between prctl and
sys_brk which might break check_data_rlimit(), and makes prctl more
friendly to other VM operations.

This patch just eliminates the abuse of mmap_sem, but it can't resolve
the above hung task warning completely since the later
access_remote_vm() call needs acquire mmap_sem.  The mmap_sem
scalability issue will be solved in the future.

[yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock]
  Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com
Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

88aa7cc6

slab: clean up the code comment in slab kmem_cache struct · 05fec35e

由 Baoquan He 提交于 6月 07, 2018

In commit 3b0efdfa ("mm, sl[aou]b: Extract common fields from struct
kmem_cache") the variable 'obj_size' was moved above, however the
related code comment is not updated accordingly.  Do it here.

Link: http://lkml.kernel.org/r/20180603032402.27526-1-bhe@redhat.comSigned-off-by: NBaoquan He <bhe@redhat.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NChristoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

05fec35e

fs/dax.c: use new return type vm_fault_t · ab77dab4

由 Souptick Joarder 提交于 6月 07, 2018

Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

commit 1c8f4220 ("mm: change return type to vm_fault_t")

There was an existing bug inside dax_load_hole() if vm_insert_mixed had
failed to allocate a page table, we'd return VM_FAULT_NOPAGE instead of
VM_FAULT_OOM.  With new vmf_insert_mixed() this issue is addressed.

vm_insert_mixed_mkwrite has inefficiency when it returns an error value,
driver has to convert it to vm_fault_t type.  With new
vmf_insert_mixed_mkwrite() this limitation will be addressed.

Link: http://lkml.kernel.org/r/20180510181121.GA15239@jordon-HP-15-Notebook-PCSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NMatthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ab77dab4

07 6月, 2018 1 次提交

strparser: Add __strp_unpause and use it in ktls. · 7170e604

由 Doron Roberts-Kedes 提交于 6月 06, 2018

strp_unpause queues strp_work in order to parse any messages that
arrived while the strparser was paused. However, the process invoking
strp_unpause could eagerly parse a buffered message itself if it held
the sock lock.

__strp_unpause is an alternative to strp_pause that avoids the scheduling
overhead that results when a receiving thread unpauses the strparser
and waits for the next message to be delivered by the workqueue thread.

This patch more than doubled the IOPS achieved in a benchmark of NBD
traffic encrypted using ktls.
Signed-off-by: NDoron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7170e604

06 6月, 2018 6 次提交

device: Use overflow helpers for devm_kmalloc() · 2509b561

由 Kees Cook 提交于 5月 08, 2018

Use the overflow helpers both in existing multiplication-using inlines as
well as the addition-overflow case in the core allocation routine.
Signed-off-by: NKees Cook <keescook@chromium.org>

2509b561

mm: Use overflow helpers in kvmalloc() · 3b3b1a29

由 Kees Cook 提交于 5月 08, 2018

Instead of open-coded multiplication and bounds checking, use the new
overflow helper. Additionally prepare for vmalloc() users to add
array_size()-family helpers in the future.
Signed-off-by: NKees Cook <keescook@chromium.org>

3b3b1a29

mm: Use overflow helpers in kmalloc_array*() · 49b7f898

由 Kees Cook 提交于 5月 08, 2018

Instead of open-coded multiplication and bounds checking, use the new
overflow helper.
Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: NKees Cook <keescook@chromium.org>

49b7f898

overflow.h: Add allocation size calculation helpers · 610b15c5

由 Kees Cook 提交于 5月 07, 2018

In preparation for replacing unchecked overflows for memory allocations,
this creates helpers for the 3 most common calculations:

array_size(a, b): 2-dimensional array
array3_size(a, b, c): 3-dimensional array
struct_size(ptr, member, n): struct followed by n-many trailing members

Each of these return SIZE_MAX on overflow instead of wrapping around.

(Additionally renames a variable named "array_size" to avoid future
collision.)
Co-developed-by: NMatthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: NKees Cook <keescook@chromium.org>

610b15c5

devlink: Add extack to reload and port_{un, }split operations · ac0fc8a1

由 David Ahern 提交于 6月 05, 2018

Add extack argument to reload, port_split and port_unsplit operations.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ac0fc8a1

ipmr: fix error path when ipmr_new_table fails · e783bb00

由 Sabrina Dubroca 提交于 6月 05, 2018

commit 0bbbf0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
refactored ipmr_new_table, so that it now returns NULL when
mr_table_alloc fails. Unfortunately, all callers of ipmr_new_table
expect an ERR_PTR.

This can result in NULL deref, for example when ipmr_rules_exit calls
ipmr_free_table with NULL net->ipv4.mrt in the
!CONFIG_IP_MROUTE_MULTIPLE_TABLES version.

This patch makes mr_table_alloc return errors, and changes
ip6mr_new_table and its callers to return/expect error pointers as
well. It also removes the version of mr_table_alloc defined under
!CONFIG_IP_MROUTE_COMMON, since it is never used.

Fixes: 0bbbf0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e783bb00

05 6月, 2018 21 次提交

qed*: Utilize FW 8.37.2.0 · d52c89f1

由 Michal Kalderon 提交于 6月 05, 2018

This FW contains several fixes and features.

RDMA
- Several modifications and fixes for Memory Windows
- drop vlan and tcp timestamp from mss calculation in driver for
  this FW
- Fix SQ completion flow when local ack timeout is infinite
- Modifications in t10dif support

ETH
- Fix aRFS for tunneled traffic without inner IP.
- Fix chip configuration which may fail under heavy traffic conditions.
- Support receiving any-VNI in VXLAN and GENEVE RX classification.

iSCSI / FcoE
- Fix iSCSI recovery flow
- Drop vlan and tcp timestamp from mss calc for fw 8.37.2.0

Misc
- Several registers (split registers) won't read correctly with
  ethtool -d
Signed-off-by: NAriel Elior <Ariel.Elior@cavium.com>
Signed-off-by: NManish Rangankar <manish.rangankar@cavium.com>
Signed-off-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d52c89f1

net-tcp: remove useless tw_timeout field · 95358a95

由 Maciej Żenczykowski 提交于 6月 05, 2018

Tested: 'git grep tw_timeout' comes up empty and it builds :-)
Signed-off-by: NMaciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

95358a95

xsk: wire upp Tx zero-copy functions · ac98d8aa

由 Magnus Karlsson 提交于 6月 04, 2018

Here we add the functionality required to support zero-copy Tx, and
also exposes various zero-copy related functions for the netdevs.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

ac98d8aa

net: added netdevice operation for Tx · e3760c7e

由 Magnus Karlsson 提交于 6月 04, 2018

Added ndo_xsk_async_xmit. This ndo "kicks" the netdev to start to pull
userland AF_XDP Tx frames from a NAPI context.
Signed-off-by: NMagnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

e3760c7e

xsk: add zero-copy support for Rx · 173d3adb

由 Björn Töpel 提交于 6月 04, 2018

Extend the xsk_rcv to support the new MEM_TYPE_ZERO_COPY memory, and
wireup ndo_bpf call in bind.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

173d3adb

xdp: add MEM_TYPE_ZERO_COPY · 02b55e56

由 Björn Töpel 提交于 6月 04, 2018

Here, a new type of allocator support is added to the XDP return
API. A zero-copy allocated xdp_buff cannot be converted to an
xdp_frame. Instead is the buff has to be copied. This is not supported
at all in this commit.

Also, an opaque "handle" is added to xdp_buff. This can be used as a
context for the zero-copy allocator implementation.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

02b55e56

net: xdp: added bpf_netdev_command XDP_{QUERY, SETUP}_XSK_UMEM · 74515c57

由 Björn Töpel 提交于 6月 04, 2018

Extend ndo_bpf with two new commands used for query zero-copy support
and register an UMEM to a queue_id of a netdev.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

74515c57

xsk: introduce xdp_umem_page · 8aef7340

由 Björn Töpel 提交于 6月 04, 2018

The xdp_umem_page holds the address for a page. Trade memory for
faster lookup. Later, we'll add DMA address here as well.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

8aef7340

xsk: moved struct xdp_umem definition · e61e62b9

由 Björn Töpel 提交于 6月 04, 2018

Moved struct xdp_umem to xdp_sock.h, in order to prepare for zero-copy
support.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

e61e62b9

net: phy: broadcom: Enable 125 MHz clock on LED4 pin for BCM54612E by default. · 69e2eccc

由 Kun Yi 提交于 6月 04, 2018

BCM54612E have 4 multi-functional LED pins that can be configured
through register setting; the LED4 pin can be configured to a 125MHz
reference clock output by setting the spare register. Since the dedicated
CLK125 reference clock pin is not brought out on the 48-Pin MLP, the LED4
pin is the only pin to provide such function in this package, and therefore
it is beneficial to just enable the reference clock by default.
Signed-off-by: NKun Yi <kunyi@google.com>
Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

69e2eccc

net: phy: remove PM ops from MDIO bus · 9107c05e

由 Heiner Kallweit 提交于 6月 02, 2018

Current implementation of MDIO bus PM ops doesn't actually implement
bus-specific PM ops but just calls PM ops defined on a device level
what doesn't seem to be fully in line with the core PM model.

When looking e.g. at __device_suspend() the PM core looks for PM ops
of a device in a specific order:
1. device PM domain
2. device type
3. device class
4. device bus

I think it has good reason that there's no PM ops on device level.

Now that a device type representation of PHY's as special type of MDIO
devices was added (only user of MDIO bus PM ops), the MDIO bus
PM ops can be removed including member pm of struct mdio_device.

If for some other type of MDIO device PM ops are needed, it should be
modeled as struct device_type as well.
Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9107c05e

net: remove net_device operation ndo_xdp_flush · 189454e8

由 Jesper Dangaard Brouer 提交于 6月 05, 2018

All drivers are cleaned up and no references to ndo_xdp_flush
are left in drivers, it is time to remove the net_device_ops
operation ndo_xdp_flush.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

189454e8

branch-check: fix long->int truncation when profiling branches · 2026d357

由 Mikulas Patocka 提交于 5月 30, 2018

The function __builtin_expect returns long type (see the gcc
documentation), and so do macros likely and unlikely. Unfortunatelly, when
CONFIG_PROFILE_ANNOTATED_BRANCHES is selected, the macros likely and
unlikely expand to __branch_check__ and __branch_check__ truncates the
long type to int. This unintended truncation may cause bugs in various
kernel code (we found a bug in dm-writecache because of it), so it's
better to fix __branch_check__ to return long.

Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1805300818140.24812@file01.intranet.prod.int.rdu2.redhat.com

Cc: Ingo Molnar <mingo@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 1f0d69a9 ("tracing: profile likely and unlikely annotations")
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>

2026d357

ring-buffer: Fix typo in comment · a9235b54

由 Vasyl Gomonovych 提交于 5月 18, 2018

Fix typo of the word 'been'

Link: http://lkml.kernel.org/r/20180518203130.2011-1-gomonovych@gmail.comSigned-off-by: NVasyl Gomonovych <gomonovych@gmail.com>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>

a9235b54

qed: Add srq core support for RoCE and iWARP · 39dbc646

由 Yuval Bason 提交于 6月 03, 2018

This patch adds support for configuring SRQ and provides the necessary
APIs for rdma upper layer driver (qedr) to enable the SRQ feature.
Signed-off-by: NMichal Kalderon <michal.kalderon@cavium.com>
Signed-off-by: NAriel Elior <ariel.elior@cavium.com>
Signed-off-by: NYuval Bason <yuval.bason@cavium.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

39dbc646

net: skbuff.h: drop unneeded <linux/slab.h> · 1f4c7413

由 Randy Dunlap 提交于 6月 02, 2018

<linux/skbuff.h> does not use nor need <linux/slab.h>, so drop this
header file from skbuff.h.

<linux/skbuff.h> is currently #included in around 1200 C source and
header files, making it the 31st most-used header file.

Build tested [allmodconfig] on 20 arch-es.
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1f4c7413

rxrpc: Fix handling of call quietly cancelled out on server · 1a025028

由 David Howells 提交于 6月 03, 2018

Sometimes an in-progress call will stop responding on the fileserver when
the fileserver quietly cancels the call with an internally marked abort
(RX_CALL_DEAD), without sending an ABORT to the client.

This causes the client's call to eventually expire from lack of incoming
packets directed its way, which currently leads to it being cancelled
locally with ETIME.  Note that it's not currently clear as to why this
happens as it's really hard to reproduce.

The rotation policy implement by kAFS, however, doesn't differentiate
between ETIME meaning we didn't get any response from the server and ETIME
meaning the call got cancelled mid-flow.  The latter leads to an oops when
fetching data as the rotation partially resets the afs_read descriptor,
which can result in a cleared page pointer being dereferenced because that
page has already been filled.

Handle this by the following means:

 (1) Set a flag on a call when we receive a packet for it.

 (2) Store the highest packet serial number so far received for a call
     (bearing in mind this may wrap).

 (3) If, when the "not received anything recently" timeout expires on a
     call, we've received at least one packet for a call and the connection
     as a whole has received packets more recently than that call, then
     cancel the call locally with ECONNRESET rather than ETIME.

     This indicates that the call was definitely in progress on the server.

 (4) In kAFS, if the rotation algorithm sees ECONNRESET rather than ETIME,
     don't try the next server, but rather abort the call.

     This avoids the oops as we don't try to reuse the afs_read struct.
     Rather, as-yet ungotten pages will be reread at a later data.

Also:

 (5) Add an rxrpc tracepoint to log detection of the call being reset.

Without this, I occasionally see an oops like the following:

    general protection fault: 0000 [#1] SMP PTI
    ...
    RIP: 0010:_copy_to_iter+0x204/0x310
    RSP: 0018:ffff8800cae0f828 EFLAGS: 00010206
    RAX: 0000000000000560 RBX: 0000000000000560 RCX: 0000000000000560
    RDX: ffff8800cae0f968 RSI: ffff8800d58b3312 RDI: 0005080000000000
    RBP: ffff8800cae0f968 R08: 0000000000000560 R09: ffff8800ca00f400
    R10: ffff8800c36f28d4 R11: 00000000000008c4 R12: ffff8800cae0f958
    R13: 0000000000000560 R14: ffff8800d58b3312 R15: 0000000000000560
    FS:  00007fdaef108080(0000) GS:ffff8800ca680000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fb28a8fa000 CR3: 00000000d2a76002 CR4: 00000000001606e0
    Call Trace:
     skb_copy_datagram_iter+0x14e/0x289
     rxrpc_recvmsg_data.isra.0+0x6f3/0xf68
     ? trace_buffer_unlock_commit_regs+0x4f/0x89
     rxrpc_kernel_recv_data+0x149/0x421
     afs_extract_data+0x1e0/0x798
     ? afs_wait_for_call_to_complete+0xc9/0x52e
     afs_deliver_fs_fetch_data+0x33a/0x5ab
     afs_deliver_to_call+0x1ee/0x5e0
     ? afs_wait_for_call_to_complete+0xc9/0x52e
     afs_wait_for_call_to_complete+0x12b/0x52e
     ? wake_up_q+0x54/0x54
     afs_make_call+0x287/0x462
     ? afs_fs_fetch_data+0x3e6/0x3ed
     ? rcu_read_lock_sched_held+0x5d/0x63
     afs_fs_fetch_data+0x3e6/0x3ed
     afs_fetch_data+0xbb/0x14a
     afs_readpages+0x317/0x40d
     __do_page_cache_readahead+0x203/0x2ba
     ? ondemand_readahead+0x3a7/0x3c1
     ondemand_readahead+0x3a7/0x3c1
     generic_file_buffered_read+0x18b/0x62f
     __vfs_read+0xdb/0xfe
     vfs_read+0xb2/0x137
     ksys_read+0x50/0x8c
     do_syscall_64+0x7d/0x1a0
     entry_SYSCALL_64_after_hwframe+0x49/0xbe

Note the weird value in RDI which is a result of trying to kmap() a NULL
page pointer.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1a025028

swait: strengthen language to discourage use · c5e7a7ea

由 Linus Torvalds 提交于 6月 04, 2018

We already earlier discouraged people from using this interface in
commit 88796e7e ("sched/swait: Document it clearly that the swait
facilities are special and shouldn't be used"), but I just got a pull
request with a new broken user.

So make the comment *really* clear.

The swait interfaces are bad, and should not be used unless you have
some *very* strong reasons that include tons of hard performance numbers
on just why you want to use them, and you show that you actually
understand that they aren't at all like the normal wait/wakeup
interfaces.

So far, every single user has been suspect.  The main user is KVM, which
is completely pointless (there is only ever one waiter, which avoids the
interface subtleties, but also means that having a queue instead of a
pointer is counter-productive and certainly not an "optimization").

So make the comments much stronger.

Not that anybody likely reads them anyway, but there's always some
slight hope that it will cause somebody to think twice.

I'd like to remove this interface entirely, but there is the theoretical
possibility that it's actually the right thing to use in some situation,
most likely some deep RT use.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c5e7a7ea

ipv6: omit traffic class when calculating flow hash · fa1be7e0

由 Michal Kubecek 提交于 6月 04, 2018

Some of the code paths calculating flow hash for IPv6 use flowlabel member
of struct flowi6 which, despite its name, encodes both flow label and
traffic class. If traffic class changes within a TCP connection (as e.g.
ssh does), ECMP route can switch between path. It's also inconsistent with
other code paths where ip6_flowlabel() (returning only flow label) is used
to feed the key.

Use only flow label everywhere, including one place where hash key is set
using ip6_flowinfo().

Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)")
Fixes: f70ea018 ("net: Add functions to get skb->hash based on flow structures")
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

fa1be7e0

Revert "ipv6: omit traffic class when calculating flow hash" · a925ab48

由 David S. Miller 提交于 6月 04, 2018

This reverts commit 87ae68c8.

Applied the wrong version of this fix, correct version
coming up.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a925ab48

ipv6: omit traffic class when calculating flow hash · 87ae68c8

由 Michal Kubecek 提交于 6月 02, 2018

Some of the code paths calculating flow hash for IPv6 use flowlabel member
of struct flowi6 which, despite its name, encodes both flow label and
traffic class. If traffic class changes within a TCP connection (as e.g.
ssh does), ECMP route can switch between path. It's also incosistent with
other code paths where ip6_flowlabel() (returning only flow label) is used
to feed the key.

Use only flow label everywhere, including one place where hash key is set
using ip6_flowinfo().

Fixes: 51ebd318 ("ipv6: add support of equal cost multipath (ECMP)")
Fixes: f70ea018 ("net: Add functions to get skb->hash based on flow structures")
Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
Tested-by: NIdo Schimmel <idosch@mellanox.com>
Acked-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

87ae68c8

04 6月, 2018 3 次提交

xsk: new descriptor addressing scheme · bbff2f32

由 Björn Töpel 提交于 6月 04, 2018

Currently, AF_XDP only supports a fixed frame-size memory scheme where
each frame is referenced via an index (idx). A user passes the frame
index to the kernel, and the kernel acts upon the data.  Some NICs,
however, do not have a fixed frame-size model, instead they have a
model where a memory window is passed to the hardware and multiple
frames are filled into that window (referred to as the "type-writer"
model).

By changing the descriptor format from the current frame index
addressing scheme, AF_XDP can in the future be extended to support
these kinds of NICs.

In the index-based model, an idx refers to a frame of size
frame_size. Addressing a frame in the UMEM is done by offseting the
UMEM starting address by a global offset, idx * frame_size + offset.
Communicating via the fill- and completion-rings are done by means of
idx.

In this commit, the idx is removed in favor of an address (addr),
which is a relative address ranging over the UMEM. To convert an
idx-based address to the new addr is simply: addr = idx * frame_size +
offset.

We also stop referring to the UMEM "frame" as a frame. Instead it is
simply called a chunk.

To transfer ownership of a chunk to the kernel, the addr of the chunk
is passed in the fill-ring. Note, that the kernel will mask addr to
make it chunk aligned, so there is no need for userspace to do
that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
3000 to the fill-ring will refer to the same chunk.

On the completion-ring, the addr will match that of the Tx descriptor,
passed to the kernel.

Changing the descriptor format to use chunks/addr will allow for
future changes to move to a type-writer based model, where multiple
frames can reside in one chunk. In this model passing one single chunk
into the fill-ring, would potentially result in multiple Rx
descriptors.

This commit changes the uapi of AF_XDP sockets, and updates the
documentation.
Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>

bbff2f32

bpf: flowlabel in bpf_fib_lookup should be flowinfo · bd3a08aa

由 David Ahern 提交于 6月 03, 2018

As Michal noted the flow struct takes both the flow label and priority.
Update the bpf_fib_lookup API to note that it is flowinfo and not just
the flow label.

Cc: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

bd3a08aa

bpf: implement bpf_get_current_cgroup_id() helper · bf6fa2c8

由 Yonghong Song 提交于 6月 03, 2018

bpf has been used extensively for tracing. For example, bcc
contains an almost full set of bpf-based tools to trace kernel
and user functions/events. Most tracing tools are currently
either filtered based on pid or system-wide.

Containers have been used quite extensively in industry and
cgroup is often used together to provide resource isolation
and protection. Several processes may run inside the same
container. It is often desirable to get container-level tracing
results as well, e.g. syscall count, function count, I/O
activity, etc.

This patch implements a new helper, bpf_get_current_cgroup_id(),
which will return cgroup id based on the cgroup within which
the current task is running.

The later patch will provide an example to show that
userspace can get the same cgroup id so it could
configure a filter or policy in the bpf program based on
task cgroup id.

The helper is currently implemented for tracing. It can
be added to other program types as well when needed.
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NYonghong Song <yhs@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

bf6fa2c8

03 6月, 2018 6 次提交

xdp: done implementing ndo_xdp_xmit flush flag for all drivers · 73de5717

由 Jesper Dangaard Brouer 提交于 5月 31, 2018

Removing XDP_XMIT_FLAGS_NONE as all driver now implement
a flush operation in their ndo_xdp_xmit call.  The compiler
will catch if any users of XDP_XMIT_FLAGS_NONE remains.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

73de5717

xdp: add flags argument to ndo_xdp_xmit API · 42b33468

由 Jesper Dangaard Brouer 提交于 5月 31, 2018

This patch only change the API and reject any use of flags. This is an
intermediate step that allows us to implement the flush flag operation
later, for each individual driver in a separate patch.

The plan is to implement flush operation via XDP_XMIT_FLUSH flag
and then remove XDP_XMIT_FLAGS_NONE when done.
Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

42b33468

vlan: use non-archaic spelling of failes · 8051ac76

由 Thadeu Lima de Souza Cascardo 提交于 5月 31, 2018

Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8051ac76

bpf: fix context access in tracing progs on 32 bit archs · bc23105c

由 Daniel Borkmann 提交于 6月 02, 2018

Wang reported that all the testcases for BPF_PROG_TYPE_PERF_EVENT
program type in test_verifier report the following errors on x86_32:

  172/p unpriv: spill/fill of different pointers ldx FAIL
  Unexpected error message!
  0: (bf) r6 = r10
  1: (07) r6 += -8
  2: (15) if r1 == 0x0 goto pc+3
  R1=ctx(id=0,off=0,imm=0) R6=fp-8,call_-1 R10=fp0,call_-1
  3: (bf) r2 = r10
  4: (07) r2 += -76
  5: (7b) *(u64 *)(r6 +0) = r2
  6: (55) if r1 != 0x0 goto pc+1
  R1=ctx(id=0,off=0,imm=0) R2=fp-76,call_-1 R6=fp-8,call_-1 R10=fp0,call_-1 fp-8=fp
  7: (7b) *(u64 *)(r6 +0) = r1
  8: (79) r1 = *(u64 *)(r6 +0)
  9: (79) r1 = *(u64 *)(r1 +68)
  invalid bpf_context access off=68 size=8

  378/p check bpf_perf_event_data->sample_period byte load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (71) r0 = *(u8 *)(r1 +68)
  invalid bpf_context access off=68 size=1

  379/p check bpf_perf_event_data->sample_period half load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (69) r0 = *(u16 *)(r1 +68)
  invalid bpf_context access off=68 size=2

  380/p check bpf_perf_event_data->sample_period word load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (61) r0 = *(u32 *)(r1 +68)
  invalid bpf_context access off=68 size=4

  381/p check bpf_perf_event_data->sample_period dword load permitted FAIL
  Failed to load prog 'Permission denied'!
  0: (b7) r0 = 0
  1: (79) r0 = *(u64 *)(r1 +68)
  invalid bpf_context access off=68 size=8

Reason is that struct pt_regs on x86_32 doesn't fully align to 8 byte
boundary due to its size of 68 bytes. Therefore, bpf_ctx_narrow_access_ok()
will then bail out saying that off & (size_default - 1) which is 68 & 7
doesn't cleanly align in the case of sample_period access from struct
bpf_perf_event_data, hence verifier wrongly thinks we might be doing an
unaligned access here though underlying arch can handle it just fine.
Therefore adjust this down to machine size and check and rewrite the
offset for narrow access on that basis. We also need to fix corresponding
pe_prog_is_valid_access(), since we hit the check for off % size != 0
(e.g. 68 % 8 -> 4) in the first and last test. With that in place, progs
for tracing work on x86_32.
Reported-by: NWang YanQing <udknight@gmail.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Tested-by: NWang YanQing <udknight@gmail.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

bc23105c

bpf: make sure to clear unused fields in tunnel/xfrm state fetch · 1fbc2e0c

由 Daniel Borkmann 提交于 6月 02, 2018

Since the remaining bits are not filled in struct bpf_tunnel_key
resp. struct bpf_xfrm_state and originate from uninitialized stack
space, we should make sure to clear them before handing control
back to the program.

Also add a padding element to struct bpf_xfrm_state for future use
similar as we have in struct bpf_tunnel_key and clear it as well.

  struct bpf_xfrm_state {
      __u32                      reqid;            /*     0     4 */
      __u32                      spi;              /*     4     4 */
      __u16                      family;           /*     8     2 */

      /* XXX 2 bytes hole, try to pack */

      union {
          __u32              remote_ipv4;          /*           4 */
          __u32              remote_ipv6[4];       /*          16 */
      };                                           /*    12    16 */

      /* size: 28, cachelines: 1, members: 4 */
      /* sum members: 26, holes: 1, sum holes: 2 */
      /* last cacheline: 28 bytes */
  };
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

1fbc2e0c

bpf: add bpf_skb_cgroup_id helper · cb20b08e

由 Daniel Borkmann 提交于 6月 02, 2018

Add a new bpf_skb_cgroup_id() helper that allows to retrieve the
cgroup id from the skb's socket. This is useful in particular to
enable bpf_get_cgroup_classid()-like behavior for cgroup v1 in
cgroup v2 by allowing ID based matching on egress. This can in
particular be used in combination with applying policy e.g. from
map lookups, and also complements the older bpf_skb_under_cgroup()
interface. In user space the cgroup id for a given path can be
retrieved through the f_handle as demonstrated in [0] recently.

  [0] https://lkml.org/lkml/2018/5/22/1190Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

cb20b08e

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功