- 29 5月, 2018 11 次提交
-
-
由 David Ahern 提交于
Add support for IFA_RT_PRIORITY to ipv4 addresses. If the metric is changed on an existing address then the new route is inserted before removing the old one. Since the metric is one of the route keys, the prefix route can not be replaced. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Currently, if two interfaces have addresses in the same connected route, then the order of the prefix route entries is determined by the order in which the addresses are assigned or the links brought up. Add IFA_RT_PRIORITY to allow user to specify the metric of the prefix route associated with an address giving interface managers better control of the order of prefix routes. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Move config parameters for adding an ipv6 address to a struct. struct names stem from inet6_rtm_newaddr which is the modern handler for adding an address. Start the conversion to ifa6_config with ipv6_add_addr. This is an argument move only; no functional change intended. Mapping of variable changes: addr --> cfg->pfx peer_addr --> cfg->peer_pfx pfxlen --> cfg->plen flags --> cfg->ifa_flags scope, valid_lft, prefered_lft have the same names within cfg (with corrected spelling). Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
MQ doesn't hold any statistics on its own, however, statistic from offloads are requested starting from the root, hence MQ will read the old values for its sums. Call into the drivers, because of the additive nature of the stats drivers are aware of how much "pending updates" they have to children of the MQ. Since MQ reset its stats on every dump we can simply offset the stats, predicting how stats of offloaded children will change. Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
mq offload is trivial, we just need to let the device know that the root qdisc is mq. Alternative approach would be to export qdisc_lookup() and make drivers check the root type themselves, but notification via ndo_setup_tc is more in line with other qdiscs. Note that mq doesn't hold any stats on it's own, it just adds up stats of its children. Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jakub Kicinski 提交于
AFAICT struct gnet_stats_queue.qlen is not used in Qdiscs. It may, however, be useful for offloads to report HW queue length there. Add that value to the result of qdisc_qlen_sum(). Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
The struct Qdisc has a lot of holes, especially after commit a53851e2 ("net: sched: explicit locking in gso_cpu fallback"), which as a side effect, moved the fields just after 'busylock' on a new cacheline. Since both 'padded' and 'refcnt' are not updated frequently, and there is a hole before 'gso_skb', we can move such fields there, saving a cacheline without any performance side effect. Before this commit: pahole -C Qdisc net/sche/sch_generic.o # ... /* size: 384, cachelines: 6, members: 25 */ /* sum members: 236, holes: 3, sum holes: 92 */ /* padding: 56 */ After this commit: pahole -C Qdisc net/sche/sch_generic.o # ... /* size: 320, cachelines: 5, members: 25 */ /* sum members: 236, holes: 2, sum holes: 28 */ /* padding: 56 */ Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yangbo Lu 提交于
This patch is to move some definitions in ptp_qoriq.c to the header file. Signed-off-by: NYangbo Lu <yangbo.lu@nxp.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Sridhar Samudrala 提交于
This feature bit can be used by hypervisor to indicate virtio_net device to act as a standby for another device with the same MAC address. VIRTIO_NET_F_STANDBY is defined as bit 62 as it is a device feature bit. Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Sridhar Samudrala 提交于
The net_failover driver provides an automated failover mechanism via APIs to create and destroy a failover master netdev and manages a primary and standby slave netdevs that get registered via the generic failover infrastructure. The failover netdev acts a master device and controls 2 slave devices. The original paravirtual interface gets registered as 'standby' slave netdev and a passthru/vf device with the same MAC gets registered as 'primary' slave netdev. Both 'standby' and 'failover' netdevs are associated with the same 'pci' device. The user accesses the network interface via 'failover' netdev. The 'failover' netdev chooses 'primary' netdev as default for transmits when it is available with link up and running. This can be used by paravirtual drivers to enable an alternate low latency datapath. It also enables hypervisor controlled live migration of a VM with direct attached VF by failing over to the paravirtual datapath when the VF is unplugged. Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Sridhar Samudrala 提交于
The failover module provides a generic interface for paravirtual drivers to register a netdev and a set of ops with a failover instance. The ops are used as event handlers that get called to handle netdev register/ unregister/link change/name change events on slave pci ethernet devices with the same mac address as the failover netdev. This enables paravirtual drivers to use a VF as an accelerated low latency datapath. It also allows migration of VMs with direct attached VFs by failing over to the paravirtual datapath when the VF is unplugged. Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 5月, 2018 8 次提交
-
-
由 Jonathan Cameron 提交于
The case of a new numa node got missed in avoiding using the node info from page_struct during hotplug. In this path we have a call to register_mem_sect_under_node (which allows us to specify it is hotplug so don't change the node), via link_mem_sections which unfortunately does not. Fix is to pass check_nid through link_mem_sections as well and disable it in the new numa node path. Note the bug only 'sometimes' manifests depending on what happens to be in the struct page structures - there are lots of them and it only needs to match one of them. The result of the bug is that (with a new memory only node) we never successfully call register_mem_sect_under_node so don't get the memory associated with the node in sysfs and meminfo for the node doesn't report it. It came up whilst testing some arm64 hotplug patches, but appears to be universal. Whilst I'm triggering it by removing then reinserting memory to a node with no other elements (thus making the node disappear then appear again), it appears it would happen on hotplugging memory where there was none before and it doesn't seem to be related the arm64 patches. These patches call __add_pages (where most of the issue was fixed by Pavel's patch). If there is a node at the time of the __add_pages call then all is well as it calls register_mem_sect_under_node from there with check_nid set to false. Without a node that function returns having not done the sysfs related stuff as there is no node to use. This is expected but it is the resulting path that fails... Exact path to the problem is as follows: mm/memory_hotplug.c: add_memory_resource() The node is not online so we enter the 'if (new_node)' twice, on the second such block there is a call to link_mem_sections which calls into drivers/node.c: link_mem_sections() which calls drivers/node.c: register_mem_sect_under_node() which calls get_nid_for_pfn and keeps trying until the output of that matches the expected node (passed all the way down from add_memory_resource) It is effectively the same fix as the one referred to in the fixes tag just in the code path for a new node where the comments point out we have to rerun the link creation because it will have failed in register_new_memory (as there was no node at the time). (actually that comment is wrong now as we don't have register_new_memory any more it got renamed to hotplug_memory_register in Pavel's patch). Link: http://lkml.kernel.org/r/20180504085311.1240-1-Jonathan.Cameron@huawei.com Fixes: fc44f7f9 ("mm/memory_hotplug: don't read nid from struct page during hotplug") Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Oscar has noticed that we splat WARNING: CPU: 0 PID: 64 at ./include/linux/gfp.h:467 vmemmap_alloc_block+0x4e/0xc9 [...] CPU: 0 PID: 64 Comm: kworker/u4:1 Tainted: G W E 4.17.0-rc5-next-20180517-1-default+ #66 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Workqueue: kacpi_hotplug acpi_hotplug_work_fn Call Trace: vmemmap_populate+0xf2/0x2ae sparse_mem_map_populate+0x28/0x35 sparse_add_one_section+0x4c/0x187 __add_pages+0xe7/0x1a0 add_pages+0x16/0x70 add_memory_resource+0xa3/0x1d0 add_memory+0xe4/0x110 acpi_memory_device_add+0x134/0x2e0 acpi_bus_attach+0xd9/0x190 acpi_bus_scan+0x37/0x70 acpi_device_hotplug+0x389/0x4e0 acpi_hotplug_work_fn+0x1a/0x30 process_one_work+0x146/0x340 worker_thread+0x47/0x3e0 kthread+0xf5/0x130 ret_from_fork+0x35/0x40 when adding memory to a node that is currently offline. The VM_WARN_ON is just too loud without a good reason. In this particular case we are doing alloc_pages_node(node, GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN, order) so we do not insist on allocating from the given node (it is more a hint) so we can fall back to any other populated node and moreover we explicitly ask to not warn for the allocation failure. Soften the warning only to cases when somebody asks for the given node explicitly by __GFP_THISNODE. Link: http://lkml.kernel.org/r/20180523125555.30039-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NOscar Salvador <osalvador@techadventures.net> Tested-by: NOscar Salvador <osalvador@techadventures.net> Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Reza Arbab <arbab@linux.vnet.ibm.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tariq Toukan 提交于
Complete the transition of all WQ types to use fragmented order-0 coherent memory instead of high-order allocations. CQ-WQ already uses order-0. Here we do the same for cyclic and linked-list WQs. This allows the driver to load cleanly on systems with a highly fragmented coherent memory. Performance tests: ConnectX-5 100Gbps, CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz Packet rate of 64B packets, single transmit ring, size 8K. No degradation is sensed. Signed-off-by: NTariq Toukan <tariqt@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Chris Mi 提交于
If set, the FDB table supports the forward action with a destination list that includes a flow table. Signed-off-by: NChris Mi <chrism@mellanox.com> Reviewed-by: NPaul Blakey <paulb@mellanox.com> Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Yi-Hung Wei 提交于
Define netlink messages and attributes to support user kernel communication that uses the conntrack limit feature. Signed-off-by: NYi-Hung Wei <yihung.wei@gmail.com> Acked-by: NPravin B Shelar <pshelar@ovn.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Manish Chopra 提交于
With this patch, User can configure for the supported flows to be dropped. Added a stat "gft_filter_drop" as well to be populated in ethtool for the dropped flows. For example - ethtool -N p5p1 flow-type udp4 dst-port 8000 action -1 ethtool -N p5p1 flow-type tcp4 scr-ip 192.168.8.1 action -1 Signed-off-by: NManish Chopra <manish.chopra@cavium.com> Signed-off-by: NShahed Shaikh <shahed.shaikh@cavium.com> Signed-off-by: NAriel Elior <ariel.elior@cavium.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Manish Chopra 提交于
Currently, driver supports flow classification to PF receive queues based on TCP/UDP 4 tuples [src_ip, dst_ip, src_port, dst_port] only. This patch enables to configure different flow profiles [For example - only UDP dest port or src_ip based] on the adapter so that classification can be done according to just those fields as well. Although, at a time just one type of flow configuration is supported due to limited number of flow profiles available on the device. For example - ethtool -N enp7s0f0 flow-type udp4 dst-port 45762 action 2 ethtool -N enp7s0f0 flow-type tcp4 src-ip 192.16.4.10 action 1 ethtool -N enp7s0f0 flow-type udp6 dst-port 45762 action 3 Signed-off-by: NManish Chopra <manish.chopra@cavium.com> Signed-off-by: NShahed Shaikh <shahed.shaikh@cavium.com> Signed-off-by: NAriel Elior <ariel.elior@cavium.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
This patch adds support for a new port flag - BR_ISOLATED. If it is set then isolated ports cannot communicate between each other, but they can still communicate with non-isolated ports. The same can be achieved via ACLs but they can't scale with large number of ports and also the complexity of the rules grows. This feature can be used to achieve isolated vlan functionality (similar to pvlan) as well, though currently it will be port-wide (for all vlans on the port). The new test in should_deliver uses data that is already cache hot and the new boolean is used to avoid an additional source port test in should_deliver. Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 5月, 2018 18 次提交
-
-
由 John Hurley 提交于
LAG upper event notifiers contain the tx type used by the LAG device. Extend this to also include the hash policy used for tx types that utilize hashing. Signed-off-by: NJohn Hurley <john.hurley@netronome.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Tracepoint does not add value and the call to fib_lookup follows it which shows the same information and the fib lookup result. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Commit bb0ad198 ("ipv6: fib6_rules: support for match on sport, dport and ip proto") added support for protocol and ports to FIB rules. Update the FIB lookup tracepoint to dump the parameters. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Commit 4a2d73a4 ("ipv4: fib_rules: support match on sport, dport and ip proto") added support for protocol and ports to FIB rules. Update the FIB lookup tracepoint to dump the parameters. In addition, make the IPv4 tracepoint similar to the IPv6 one where the lookup parameters and result are dumped in 1 event. It is much easier to use and understand the outcome of the lookup. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Cong Wang 提交于
Commit 05f0fe6b ("RCU, workqueue: Implement rcu_work") introduces new API's for dispatching work in a RCU callback. Now we can just switch to the new API's for tc filters. This could get rid of a lot of code. Cc: Tejun Heo <tj@kernel.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Biggers 提交于
The PPPIOCDETACH ioctl effectively tries to "close" the given ppp file before f_count has reached 0, which is fundamentally a bad idea. It does check 'f_count < 2', which excludes concurrent operations on the file since they would only be possible with a shared fd table, in which case each fdget() would take a file reference. However, it fails to account for the fact that even with 'f_count == 1' the file can still be linked into epoll instances. As reported by syzbot, this can trivially be used to cause a use-after-free. Yet, the only known user of PPPIOCDETACH is pppd versions older than ppp-2.4.2, which was released almost 15 years ago (November 2003). Also, PPPIOCDETACH apparently stopped working reliably at around the same time, when the f_count check was added to the kernel, e.g. see https://lkml.org/lkml/2002/12/31/83. Also, the current 'f_count < 2' check makes PPPIOCDETACH only work in single-threaded applications; it always fails if called from a multithreaded application. All pppd versions released in the last 15 years just close() the file descriptor instead. Therefore, instead of hacking around this bug by exporting epoll internals to modules, and probably missing other related bugs, just remove the PPPIOCDETACH ioctl and see if anyone actually notices. Leave a stub in place that prints a one-time warning and returns EINVAL. Reported-by: syzbot+16363c99d4134717c05b@syzkaller.appspotmail.com Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NEric Biggers <ebiggers@google.com> Acked-by: NPaul Mackerras <paulus@ozlabs.org> Reviewed-by: NGuillaume Nault <g.nault@alphalink.fr> Tested-by: NGuillaume Nault <g.nault@alphalink.fr> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jesper Dangaard Brouer 提交于
Extending tracepoint xdp:xdp_devmap_xmit in devmap with an err code allow people to easier identify the reason behind the ndo_xdp_xmit call to a given driver is failing. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Jesper Dangaard Brouer 提交于
This patch change the API for ndo_xdp_xmit to support bulking xdp_frames. When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown. Most of the slowdown is caused by DMA API indirect function calls, but also the net_device->ndo_xdp_xmit() call. Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed performance improved: for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps With frames avail as a bulk inside the driver ndo_xdp_xmit call, further optimizations are possible, like bulk DMA-mapping for TX. Testing without CONFIG_RETPOLINE show the same performance for physical NIC drivers. The virtual NIC driver tun sees a huge performance boost, as it can avoid doing per frame producer locking, but instead amortize the locking cost over the bulk. V2: Fix compile errors reported by kbuild test robot <lkp@intel.com> V4: Isolated ndo, driver changes and callers. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Jesper Dangaard Brouer 提交于
When sending an xdp_frame through xdp_do_redirect call, then error cases can happen where the xdp_frame needs to be dropped, and returning an -errno code isn't sufficient/possible any-longer (e.g. for cpumap case). This is already fully supported, by simply calling xdp_return_frame. This patch is an optimization, which provides xdp_return_frame_rx_napi, which is a faster variant for these error cases. It take advantage of the protection provided by XDP RX running under NAPI protection. This change is mostly relevant for drivers using the page_pool allocator as it can take advantage of this. (Tested with mlx5). Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Jesper Dangaard Brouer 提交于
Notice how this allow us get XDP statistic without affecting the XDP performance, as tracepoint is no-longer activated on a per packet basis. V5: Spotted by John Fastabend. Fix 'sent' also counted 'drops' in this patch, a later patch corrected this, but it was a mistake in this intermediate step. Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Jesper Dangaard Brouer 提交于
Functionality is the same, but the ndo_xdp_xmit call is now simply invoked from inside the devmap.c code. V2: Fix compile issue reported by kbuild test robot <lkp@intel.com> V5: Cleanups requested by Daniel - Newlines before func definition - Use BUILD_BUG_ON checks - Remove unnecessary use return value store in dev_map_enqueue Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Yonghong Song 提交于
Currently, suppose a userspace application has loaded a bpf program and attached it to a tracepoint/kprobe/uprobe, and a bpf introspection tool, e.g., bpftool, wants to show which bpf program is attached to which tracepoint/kprobe/uprobe. Such attachment information will be really useful to understand the overall bpf deployment in the system. There is a name field (16 bytes) for each program, which could be used to encode the attachment point. There are some drawbacks for this approaches. First, bpftool user (e.g., an admin) may not really understand the association between the name and the attachment point. Second, if one program is attached to multiple places, encoding a proper name which can imply all these attachments becomes difficult. This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY. Given a pid and fd, if the <pid, fd> is associated with a tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return . prog_id . tracepoint name, or . k[ret]probe funcname + offset or kernel addr, or . u[ret]probe filename + offset to the userspace. The user can use "bpftool prog" to find more information about bpf program itself with prog_id. Acked-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Yonghong Song 提交于
A new extern function, perf_get_event(), is added to return a perf event given a struct file. This function will be used in later patches. Signed-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Huy Nguyen 提交于
Add firmware command interface to read and write PPTB and PBMC registers. PPTB register enables mappings priority to a specific receive buffer. PBMC registers enables changing the receive buffer's configuration such as buffer size, xon/xoff thresholds, buffer's lossy property and buffer's shared property. Signed-off-by: NHuy Nguyen <huyn@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Huy Nguyen 提交于
Add pbmc and pptb in the port_access_reg_cap_mask. These two bits determine if device supports receive buffer configuration. Signed-off-by: NHuy Nguyen <huyn@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Huy Nguyen 提交于
In this patch, we add dcbnl buffer attribute to allow user change the NIC's buffer configuration such as priority to buffer mapping and buffer size of individual buffer. This attribute combined with pfc attribute allows advanced user to fine tune the qos setting for specific priority queue. For example, user can give dedicated buffer for one or more priorities or user can give large buffer to certain priorities. The dcb buffer configuration will be controlled by lldptool. lldptool -T -i eth2 -V BUFFER prio 0,2,5,7,1,2,3,6 maps priorities 0,1,2,3,4,5,6,7 to receive buffer 0,2,5,7,1,2,3,6 lldptool -T -i eth2 -V BUFFER size 87296,87296,0,87296,0,0,0,0 sets receive buffer size for buffer 0,1,2,3,4,5,6,7 respectively After discussion on mailing list with Jakub, Jiri, Ido and John, we agreed to choose dcbnl over devlink interface since this feature is intended to set port attributes which are governed by the netdev instance of that port, where devlink API is more suitable for global ASIC configurations. We present an use case scenario where dcbnl buffer attribute configured by advance user helps reduce the latency of messages of different sizes. Scenarios description: On ConnectX-5, we run latency sensitive traffic with small/medium message sizes ranging from 64B to 256KB and bandwidth sensitive traffic with large messages sizes 512KB and 1MB. We group small, medium, and large message sizes to their own pfc enables priorities as follow. Priorities 1 & 2 (64B, 256B and 1KB) Priorities 3 & 4 (4KB, 8KB, 16KB, 64KB, 128KB and 256KB) Priorities 5 & 6 (512KB and 1MB) By default, ConnectX-5 maps all pfc enabled priorities to a single lossless fixed buffer size of 50% of total available buffer space. The other 50% is assigned to lossy buffer. Using dcbnl buffer attribute, we create three equal size lossless buffers. Each buffer has 25% of total available buffer space. Thus, the lossy buffer size reduces to 25%. Priority to lossless buffer mappings are set as follow. Priorities 1 & 2 on lossless buffer #1 Priorities 3 & 4 on lossless buffer #2 Priorities 5 & 6 on lossless buffer #3 We observe improvements in latency for small and medium message sizes as follows. Please note that the large message sizes bandwidth performance is reduced but the total bandwidth remains the same. 256B message size (42 % latency reduction) 4K message size (21% latency reduction) 64K message size (16% latency reduction) CC: Ido Schimmel <idosch@idosch.org> CC: Jakub Kicinski <jakub.kicinski@netronome.com> CC: Jiri Pirko <jiri@resnulli.us> CC: Or Gerlitz <gerlitz.or@gmail.com> CC: Parav Pandit <parav@mellanox.com> CC: Aron Silverton <aron.silverton@oracle.com> Signed-off-by: NHuy Nguyen <huyn@mellanox.com> Reviewed-by: NParav Pandit <parav@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Heiner Kallweit 提交于
In struct phy_device we have a number of flags being defined as type bool. Similar to e.g. struct pci_dev we can save some space by using bit-fields. Signed-off-by: NHeiner Kallweit <hkallweit1@gmail.com> Reviewed-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joonsoo Kim 提交于
This reverts the following commits that change CMA design in MM. 3d2054ad ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y") 1d47a3ec ("mm/cma: remove ALLOC_CMA") bad8c6c0 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE") Ville reported a following error on i386. Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) microcode: microcode updated early to revision 0x4, date = 2013-06-28 Initializing CPU#0 Initializing HighMem for node 0 (000377fe:00118000) Initializing Movable for node 0 (00000001:00118000) BUG: Bad page state in process swapper pfn:377fe page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0 flags: 0x80000000() raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001 page dumped because: nonzero mapcount Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145 Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013 Call Trace: dump_stack+0x60/0x96 bad_page+0x9a/0x100 free_pages_check_bad+0x3f/0x60 free_pcppages_bulk+0x29d/0x5b0 free_unref_page_commit+0x84/0xb0 free_unref_page+0x3e/0x70 __free_pages+0x1d/0x20 free_highmem_page+0x19/0x40 add_highpages_with_active_regions+0xab/0xeb set_highmem_pages_init+0x66/0x73 mem_init+0x1b/0x1d7 start_kernel+0x17a/0x363 i386_start_kernel+0x95/0x99 startup_32_smp+0x164/0x168 The reason for this error is that the span of MOVABLE_ZONE is extended to whole node span for future CMA initialization, and, normal memory is wrongly freed here. I submitted the fix and it seems to work, but, another problem happened. It's so late time to fix the later problem so I decide to reverting the series. Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com> Acked-by: NLaura Abbott <labbott@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 5月, 2018 3 次提交
-
-
由 Daniel Borkmann 提交于
While reviewing the verifier code, I recently noticed that the following two program variants in relation to tail calls can be loaded. Variant 1: # bpftool p d x i 15 0: (15) if r1 == 0x0 goto pc+3 1: (18) r2 = map[id:5] 3: (05) goto pc+2 4: (18) r2 = map[id:6] 6: (b7) r3 = 7 7: (35) if r3 >= 0xa0 goto pc+2 8: (54) (u32) r3 &= (u32) 255 9: (85) call bpf_tail_call#12 10: (b7) r0 = 1 11: (95) exit # bpftool m s i 5 5: prog_array flags 0x0 key 4B value 4B max_entries 4 memlock 4096B # bpftool m s i 6 6: prog_array flags 0x0 key 4B value 4B max_entries 160 memlock 4096B Variant 2: # bpftool p d x i 20 0: (15) if r1 == 0x0 goto pc+3 1: (18) r2 = map[id:8] 3: (05) goto pc+2 4: (18) r2 = map[id:7] 6: (b7) r3 = 7 7: (35) if r3 >= 0x4 goto pc+2 8: (54) (u32) r3 &= (u32) 3 9: (85) call bpf_tail_call#12 10: (b7) r0 = 1 11: (95) exit # bpftool m s i 8 8: prog_array flags 0x0 key 4B value 4B max_entries 160 memlock 4096B # bpftool m s i 7 7: prog_array flags 0x0 key 4B value 4B max_entries 4 memlock 4096B In both cases the index masking inserted by the verifier in order to control out of bounds speculation from a CPU via b2157399 ("bpf: prevent out-of-bounds speculation") seems to be incorrect in what it is enforcing. In the 1st variant, the mask is applied from the map with the significantly larger number of entries where we would allow to a certain degree out of bounds speculation for the smaller map, and in the 2nd variant where the mask is applied from the map with the smaller number of entries, we get buggy behavior since we truncate the index of the larger map. The original intent from commit b2157399 is to reject such occasions where two or more different tail call maps are used in the same tail call helper invocation. However, the check on the BPF_MAP_PTR_POISON is never hit since we never poisoned the saved pointer in the first place! We do this explicitly for map lookups but in case of tail calls we basically used the tail call map in insn_aux_data that was processed in the most recent path which the verifier walked. Thus any prior path that stored a pointer in insn_aux_data at the helper location was always overridden. Fix it by moving the map pointer poison logic into a small helper that covers both BPF helpers with the same logic. After that in fixup_bpf_calls() the poison check is then hit for tail calls and the program rejected. Latter only happens in unprivileged case since this is the *only* occasion where a rewrite needs to happen, and where such rewrite is specific to the map (max_entries, index_mask). In the privileged case the rewrite is generic for the insn->imm / insn->code update so multiple maps from different paths can be handled just fine since all the remaining logic happens in the instruction processing itself. This is similar to the case of map lookups: in case there is a collision of maps in fixup_bpf_calls() we must skip the inlined rewrite since this will turn the generic instruction sequence into a non- generic one. Thus the patch_call_imm will simply update the insn->imm location where the bpf_map_lookup_elem() will later take care of the dispatch. Given we need this 'poison' state as a check, the information of whether a map is an unpriv_array gets lost, so enforcing it prior to that needs an additional state. In general this check is needed since there are some complex and tail call intensive BPF programs out there where LLVM tends to generate such code occasionally. We therefore convert the map_ptr rather into map_state to store all this w/o extra memory overhead, and the bit whether one of the maps involved in the collision was from an unpriv_array thus needs to be retained as well there. Fixes: b2157399 ("bpf: prevent out-of-bounds speculation") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Mathieu Xhonneux 提交于
This patch adds the End.BPF action to the LWT seg6local infrastructure. This action works like any other seg6local End action, meaning that an IPv6 header with SRH is needed, whose DA has to be equal to the SID of the action. It will also advance the SRH to the next segment, the BPF program does not have to take care of this. Since the BPF program may not be a source of instability in the kernel, it is important to ensure that the integrity of the packet is maintained before yielding it back to the IPv6 layer. The hook hence keeps track if the SRH has been altered through the helpers, and re-validates its content if needed with seg6_validate_srh. The state kept for validation is stored in a per-CPU buffer. The BPF program is not allowed to directly write into the packet, and only some fields of the SRH can be altered through the helper bpf_lwt_seg6_store_bytes. Performances profiling has shown that the SRH re-validation does not induce a significant overhead. If the altered SRH is deemed as invalid, the packet is dropped. This validation is also done before executing any action through bpf_lwt_seg6_action, and will not be performed again if the SRH is not modified after calling the action. The BPF program may return 3 types of return codes: - BPF_OK: the End.BPF action will look up the next destination through seg6_lookup_nexthop. - BPF_REDIRECT: if an action has been executed through the bpf_lwt_seg6_action helper, the BPF program should return this value, as the skb's destination is already set and the default lookup should not be performed. - BPF_DROP : the packet will be dropped. Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: NDavid Lebrun <dlebrun@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
由 Mathieu Xhonneux 提交于
The new bpf_lwt_push_encap helper should only be accessible within the LWT BPF IN hook, and not the OUT one, as this may lead to a skb under panic. At the moment, both LWT BPF IN and OUT share the same list of helpers, whose calls are authorized by the verifier. This patch separates the verifier ops for the IN and OUT hooks, and allows the IN hook to call the bpf_lwt_push_encap helper. This patch is also the occasion to put all lwt_*_func_proto functions together for clarity. At the moment, socks_op_func_proto is in the middle of lwt_inout_func_proto and lwt_xmit_func_proto. Signed-off-by: NMathieu Xhonneux <m.xhonneux@gmail.com> Acked-by: NDavid Lebrun <dlebrun@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-