- 11 12月, 2018 6 次提交
-
-
由 Or Gerlitz 提交于
This isn't used anywhere across the mlx5 driver stack, remove it. Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com> Reviewed-by: NSaeed Mahameed <saeedm@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Eli Britstein 提交于
Update the flow steering command formatting according to the extended destination API. Note that the FW dictates that multi destination FTEs that involve at least one encap must use the extended destination format, while single destination ones must use the legacy format. Using extended destination format requires FW support. Check for its capabilities and return error if not supported. Signed-off-by: NEli Britstein <elibr@mellanox.com> Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com> Reviewed-by: NOz Shlomo <ozsh@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Eli Britstein 提交于
Change the driver flow destination struct to use bit flags with the vhca id valid being the 1st one. The flags field is more extendable and will be used in downstream patch. Signed-off-by: NEli Britstein <elibr@mellanox.com> Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com> Reviewed-by: NOz Shlomo <ozsh@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Eli Britstein 提交于
Extended destinations provide the ability to configure different encapsulation properties per destination on a single FTE. This is needed for use-cases such as remote mirroring over tunneled networks. Signed-off-by: NEli Britstein <elibr@mellanox.com> Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com> Reviewed-by: NOz Shlomo <ozsh@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Oz Shlomo 提交于
GRE RFC defines a 32 bit key field. NVGRE RFC splits the 32 bit key field to 24 bit VSID (gre_key_h) and 8 bit flow entropy (gre_key_l). Define the two key parsing alternatives in a union, thus enabling both access methods. Signed-off-by: NOz Shlomo <ozsh@mellanox.com> Reviewed-by: NEli Britstein <elibr@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
由 Eyal Davidovich 提交于
Will be used in downstream patch to monitor counter changes by the HCA and report it to the driver by an event. The driver will update its counters cached data accordingly. Signed-off-by: NEyal Davidovich <eyald@mellanox.com> Reviewed-by: NEran Ben Elisha <eranbe@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
- 10 12月, 2018 1 次提交
-
-
由 Tariq Toukan 提交于
Introduce and use a helper that extracts the opcode from a CQE (completion queue entry) structure. Signed-off-by: NTariq Toukan <tariqt@mellanox.com> Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
-
- 09 12月, 2018 3 次提交
-
-
由 Andrew Lunn 提交于
The Marvell 6390 Ethernet switch family does not perform MDIO turnaround correctly. Many hardware MDIO bus masters don't care about this, but the bitbangging implementation in Linux does by default. Add phy_ignore_ta_mask to the platform data so that the bitbangging code can be told which devices are known to get TA wrong. v2 Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
It is sometimes necessary to instantiate a bit-banging MDIO bus as a platform device, without the aid of device tree. When device tree is being used, the bus is not scanned for devices, only those devices which are in device tree are probed. Without device tree, by default, all addresses on the bus are scanned. This may then find a device which is not a PHY, e.g. a switch. And the switch may have registers containing values which look like a PHY. So during the scan, a PHY device is wrongly created. After the bus has been registered, a search is made for mdio_board_info structures which indicates devices on the bus, and the driver which should be used for them. This is typically used to instantiate Ethernet switches from platform drivers. However, if the scanning of the bus has created a PHY device at the same location as indicated into the board info for a switch, the switch device is not created, since the address is already busy. This can be avoided by setting the phy_mask of the mdio bus. This mask prevents addresses on the bus being scanned. v2 Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Rientjes 提交于
This reverts commit 89c83fb5. This should have been done as part of 2f0799a0 ("mm, thp: restore node-local hugepage allocations"). The movement of the thp allocation policy from alloc_pages_vma() to alloc_hugepage_direct_gfpmask() was intended to only set __GFP_THISNODE for mempolicies that are not MPOL_BIND whereas the revert could set this regardless of mempolicy. While the check for MPOL_BIND between alloc_hugepage_direct_gfpmask() and alloc_pages_vma() was racy, that has since been removed since the revert. What is left is the possibility to use __GFP_THISNODE in policy_node() when it is unexpected because the special handling for hugepages in alloc_pages_vma() was removed as part of the consolidation. Secondly, prior to 89c83fb5, alloc_pages_vma() implemented a somewhat different policy for hugepage allocations, which were allocated through alloc_hugepage_vma(). For hugepage allocations, if the allocating process's node is in the set of allowed nodes, allocate with __GFP_THISNODE for that node (for MPOL_PREFERRED, use that node with __GFP_THISNODE instead). This was changed for shmem_alloc_hugepage() to allow fallback to other nodes in 89c83fb5 as it did for new_page() in mm/mempolicy.c which is functionally different behavior and removes the requirement to only allocate hugepages locally. So this commit does a full revert of 89c83fb5 instead of the partial revert that was done in 2f0799a0. The result is the same thp allocation policy for 4.20 that was in 4.19. Fixes: 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") Fixes: 2f0799a0 ("mm, thp: restore node-local hugepage allocations") Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 12月, 2018 5 次提交
-
-
由 Stefano Brivio 提交于
While skb_push() makes the kernel panic if the skb headroom is less than the unaligned hardware header size, it will proceed normally in case we copy more than that because of alignment, and we'll silently corrupt adjacent slabs. In the case fixed by the previous patch, "ipv6: Check available headroom in ip6_xmit() even without options", we end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware header and write 16 bytes, starting 2 bytes before the allocated buffer. Always check we're not writing before skb->head and, if the headroom is not enough, warn and drop the packet. v2: - instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet (Eric Dumazet) - if we avoid the panic, though, we need to explicitly check the headroom before the memcpy(), otherwise we'll have corrupted slabs on a running kernel, after we warn - use __skb_push() instead of skb_push(), as the headroom check is already implemented here explicitly (Eric Dumazet) Signed-off-by: NStefano Brivio <sbrivio@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
The existing garbage collection algorithm has a number of problems: 1. The gc algorithm will not evict PERMANENT entries as those entries are managed by userspace, yet the existing algorithm walks the entire hash table which means it always considers PERMANENT entries when looking for entries to evict. In some use cases (e.g., EVPN) there can be tens of thousands of PERMANENT entries leading to wasted CPU cycles when gc kicks in. As an example, with 32k permanent entries, neigh_alloc has been observed taking more than 4 msec per invocation. 2. Currently, when the number of neighbor entries hits gc_thresh2 and the last flush for the table was more than 5 seconds ago gc kicks in walks the entire hash table evicting *all* entries not in PERMANENT or REACHABLE state and not marked as externally learned. There is no discriminator on when the neigh entry was created or if it just moved from REACHABLE to another NUD_VALID state (e.g., NUD_STALE). It is possible for entries to be created or for established neighbor entries to be moved to STALE (e.g., an external node sends an ARP request) right before the 5 second window lapses: -----|---------x|----------|----- t-5 t t+5 If that happens those entries are evicted during gc causing unnecessary thrashing on neighbor entries and userspace caches trying to track them. Further, this contradicts the description of gc_thresh2 which says "Entries older than 5 seconds will be cleared". One workaround is to make gc_thresh2 == gc_thresh3 but that negates the whole point of having separate thresholds. 3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries when gc_thresh2 is exceeded is over kill and contributes to trashing especially during startup. This patch addresses these problems as follows: 1. Use of a separate list_head to track entries that can be garbage collected along with a separate counter. PERMANENT entries are not added to this list. The gc_thresh parameters are only compared to the new counter, not the total entries in the table. The forced_gc function is updated to only walk this new gc_list looking for entries to evict. 2. Entries are added to the list head at the tail and removed from the front. 3. Entries are only evicted if they were last updated more than 5 seconds ago, adhering to the original intent of gc_thresh2. 4. Forced gc is stopped once the number of gc_entries drops below gc_thresh2. 5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped when allocating a new neighbor for a PERMANENT entry. By extension this means there are no explicit limits on the number of PERMANENT entries that can be created, but this is no different than FIB entries or FDB entries. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
When a driver unoffloads all FDB entries en bloc, it's inefficient to send the switchdev notification one by one. Add a helper that unsets the offload flag on FDB entries on a given bridge port and VLAN. Signed-off-by: NPetr Machata <petrm@mellanox.com> Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
When a driver unoffloads all FDB entries en bloc, it's inefficient to send the switchdev notification one by one. Add a helper that walks the FDB table, unsetting the offload flag on RDST with a given VNI. Signed-off-by: NPetr Machata <petrm@mellanox.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
When a VXLAN device becomes relevant to a driver (such as when it is attached to an offloaded bridge), the driver will generally need to walk the existing FDB entries and offload them. Add a function vxlan_fdb_replay() to call a given notifier block for each FDB entry with a given VNI. Signed-off-by: NPetr Machata <petrm@mellanox.com> Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 12月, 2018 5 次提交
-
-
由 Danit Goldberg 提交于
Packet based credit mode bit determines whether the credit mode is done per message or packet. Expose the QP creation flag and the HCA capability. Signed-off-by: NDanit Goldberg <danitg@mellanox.com> Reviewed-by: NMajd Dibbiny <majd@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
-
由 Petr Machata 提交于
In order to pass extack together with NETDEV_PRE_UP notifications, it's necessary to route the extack to __dev_open() from diverse (possibly indirect) callers. The last missing API is __dev_change_flags(). Therefore extend __dev_change_flags() with and extra extack argument and update the two existing users. Since the function declaration line is changed anyway, name the struct net_device argument to placate checkpatch. Signed-off-by: NPetr Machata <petrm@mellanox.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Reviewed-by: NIdo Schimmel <idosch@mellanox.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
In order to pass extack together with NETDEV_PRE_UP notifications, it's necessary to route the extack to __dev_open() from diverse (possibly indirect) callers. One prominent API through which the notification is invoked is dev_change_flags(). Therefore extend dev_change_flags() with and extra extack argument and update all users. Most of the calls end up just encoding NULL, but several sites (VLAN, ipvlan, VRF, rtnetlink) do have extack available. Since the function declaration line is changed anyway, name the other function arguments to placate checkpatch. Signed-off-by: NPetr Machata <petrm@mellanox.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Reviewed-by: NIdo Schimmel <idosch@mellanox.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Petr Machata 提交于
In order to pass extack together with NETDEV_PRE_UP notifications, it's necessary to route the extack to __dev_open() from diverse (possibly indirect) callers. One prominent API through which the notification is invoked is dev_open(). Therefore extend dev_open() with and extra extack argument and update all users. Most of the calls end up just encoding NULL, but bond and team drivers have the extack readily available. Signed-off-by: NPetr Machata <petrm@mellanox.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Reviewed-by: NIdo Schimmel <idosch@mellanox.com> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
Each DSA tag protocol needs to add additional headers to the Ethernet frame in order to direct it towards a specific switch egress port. It must also remove the head from a frame received from a switch. Indicate the maximum size of these headers in the tag protocol ops structure, so the core can take these overheads into account. Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 12月, 2018 8 次提交
-
-
由 Guo Ren 提交于
The broken macros make the glibc compile error. If there is no __NR3264_fstat*, we should also removed related definitions. Reported-by: NMarcin Juszkiewicz <marcin.juszkiewicz@linaro.org> Fixes: bf4b6a7d ("y2038: Remove stat64 family from default syscall set") [arnd: Both Marcin and Guo provided this patch to fix up my clearly broken commit, I applied the version with the better changelog.] Signed-off-by: NGuo Ren <ren_guo@c-sky.com> Signed-off-by: NMao Han <han_mao@c-sky.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
由 Jakub Audykowicz 提交于
If for some reason an association's fragmentation point is zero, sctp_datamsg_from_user will try to endlessly try to divide a message into zero-sized chunks. This eventually causes kernel panic due to running out of memory. Although this situation is quite unlikely, it has occurred before as reported. I propose to add this simple last-ditch sanity check due to the severity of the potential consequences. Signed-off-by: NJakub Audykowicz <jakub.audykowicz@gmail.com> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
Replace the if else code structure with a call to the helper linkmode_mod_bit. Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
Add a _mod_ variant of mii_lpa_to_linkmode_lpa_t. Use this to fix the genphy_read_status() where the 1G link partner features are getting lost. Fixes: c0ec3c27 ("net: phy: Convert u32 phydev->lp_advertising to linkmode") Reported-by: NHeiner Kallweit <hkallweit1@gmail.com> Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
Rename mii_stat1000_to_linkmode_lpa_t to mii_stat1000_mod_linkmode_lpa_t to indicate it modifies the passed linkmode bitmap, without clearing any other bits. Add a helper to set/clear bits in a linkmode. Use this helper to ensure bit are clear which the stat1000 indicates should not be set. Fixes: c0ec3c27 ("net: phy: Convert u32 phydev->lp_advertising to linkmode") Suggested-by: NHeiner Kallweit <hkallweit1@gmail.com> Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Andrew Lunn 提交于
mii_adv_to_linkmode_adv_t() clears all bits before setting it needs to set. This means the freshly set Autoneg gets cleared. Change the order, and add comments about it clearing the old content of the bitmap. Fixes: c0ec3c27 ("net: phy: Convert u32 phydev->lp_advertising to linkmode") Reported-by: NHeiner Kallweit <hkallweit1@gmail.com> Signed-off-by: NAndrew Lunn <andrew@lunn.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Rientjes 提交于
This is a full revert of ac5b2c18 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") and a partial revert of 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"). By not setting __GFP_THISNODE, applications can allocate remote hugepages when the local node is fragmented or low on memory when either the thp defrag setting is "always" or the vma has been madvised with MADV_HUGEPAGE. Remote access to hugepages often has much higher latency than local pages of the native page size. On Haswell, ac5b2c18 was shown to have a 13.9% access regression after this commit for binaries that remap their text segment to be backed by transparent hugepages. The intent of ac5b2c18 is to address an issue where a local node is low on memory or fragmented such that a hugepage cannot be allocated. In every scenario where this was described as a fix, there is abundant and unfragmented remote memory available to allocate from, even with a greater access latency. If remote memory is also low or fragmented, not setting __GFP_THISNODE was also measured on Haswell to have a 40% regression in allocation latency. Restore __GFP_THISNODE for thp allocations. Fixes: ac5b2c18 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") Fixes: 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mathias Payer 提交于
When reading an extra descriptor, we need to properly check the minimum and maximum size allowed, to prevent from invalid data being sent by a device. Reported-by: NHui Peng <benquike@gmail.com> Reported-by: NMathias Payer <mathias.payer@nebelwelt.net> Co-developed-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NHui Peng <benquike@gmail.com> Signed-off-by: NMathias Payer <mathias.payer@nebelwelt.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: stable <stable@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 05 12月, 2018 4 次提交
-
-
由 Johan Hovold 提交于
The USB-serial console implementation has never reported the actual terminal settings used. Despite storing the corresponding cflags in its struct console, these were never honoured on later tty open() where the tty termios would be left initialised to the driver defaults. Unlike the serial console implementation, the USB-serial code calls subdriver open() already at console setup. While calling set_termios() and write() before open() looks like it could work for some USB-serial drivers, others definitely do not expect this, so modelling this after serial core is going to be intrusive, if at all possible. Instead, use a (renamed) tty helper to save the termios data used at console setup so that the tty termios reflects the actual terminal settings after a subsequent tty open(). Note that the calls to tty_init_termios() (tty_driver_install()) and tty_save_termios() are serialised using the disconnect mutex. This specifically fixes a regression that was triggered by a recent change adding software flow control to the pl2303 driver: a getty trying to disable flow control while leaving the baud rate unchanged would now also set the baud rate to the driver default (prior to the flow-control change this had been a noop). Fixes: 7041d9c3 ("USB: serial: pl2303: add support for tx xon/xoff flow control") Cc: stable <stable@vger.kernel.org> # 4.18 Cc: Florian Zumbiehl <florz@florz.de> Reported-by: NJarkko Nikula <jarkko.nikula@linux.intel.com> Tested-by: NJarkko Nikula <jarkko.nikula@linux.intel.com> Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NJohan Hovold <johan@kernel.org>
-
由 Matthew Wilcox 提交于
Internal to dax_unlock_mapping_entry(), dax_unlock_entry() is used to store a replacement entry in the Xarray at the given xas-index with the DAX_LOCKED bit clear. When called, dax_unlock_entry() expects the unlocked value of the entry relative to the current Xarray state to be specified. In most contexts dax_unlock_entry() is operating in the same scope as the matched dax_lock_entry(). However, in the dax_unlock_mapping_entry() case the implementation needs to recall the original entry. In the case where the original entry is a 'pmd' entry it is possible that the pfn performed to do the lookup is misaligned to the value retrieved in the Xarray. Change the api to return the unlock cookie from dax_lock_page() and pass it to dax_unlock_page(). This fixes a bug where dax_unlock_page() was assuming that the page was PMD-aligned if the entry was a PMD entry with signatures like: WARNING: CPU: 38 PID: 1396 at fs/dax.c:340 dax_insert_entry+0x2b2/0x2d0 RIP: 0010:dax_insert_entry+0x2b2/0x2d0 [..] Call Trace: dax_iomap_pte_fault.isra.41+0x791/0xde0 ext4_dax_huge_fault+0x16f/0x1f0 ? up_read+0x1c/0xa0 __do_fault+0x1f/0x160 __handle_mm_fault+0x1033/0x1490 handle_mm_fault+0x18b/0x3d0 Link: https://lkml.kernel.org/r/20181130154902.GL10377@bombadil.infradead.org Fixes: 9f32d221 ("dax: Convert dax_lock_mapping_entry to XArray") Reported-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NMatthew Wilcox <willy@infradead.org> Tested-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com>
-
由 Eric Dumazet 提交于
TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12 as a step to enable bigger tcp sndbuf limits. It works reasonably well, but the following happens : Once the limit is reached, TCP stack generates an [E]POLLOUT event for every incoming ACK packet. This causes a high number of context switches. This patch implements the strategy David Miller added in sock_def_write_space() : - If TCP socket has a notsent_lowat constraint of X bytes, allow sendmsg() to fill up to X bytes, but send [E]POLLOUT only if number of notsent bytes is below X/2 This considerably reduces TCP_NOTSENT_LOWAT overhead, while allowing to keep the pipe full. Tested: 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM A:/# cat /proc/sys/net/ipv4/tcp_wmem 4096 262144 64000000 A:/# super_netperf 100 -H B -l 1000 -- -K bbr & A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/ A:/# vmstat 5 5 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 256220672 13532 694976 0 0 10 0 28 14 0 1 99 0 0 2 0 0 256320016 13532 698480 0 0 512 0 715901 5927 0 10 90 0 0 0 0 0 256197232 13532 700992 0 0 735 13 771161 5849 0 11 89 0 0 1 0 0 256233824 13532 703320 0 0 512 23 719650 6635 0 11 89 0 0 2 0 0 256226880 13532 705780 0 0 642 4 775650 6009 0 12 88 0 0 A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat A:/# grep TCP /proc/net/sockstat TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow A:/# vmstat 5 5 # check that context switches have not inflated too much. procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 2 0 0 260386512 13592 662148 0 0 10 0 17 14 0 1 99 0 0 0 0 0 260519680 13592 604184 0 0 512 13 726843 12424 0 10 90 0 0 1 1 0 260435424 13592 598360 0 0 512 25 764645 12925 0 10 90 0 0 1 0 0 260855392 13592 578380 0 0 512 7 722943 13624 0 11 88 0 0 1 0 0 260445008 13592 601176 0 0 614 34 772288 14317 0 10 90 0 0 Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ido Schimmel 提交于
Commit abf4bb6b ("skbuff: Add the offload_mr_fwd_mark field") added the 'offload_mr_fwd_mark' field to indicate that a packet has already undergone L3 multicast routing by a capable device. The field is used to prevent the kernel from forwarding a packet through a netdev through which the device has already forwarded the packet. Currently, no unicast packet is routed by both the device and the kernel, but this is about to change by subsequent patches and we need to be able to mark such packets, so that they will no be forwarded twice. Instead of adding yet another field to 'struct sk_buff', we can just rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet either has a multicast or a unicast destination IP. While at it, add a comment about both 'offload_fwd_mark' and 'offload_l3_fwd_mark'. Signed-off-by: NIdo Schimmel <idosch@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 12月, 2018 8 次提交
-
-
由 Yishai Hadas 提交于
Expose device capabilities for DEVX user context, it includes which caps the device is supported and a matching bit to set as part of user context creation. Signed-off-by: NYishai Hadas <yishaih@mellanox.com> Reviewed-by: NArtemy Kovalyov <artemyko@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
-
由 Leon Romanovsky 提交于
Transfer initialization and cleanup from mlx5_priv struct of mlx5_core_dev to be part of mlx5_ib_dev. This completes removal of SRQ from mlx5_core. Reviewed-by: NMark Bloch <markb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
-
由 Leon Romanovsky 提交于
There is no need to keep SRQ which is RDMA object in mlx5_core. In this patch, we partially move the execution code, while next patches will move table initialization/release logic too. Reviewed-by: NMark Bloch <markb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
-
由 Leon Romanovsky 提交于
Delete functions which are not called and not needed. Reviewed-by: NMark Bloch <markb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
-
由 Leon Romanovsky 提交于
Ensure that both RDMA and netdev parts of SRQ implementation has same copyright and license information annotated by SPDX tags. Reviewed-by: NMark Bloch <markb@mellanox.com> Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
-
由 Willem de Bruijn 提交于
With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info. Release of its last reference triggers a completion notification. The TCP stack in tcp_sendmsg_locked holds an extra ref independent of the skbs, because it can build, send and free skbs within its loop, possibly reaching refcount zero and freeing the ubuf_info too soon. The UDP stack currently also takes this extra ref, but does not need it as all skbs are sent after return from __ip(6)_append_data. Avoid the extra refcount_inc and refcount_dec_and_test, and generally the sock_zerocopy_put in the common path, by passing the initial reference to the first skb. This approach is taken instead of initializing the refcount to 0, as that would generate error "refcount_t: increment on 0" on the next skb_zcopy_set. Changes v3 -> v4 - Move skb_zcopy_set below the only kfree_skb that might cause a premature uarg destroy before skb_zerocopy_put_abort - Move the entire skb_shinfo assignment block, to keep that cacheline access in one place Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Willem de Bruijn 提交于
Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and interpret flag MSG_ZEROCOPY. This patch was previously part of the zerocopy RFC patchsets. Zerocopy is not effective at small MTU. With segmentation offload building larger datagrams, the benefit of page flipping outweights the cost of generating a completion notification. tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on test patch and making skb_orphan_frags_rx same as skb_orphan_frags: ipv4 udp -t 1 tx=191312 (11938 MB) txc=0 zc=n rx=191312 (11938 MB) ipv4 udp -z -t 1 tx=304507 (19002 MB) txc=304507 zc=y rx=304507 (19002 MB) ok ipv6 udp -t 1 tx=174485 (10888 MB) txc=0 zc=n rx=174485 (10888 MB) ipv6 udp -z -t 1 tx=294801 (18396 MB) txc=294801 zc=y rx=294801 (18396 MB) ok Changes v1 -> v2 - Fixup reverse christmas tree violation v2 -> v3 - Split refcount avoidance optimization into separate patch - Fix refcount leak on error in fragmented case (thanks to Paolo Abeni for pointing this one out!) - Fix refcount inc on zero - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data. This is needed since commit 5cf4a853 ("tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp. Signed-off-by: NWillem de Bruijn <willemb@google.com> Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Xin Long 提交于
In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences a transport's asoc under rcu_read_lock while asoc is freed not after a grace period, which leads to a use-after-free panic. This patch fixes it by calling kfree_rcu to make asoc be freed after a grace period. Note that only the asoc's memory is delayed to free in the patch, it won't cause sk to linger longer. Thanks Neil and Marcelo to make this clear. Fixes: 7fda702f ("sctp: use new rhlist interface on sctp transport rhashtable") Fixes: cd2b7087 ("sctp: check duplicate node before inserting a new transport") Reported-by: syzbot+0b05d8aa7cb185107483@syzkaller.appspotmail.com Reported-by: syzbot+aad231d51b1923158444@syzkaller.appspotmail.com Suggested-by: NNeil Horman <nhorman@tuxdriver.com> Signed-off-by: NXin Long <lucien.xin@gmail.com> Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-