- 30 8月, 2015 1 次提交
-
-
由 Trond Myklebust 提交于
In case the reconnection attempt fails. Cc: stable@vger.kernel.org Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
- 20 8月, 2015 1 次提交
-
-
由 Trond Myklebust 提交于
Follow up to commit c4a7ca77 ("SUNRPC: Allow waiting on memory allocation"). Allows the RPC socket code to do non-IO blocking. Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
- 18 8月, 2015 1 次提交
-
-
由 Trond Myklebust 提交于
It is rather pointless to test the value of transport->inet after calling xs_reset_transport(), since it will always be zero, and so we will never see any exponential back off behaviour. Also don't force early connections for SOFTCONN tasks. If the server disconnects us, we should respect the exponential backoff. Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
- 13 8月, 2015 1 次提交
-
-
由 Jeff Layton 提交于
The current limit of 32 bytes artificially limits the name string that we end up stuffing into NFSv4.x client ID blobs. If you have multiple hosts with long hostnames that only differ near the end, then this can cause NFSv4 client ID collisions. Linux nodenames are actually limited to __NEW_UTS_LEN bytes (64), so use that as the limit instead. Also, use XDR_QUADLEN to specify the slack length, just for clarity and in case someone in the future changes this to something not evenly divisible by 4. Reported-by: NMichael Skralivetsky <michael.skralivetsky@primarydata.com> Signed-off-by: NJeff Layton <jeff.layton@primarydata.com> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
- 06 8月, 2015 14 次提交
-
-
由 Devesh Sharma 提交于
This is a rework of the following patch sent almost a year back: http://www.mail-archive.com/linux-rdma%40vger.kernel.org/msg20730.html In presence of active mount if someone tries to rmmod vendor-driver, the command remains stuck forever waiting for destruction of all rdma-cm-id. in worst case client can crash during shutdown with active mounts. The existing code assumes that ia->ri_id->device cannot change during the lifetime of a transport. xprtrdma do not have support for DEVICE_REMOVAL event either. Lifting that assumption and adding support for DEVICE_REMOVAL event is a long chain of work, and is in plan. The community decided that preventing the hang right now is more important than waiting for architectural changes. Thus, this patch introduces a temporary workaround to acquire HCA driver module reference count during the mount of a nfs-rdma mount point. Signed-off-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Reviewed-by: NSagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG calls so administrators can tell if they happen to be used more than expected. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
checkpatch.pl complained about the seq_printf() format string split across lines and the use of %Lu. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Repair how rpcrdma_marshal_req() chooses which RDMA message type to use for large non-WRITE operations so that it picks RDMA_NOMSG in the correct situations, and sets up the marshaling logic to SEND only the RPC/RDMA header. Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS server XDR decoder for NFSv2 SYMLINK does not handle having the pathname argument arrive in a separate buffer. The decoder could be fixed, but this is simpler and RDMA_NOMSG can be used in a variety of other situations. Ensure that the Linux client continues to use "RDMA_MSG + read list" when sending large NFSv3 SYMLINK requests, which is more efficient than using RDMA_NOMSG. Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG + read list" just like NFSv3 (see Section 5 of RFC 5667). Before, these did not work at all. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Currently xprtrdma appends an extra chunk element to the RPC/RDMA read chunk list of each NFSv4 WRITE compound. The extra element contains the final GETATTR operation in the compound. The result is an extra RDMA READ operation to transfer a very short piece of each NFS WRITE compound (typically 16 bytes). This is inefficient. It is also incorrect. The client is sending the trailing GETATTR at the same Position as the preceding WRITE data payload. Whether or not RFC 5667 allows the GETATTR to appear in a read chunk, RFC 5666 requires that these two separate RPC arguments appear at two distinct Positions. It can also be argued that the GETATTR operation is not bulk data, and therefore RFC 5667 forbids its appearance in a read chunk at all. Although RFC 5667 is not precise about when using a read list with NFSv4 COMPOUND is allowed, the intent is that only data arguments not touched by NFS (ie, read and write payloads) are to be sent using RDMA READ or WRITE. The NFS client constructs GETATTR arguments itself, and therefore is required to send the trailing GETATTR operation as additional inline content, not as a data payload. NB: This change is not backwards compatible. Some older servers do not accept inline content following the read list. The Linux NFS server should handle this content correctly as of commit a97c331f ("svcrdma: Handle additional inline content"). Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Currently Linux always offers a reply chunk, even when the reply can be sent inline (ie. is smaller than 1KB). On the client, registering a memory region can be expensive. A server may choose not to use the reply chunk, wasting the cost of the registration. This is a change only for RPC replies smaller than 1KB which the server constructs in the RPC reply send buffer. Because the elements of the reply must be XDR encoded, a copy-free data transfer has no benefit in this case. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Reviewed-by: NSagi Grimberg <sagig@mellanox.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
The client has been setting up a reply chunk for NFS READs that are smaller than the inline threshold. This is not efficient: both the server and client CPUs have to copy the reply's data payload into and out of the memory region that is then transferred via RDMA. Using the write list, the data payload is moved by the device and no extra data copying is necessary. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com> Reviewed-By: NSagi Grimberg <sagig@mellanox.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
When the size of the RPC message is near the inline threshold (1KB), the client would allow messages to be sent that were a few bytes too large. When marshaling RPC/RDMA requests, ensure the combined size of RPC/RDMA header and RPC header do not exceed the inline threshold. Endpoints typically reject RPC/RDMA messages that exceed the size of their receive buffers. The two server implementations I test with (Linux and Solaris) use receive buffers that are larger than the client’s inline threshold. Thus so far this has been benign, observed only by code inspection. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
RDMA_MSGP type calls insert a zero pad in the middle of the RPC message to align the RPC request's data payload to the server's alignment preferences. A server can then "page flip" the payload into place to avoid a data copy in certain circumstances. However: 1. The client has to have a priori knowledge of the server's preferred alignment 2. Requests eligible for RDMA_MSGP are requests that are small enough to have been sent inline, and convey a data payload at the _end_ of the RPC message Today 1. is done with a sysctl, and is a global setting that is copied during mount. Linux does not support CCP to query the server's preferences (RFC 5666, Section 6). A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4 compound fits bullet 2. Thus the Linux client currently leaves RDMA_MSGP disabled. The Linux server handles RDMA_MSGP, but does not use any special page flipping, so it confers no benefit. Clean up the marshaling code by removing the logic that constructs RDMA_MSGP type calls. This also reduces the maximum send iovec size from four to just two elements. /proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and thus is left in place. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Untangle the end of rpcrdma_ia_open() by moving DMA MR set-up, which is different for each registration method, to the .ro_open functions. This is refactoring only. No behavior change is expected. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
All HCA providers have an ib_get_dma_mr() verb. Thus rpcrdma_ia_open() will either grab the device's local_dma_key if one is available, or it will call ib_get_dma_mr(). If ib_get_dma_mr() fails, rpcrdma_ia_open() fails and no transport is created. Therefore execution never reaches the ib_reg_phys_mr() call site in rpcrdma_register_internal(), so it can be removed. The remaining logic in rpcrdma_{de}register_internal() is folded into rpcrdma_{alloc,free}_regbuf(). This is clean up only. No behavior change is expected. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com> Reviewed-By: NSagi Grimberg <sagig@mellanox.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
PHYSICAL memory registration uses a single rkey for all of the client's memory, thus is insecure. It is still useful in some cases for testing. Retain the ability to select PHYSICAL memory registration capability via /proc/sys/sunrpc/rdma_memreg_strategy, but don't fall back to it if the HCA does not support FRWR or FMR. This means amso1100 no longer works out of the box with NFS/RDMA. When using amso1100 HCAs, set the memreg_strategy sysctl to 6 before performing NFS/RDMA mounts. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
The point of larger rsize and wsize is to reduce the per-byte cost of memory registration and deregistration. Modern HCAs can typically handle a megabyte or more with a single registration operation. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com> Reviewed-By: NSagi Grimberg <sagig@mellanox.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
In particular, recognize when an IPv6 connection is bound. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 01 8月, 2015 1 次提交
-
-
由 WANG Cong 提交于
When we share an action within a filter, the bind refcnt should increase, therefore we should not call tcf_hash_release(). Fixes: 1a29321e ("net_sched: act: Dont increment refcnt on replace") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NCong Wang <cwang@twopensource.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 7月, 2015 2 次提交
-
-
由 Sowmini Varadhan 提交于
The newsk returned by sk_clone_lock should hold a get_net() reference if, and only if, the parent is not a kernel socket (making this similar to sk_alloc()). E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock sets up the syn_recv newsk from sk_clone_lock. When the parent (listen) socket is a kernel socket (defined in sk_alloc() as having sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt and should not hold a get_net() reference. Fixes: 26abe143 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Acked-by: NEric Dumazet <edumazet@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Since commit 55334a5d ("net_sched: act: refuse to remove bound action outside"), we end up with a wrong reference count for a tc action. Test case 1: FOO="1,6 0 0 4294967295," BAR="1,6 0 0 4294967294," tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 \ action bpf bytecode "$FOO" tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 1 ref 1 bind 1 tc actions replace action bpf bytecode "$BAR" index 1 tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe index 1 ref 2 bind 1 tc actions replace action bpf bytecode "$FOO" index 1 tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 1 ref 3 bind 1 Test case 2: FOO="1,6 0 0 4294967295," tc filter add dev foo parent 1: bpf bytecode "$FOO" flowid 1:1 action ok tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 1 bind 1 tc actions add action drop index 1 RTNETLINK answers: File exists [...] tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 2 bind 1 tc actions add action drop index 1 RTNETLINK answers: File exists [...] tc actions show action gact action order 0: gact action pass random type none pass val 0 index 1 ref 3 bind 1 What happens is that in tcf_hash_check(), we check tcf_common for a given index and increase tcfc_refcnt and conditionally tcfc_bindcnt when we've found an existing action. Now there are the following cases: 1) We do a late binding of an action. In that case, we leave the tcfc_refcnt/tcfc_bindcnt increased and are done with the ->init() handler. This is correctly handeled. 2) We replace the given action, or we try to add one without replacing and find out that the action at a specific index already exists (thus, we go out with error in that case). In case of 2), we have to undo the reference count increase from tcf_hash_check() in the tcf_hash_check() function. Currently, we fail to do so because of the 'tcfc_bindcnt > 0' check which bails out early with an -EPERM error. Now, while commit 55334a5d prevents 'tc actions del action ...' on an already classifier-bound action to drop the reference count (which could then become negative, wrap around etc), this restriction only accounts for invocations outside a specific action's ->init() handler. One possible solution would be to add a flag thus we possibly trigger the -EPERM ony in situations where it is indeed relevant. After the patch, above test cases have correct reference count again. Fixes: 55334a5d ("net_sched: act: refuse to remove bound action outside") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NCong Wang <cwang@twopensource.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 7月, 2015 5 次提交
-
-
由 Daniel Borkmann 提交于
We currently trigger multiple memory leaks when replacing bpf actions, besides others: comm "tc", pid 1909, jiffies 4294851310 (age 1602.796s) hex dump (first 32 bytes): 01 00 00 00 03 00 00 00 00 00 00 00 00 00 00 00 ................ 18 b0 98 6d 00 88 ff ff 00 00 00 00 00 00 00 00 ...m............ backtrace: [<ffffffff817e623e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8120a22d>] __vmalloc_node_range+0x1bd/0x2c0 [<ffffffff8120a37a>] __vmalloc+0x4a/0x50 [<ffffffff811a8d0a>] bpf_prog_alloc+0x3a/0xa0 [<ffffffff816c0684>] bpf_prog_create+0x44/0xa0 [<ffffffffa09ba4eb>] tcf_bpf_init+0x28b/0x3c0 [act_bpf] [<ffffffff816d7001>] tcf_action_init_1+0x191/0x1b0 [<ffffffff816d70a2>] tcf_action_init+0x82/0xf0 [<ffffffff816d4d12>] tcf_exts_validate+0xb2/0xc0 [<ffffffffa09b5838>] cls_bpf_modify_existing+0x98/0x340 [cls_bpf] [<ffffffffa09b5cd6>] cls_bpf_change+0x1a6/0x274 [cls_bpf] [<ffffffff816d56e5>] tc_ctl_tfilter+0x335/0x910 [<ffffffff816b9145>] rtnetlink_rcv_msg+0x95/0x240 [<ffffffff816df34f>] netlink_rcv_skb+0xaf/0xc0 [<ffffffff816b909e>] rtnetlink_rcv+0x2e/0x40 [<ffffffff816deaaf>] netlink_unicast+0xef/0x1b0 Issue is that the old content from tcf_bpf is allocated and needs to be released when we replace it. We seem to do that since the beginning of act_bpf on the filter and insns, later on the name as well. Example test case, after patch: # FOO="1,6 0 0 4294967295," # BAR="1,6 0 0 4294967294," # tc actions add action bpf bytecode "$FOO" index 2 # tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 2 ref 1 bind 0 # tc actions replace action bpf bytecode "$BAR" index 2 # tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967294' default-action pipe index 2 ref 1 bind 0 # tc actions replace action bpf bytecode "$FOO" index 2 # tc actions show action bpf action order 0: bpf bytecode '1,6 0 0 4294967295' default-action pipe index 2 ref 1 bind 0 # tc actions del action bpf index 2 [...] # echo "scan" > /sys/kernel/debug/kmemleak # cat /sys/kernel/debug/kmemleak | grep "comm \"tc\"" | wc -l 0 Fixes: d23b8ad8 ("tc: add BPF based action") Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This patch is the IPv6 equivalent of commit 6c8b4e3f ("arp: flush arp cache on IFF_NOARP change") Without it, we keep buggy neighbours in the cache, with destination MAC address equal to our own MAC address. Tested: tcpdump -i eth0 -s 0 ip6 -n -e & ip link set dev eth0 arp off ping6 remote // sends buggy frames ip link set dev eth0 arp on ping6 remote // should work once kernel is patched Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NMario Fanelli <mariofanelli@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
Since mdb states were introduced when deleting an entry the state was left as it was set in the delete request from the user which leads to the following output when doing a monitor (for example): $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 permanent $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 temp ^^^ Note the "temp" state in the delete notification which is wrong since the entry was permanent, the state in a delete is always reported as "temp" regardless of the real state of the entry. After this patch: $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 permanent $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 permanent There's one important note to make here that the state is actually not matched when doing a delete, so one can delete a permanent entry by stating "temp" in the end of the command, I've chosen this fix in order not to break user-space tools which rely on this (incorrect) behaviour. So to give an example after this patch and using the wrong state: $ bridge mdb add dev br0 port eth3 grp 239.0.0.1 permanent (monitor) dev br0 port eth3 grp 239.0.0.1 permanent $ bridge mdb del dev br0 port eth3 grp 239.0.0.1 temp (monitor) dev br0 port eth3 grp 239.0.0.1 permanent Note the state of the entry that got deleted is correct in the notification. Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Fixes: ccb1c31a ("bridge: add flags to distinguish permanent mdb entires") Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Satish Ashok 提交于
When fast leave is configured on a bridge port and an IGMP leave is received for a group, the group is not deleted immediately if there is a router detected or if multicast querier is configured. Ideally the group should be deleted immediately when fast leave is configured. Signed-off-by: NSatish Ashok <sashok@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Toshiaki Makita 提交于
There are several devices that can receive vlan tagged packets with CHECKSUM_PARTIAL like tap, possibly veth and xennet. When (multiple) vlan tagged packets with CHECKSUM_PARTIAL are forwarded by bridge to a device with the IP_CSUM feature, they end up with checksum error because before entering bridge, the network header is set to ETH_HLEN (not including vlan header length) in __netif_receive_skb_core(), get_rps_cpu(), or drivers' rx functions, and nobody fixes the pointer later. Since the network header is exepected to be ETH_HLEN in flow-dissection and hash-calculation in RPS in rx path, and since the header pointer fix is needed only in tx path, set the appropriate network header on forwarding packets. Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 7月, 2015 4 次提交
-
-
由 Alexander Drozdov 提交于
tpacket_fill_skb() can return a negative value (-errno) which is stored in tp_len variable. In that case the following condition will be (but shouldn't be) true: tp_len > dev->mtu + dev->hard_header_len as dev->mtu and dev->hard_header_len are both unsigned. That may lead to just returning an incorrect EMSGSIZE errno to the user. Fixes: 52f1454f ("packet: allow to transmit +4 byte in TX_RING slot for VLAN case") Signed-off-by: NAlexander Drozdov <al.drozdov@gmail.com> Acked-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When arp is off on a device, and ioctl(SIOCGARP) is queried, a buggy answer is given with MAC address of the device, instead of the mac address of the destination/gateway. We filter out NUD_NOARP neighbours for /proc/net/arp, we must do the same for SIOCGARP ioctl. Tested: lpaa23:~# ./arp 10.246.7.190 MAC=00:01:e8:22:cb:1d // correct answer lpaa23:~# ip link set dev eth0 arp off lpaa23:~# cat /proc/net/arp # check arp table is now 'empty' IP address HW type Flags HW address Mask Device lpaa23:~# ./arp 10.246.7.190 MAC=00:1a:11:c3:0d:7f // buggy answer before patch (this is eth0 mac) After patch : lpaa23:~# ip link set dev eth0 arp off lpaa23:~# ./arp 10.246.7.190 ioctl(SIOCGARP) failed: No such device or address Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NVytautas Valancius <valas@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ward 提交于
This notification causes the FIB to be updated, which is not needed because the address already exists, and more importantly it may undo intentional changes that were made to the FIB after the address was originally added. (As a point of comparison, when an address becomes deprecated because its preferred lifetime expired, a notification on this chain is not generated.) The motivation for this commit is fixing an incompatibility between DHCP clients which set and update the address lifetime according to the lease, and a commercial VPN client which replaces kernel routes in a way that outbound traffic is sent only through the tunnel (and disconnects if any further route changes are detected via netlink). Signed-off-by: NDavid Ward <david.ward@ll.mit.edu> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
These should be handled only by the respective STP which is in control. They become problematic for devices with limited resources with many ports because the hold_timer is per port and fires each second and the hello timer fires each 2 seconds even though it's global. While in user-space STP mode these timers are completely unnecessary so it's better to keep them off. Also ensure that when the bridge is up these timers are started only when running with kernel STP. Signed-off-by: NSatish Ashok <sashok@cumulusnetworks.com> Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 7月, 2015 3 次提交
-
-
由 Lars Westerhoff 提交于
When binding a PF_PACKET socket, the use count of the bound interface is always increased with dev_hold in dev_get_by_{index,name}. However, when rebound with the same protocol and device as in the previous bind the use count of the interface was not decreased. Ultimately, this caused the deletion of the interface to fail with the following message: unregister_netdevice: waiting for dummy0 to become free. Usage count = 1 This patch moves the dev_put out of the conditional part that was only executed when either the protocol or device changed on a bind. Fixes: 902fefb8 ('packet: improve socket create/bind latency in some cases') Signed-off-by: NLars Westerhoff <lars.westerhoff@newtec.eu> Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NDaniel Borkmann <dborkman@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Trond Myklebust 提交于
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Alexander Duyck 提交于
It was reported that update_suffix was taking a long time on systems where a large number of leaves were attached to a single node. As it turns out fib_table_flush was calling update_suffix for each leaf that didn't have all of the aliases stripped from it. As a result, on this large node removing one leaf would result in us calling update_suffix for every other leaf on the node. The fix is to just remove the calls to leaf_pull_suffix since they are redundant as we already have a call in resize that will go through and update the suffix length for the node before we exit out of fib_table_flush or fib_table_flush_external. Reported-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com> Tested-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 27 7月, 2015 7 次提交
-
-
由 NeilBrown 提交于
The networking layer does not reliably report the distinction between a non-block write failing because: 1/ the queue is too full already and 2/ a memory allocation attempt failed. The distinction is important because in the first case it is appropriate to retry as soon as the socket reports that it is writable, and in the second case a small delay is required as the socket will most likely report as writable but kmalloc could still fail. sk_stream_wait_memory() exhibits this distinction nicely, setting 'vm_wait' if a small wait is needed. However in the non-blocking case it always returns -EAGAIN no matter the cause of the failure. This -EAGAIN call get all the way to sunrpc. The sunrpc layer expects EAGAIN to indicate the first cause, and ENOBUFS to indicate the second. Various documentation suggests that this is not unreasonable, but does not guarantee the desired error codes. The result of getting -EAGAIN when -ENOBUFS is expected is that the send is tried again in a tight loop and soft lockups are reported. so: add tests after calls to xs_sendpages() to translate -EAGAIN into -ENOBUFS if the socket is writable. This cannot happen inside xs_sendpages() as the test for "is socket writable" is different between TCP and UDP. With this change, the tight loop retrying xs_sendpages() becomes a loop which only retries every 250ms, and so will not trigger a soft-lockup warning. It is possible that the write did fail because the queue was too full and by the time xs_sendpages() completed, the queue was writable again. In this case an extra 250ms delay is inserted that isn't really needed. This circumstance suggests a degree of congestion so a delay is not necessarily a bad thing, and it can only cause a single 250ms delay, not a series of them. Signed-off-by: NNeilBrown <neilb@suse.com> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Sabrina Dubroca 提交于
Currently, tcp_recvmsg enters a busy loop in sk_wait_data if called with flags = MSG_WAITALL | MSG_PEEK. sk_wait_data waits for sk_receive_queue not empty, but in this case, the receive queue is not empty, but does not contain any skb that we can use. Add a "last skb seen on receive queue" argument to sk_wait_data, so that it sleeps until the receive queue has new skbs. Link: https://bugzilla.kernel.org/show_bug.cgi?id=99461 Link: https://sourceware.org/bugzilla/show_bug.cgi?id=18493 Link: https://bugzilla.redhat.com/show_bug.cgi?id=1205258Reported-by: NEnrico Scholz <rh-bugzilla@ensc.de> Reported-by: NDan Searle <dan@censornet.com> Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
We can simply remove the INET_FRAG_EVICTED flag to avoid all the flags race conditions with the evictor and use a participation test for the evictor list, when we're at that point (after inet_frag_kill) in the timer there're 2 possible cases: 1. The evictor added the entry to its evictor list while the timer was waiting for the chainlock or 2. The timer unchained the entry and the evictor won't see it In both cases we should be able to see list_evictor correctly due to the sync on the chainlock. Joint work with Florian Westphal. Tested-by: NFrank Schreuder <fschreuder@transip.nl> Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
Frank reports 'NMI watchdog: BUG: soft lockup' errors when load is high. Instead of (potentially) unbounded restarts of the eviction process, just skip to the next entry. One caveat is that, when a netns is exiting, a timer may still be running by the time inet_evict_bucket returns. We use the frag memory accounting to wait for outstanding timers, so that when we free the percpu counter we can be sure no running timer will trip over it. Reported-and-tested-by: NFrank Schreuder <fschreuder@transip.nl> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
Followup patch will call it after inet_frag_queue was freed, so q->net doesn't work anymore (but netf = q->net; free(q); mem_limit(netf) would). Tested-by: NFrank Schreuder <fschreuder@transip.nl> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Florian Westphal 提交于
commit 65ba1f1e ("inet: frags: fix a race between inet_evict_bucket and inet_frag_kill") describes the bug, but the fix doesn't work reliably. Problem is that ->flags member can be set on other cpu without chainlock being held by that task, i.e. the RMW-Cycle can clear INET_FRAG_EVICTED bit after we put the element on the evictor private list. We can crash when walking the 'private' evictor list since an element can be deleted from list underneath the evictor. Join work with Nikolay Alexandrov. Fixes: b13d3cbf ("inet: frag: move eviction of queues to work queue") Reported-by: NJohan Schuijt <johan@transip.nl> Tested-by: NFrank Schreuder <fschreuder@transip.nl> Signed-off-by: NNikolay Alexandrov <nikolay@cumulusnetworks.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Borkmann 提交于
Back then when we added support for SCTP_SNDINFO/SCTP_RCVINFO from RFC6458 5.3.4/5.3.5, we decided to add a deprecation warning for the (as per RFC deprecated) SCTP_SNDRCV via commit bbbea41d ("net: sctp: deprecate rfc6458, 5.3.2. SCTP_SNDRCV support"), see [1]. Imho, it was not a good idea, and we should just revert that message for a couple of reasons: 1) It's uapi and therefore set in stone forever. 2) To be able to run on older and newer kernels, an SCTP application would need to probe for both, SCTP_SNDRCV, but also SCTP_SNDINFO/ SCTP_RCVINFO support, so that on older kernels, it can make use of SCTP_SNDRCV, and on newer kernels SCTP_SNDINFO/SCTP_RCVINFO. In my (limited) experience, a lot of SCTP appliances are migrating to newer kernels only ve(ee)ry slowly. 3) Some people don't have the chance to change their applications, f.e. due to proprietary legacy stuff. So, they'll hit this warning in fast path and are stuck with older kernels. But i.e. due to point 1) I really fail to see the benefit of a warning. So just revert that for now, the issue was reported up Jamal. [1] http://thread.gmane.org/gmane.linux.network/321960/Reported-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Cc: Michael Tuexen <tuexen@fh-muenster.de> Acked-by: NJamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-