- 20 9月, 2016 22 次提交
-
-
由 Chuck Lever 提交于
Send an RDMA-CM private message on connect, and look for one during a connection-established event. Both sides can communicate their various implementation limits. Implementations that don't support this sideband protocol ignore it. Once the client knows the server's inline threshold maxima, it can adjust the use of Reply chunks, and eliminate most use of Position Zero Read chunks. Moderately-sized I/O can be done using a pure inline RDMA Send instead of RDMA operations that require memory registration. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: The fields in the recv_wr do not vary. There is no need to initialize them before each ib_post_recv(). This removes a large-ish data structure from the stack. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: Most of the fields in each send_wr do not vary. There is no need to initialize them before each ib_post_send(). This removes a large-ish data structure from the stack. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up. Since commit fc664485 ("xprtrdma: Split the completion queue"), rpcrdma_ep_post_recv() no longer uses the "ep" argument. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up. The "ia" argument is no longer used. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Currently, each regbuf is allocated and DMA mapped at the same time. This is done during transport creation. When a device driver is unloaded, every DMA-mapped buffer in use by a transport has to be unmapped, and then remapped to the new device if the driver is loaded again. Remapping will have to be done _after_ the connect worker has set up the new device. But there's an ordering problem: call_allocate, which invokes xprt_rdma_allocate which calls rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_ the connect worker can run to set up the new device. Instead, at transport creation, allocate each buffer, but leave it unmapped. Once the RPC carries these buffers into ->send_request, by which time a transport connection should have been established, check to see that the RPC's buffers have been DMA mapped. If not, map them there. When device driver unplug support is added, it will simply unmap all the transport's regbufs, but it doesn't have to deallocate the underlying memory. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt. Fortunately, xprtrdma now knows which direction I/O is going as soon as it allocates each regbuf. The RPC Call and Reply buffers are no longer the same regbuf. They can each be labeled correctly now. The RPC Reply buffer is never part of either a Send or Receive WR, but it can be part of Reply chunk, which is mapped and registered via ->ro_map . So it is not DMA mapped when it is allocated (DMA_NONE), to avoid a double- mapping. Since Receive buffers are no longer DMA_BIDIRECTIONAL and their contents are never modified by the host CPU, DMA-API-HOWTO.txt suggests that a DMA sync before posting each buffer should be unnecessary. (See my_card_interrupt_handler). Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Commit 94931746 ("xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers") capped the number of chunks that may appear in RPC-over-RDMA headers. The maximum header size can be estimated and fixed to avoid allocating buffer space that is never used. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
RPC-over-RDMA needs to separate its RPC call and reply buffers. o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA Send operation using DMA_TO_DEVICE o If the client expects a large RPC reply, it DMA maps rq_rcv_buf as part of a Reply chunk using DMA_FROM_DEVICE The two mappings are for data movement in opposite directions. DMA-API.txt suggests that if these mappings share a DMA cacheline, bad things can happen. This could occur in the final bytes of rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers happen to share a DMA cacheline. On x86_64 the cacheline size is typically 8 bytes, and RPC call messages are usually much smaller than the send buffer, so this hasn't been a noticeable problem. But the DMA cacheline size can be larger on other platforms. Also, often rq_rcv_buf starts most of the way into a page, thus an additional RDMA segment is needed to map and register the end of that buffer. Try to avoid that scenario to reduce the cost of registering and invalidating Reply chunks. Instead of carrying a single regbuf that covers both rq_snd_buf and rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for rq_snd_buf and one regbuf for rq_rcv_buf. Some incidental changes worth noting: - To clear out some spaghetti, refactor xprt_rdma_allocate. - The value stored in rg_size is the same as the value stored in the iov.length field, so eliminate rg_size Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Currently there's a hidden and indirect mechanism for finding the rpcrdma_req that goes with an rpc_rqst. It depends on getting from the rq_buffer pointer in struct rpc_rqst to the struct rpcrdma_regbuf that controls that buffer, and then to the struct rpcrdma_req it goes with. This was done back in the day to avoid the need to add a per-rqst pointer or to alter the buf_free API when support for RPC-over-RDMA was introduced. I'm about to change the way regbuf's work to support larger inline thresholds. Now is a good time to replace this indirect mechanism with something that is more straightforward. I guess this should be considered a clean up. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
For xprtrdma, the RPC Call and Reply buffers are involved in real I/O operations. To start with, the DMA direction of the I/O for a Call is opposite that of a Reply. In the current arrangement, the Reply buffer address is on a four-byte alignment just past the call buffer. Would be friendlier on some platforms if that was at a DMA cache alignment instead. Because the current arrangement allocates a single memory region which contains both buffers, the RPC Reply buffer often contains a page boundary in it when the Call buffer is large enough (which is frequent). It would be a little nicer for setting up DMA operations (and possible registration of the Reply buffer) if the two buffers were separated, well-aligned, and contained as few page boundaries as possible. Now, I could just pad out the single memory region used for the pair of buffers. But frequently that would mean a lot of unused space to ensure the Reply buffer did not have a page boundary. Add a separate pointer to rpc_rqst that points right to the RPC Reply buffer. This makes no difference to xprtsock, but it will help xprtrdma in subsequent patches. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Instead of passing just the rq_buffer into the buf_free method, pass the task structure and let buf_free take care of freeing both XDR buffers at once. There's a micro-optimization here. In the common case, both xprt_release and the transport's buf_free method were checking if rq_buffer was NULL. Now the check is done only once per RPC. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
xprtrdma needs to allocate the Call and Reply buffers separately. TBH, the reliance on using a single buffer for the pair of XDR buffers is transport implementation-specific. Transports that want to allocate separate Call and Reply buffers will ignore the "size" argument anyway. Don't bother passing it. The buf_alloc method can't return two pointers. Instead, make the method's return value an error code, and set the rq_buffer pointer in the method itself. This gives call_allocate an opportunity to terminate an RPC instead of looping forever when a permanent problem occurs. If a request is just bogus, or the transport is in a state where it can't allocate resources for any request, there needs to be a way to kill the RPC right there and not loop. This immediately fixes a rare problem in the backchannel send path, which loops if the server happens to send a CB request whose call+reply size is larger than a page (which it shouldn't do yet). One more issue: looks like xprt_inject_disconnect was incorrectly placed in the failure path in call_allocate. It needs to be in the success path, as it is for other call-sites. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: there is some XDR initialization logic that is common to the forward channel and backchannel. Move it to an XDR header so it can be shared. rpc_rqst::rq_buffer points to a buffer containing big-endian data. Update its annotation as part of the clean up. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: r_xprt is already available everywhere these macros are invoked, so just dereference that directly. RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be removed. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Andy Adamson 提交于
Use a setup function to call into the NFS layer to test an rpc_xprt for session trunking so as to not leak the rpc_xprt_switch into the nfs layer. Search for the address in the rpc_xprt_switch first so as not to put an unnecessary EXCHANGE_ID on the wire. Signed-off-by: NAndy Adamson <andros@netapp.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Andy Adamson 提交于
Signed-off-by: NAndy Adamson <andros@netapp.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Andy Adamson 提交于
Give the NFS layer access to the rpc_xprt_switch_add_xprt function Signed-off-by: NAndy Adamson <andros@netapp.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Andy Adamson 提交于
Give the NFS layer access to the xprt_switch_put function Signed-off-by: NAndy Adamson <andros@netapp.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Andy Adamson 提交于
rpc_task_set_client is only called from rpc_run_task after rpc_new_task and rpc_task_release_client is not needed as the task is new. When called from rpc_new_task, rpc_task_set_client also removed the assigned rpc_xprt which is not desired. Signed-off-by: NAndy Adamson <andros@netapp.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Trond Myklebust 提交于
Clean up. Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Amitoj Kaur Chawla 提交于
The variable `err` is not used anywhere and just returns the predefined value `0` at the end of the function. Hence, remove the variable and return 0 explicitly. Signed-off-by: NAmitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 13 9月, 2016 1 次提交
-
-
由 Chuck Lever 提交于
rsc_lookup steals the passed-in memory to avoid doing an allocation of its own, so we can't just pass in a pointer to memory that someone else is using. If we really want to avoid allocation there then maybe we should preallocate somwhere, or reference count these handles. For now we should revert. On occasion I see this on my server: kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851! kernel: invalid opcode: 0000 [#1] SMP kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted 4.8.0-rc4-00006-g9d06b0b #15 kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015 kernel: Workqueue: events do_cache_clean [sunrpc] kernel: task: ffff8808541d8000 task.stack: ffff880854344000 kernel: RIP: 0010:[<ffffffff811e7075>] [<ffffffff811e7075>] kfree+0x155/0x180 kernel: RSP: 0018:ffff880854347d70 EFLAGS: 00010246 kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: 146ff0f9d5ec5600 kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: ffff88083f9db064 kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: ffff88087fdd8780 kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: ffff880853f01500 kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: ffffffffa04253a0 kernel: FS: 0000000000000000(0000) GS:ffff88087fdc0000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: 00000000001406e0 kernel: Stack: kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 ffff880854347da0 kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 ffffffffa013d006 kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 0000000057c7044f kernel: Call Trace: kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss] kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss] kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc] kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc] kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0 kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0 kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0 kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0 kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40 kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110 kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff kernel: RIP [<ffffffff811e7075>] kfree+0x155/0x180 kernel: RSP <ffff880854347d70> kernel: ---[ end trace 3fdec044969def26 ]--- It seems to be most common after a server reboot where a client has been using a Kerberos mount, and reconnects to continue its workload. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 10 9月, 2016 1 次提交
-
-
由 Marcelo Ricardo Leitner 提交于
Previously, without GSO, it was easy to identify it: if the chunk didn't fit and there was no data chunk in the packet yet, we could fragment at IP level. So if there was an auth chunk and we were bundling a big data chunk, it would fragment regardless of the size of the auth chunk. This also works for the context of PMTU reductions. But with GSO, we cannot distinguish such PMTU events anymore, as the packet is allowed to exceed PMTU. So we need another check: to ensure that the chunk that we are adding, actually fits the current PMTU. If it doesn't, trigger a flush and let it be fragmented at IP level in the next round. Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 9月, 2016 2 次提交
-
-
由 Artem Germanov 提交于
Commit 76174004 (tcp: do not slow start when cwnd equals ssthresh ) introduced regression in TCP YeAH. Using 100ms delay 1% loss virtual ethernet link kernel 4.2 shows bandwidth ~500KB/s for single TCP connection and kernel 4.3 and above (including 4.8-rc4) shows bandwidth ~100KB/s. That is caused by stalled cwnd when cwnd equals ssthresh. This patch fixes it by proper increasing cwnd in this case. Signed-off-by: NArtem Germanov <agermanov@anchorfree.com> Acked-by: NDmitry Adamushko <d.adamushko@anchorfree.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When DATA and/or FIN are carried in a SYN/ACK message or SYN message, we append an skb in socket receive queue, but we forget to call sk_forced_mem_schedule(). Effect is that the socket has a negative sk->sk_forward_alloc as long as the message is not read by the application. Josh Hunt fixed a similar issue in commit d22e1537 ("tcp: fix tcp fin memory accounting") Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path") Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NJosh Hunt <johunt@akamai.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 9月, 2016 5 次提交
-
-
由 Wei Yongjun 提交于
In general, when DAD detected IPv6 duplicate address, ifp->state will be set to INET6_IFADDR_STATE_ERRDAD and DAD is stopped by a delayed work, the call tree should be like this: ndisc_recv_ns -> addrconf_dad_failure <- missing ifp put -> addrconf_mod_dad_work -> schedule addrconf_dad_work() -> addrconf_dad_stop() <- missing ifp hold before call it addrconf_dad_failure() called with ifp refcont holding but not put. addrconf_dad_work() call addrconf_dad_stop() without extra holding refcount. This will not cause any issue normally. But the race between addrconf_dad_failure() and addrconf_dad_work() may cause ifp refcount leak and netdevice can not be unregister, dmesg show the following messages: IPv6: eth0: IPv6 duplicate address fe80::XX:XXXX:XXXX:XX detected! ... unregister_netdevice: waiting for eth0 to become free. Usage count = 1 Cc: stable@vger.kernel.org Fixes: c15b1cca ("ipv6: move DAD and addrconf_verify processing to workqueue") Signed-off-by: NWei Yongjun <weiyongjun1@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Mark Tomlinson 提交于
When deleting an IP address from an interface, there is a clean-up of routes which refer to this local address. However, there was no check to see that the VRF matched. This meant that deletion wasn't confined to the VRF it should have been. To solve this, a new field has been added to fib_info to hold a table id. When removing fib entries corresponding to a local ip address, this table id is also used in the comparison. The table id is populated when the fib_info is created. This was already done in some places, but not in ip_rt_ioctl(). This has now been fixed. Fixes: 021dd3b8 ("net: Add routes to the table associated with the device") Acked-by: NDavid Ahern <dsa@cumulusnetworks.com> Tested-by: NDavid Ahern <dsa@cumulusnetworks.com> Signed-off-by: NMark Tomlinson <mark.tomlinson@alliedtelesis.co.nz> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Chuck Lever 提交于
An RPC can terminate before its reply arrives, if a credential problem or a soft timeout occurs. After this happens, xprtrdma reports it is out of Receive buffers. A Receive buffer is posted before each RPC is sent, and returned to the buffer pool when a reply is received. If no reply is received for an RPC, that Receive buffer remains posted. But xprtrdma tries to post another when the next RPC is sent. If this happens a few dozen times, there are no receive buffers left to be posted at send time. I don't see a way for a transport connection to recover at that point, and it will spit warnings and unnecessarily delay RPCs on occasion for its remaining lifetime. Commit 1e465fd4 ("xprtrdma: Replace send and receive arrays") removed a little bit of logic to detect this case and not provide a Receive buffer so no more buffers are posted, and then transport operation continues correctly. We didn't understand what that logic did, and it wasn't commented, so it was removed as part of the overhaul to support backchannel requests. Restore it, but be wary of the need to keep extra Receives posted to deal with backchannel requests. Fixes: 1e465fd4 ("xprtrdma: Replace send and receive arrays") Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Reviewed-by: NAnna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Chuck Lever 提交于
Receive buffer exhaustion, if it were to actually occur, would be catastrophic. However, when there are no reply buffers to post, that means all of them have already been posted and are waiting for incoming replies. By design, there can never be more RPCs in flight than there are available receive buffers. A receive buffer can be left posted after an RPC exits without a received reply; say, due to a credential problem or a soft timeout. This does not result in fewer posted receive buffers than there are pending RPCs, and there is already logic in xprtrdma to deal appropriately with this case. It also looks like the "+ 2" that was removed was accidentally accommodating the number of extra receive buffers needed for receiving backchannel requests. That will need to be addressed by another patch. Fixes: 3d4cf35b ("xprtrdma: Reply buffer exhaustion can be...") Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Reviewed-by: NAnna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Dave Jones 提交于
Neither the failure or success paths of ping_v6_sendmsg release the dst it acquires. This leads to a flood of warnings from "net/core/dst.c:288 dst_release" on older kernels that don't have 8bf4ada2 backported. That patch optimistically hoped this had been fixed post 3.10, but it seems at least one case wasn't, where I've seen this triggered a lot from machines doing unprivileged icmp sockets. Cc: Martin Lau <kafai@fb.com> Signed-off-by: NDave Jones <davej@codemonkey.org.uk> Acked-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 9月, 2016 3 次提交
-
-
由 Linus Torvalds 提交于
Right now we use the 'readlock' both for protecting some of the af_unix IO path and for making the bind be single-threaded. The two are independent, but using the same lock makes for a nasty deadlock due to ordering with regards to filesystem locking. The bind locking would want to nest outside the VSF pathname locking, but the IO locking wants to nest inside some of those same locks. We tried to fix this earlier with commit c845acb3 ("af_unix: Fix splice-bind deadlock") which moved the readlock inside the vfs locks, but that caused problems with overlayfs that will then call back into filesystem routines that take the lock in the wrong order anyway. Splitting the locks means that we can go back to having the bind lock be the outermost lock, and we don't have any deadlocks with lock ordering. Acked-by: NRainer Weikusat <rweikusat@cyberadapt.com> Acked-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Linus Torvalds 提交于
This reverts commit c845acb3. It turns out that it just replaces one deadlock with another one: we can still get the wrong lock ordering with the readlock due to overlayfs calling back into the filesystem layer and still taking the vfs locks after the readlock. The proper solution ends up being to just split the readlock into two pieces: the bind lock (taken *outside* the vfs locks) and the IO lock (taken *inside* the filesystem locks). The two locks are independent anyway. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NShmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Mahesh Bandewar 提交于
Following few steps will crash kernel - (a) Create bonding master > modprobe bonding miimon=50 (b) Create macvlan bridge on eth2 > ip link add link eth2 dev mvl0 address aa:0:0:0:0:01 \ type macvlan (c) Now try adding eth2 into the bond > echo +eth2 > /sys/class/net/bond0/bonding/slaves <crash> Bonding does lots of things before checking if the device enslaved is busy or not. In this case when the notifier call-chain sends notifications, the bond_netdev_event() assumes that the rx_handler /rx_handler_data is registered while the bond_enslave() hasn't progressed far enough to register rx_handler for the new slave. This patch adds a rx_handler check that can be performed right at the beginning of the enslave code to avoid getting into this situation. Signed-off-by: NMahesh Bandewar <maheshb@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 9月, 2016 2 次提交
-
-
由 Paolo Abeni 提交于
The commit f9b2ee71 ("SUNRPC: Move UDP receive data path into a workqueue context"), as a side effect, moved the skb_free_datagram() call outside the scope of the related socket lock, but UDP sockets require such lock to be held for proper memory accounting. Fix it by replacing skb_free_datagram() with skb_free_datagram_locked(). Fixes: f9b2ee71 ("SUNRPC: Move UDP receive data path into a workqueue context") Reported-and-tested-by: NJan Stancek <jstancek@redhat.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Sabrina Dubroca 提交于
Tunnel deletion is delayed by both a workqueue (l2tp_tunnel_delete -> wq -> l2tp_tunnel_del_work) and RCU (sk_destruct -> RCU -> l2tp_tunnel_destruct). By the time l2tp_tunnel_destruct() runs to destroy the tunnel and finish destroying the socket, the private data reserved via the net_generic mechanism has already been freed, but l2tp_tunnel_destruct() actually uses this data. Make sure tunnel deletion for the netns has completed before returning from l2tp_exit_net() by first flushing the tunnel removal workqueue, and then waiting for RCU callbacks to complete. Fixes: 167eb17e ("l2tp: create tunnel sockets in the right namespace") Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 9月, 2016 4 次提交
-
-
由 Eli Cooper 提交于
Commit 8eb30be0 ("ipv6: Create ip6_tnl_xmit") unsets flowi6_proto in ip4ip6_tnl_xmit() and ip6ip6_tnl_xmit(). Since xfrm_selector_match() relies on this info, IPv6 packets sent by an ip6tunnel cannot be properly selected by their protocols after removing it. This patch puts flowi6_proto back. Cc: stable@vger.kernel.org Fixes: 8eb30be0 ("ipv6: Create ip6_tnl_xmit") Signed-off-by: NEli Cooper <elicooper@gmx.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Gao Feng 提交于
The original codes depend on that the function parameters are evaluated from left to right. But the parameter's evaluation order is not defined in C standard actually. When flow_keys_have_l4(&keys) is invoked before ___skb_get_hash(skb, &keys, hashrnd) with some compilers or environment, the keys passed to flow_keys_have_l4 is not initialized. Fixes: 6db61d79 ("flow_dissector: Ignore flow dissector return value from ___skb_get_hash") Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NGao Feng <fgao@ikuai8.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Neal Cardwell 提交于
Yuchung noticed that on the first TFO server data packet sent after the (TFO) handshake, the server echoed the TCP timestamp value in the SYN/data instead of the timestamp value in the final ACK of the handshake. This problem did not happen on regular opens. The tcp_replace_ts_recent() logic that decides whether to remember an incoming TS value needs tp->rcv_wup to hold the latest receive sequence number that we have ACKed (latest tp->rcv_nxt we have ACKed). This commit fixes this issue by ensuring that a TFO server properly updates tp->rcv_wup to match tp->rcv_nxt at the time it sends a SYN/ACK for the SYN/data. Reported-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path") Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Aleksandrov 提交于
pskb_may_pull may fail due to various reasons (e.g. alloc failure), but the skb isn't changed/dropped and processing continues so we shouldn't increment tx_dropped. CC: Kyeyoon Park <kyeyoonp@codeaurora.org> CC: Roopa Prabhu <roopa@cumulusnetworks.com> CC: Stephen Hemminger <stephen@networkplumber.org> CC: bridge@lists.linux-foundation.org Fixes: 95850116 ("bridge: Add support for IEEE 802.11 Proxy ARP") Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-