- 13 2月, 2018 22 次提交
-
-
由 Kirill Tkhai 提交于
arp_net_ops just addr/removes /proc entry. devinet_ops allocates and frees duplicate of init_net tables and (un)registers sysctl entries. fib_net_ops allocates and frees pernet tables, creates/destroys netlink socket and (un)initializes /proc entries. Foreign pernet_operations do not touch them. ip_rt_proc_ops only modifies pernet /proc entries. xfrm_net_ops creates/destroys /proc entries, allocates/frees pernet statistics, hashes and tables, and (un)initializes sysctl files. These are not touched by foreigh pernet_operations xfrm4_net_ops allocates/frees private pernet memory, and configures sysctls. sysctl_route_ops creates/destroys sysctls. rt_genid_ops only initializes fields of just allocated net. ipv4_inetpeer_ops allocated/frees net private memory. igmp_net_ops just creates/destroys /proc files and socket, noone else interested in. tcp_sk_ops seems to be safe, because tcp_sk_init() does not depend on any other pernet_operations modifications. Iteration over hash table in inet_twsk_purge() is made under RCU lock, and it's safe to iterate the table this way. Removing from the table happen from inet_twsk_deschedule_put(), but this function is safe without any extern locks, as it's synchronized inside itself. There are many examples, it's used in different context. So, it's safe to leave tcp_sk_exit_batch() unlocked. tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe. udplite4_net_ops only creates/destroys pernet /proc file. icmp_sk_ops creates percpu sockets, not touched by foreign pernet_operations. ipmr_net_ops creates/destroys pernet fib tables, (un)registers fib rules and /proc files. This seem to be safe to execute in parallel with foreign pernet_operations. af_inet_ops just sets up default parameters of newly created net. ipv4_mib_ops creates and destroys pernet percpu statistics. raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops and ip_proc_ops only create/destroy pernet /proc files. ip4_frags_ops creates and destroys sysctl file. So, it's safe to make the pernet_operations async. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
These pernet_operations register and destroy sysctl directory, and it's not interesting for foreign pernet_operations. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
These pernet_operations initialize and purge net::wext_nlevents queue, and are not touched by foreign pernet_operations. Mark them async. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
This pernet_operations create and destroy net::genl_sock. Foreign pernet_operations don't touch it. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
psched_net_ops only creates and destroyes /proc entry, and safe to be executed in parallel with any foreigh pernet_operations. tcf_action_net_ops initializes and destructs tcf_action_net::egdev_ht, which is not touched by foreign pernet_operations. So, make them async. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
Both of them create and initialize lists, which are not touched by another foreing pernet_operations. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
There are: 1)dev_proc_ops and dev_mc_net_ops, which create and destroy pernet proc file and not interesting for another net namespaces; 2)netdev_net_ops, which creates pernet hashes, which are not touched by another pernet_operations. So, make them async. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
This patch starts to convert pernet_subsys, registered from subsys initcalls. It seems safe to be executed in parallel with others, as it's only creates/destoyes proc entry, which nobody else is not interested in. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
rtnetlink_net_init() and rtnetlink_net_exit() create and destroy netlink socket net::rtnl. The socket is used to send rtnl notification via rtnl_net_notifyid(). There is no a problem to create and destroy it in parallel with other pernet operations, as we link net in setup_net() after the socket is created, and destroy in cleanup_net() after net is unhashed from all the lists and there is no RCU references on it. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
The methods of netlink_net_ops create and destroy "netlink" file, which are not interesting for foreigh pernet_operations. So, netlink_net_ops may safely be made async. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
net_defaults_ops introduce only net_defaults_init_net method, and it acts on net::core::sysctl_somaxconn, which is not interesting for the rest of pernet_subsys and pernet_device lists. Then, make them async. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
net_inuse_ops methods expose statistics in /proc. No one from the rest of pernet_subsys or pernet_device lists touch net::core::inuse. So, it's safe to make net_inuse_ops async. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
The pernet_operations would have had a problem in parallel execution with others, if init_net had been able to released. But it's not, and the rest is safe for that. There is memory allocation, which nobody else interested in, and sysctl registration. So, we make them async. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
Methods netfilter_net_init() and netfilter_net_exit() initialize net::nf::hooks and change net-related proc directory of net. Another pernet_operations are not interested in forein net::nf::hooks or proc entries, so it's safe to make them executed in parallel with methods of other pernet operations. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
This patch starts to convert pernet_subsys, registered from core initcalls. Methods sysctl_net_init() and sysctl_net_exit() initialize net::sysctls table of a namespace. pernet_operations::init()/exit() methods from the rest of the list do not touch net::sysctls of strangers, so it's safe to execute sysctl_pernet_ops's methods in parallel with any other pernet_operations. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
This patch starts to convert pernet_subsys, registered from pure initcalls. net_ns_ops::net_ns_net_init/net_ns_net_init, methods use only ida_simple_* functions, which are not need a synchronization. They are synchronized by idr subsystem. So, net_ns_ops methods are able to be executed in parallel with methods of other pernet operations. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
This adds new pernet_operations::async flag to indicate operations, which ->init(), ->exit() and ->exit_batch() methods are allowed to be executed in parallel with the methods of any other pernet_operations. When there are only asynchronous pernet_operations in the system, net_mutex won't be taken for a net construction and destruction. Also, remove BUG_ON(mutex_is_locked()) from net_assign_generic() without replacing with the equivalent net_sem check, as there is one more lockdep assert below. v3: Add comment near net_mutex. Suggested-by: NEric W. Biederman <ebiederm@xmission.com> Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
net_sem protects from pernet_list changing, while ops_free_list() makes simple kfree(), and it can't race with other pernet_operations callbacks. So we may release net_mutex earlier then it was. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
Currently, the mutex is mostly used to protect pernet operations list. It orders setup_net() and cleanup_net() with parallel {un,}register_pernet_operations() calls, so ->exit{,batch} methods of the same pernet operations are executed for a dying net, as were used to call ->init methods, even after the net namespace is unlinked from net_namespace_list in cleanup_net(). But there are several problems with scalability. The first one is that more than one net can't be created or destroyed at the same moment on the node. For big machines with many cpus running many containers it's very sensitive. The second one is that it's need to synchronize_rcu() after net is removed from net_namespace_list(): Destroy net_ns: cleanup_net() mutex_lock(&net_mutex) list_del_rcu(&net->list) synchronize_rcu() <--- Sleep there for ages list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list(ops, &net_exit_list) list_for_each_entry_reverse(ops, &pernet_list, list) ops_free_list(ops, &net_exit_list) mutex_unlock(&net_mutex) This primitive is not fast, especially on the systems with many processors and/or when preemptible RCU is enabled in config. So, all the time, while cleanup_net() is waiting for RCU grace period, creation of new net namespaces is not possible, the tasks, who makes it, are sleeping on the same mutex: Create net_ns: copy_net_ns() mutex_lock_killable(&net_mutex) <--- Sleep there for ages I observed 20-30 seconds hangs of "unshare -n" on ordinary 8-cpu laptop with preemptible RCU enabled after CRIU tests round is finished. The solution is to convert net_mutex to the rw_semaphore and add fine grain locks to really small number of pernet_operations, what really need them. Then, pernet_operations::init/::exit methods, modifying the net-related data, will require down_read() locking only, while down_write() will be used for changing pernet_list (i.e., when modules are being loaded and unloaded). This gives signify performance increase, after all patch set is applied, like you may see here: %for i in {1..10000}; do unshare -n bash -c exit; done *before* real 1m40,377s user 0m9,672s sys 0m19,928s *after* real 0m17,007s user 0m5,311s sys 0m11,779 (5.8 times faster) This patch starts replacing net_mutex to net_sem. It adds rw_semaphore, describes the variables it protects, and makes to use, where appropriate. net_mutex is still present, and next patches will kick it out step-by-step. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
Line up destructors actions in the revers order to constructors. Next patches will add more actions, and this will be comfortable, if there is the such order. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kirill Tkhai 提交于
This patch merges two repeating pieces of code in one, and they will live in setup_net() now. The only change is that assignment: init_net_initialized = true; becomes reordered with: list_add_tail_rcu(&net->list, &net_namespace_list); The order does not have visible effect, and it is a simple cleanup because of: init_net_initialized is used in !CONFIG_NET_NS case to order proc_net_ns_ops registration occuring at boot time: start_kernel()->proc_root_init()->proc_net_init(), with net_ns_init()->setup_net(&init_net, &init_user_ns) also occuring in boot time from the same init_task. When there are no another tasks to race with them, for the single task it does not matter, which order two sequential independent loads should be made. So we make them reordered. Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Denys Vlasenko 提交于
Changes since v1: Added changes in these files: drivers/infiniband/hw/usnic/usnic_transport.c drivers/staging/lustre/lnet/lnet/lib-socket.c drivers/target/iscsi/iscsi_target_login.c drivers/vhost/net.c fs/dlm/lowcomms.c fs/ocfs2/cluster/tcp.c security/tomoyo/network.c Before: All these functions either return a negative error indicator, or store length of sockaddr into "int *socklen" parameter and return zero on success. "int *socklen" parameter is awkward. For example, if caller does not care, it still needs to provide on-stack storage for the value it does not need. None of the many FOO_getname() functions of various protocols ever used old value of *socklen. They always just overwrite it. This change drops this parameter, and makes all these functions, on success, return length of sockaddr. It's always >= 0 and can be differentiated from an error. Tests in callers are changed from "if (err)" to "if (err < 0)", where needed. rpc_sockname() lost "int buflen" parameter, since its only use was to be passed to kernel_getsockname() as &buflen and subsequently not used in any way. Userspace API is not changed. text data bss dec hex filename 30108430 2633624 873672 33615726 200ef6e vmlinux.before.o 30108109 2633612 873672 33615393 200ee21 vmlinux.o Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com> CC: David S. Miller <davem@davemloft.net> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: linux-bluetooth@vger.kernel.org CC: linux-decnet-user@lists.sourceforge.net CC: linux-wireless@vger.kernel.org CC: linux-rdma@vger.kernel.org CC: linux-sctp@vger.kernel.org CC: linux-nfs@vger.kernel.org CC: linux-x25@vger.kernel.org Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 2月, 2018 1 次提交
-
-
由 Linus Torvalds 提交于
This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 2月, 2018 1 次提交
-
-
由 Alexey Kodanev 提交于
When SCTP makes INIT or INIT_ACK packet the total chunk length can exceed SCTP_MAX_CHUNK_LEN which leads to kernel panic when transmitting these packets, e.g. the crash on sending INIT_ACK: [ 597.804948] skbuff: skb_over_panic: text:00000000ffae06e4 len:120168 put:120156 head:000000007aa47635 data:00000000d991c2de tail:0x1d640 end:0xfec0 dev:<NULL> ... [ 597.976970] ------------[ cut here ]------------ [ 598.033408] kernel BUG at net/core/skbuff.c:104! [ 600.314841] Call Trace: [ 600.345829] <IRQ> [ 600.371639] ? sctp_packet_transmit+0x2095/0x26d0 [sctp] [ 600.436934] skb_put+0x16c/0x200 [ 600.477295] sctp_packet_transmit+0x2095/0x26d0 [sctp] [ 600.540630] ? sctp_packet_config+0x890/0x890 [sctp] [ 600.601781] ? __sctp_packet_append_chunk+0x3b4/0xd00 [sctp] [ 600.671356] ? sctp_cmp_addr_exact+0x3f/0x90 [sctp] [ 600.731482] sctp_outq_flush+0x663/0x30d0 [sctp] [ 600.788565] ? sctp_make_init+0xbf0/0xbf0 [sctp] [ 600.845555] ? sctp_check_transmitted+0x18f0/0x18f0 [sctp] [ 600.912945] ? sctp_outq_tail+0x631/0x9d0 [sctp] [ 600.969936] sctp_cmd_interpreter.isra.22+0x3be1/0x5cb0 [sctp] [ 601.041593] ? sctp_sf_do_5_1B_init+0x85f/0xc30 [sctp] [ 601.104837] ? sctp_generate_t1_cookie_event+0x20/0x20 [sctp] [ 601.175436] ? sctp_eat_data+0x1710/0x1710 [sctp] [ 601.233575] sctp_do_sm+0x182/0x560 [sctp] [ 601.284328] ? sctp_has_association+0x70/0x70 [sctp] [ 601.345586] ? sctp_rcv+0xef4/0x32f0 [sctp] [ 601.397478] ? sctp6_rcv+0xa/0x20 [sctp] ... Here the chunk size for INIT_ACK packet becomes too big, mostly because of the state cookie (INIT packet has large size with many address parameters), plus additional server parameters. Later this chunk causes the panic in skb_put_data(): skb_packet_transmit() sctp_packet_pack() skb_put_data(nskb, chunk->skb->data, chunk->skb->len); 'nskb' (head skb) was previously allocated with packet->size from u16 'chunk->chunk_hdr->length'. As suggested by Marcelo we should check the chunk's length in _sctp_make_chunk() before trying to allocate skb for it and discard a chunk if its size bigger than SCTP_MAX_CHUNK_LEN. Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com> Acked-by: NMarcelo Ricardo Leitner <marcelo.leinter@gmail.com> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 2月, 2018 11 次提交
-
-
由 Trond Myklebust 提交于
Calling __UDPX_INC_STATS() from a preemptible context leads to a warning of the form: BUG: using __this_cpu_add() in preemptible [00000000] code: kworker/u5:0/31 caller is xs_udp_data_receive_workfn+0x194/0x270 CPU: 1 PID: 31 Comm: kworker/u5:0 Not tainted 4.15.0-rc8-00076-g90ea9f1b #2 Workqueue: xprtiod xs_udp_data_receive_workfn Call Trace: dump_stack+0x85/0xc1 check_preemption_disabled+0xce/0xe0 xs_udp_data_receive_workfn+0x194/0x270 process_one_work+0x318/0x620 worker_thread+0x20a/0x390 ? process_one_work+0x620/0x620 kthread+0x120/0x130 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x24/0x30 Since we're taking a spinlock in those functions anyway, let's fix the issue by moving the call so that it occurs under the spinlock. Reported-by: Nkernel test robot <fengguang.wu@intel.com> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Olga Kornievskaia 提交于
Hi folks, On a multi-core machine, is it expected that we can have parallel RPCs handled by each of the per-core workqueue? In testing a read workload, observing via "top" command that a single "kworker" thread is running servicing the requests (no parallelism). It's more prominent while doing these operations over krb5p mount. What has been suggested by Bruce is to try this and in my testing I see then the read workload spread among all the kworker threads. Signed-off-by: NOlga Kornievskaia <kolga@netapp.com> Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
-
由 Hoang Le 提交于
In commit d618d09a ("tipc: enforce valid ratio between skb truesize and contents") we introduced a test for ensuring that the condition truesize/datasize <= 4 is true for a received buffer. Unfortunately this test has two problems. - Because of the integer arithmetics the test if (skb->truesize / buf_roundup_len(skb) > 4) will miss all ratios [4 < ratio < 5], which was not the intention. - The buffer returned by skb_copy() inherits skb->truesize of the original buffer, which doesn't help the situation at all. In this commit, we change the ratio condition and replace skb_copy() with a call to skb_copy_expand() to finally get this right. Acked-by: NJon Maloy <jon.maloy@ericsson.com> Signed-off-by: NJon Maloy <jon.maloy@ericsson.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ivan Vecera 提交于
The following sequence is currently broken: # tc qdisc add dev foo ingress # tc filter replace dev foo protocol all ingress \ u32 match u8 0 0 action mirred egress mirror dev bar1 # tc filter replace dev foo protocol all ingress \ handle 800::800 pref 49152 \ u32 match u8 0 0 action mirred egress mirror dev bar2 Error: cls_u32: Key node flags do not match passed flags. We have an error talking to the kernel, -1 The error comes from u32_change() when comparing new and existing flags. The existing ones always contains one of TCA_CLS_FLAGS_{,NOT}_IN_HW flag depending on offloading state. These flags cannot be passed from userspace so the condition (n->flags != flags) in u32_change() always fails. Fix the condition so the flags TCA_CLS_FLAGS_NOT_IN_HW and TCA_CLS_FLAGS_IN_HW are not taken into account. Fixes: 24d3dc6d ("net/sched: cls_u32: Reflect HW offload status") Signed-off-by: NIvan Vecera <ivecera@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Dan Williams 提交于
mpls_label_ok() validates that the 'platform_label' array index from a userspace netlink message payload is valid. Under speculation the mpls_label_ok() result may not resolve in the CPU pipeline until after the index is used to access an array element. Sanitize the index to zero to prevent userspace-controlled arbitrary out-of-bounds speculation, a precursor for a speculative execution side channel vulnerability. Cc: <stable@vger.kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Sowmini Varadhan 提交于
rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management An rds_connection can get added during netns deletion between lines 528 and 529 of 506 static void rds_tcp_kill_sock(struct net *net) : /* code to pull out all the rds_connections that should be destroyed */ : 528 spin_unlock_irq(&rds_tcp_conn_lock); 529 list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node) 530 rds_conn_destroy(tc->t_cpath->cp_conn); Such an rds_connection would miss out the rds_conn_destroy() loop (that cancels all pending work) and (if it was scheduled after netns deletion) could trigger the use-after-free. A similar race-window exists for the module unload path in rds_tcp_exit -> rds_tcp_destroy_conns Concurrency with netns deletion (rds_tcp_kill_sock()) must be handled by checking check_net() before enqueuing new work or adding new connections. Concurrency with module-unload is handled by maintaining a module specific flag that is set at the start of the module exit function, and must be checked before enqueuing new work or adding new connections. This commit refactors existing RDS_DESTROY_PENDING checks added by commit 3db6e0d1 ("rds: use RCU to synchronize work-enqueue with connection teardown") and consolidates all the concurrency checks listed above into the function rds_destroy_pending(). Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: NSantosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kees Cook 提交于
Most callers of put_cmsg() use a "sizeof(foo)" for the length argument. Within put_cmsg(), a copy_to_user() call is made with a dynamic size, as a result of the cmsg header calculations. This means that hardened usercopy will examine the copy, even though it was technically a fixed size and should be implicitly whitelisted. All the put_cmsg() calls being built from values in skbuff_head_cache are coming out of the protocol-defined "cb" field, so whitelist this field entirely instead of creating per-use bounce buffers, for which there are concerns about performance. Original report was: Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLAB object 'skbuff_head_cache' (offset 64, size 16)! WARNING: CPU: 0 PID: 3663 at mm/usercopy.c:81 usercopy_warn+0xdb/0x100 mm/usercopy.c:76 ... __check_heap_object+0x89/0xc0 mm/slab.c:4426 check_heap_object mm/usercopy.c:236 [inline] __check_object_size+0x272/0x530 mm/usercopy.c:259 check_object_size include/linux/thread_info.h:112 [inline] check_copy_size include/linux/thread_info.h:143 [inline] copy_to_user include/linux/uaccess.h:154 [inline] put_cmsg+0x233/0x3f0 net/core/scm.c:242 sock_recv_errqueue+0x200/0x3e0 net/core/sock.c:2913 packet_recvmsg+0xb2e/0x17a0 net/packet/af_packet.c:3296 sock_recvmsg_nosec net/socket.c:803 [inline] sock_recvmsg+0xc9/0x110 net/socket.c:810 ___sys_recvmsg+0x2a4/0x640 net/socket.c:2179 __sys_recvmmsg+0x2a9/0xaf0 net/socket.c:2287 SYSC_recvmmsg net/socket.c:2368 [inline] SyS_recvmmsg+0xc4/0x160 net/socket.c:2352 entry_SYSCALL_64_fastpath+0x29/0xa0 Reported-by: syzbot+e2d6cfb305e9f3911dea@syzkaller.appspotmail.com Fixes: 6d07d1cd ("usercopy: Restrict non-usercopy caches to size 0") Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christian Brauner 提交于
Since we've added support for IFLA_IF_NETNSID for RTM_{DEL,GET,SET,NEW}LINK it is possible for userspace to send us requests with three different properties to identify a target network namespace. This affects at least RTM_{NEW,SET}LINK. Each of them could potentially refer to a different network namespace which is confusing. For legacy reasons the kernel will pick the IFLA_NET_NS_PID property first and then look for the IFLA_NET_NS_FD property but there is no reason to extend this type of behavior to network namespace ids. The regression potential is quite minimal since the rtnetlink requests in question either won't allow IFLA_IF_NETNSID requests before 4.16 is out (RTM_{NEW,SET}LINK) or don't support IFLA_NET_NS_{PID,FD} (RTM_{DEL,GET}LINK) in the first place. Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Acked-by: NJiri Benc <jbenc@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nicolas Dichtel 提交于
Nowadays, nlmsg_multicast() returns only 0 or -ESRCH but this was not the case when commit 134e6375 was pushed. However, there was no reason to stop the loop if a netns does not have listeners. Returns -ESRCH only if there was no listeners in all netns. To avoid having the same problem in the future, I didn't take the assumption that nlmsg_multicast() returns only 0 or -ESRCH. Fixes: 134e6375 ("genetlink: make netns aware") CC: Johannes Berg <johannes.berg@intel.com> Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Howells 提交于
Don't put buffers of data to be handed to crypto on the stack as this may cause an assertion failure in the kernel (see below). Fix this by using an kmalloc'd buffer instead. kernel BUG at ./include/linux/scatterlist.h:147! ... RIP: 0010:rxkad_encrypt_response.isra.6+0x191/0x1b0 [rxrpc] RSP: 0018:ffffbe2fc06cfca8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff989277d59900 RCX: 0000000000000028 RDX: 0000259dc06cfd88 RSI: 0000000000000025 RDI: ffffbe30406cfd88 RBP: ffffbe2fc06cfd60 R08: ffffbe2fc06cfd08 R09: ffffbe2fc06cfd08 R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff7c5f80d9f95 R13: ffffbe2fc06cfd88 R14: ffff98927a3f7aa0 R15: ffffbe2fc06cfd08 FS: 0000000000000000(0000) GS:ffff98927fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055b1ff28f0f8 CR3: 000000001b412003 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: rxkad_respond_to_challenge+0x297/0x330 [rxrpc] rxrpc_process_connection+0xd1/0x690 [rxrpc] ? process_one_work+0x1c3/0x680 ? __lock_is_held+0x59/0xa0 process_one_work+0x249/0x680 worker_thread+0x3a/0x390 ? process_one_work+0x680/0x680 kthread+0x121/0x140 ? kthread_create_worker_on_cpu+0x70/0x70 ret_from_fork+0x3a/0x50 Reported-by: NJonathan Billings <jsbillings@jsbillings.org> Reported-by: NMarc Dionne <marc.dionne@auristor.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> Tested-by: NJonathan Billings <jsbillings@jsbillings.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Chuck Lever 提交于
A single NFSv4 WRITE compound can often have three operations: PUTFH, WRITE, then GETATTR. When the WRITE payload is sent in a Read chunk, the client places the GETATTR in the inline part of the RPC/RDMA message, just after the WRITE operation (sans payload). The position value in the Read chunk enables the receiver to insert the Read chunk at the correct place in the received XDR stream; that is between the WRITE and GETATTR. According to RFC 8166, an NFS/RDMA client does not have to add XDR round-up to the Read chunk that carries the WRITE payload. The receiver adds XDR round-up padding if it is absent and the receiver's XDR decoder requires it to be present. Commit 193bcb7b ("svcrdma: Populate tail iovec when receiving") attempted to add support for receiving such a compound so that just the WRITE payload appears in rq_arg's page list, and the trailing GETATTR is placed in rq_arg's tail iovec. (TCP just strings the whole compound into the head iovec and page list, without regard to the alignment of the WRITE payload). The server transport logic also had to accommodate the optional XDR round-up of the Read chunk, which it did simply by lengthening the tail iovec when round-up was needed. This approach is adequate for the NFSv2 and NFSv3 WRITE decoders. Unfortunately it is not sufficient for nfsd4_decode_write. When the Read chunk length is a couple of bytes less than PAGE_SIZE, the computation at the end of nfsd4_decode_write allows argp->pagelen to go negative, which breaks the logic in read_buf that looks for the tail iovec. The result is that a WRITE operation whose payload length is just less than a multiple of a page succeeds, but the subsequent GETATTR in the same compound fails with NFS4ERR_OP_ILLEGAL because the XDR decoder can't find it. Clients ignore the error, but they must update their attribute cache via a separate round trip. As nfsd4_decode_write appears to expect the payload itself to always have appropriate XDR round-up, have svc_rdma_build_normal_read_chunk add the Read chunk XDR round-up to the page_len rather than lengthening the tail iovec. Reported-by: NOlga Kornievskaia <kolga@netapp.com> Fixes: 193bcb7b ("svcrdma: Populate tail iovec when receiving") Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NOlga Kornievskaia <kolga@netapp.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 08 2月, 2018 5 次提交
-
-
由 Song Liu 提交于
tracepoint tcp_send_reset requires a full socket to work. However, it may be called when in TCP_TIME_WAIT: case TCP_TW_RST: tcp_v6_send_reset(sk, skb); inet_twsk_deschedule_put(inet_twsk(sk)); goto discard_it; To avoid this problem, this patch checks the socket with sk_fullsock() before calling trace_tcp_send_reset(). Fixes: c24b14c4 ("tcp: add tracepoint trace_tcp_send_reset") Signed-off-by: NSong Liu <songliubraving@fb.com> Reviewed-by: NLawrence Brakmo <brakmo@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Md. Islam 提交于
In Kernel 4.15.0+, Netem does not work properly. Netem setup: tc qdisc add dev h1-eth0 root handle 1: netem delay 10ms 2ms Result: PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data. 64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=22.8 ms 64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=10.9 ms 64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=10.9 ms 64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=11.4 ms 64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms 64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=4303 ms 64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.2 ms 64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.3 ms 64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=4304 ms 64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=4303 ms Patch: (rnd % (2 * sigma)) - sigma was overflowing s32. After applying the patch, I found following output which is desirable. PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data. 64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=21.1 ms 64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=8.46 ms 64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=9.00 ms 64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=11.8 ms 64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=8.36 ms 64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms 64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=8.11 ms 64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=10.0 ms 64 bytes from 172.16.101.2: icmp_seq=9 ttl=64 time=11.3 ms 64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.5 ms 64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.2 ms Reviewed-by: NStephen Hemminger <stephen@networkplumber.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Because of differences in how ipv4 and ipv6 handle fib lookups, verification of nexthops with onlink flag need to default to the main table rather than the local table used by IPv4. As it stands an address within a connected route on device 1 can be used with onlink on device 2. Updating the table properly rejects the route due to the egress device mismatch. Update the extack message as well to show it could be a device mismatch for the nexthop spec. Fixes: fc1e64e1 ("net/ipv6: Add support for onlink flag") Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
Verification of nexthops with onlink flag need to handle unreachable routes. The lookup is only intended to validate the gateway address is not a local address and if the gateway resolves the egress device must match the given device. Hence, hitting any default reject route is ok. Fixes: fc1e64e1 ("net/ipv6: Add support for onlink flag") Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Howells 提交于
AF_RXRPC is incorrectly sending back to the server any abort it receives for a client connection. This is due to the final-ACK offload to the connection event processor patch. The abort code is copied into the last-call information on the connection channel and then the event processor is set. Instead, the following should be done: (1) In the case of a final-ACK for a successful call, the ACK should be scheduled as before. (2) In the case of a locally generated ABORT, the ABORT details should be cached for sending in response to further packets related to that call and no further action scheduled at call disconnect time. (3) In the case of an ACK received from the peer, the call should be considered dead, no ABORT should be transmitted at this time. In response to further non-ABORT packets from the peer relating to this call, an RX_USER_ABORT ABORT should be transmitted. (4) In the case of a call killed due to network error, an RX_USER_ABORT ABORT should be cached for transmission in response to further packets, but no ABORT should be sent at this time. Fixes: 3136ef49 ("rxrpc: Delay terminal ACK transmission on a client call") Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-