- 22 9月, 2017 5 次提交
-
-
由 Vincent Bernat 提交于
Currently, there is a difference in netlink events received when an interface is modified through bridge ioctl() or through netlink. This patch generates additional events when an interface is added to or removed from a bridge via ioctl(). When adding then removing an interface from a bridge with netlink, we get: 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 Deleted 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff When using ioctl(): 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master bridge0 state UNKNOWN group default link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 Deleted 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 9e:da:60:ee:cf:c8 5: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 9e:da:60:ee:cf:c8 brd ff:ff:ff:ff:ff:ff Without this patch, the last netlink notification is not sent. Signed-off-by: NVincent Bernat <vincent@bernat.im> Reviewed-by: NStephen Hemminger <stephen@networkplumber.org> Reviewed-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ursula Braun 提交于
Use the smc_connection as first parameter with smc_cdc_get_free_slot(). This is just a small code cleanup, no functional change. Signed-off-by: NUrsula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vivien Didelot 提交于
Dumping a DSA port's FDB entries is not specific to a DSA slave, so add a dsa_port_fdb_dump function, similarly to dsa_port_fdb_add and dsa_port_fdb_del. Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vivien Didelot 提交于
A few DSA slave functions take a dsa_slave_priv pointer as first argument, whereas the scope of the slave.c functions is the slave net_device structure. Fix this and rename dsa_netpoll_send_skb to dsa_slave_netpoll_send_skb. Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
Since commit 1dced6a8 ("ipv4: Restore accept_local behaviour in fib_validate_source()") a full fib lookup is needed even if the rp_filter is disabled, if accept_local is false - which is the default. What we really need in the above scenario is just checking that the source IP address is not local, and in most case we can do that is a cheaper way looking up the ifaddr hash table. This commit adds a helper for such lookup, and uses it to validate the src address when rp_filter is disabled and no 'local' routes are created by the user space in the relevant namespace. A new ipv4 netns flag is added to account for such routes. We need that to preserve the same behavior we had before this patch. It also drops the checks to bail early from __fib_validate_source, added by the commit 1dced6a8 ("ipv4: Restore accept_local behaviour in fib_validate_source()") they do not give any measurable performance improvement: if we do the lookup with are on a slower path. This improves UDP performances for unconnected sockets when rp_filter is disabled by 5% and also gives small but measurable performance improvement for TCP flood scenarios. v1 -> v2: - use the ifaddr lookup helper in __ip_dev_find(), as suggested by Eric - fall-back to full lookup if custom local routes are present Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 9月, 2017 5 次提交
-
-
由 Florian Fainelli 提交于
Instead of open coding the check. Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Reviewed-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David S. Miller 提交于
This reverts commit 00ba4cb3. Discussion with David Ahern determined that this change is actually not needed. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Paolo Abeni 提交于
The commit 6b229cf7 ("udp: add batching to udp_rmem_release()") reduced greatly the cacheline contention between the BH and the US reader batching the rmem updates in most scenarios. Such optimization is explicitly avoided if the US reader is faster then BH processing. My fault, I initially suggested this kind of behavior due to concerns of possible regressions with small sk_rcvbuf values. Tests showed such concerns are misplaced, so this commit relaxes the condition for rmem bulk updates, obtaining small but measurable performance gain in the scenario described above. Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vincent Bernat 提交于
Currently, when an interface is released from a bridge via ioctl(), we get a RTM_DELLINK event through netlink: Deleted 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 6e:23:c2:54:3a:b3 Userspace has to interpret that as a removal from the bridge, not as a complete removal of the interface. When an bridged interface is completely removed, we get two events: Deleted 2: dummy0: <BROADCAST,NOARP> mtu 1500 master bridge0 state DOWN link/ether 6e:23:c2:54:3a:b3 Deleted 2: dummy0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default link/ether 6e:23:c2:54:3a:b3 brd ff:ff:ff:ff:ff:ff In constrast, when an interface is released from a bond, we get a RTM_NEWLINK with only the new characteristics (no master): 3: dummy1: <BROADCAST,NOARP,SLAVE,UP,LOWER_UP> mtu 1500 qdisc noqueue master bond0 state UNKNOWN group default link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff 3: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff 4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff 3: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff 3: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noqueue state DOWN group default link/ether ca:c8:7b:66:f8:25 brd ff:ff:ff:ff:ff:ff 4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether ae:dc:7a:8c:9a:3c brd ff:ff:ff:ff:ff:ff Userland may be confused by the fact we say a link is deleted while its characteristics are only modified. A first solution would have been to turn the RTM_DELLINK event in del_nbp() into a RTM_NEWLINK event. However, maybe some piece of userland is relying on this RTM_DELLINK to detect when a bridged interface is released. Instead, we also emit a RTM_NEWLINK event once the interface is released (without master info). Deleted 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 master bridge0 state UNKNOWN link/ether 8a:bb:e7:94:b1:f8 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default link/ether 8a:bb:e7:94:b1:f8 brd ff:ff:ff:ff:ff:ff This is done only when using ioctl(). When using Netlink, such an event is already automatically emitted in do_setlink(). Signed-off-by: NVincent Bernat <vincent@bernat.im> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Xiang Gao 提交于
In ipv6_skip_exthdr, the lengh of AH header is computed manually as (hp->hdrlen+2)<<2. However, in include/linux/ipv6.h, a macro named ipv6_authlen is already defined for exactly the same job. This commit replaces the manual computation code with the macro. Signed-off-by: NXiang Gao <qasdfgtyuiop@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 9月, 2017 11 次提交
-
-
由 Eric Dumazet 提交于
Implement exit_batch() method to dismantle more devices per round. (rtnl_lock() ... unregister_netdevice_many() ... rtnl_unlock()) Tested: $ cat add_del_unshare.sh for i in `seq 1 40` do (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & done wait ; grep net_namespace /proc/slabinfo Before patch : $ time ./add_del_unshare.sh net_namespace 126 282 5504 1 2 : tunables 8 4 0 : slabdata 126 282 0 real 1m38.965s user 0m0.688s sys 0m37.017s After patch: $ time ./add_del_unshare.sh net_namespace 135 291 5504 1 2 : tunables 8 4 0 : slabdata 135 291 0 real 0m22.117s user 0m0.728s sys 0m35.328s Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Implement exit_batch() method to dismantle more devices per round. (rtnl_lock() ... unregister_netdevice_many() ... rtnl_unlock()) Tested: $ cat add_del_unshare.sh for i in `seq 1 40` do (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) & done wait ; grep net_namespace /proc/slabinfo Before patch : $ time ./add_del_unshare.sh net_namespace 110 267 5504 1 2 : tunables 8 4 0 : slabdata 110 267 0 real 3m25.292s user 0m0.644s sys 0m40.153s After patch: $ time ./add_del_unshare.sh net_namespace 126 282 5504 1 2 : tunables 8 4 0 : slabdata 126 282 0 real 1m38.965s user 0m0.688s sys 0m37.017s Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When dealing with a list of dismantling netns, we can scan tcp_metrics once, saving cpu cycles. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Having a global list of labels do not scale to thousands of netns in the cloud era. This causes quadratic behavior on netns creation and deletion. This is time having a per netns list of ~10 labels. Tested: $ time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done) [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 3.637 MB perf.data (~158898 samples) ] real 0m20.837s # instead of 0m24.227s user 0m0.328s sys 0m20.338s # instead of 0m23.753s 16.17% ip [kernel.kallsyms] [k] netlink_broadcast_filtered 12.30% ip [kernel.kallsyms] [k] netlink_has_listeners 6.76% ip [kernel.kallsyms] [k] _raw_spin_lock_irqsave 5.78% ip [kernel.kallsyms] [k] memset_erms 5.77% ip [kernel.kallsyms] [k] kobject_uevent_env 5.18% ip [kernel.kallsyms] [k] refcount_sub_and_test 4.96% ip [kernel.kallsyms] [k] _raw_read_lock 3.82% ip [kernel.kallsyms] [k] refcount_inc_not_zero 3.33% ip [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 2.11% ip [kernel.kallsyms] [k] unmap_page_range 1.77% ip [kernel.kallsyms] [k] __wake_up 1.69% ip [kernel.kallsyms] [k] strlen 1.17% ip [kernel.kallsyms] [k] __wake_up_common 1.09% ip [kernel.kallsyms] [k] insert_header 1.04% ip [kernel.kallsyms] [k] page_remove_rmap 1.01% ip [kernel.kallsyms] [k] consume_skb 0.98% ip [kernel.kallsyms] [k] netlink_trim 0.51% ip [kernel.kallsyms] [k] kernfs_link_sibling 0.51% ip [kernel.kallsyms] [k] filemap_map_pages 0.46% ip [kernel.kallsyms] [k] memcpy_erms Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Cong Wang 提交于
gen estimator has been rewritten in commit 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators"), the caller no longer needs to wait for a grace period. So this patch gets rid of it. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vivien Didelot 提交于
DSA overrides the master device ethtool ops, so that it can inject stats from its dedicated switch CPU port as well. The related code is currently split in dsa.c and slave.c, but it only scopes the master net device. Move it to a new master.c DSA core file. This file will be later extented with master net device specific code. Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vivien Didelot 提交于
DSA overrides the master's ethtool ops so that we can inject its CPU port's statistics. Because of that, we need to setup the ethtool ops after the master's dsa_ptr pointer has been assigned, not before. This patch setups the ethtool ops after dsa_ptr is assigned, and restores them before it gets cleared. Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vivien Didelot 提交于
When a DSA switch tree is meant to be applied, it already has a CPU port. Thus remove the condition of dst->cpu_dp. Moreover, the next lines access dst->cpu_dp unconditionally. Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vivien Didelot 提交于
There is no need to store a copy of the master ethtool ops, storing the original pointer in DSA and the new one in the master netdev itself is enough. In the meantime, set orig_ethtool_ops to NULL when restoring the master ethtool ops and check the presence of the master original ethtool ops as well as its needed functions before calling them. Signed-off-by: NVivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
skb->rbnode shares space with skb->next, skb->prev and skb->tstamp Current uses (TCP receive ofo queue and netem) need to save/restore tstamp, while skb->dev is either NULL (TCP) or a constant for a given queue (netem). Since we plan using an RB tree for TCP retransmit queue to speedup SACK processing with large BDP, this patch exchanges skb->dev and skb->tstamp. This saves some overhead in both TCP and netem. v2: removes the swtstamp field from struct tcp_skb_cb Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Corentin Labbe 提交于
net/vmw_vsock/vmci_transport.c does not use any miscdevice so this patch remove this unnecessary inclusion. Signed-off-by: NCorentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 9月, 2017 4 次提交
-
-
由 Eric Dumazet 提交于
HTB qdisc overlimits counter is properly increased, but we have no per class counter, meaning it is difficult to diagnose HTB problems. This patch adds this counter, visible in "tc -s class show dev eth0", with current iproute2. Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDenys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Colin Ian King 提交于
Pointer tcm is being initialized and is never read, it is only being used to determine the size of struct tcmsg. Clean this up by removing variable tcm and explicitly using the sizeof struct tcmsg rather than *tcm. Cleans up clang warning: warning: Value stored to 'tcm' during its initialization is never read Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Perches 提交于
Global function ipv6_rcv_saddr_equal and static functions ipv6_rcv_saddr_equal and ipv4_rcv_saddr_equal currently return int. bool is slightly more descriptive for these functions so change their return type from int to bool. Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Joe Perches 提交于
Move the #ifdef into the static void function so that the use of DBGUNDO is validated when FASTRETRANS_DEBUG <= 1. Remove the now unnecessary #else and #define DBGUNDO. Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 9月, 2017 1 次提交
-
-
由 Eric Dumazet 提交于
Now skb->mstamp_skb is updated later, we also need to call tcp_rate_skb_sent() after the update is done. Fixes: 8c72c65b ("tcp: update skb->skb_mstamp more carefully") Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 9月, 2017 4 次提交
-
-
由 Xin Long 提交于
sctp_diag would not actually dump out sk/asoc if inet_sctp_diag_fill returns err, in which case it shouldn't mark sk dumped by setting cb->args[3] as 1 in sctp_sock_dump(). Otherwise, it could cause some asocs to have no parent's sk dumped in 'ss --sctp'. So this patch is to not set cb->args[3] when inet_sctp_diag_fill() returns err in sctp_sock_dump(). Fixes: 8f840e47 ("sctp: add the sctp_diag.c file") Signed-off-by: NXin Long <lucien.xin@gmail.com> Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Xin Long 提交于
Commit 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump") tried to fix an use-after-free issue by checking !sctp_sk(sk)->ep with holding sock and sock lock. But Paolo noticed that endpoint could be destroyed in sctp_rcv without sock lock protection. It means the use-after-free issue still could be triggered when sctp_rcv put and destroy ep after sctp_sock_dump checks !ep, although it's pretty hard to reproduce. I could reproduce it by mdelay in sctp_rcv while msleep in sctp_close and sctp_sock_dump long time. This patch is to add another param cb_done to sctp_for_each_transport and dump ep->assocs with holding tsp after jumping out of transport's traversal in it to avoid this issue. It can also improve sctp diag dump to make it run faster, as no need to save sk into cb->args[5] and keep calling sctp_for_each_transport any more. This patch is also to use int * instead of int for the pos argument in sctp_for_each_transport, which could make postion increment only in sctp_for_each_transport and no need to keep changing cb->args[2] in sctp_sock_filter and sctp_sock_dump any more. Fixes: 86fdb344 ("sctp: ensure ep is not destroyed before doing the dump") Reported-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NXin Long <lucien.xin@gmail.com> Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
liujian reported a problem in TCP_USER_TIMEOUT processing with a patch in tcp_probe_timer() : https://www.spinics.net/lists/netdev/msg454496.html After investigations, the root cause of the problem is that we update skb->skb_mstamp of skbs in write queue, even if the attempt to send a clone or copy of it failed. One reason being a routing problem. This patch prevents this, solving liujian issue. It also removes a potential RTT miscalculation, since __tcp_retransmit_skb() is not OR-ing TCP_SKB_CB(skb)->sacked with TCPCB_EVER_RETRANS if a failure happens, but skb->skb_mstamp has been changed. A future ACK would then lead to a very small RTT sample and min_rtt would then be lowered to this too small value. Tested: # cat user_timeout.pkt --local_ip=192.168.102.64 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 `ifconfig tun0 192.168.102.64/16; ip ro add 192.0.2.1 dev tun0` +0 < S 0:0(0) win 0 <mss 1460> +0 > S. 0:0(0) ack 1 <mss 1460> +.1 < . 1:1(0) ack 1 win 65530 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_USER_TIMEOUT, [3000], 4) = 0 +0 write(4, ..., 24) = 24 +0 > P. 1:25(24) ack 1 win 29200 +.1 < . 1:1(0) ack 25 win 65530 //change the ipaddress +1 `ifconfig tun0 192.168.0.10/16` +1 write(4, ..., 24) = 24 +1 write(4, ..., 24) = 24 +1 write(4, ..., 24) = 24 +1 write(4, ..., 24) = 24 +0 `ifconfig tun0 192.168.102.64/16` +0 < . 1:2(1) ack 25 win 65530 +0 `ifconfig tun0 192.168.0.10/16` +3 write(4, ..., 24) = -1 # ./packetdrill user_timeout.pkt Signed-off-by: NEric Dumazet <edumazet@googl.com> Reported-by: Nliujian <liujian56@huawei.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 David Ahern 提交于
rt_iif is only set to the actual egress device for the output path. The recent change to consider the l3slave flag when returning IP_PKTINFO works for local traffic (the correct device index is returned), but it broke the more typical use case of packets received from a remote host always returning the VRF index rather than the original ingress device. Update the fixup to consider l3slave and rt_iif actually getting set. Fixes: 1dfa7639 ("net: ipv4: add check for l3slave for index returned in IP_PKTINFO") Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 9月, 2017 1 次提交
-
-
由 Tobias Klauser 提交于
Make the needlessly global function tls_sw_free_resources static to fix a gcc/sparse warning. Signed-off-by: NTobias Klauser <tklauser@distanz.ch> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 9月, 2017 2 次提交
-
-
由 Eric Dumazet 提交于
Denys reported wrong rate estimations with HTB classes. It appears the bug was added in linux-4.10, since my tests where using intervals of one second only. HTB using 4 sec default rate estimators, reported rates were 4x higher. We need to properly scale the bytes/packets samples before integrating them in EWMA. Tested: echo 1 >/sys/module/sch_htb/parameters/htb_rate_est Setup HTB with one class with a rate/cail of 5Gbit Generate traffic on this class tc -s -d cl sh dev eth0 classid 7002:11 class htb 7002:11 parent 7002:1 prio 5 quantum 200000 rate 5Gbit ceil 5Gbit linklayer ethernet burst 80000b/1 mpu 0b cburst 80000b/1 mpu 0b level 0 rate_handle 1 Sent 1488215421648 bytes 982969243 pkt (dropped 0, overlimits 0 requeues 0) rate 5Gbit 412814pps backlog 136260b 2p requeues 0 TCP pkts/rtx 982969327/45 bytes 1488215557414/68130 lended: 22732826 borrowed: 0 giants: 0 tokens: -1684 ctokens: -1684 Fixes: 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NDenys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jiri Pirko 提交于
Recent commit d7fb60b9 ("net_sched: get rid of tcfa_rcu") removed freeing in call_rcu, which changed already existing hard-to-hit race condition into 100% hit: [ 598.599825] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 [ 598.607782] IP: tcf_action_destroy+0xc0/0x140 Or: [ 40.858924] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 [ 40.862840] IP: tcf_generic_walker+0x534/0x820 Fix this by storing the ops and use them directly for module_put call. Fixes: a85a970a ("net_sched: move tc_action into tcf_common") Signed-off-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 13 9月, 2017 7 次提交
-
-
由 Haishuang Yan 提交于
In collect_md mode, if the tun dev is down, it still can call __ip6_tnl_rcv to receive on packets, and the rx statistics increase improperly. When the md tunnel is down, it's not neccessary to increase RX drops for the tunnel device, packets would be recieved on fallback tunnel, and the RX drops on fallback device will be increased as expected. Fixes: 8d79266b ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Cc: Alexei Starovoitov <ast@fb.com> Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Haishuang Yan 提交于
In collect_md mode, if the tun dev is down, it still can call ip_tunnel_rcv to receive on packets, and the rx statistics increase improperly. When the md tunnel is down, it's not neccessary to increase RX drops for the tunnel device, packets would be recieved on fallback tunnel, and the RX drops on fallback device will be increased as expected. Fixes: 2e15ea39 ("ip_gre: Add support to collect tunnel metadata.") Cc: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: NHaishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: NPravin B Shelar <pshelar@ovn.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Cong Wang 提交于
As pointed out by Jiri, there is still a race condition between tcf_block_put() and tcf_chain_destroy() in a RCU callback. There is no way to make it correct without proper locking or synchronization, because both operate on a shared list. Locking is hard, because the only lock we can pick here is a spinlock, however, in tc_dump_tfilter() we iterate this list with a sleeping function called (tcf_chain_dump()), which makes using a lock to protect chain_list almost impossible. Jiri suggested the idea of holding a refcnt before flushing, this works because it guarantees us there would be no parallel tcf_chain_destroy() during the loop, therefore the race condition is gone. But we have to be very careful with proper synchronization with RCU callbacks. Suggested-by: NJiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Cong Wang 提交于
This patch fixes the following ugliness of tc filter chain refcnt: a) tp proto should hold a refcnt to the chain too. This significantly simplifies the logic. b) Chain 0 is no longer special, it is created with refcnt=1 like any other chains. All the ugliness in tcf_chain_put() can be gone! c) No need to handle the flushing oddly, because block still holds chain 0, it can not be released, this guarantees block is the last user. d) The race condition with RCU callbacks is easier to handle with just a rcu_barrier(). Much easier to understand, nothing to hide. Thanks to the previous patch. Please see also the comments in code. e) Make the code understandable by humans, much less error-prone. Fixes: 744a4cf6 ("net: sched: fix use after free when tcf_chain_destroy is called multiple times") Fixes: 5bc17018 ("net: sched: introduce multichain support for filters") Cc: Jiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NJiri Pirko <jiri@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Cong Wang 提交于
gen estimator has been rewritten in commit 1c0d32fd ("net_sched: gen_estimator: complete rewrite of rate estimators"), the caller is no longer needed to wait for a grace period. So this patch gets rid of it. This also completely closes a race condition between action free path and filter chain add/remove path for the following patch. Because otherwise the nested RCU callback can't be caught by rcu_barrier(). Please see also the comments in code. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Back in linux-4.4, I inadvertently put a call to reqsk_put() in inet_child_forget(), forgetting it could be called from two different points. In the case it is called from inet_csk_reqsk_queue_add(), we want to keep the reference on the request socket, since it is released later by the caller (tcp_v{4|6}_rcv()) This bug never showed up because atomic_dec_and_test() was not signaling the underflow, and SLAB_DESTROY_BY RCU semantic for request sockets prevented the request to be put in quarantine. Recent conversion of socket refcount from atomic_t to refcount_t finally exposed the bug. So move the reqsk_put() to inet_csk_listen_stop() to fix this. Thanks to Shankara Pailoor for using syzkaller and providing a nice set of .config and C repro. WARNING: CPU: 2 PID: 4277 at lib/refcount.c:186 refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186 Kernel panic - not syncing: panic_on_warn set ... CPU: 2 PID: 4277 Comm: syz-executor0 Not tainted 4.13.0-rc7 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0xf7/0x1aa lib/dump_stack.c:52 panic+0x1ae/0x3a7 kernel/panic.c:180 __warn+0x1c4/0x1d9 kernel/panic.c:541 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190 do_trap_no_signal arch/x86/kernel/traps.c:224 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:273 do_error_trap+0x118/0x340 arch/x86/kernel/traps.c:310 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323 invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:846 RIP: 0010:refcount_sub_and_test+0x167/0x1b0 lib/refcount.c:186 RSP: 0018:ffff88006e006b60 EFLAGS: 00010286 RAX: 0000000000000026 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000026 RSI: 1ffff1000dc00d2c RDI: ffffed000dc00d60 RBP: ffff88006e006bf0 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 1ffff1000dc00d6d R13: 00000000ffffffff R14: 0000000000000001 R15: ffff88006ce9d340 refcount_dec_and_test+0x1a/0x20 lib/refcount.c:211 reqsk_put+0x71/0x2b0 include/net/request_sock.h:123 tcp_v4_rcv+0x259e/0x2e20 net/ipv4/tcp_ipv4.c:1729 ip_local_deliver_finish+0x2e2/0xba0 net/ipv4/ip_input.c:216 NF_HOOK include/linux/netfilter.h:248 [inline] ip_local_deliver+0x1ce/0x6d0 net/ipv4/ip_input.c:257 dst_input include/net/dst.h:477 [inline] ip_rcv_finish+0x8db/0x19c0 net/ipv4/ip_input.c:397 NF_HOOK include/linux/netfilter.h:248 [inline] ip_rcv+0xc3f/0x17d0 net/ipv4/ip_input.c:488 __netif_receive_skb_core+0x1fb7/0x31f0 net/core/dev.c:4298 __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4336 process_backlog+0x1c5/0x6d0 net/core/dev.c:5102 napi_poll net/core/dev.c:5499 [inline] net_rx_action+0x6d3/0x14a0 net/core/dev.c:5565 __do_softirq+0x2cb/0xb2d kernel/softirq.c:284 do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:898 </IRQ> do_softirq.part.16+0x63/0x80 kernel/softirq.c:328 do_softirq kernel/softirq.c:176 [inline] __local_bh_enable_ip+0x84/0x90 kernel/softirq.c:181 local_bh_enable include/linux/bottom_half.h:31 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:705 [inline] ip_finish_output2+0x8ad/0x1360 net/ipv4/ip_output.c:231 ip_finish_output+0x74e/0xb80 net/ipv4/ip_output.c:317 NF_HOOK_COND include/linux/netfilter.h:237 [inline] ip_output+0x1cc/0x850 net/ipv4/ip_output.c:405 dst_output include/net/dst.h:471 [inline] ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124 ip_queue_xmit+0x8c6/0x1810 net/ipv4/ip_output.c:504 tcp_transmit_skb+0x1963/0x3320 net/ipv4/tcp_output.c:1123 tcp_send_ack.part.35+0x38c/0x620 net/ipv4/tcp_output.c:3575 tcp_send_ack+0x49/0x60 net/ipv4/tcp_output.c:3545 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:5795 [inline] tcp_rcv_state_process+0x4876/0x4b60 net/ipv4/tcp_input.c:5930 tcp_v4_do_rcv+0x58a/0x820 net/ipv4/tcp_ipv4.c:1483 sk_backlog_rcv include/net/sock.h:907 [inline] __release_sock+0x124/0x360 net/core/sock.c:2223 release_sock+0xa4/0x2a0 net/core/sock.c:2715 inet_wait_for_connect net/ipv4/af_inet.c:557 [inline] __inet_stream_connect+0x671/0xf00 net/ipv4/af_inet.c:643 inet_stream_connect+0x58/0xa0 net/ipv4/af_inet.c:682 SYSC_connect+0x204/0x470 net/socket.c:1628 SyS_connect+0x24/0x30 net/socket.c:1609 entry_SYSCALL_64_fastpath+0x18/0xad RIP: 0033:0x451e59 RSP: 002b:00007f474843fc08 EFLAGS: 00000216 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 0000000000451e59 RDX: 0000000000000010 RSI: 0000000020002000 RDI: 0000000000000007 RBP: 0000000000000046 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000216 R12: 0000000000000000 R13: 00007ffc040a0f8f R14: 00007f47484409c0 R15: 0000000000000000 Fixes: ebb516af ("tcp/dccp: fix race at listener dismantle phase") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NShankara Pailoor <sp3485@columbia.edu> Tested-by: NShankara Pailoor <sp3485@columbia.edu> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Christophe JAILLET 提交于
All other error handling paths in this function go through the 'error' label. This one should do the same. Fixes: 9cc9a5cb ("datapath: Avoid using stack larger than 1024.") Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: NPravin B Shelar <pshelar@ovn.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-